
Structured outputs in LLM production work today, but only if you treat the schema as the source of truth and the model as a fallible parser. The reliable pattern across OpenAI, Anthropic, and open-source providers is: define a Zod or Pydantic schema, compile it to JSON Schema, bind it to the provider's native enforcement (OpenAI strict mode, Anthropic tool-use), stream partial JSON for UX, and retry on validation failure with the error fed back into the prompt. Skip any of those steps and you ship a system that breaks at the worst possible moment.
We have spent the last 18 months wiring structured outputs into agent products, RAG pipelines, and form-extraction services. Below is what actually held up under traffic, what looked good in a demo and fell over in production, and the gotchas that cost us the most engineering time.
Free-text LLM output is fine when a human reads it. The moment another system reads it, you need a contract. A pricing extractor that returns {"price": "$1,000"} one call and "approximately one thousand dollars" the next is unshippable.
Structured outputs replace prompt-and-pray with prompt-and-validate. The schema becomes the interface; the model becomes an implementation detail you can swap.
In 2026 there are three serious enforcement mechanisms in production use: OpenAI's strict JSON mode, Anthropic's tool-use schemas, and constrained-grammar decoding (vLLM, llama.cpp, Outlines, XGrammar). Each makes a different trade-off between latency, recall, and which schema features are supported.
| Approach | How it enforces | Schema dialect | Streaming | Notable limits |
|---|---|---|---|---|
OpenAI strict mode (response_format: json_schema, strict: true) | Server-side constrained decoding on GPT-4.1, 4o, o-series | JSON Schema subset | Yes, token-by-token | No oneOf at root, no format validation, all fields required, no additionalProperties: true |
| Anthropic tool-use (Claude 3.5/4) | Schema bound to a tool; model emits a tool_use block | JSON Schema (broader subset) | Yes, partial JSON via stream events | No native strict flag, schema is a strong hint not a hard constraint, occasional invalid emits |
| Constrained grammar (Outlines, Instructor with vLLM, llama.cpp) | Logit masking against a regex or CFG derived from the schema | JSON Schema, regex, EBNF | Yes | Adds 10-40% latency, schema must compile to a finite grammar, can deadlock on poorly written schemas |
The honest read: OpenAI gives you the strongest guarantee but the narrowest schema. Anthropic gives you the broadest schema with weaker guarantees. Open-source grammar decoding gives you total control and the worst latency. In practice we run all three, picked per use case.
The pattern that survived contact with every provider is: define the schema once in your application language, compile to JSON Schema, send to the provider, parse the response back through the same schema.
In TypeScript that means Zod plus a thin wrapper. In Python it means Pydantic plus Instructor. Both libraries do the schema-to-JSON-Schema compilation, validation, and retry orchestration so you do not roll your own.
A minimal Zod pattern looks like this:
import { z } from "zod";
import OpenAI from "openai";
import { zodResponseFormat } from "openai/helpers/zod";
const Invoice = z.object({
vendor: z.string(),
total_cents: z.number().int().nonnegative(),
line_items: z.array(z.object({
description: z.string(),
qty: z.number().int().positive(),
unit_cents: z.number().int().nonnegative(),
})),
paid: z.boolean(),
});
const result = await openai.chat.completions.parse({
model: "gpt-4.1",
messages: [{ role: "user", content: pdfText }],
response_format: zodResponseFormat(Invoice, "invoice"),
});
const invoice = result.choices[0].message.parsed; // typed as z.infer<typeof Invoice>
Two things matter here. First, parsed is fully typed; downstream code gets autocomplete, not any. Second, if the model refuses or the schema fails, you find out at parse time, not three function calls deeper when something tries to add cents to a string.
The Anthropic equivalent uses tool-use:
const response = await anthropic.messages.create({
model: "claude-sonnet-4-20250514",
max_tokens: 2048,
tools: [{
name: "save_invoice",
description: "Save a parsed invoice",
input_schema: zodToJsonSchema(Invoice),
}],
tool_choice: { type: "tool", name: "save_invoice" },
messages: [{ role: "user", content: pdfText }],
});
const toolUse = response.content.find((b) => b.type === "tool_use");
const invoice = Invoice.parse(toolUse.input); // validate even though Anthropic mostly conforms
tool_choice forced to a specific tool is the key. Without it, Claude sometimes returns text instead of calling the tool. With it, you get the structured emit ~99% of the time. The other 1% is why you still wrap with Invoice.parse and a retry.
Users will not wait 8 seconds for a structured response when they can see the equivalent text streaming. The answer is partial-JSON parsing: read the stream token by token, attempt to parse the incomplete JSON, and progressively reveal fields as they finish.
We tried three libraries and standardized on two:
partial-json (npm) for browser and Node. Tiny, fast, handles incomplete arrays, strings, and objects.jiter (Python) ships with Instructor's Partial[Model] support; gives you a streaming async iterator of progressively-completed Pydantic models.The implementation pattern is the same in both languages: accumulate the token buffer, run a tolerant parse on each chunk, diff against the previous parse, emit a delta. Render the deltas as they arrive. A form with eight fields fills in one field at a time instead of appearing all-at-once after a full generation.
The non-obvious win: streaming structured output is faster than streaming free text plus a post-hoc parse, because the parse happens incrementally and the UI never blocks. We measured a 2.1x improvement in perceived latency on a customer-facing extraction product.
For deeper context on agent UX patterns, our guide on building autonomous coding agents walks through the same streaming-JSON technique applied to multi-step tool calls.
Even with strict mode, schemas fail. The model returns an integer where the schema wants a string with a regex. An enum value drifts. A nested array hits a max-items rule.
The fix that pays for itself in a week of production: catch the validation error, feed the error message back into the next prompt, and retry up to N times. Instructor and the openai SDK both ship this; if you are rolling your own, the loop is small:
async function generateWithRetry<T>(
schema: z.ZodType<T>,
prompt: string,
maxRetries = 3,
): Promise<T> {
let messages = [{ role: "user", content: prompt }];
for (let attempt = 0; attempt < maxRetries; attempt++) {
const raw = await callModel(messages, schema);
const result = schema.safeParse(raw);
if (result.success) return result.data;
messages.push({ role: "assistant", content: JSON.stringify(raw) });
messages.push({
role: "user",
content: `Your previous response failed validation: ${result.error.message}. Fix it and respond with valid JSON only.`,
});
}
throw new Error("Max retries exceeded");
}
Two notes from production: cap retries at 3 (after that, the model has dug in and will keep emitting the same wrong thing), and log every retry so you can find systematic failure modes. We have caught at least four schema bugs from retry-rate dashboards alone.
These are the failure modes we have hit more than once. Every one of them survived initial code review.
OpenAI strict mode supports enums, but Claude treats them as suggestions. We had a system that emitted severity: "critical" for months on GPT-4, then started occasionally returning severity: "CRITICAL" after we swapped to Claude. The fix is to either normalize on the way out (.transform((s) => s.toLowerCase()) in Zod) or rebuild the schema with a refine that fails loudly.
OpenAI strict mode requires every field to be in the required array. If a field is genuinely optional, model it as z.union([z.string(), z.null()]) and require it; the model will emit null when it does not apply. If you use z.optional() and the field is missing from the schema, strict mode rejects the schema at the API boundary, not at runtime, so the failure is loud (good) but unintuitive.
For deeply nested schemas (line items with sub-line-items, for example), the model frequently returns a flat array instead. The reliable fix is to include a one-shot example in the prompt showing the exact nesting. Schema alone is not enough.
additionalProperties: false is mandatory for strict modeOpenAI strict mode rejects any schema where additionalProperties is not explicitly false on every object. Zod-to-JSON-Schema converters default this correctly, but if you hand-write or merge schemas, this trips you. Validate the compiled schema against the OpenAI cookbook examples before you ship.
Outlines and other constrained-grammar libraries compile enums into regex alternatives. A 5,000-value enum (think SKU codes or country-region pairs) compiles to a multi-megabyte regex that bloats memory and adds seconds of latency per call. The fix: split into a two-step extraction (extract a category first, then constrain the second call to that category's values) or use a fuzzy-match post-processor instead of constrained decoding.
If you stream partial JSON and a validation error fires mid-stream, you have already shown the user half a form. Either commit to streaming and accept that a final-validation failure means re-streaming from scratch with a polite "regenerating" toast, or do not stream at all for that endpoint. Mixing the two creates UI states nobody knows how to design for.
For teams refactoring an existing extraction pipeline, our AI-assisted refactoring playbook covers the migration pattern we use to swap providers without breaking schemas in flight.
If you are running LLMs in production without structured outputs, the highest-leverage fix is to wrap your three most-called endpoints in a Zod or Pydantic schema with strict-mode enforcement and a 3-retry budget. You will catch a class of bugs you did not know you had, and your downstream code will get type-safe for free.
If you already have structured outputs and are firefighting reliability, add three dashboards: per-endpoint validation-failure rate, retry-count distribution, and schema-version distribution. Most reliability bugs live in one of those graphs.
If you are picking a stack for a new project: start with OpenAI strict mode plus Zod. Move to Anthropic tool-use when you need a schema feature OpenAI rejects. Move to constrained-grammar decoding only when you are running a local model or need formats neither provider supports (custom DSLs, code snippets, structured math).
Most teams we work with at Cadence hit the same wall: the structured-output layer works locally, breaks under load, and nobody on the team has shipped one to production before. Every engineer on Cadence is AI-native by default, vetted on Cursor, Claude Code, and Copilot fluency plus prompt-as-spec discipline before they unlock bookings, so the wiring-up of schemas, retries, and streaming is muscle memory, not a learning curve. Median time-to-first-commit across the platform is 27 hours.
If you have an LLM feature stuck in "works on my machine," book a Mid ($1,000/week) or Senior ($1,500/week) Cadence engineer for a week. You get a 48-hour free trial to see the structured-output layer wired up before the meter starts. Get a Build/Buy/Book recommendation if you are not sure which approach fits your stack.
For deeper background on the broader shift this is part of, see our piece on AI-native engineering ROI and how the disciplined teams pull 2.5-3.5x productivity gains from the same tools everyone else is using.
OpenAI strict mode has the strongest guarantee (constrained decoding, schema enforcement at the token level), but the narrowest supported schema. Anthropic's tool-use schemas accept a wider JSON Schema dialect and conform ~99% of the time, with the remaining 1% caught by an application-side safeParse. For most teams, OpenAI strict mode is the right default.
Use Zod (or Pydantic in Python) as the source of truth and compile to JSON Schema at the API boundary. Maintaining JSON Schema by hand drifts from your TypeScript types and creates a permanent class of bugs. The Zod-to-JSON-Schema converters bundled with the OpenAI and Anthropic SDKs handle the conversion correctly for 95% of schemas.
Use a multi-step prompt ladder: the first call returns a structured "plan" (which tool to call, with what arguments), your code executes the tool, and the result feeds the next structured call. Anthropic's tool-use API and OpenAI's function-calling both support this natively. The pattern matters more than the SDK; the key is one schema per step, not one mega-schema for the whole agent run.
Streaming JSON is structurally incomplete by definition (open braces, unclosed strings) until the final token. Use a tolerant parser like partial-json (npm) or jiter (Python) that knows how to read incomplete JSON; do not feed the raw stream to JSON.parse. The intermediate states are valid, the parser just has to be designed for them.
Constrained-grammar decoding via Outlines or XGrammar typically adds 10-40% to per-token latency, depending on schema complexity. OpenAI strict mode has near-zero latency overhead (the constraint is enforced server-side as part of normal decoding). For most production workloads where you are already calling a hosted API, strict mode wins on both reliability and cost.
Fullstack developer at withRemote. Ships across the stack — TypeScript, Node, Postgres, Vercel. Writes on shipping speed and pragmatic architecture.