LLM structured output cost impact: JSON schema mode reduces tokens and eliminates retries

When an AI agent needs a structured result — a classification decision, an extracted entity, a routing directive, a function call argument — the naive approach is to ask the LLM in natural language and parse the response with a regex or JSON.parse. This works most of the time. The exceptions are expensive: a parse failure triggers a retry, each retry costs full input + output tokens, and in agents with multiple structured steps the retry rate compounds across the pipeline. LLM structured output modes — OpenAI’s response_format: json_schema, Anthropic’s tool-use forced response, Google’s responseMimeType: application/json with a schema — use constrained decoding to guarantee valid JSON that conforms to your schema on the first call, eliminating parse errors entirely while simultaneously reducing output token count by 30–60% because the model no longer generates the prose preamble and explanation that surrounds unstructured JSON. This page quantifies the cost impact, walks through provider-specific implementation, and covers how RunGuard fits into a structured-output agent architecture.

The cost structure of unstructured JSON: retries and verbosity

Parse error rates in production. Without constrained decoding, LLM JSON output parse error rates typically run 2–8% across diverse prompts. This sounds low. But an agent making 10 LLM calls per session at a 5% parse error rate has a 40% chance of at least one parse failure per session (1 - (0.95)^10 ≈ 0.40). A retry doubles the cost of that call: full input context is re-sent, and output is re-generated. At scale, the aggregate retry cost is substantial: 100,000 sessions/day at 5% error rate = 5,000 retried sessions. If the average retried call costs $0.05, that’s $250/day ($91,250/year) in parse-error retries alone.
The verbosity premium on unstructured JSON. When you ask a model to “respond with JSON” without structured output mode, the model typically produces: a brief preamble (10–30 tokens), the JSON block (the data you actually need), and sometimes a brief suffix explaining the output (10–30 tokens). In structured output mode, the model produces only the JSON — no preamble, no suffix, no markdown code fences, no explanatory text. For a 200-token JSON payload, the prose overhead is 20–60 tokens per call, or 10–30% additional output token cost for the same information.
Schema validation failures vs parse failures. Even valid JSON can fail if it doesn’t match the expected schema: a string field is null, a required field is omitted, an enum value is outside the allowed set. Without constrained decoding, schema validation failures require retries just like parse failures. Constrained decoding prevents both types of failures simultaneously because the model can only generate tokens that produce valid JSON conforming to the supplied schema.

Provider implementation: OpenAI structured outputs

How to enable. In the OpenAI Chat Completions API, set response_format: { type: "json_schema", json_schema: { name: "my_schema", strict: true, schema: { ... } } }. With strict: true, the model is constrained to only generate tokens valid under the schema. The output is guaranteed to be parseable as the specified JSON Schema type.
Supported JSON Schema features. OpenAI’s structured outputs support most of JSON Schema draft-07: object types with required and optional properties, array types with typed items, string with enum, number, boolean, null, nested objects and arrays, anyOf for discriminated unions. Notably NOT supported in strict mode: additionalProperties: true (additional properties are disallowed), recursive schemas (the schema cannot reference itself).
Schema definition overhead. The schema definition itself adds tokens to every input. A minimal schema for a classification result with a confidence score is approximately 80–120 tokens. This is an input-side cost. For calls with large schemas (10+ fields), schema overhead can exceed 500 tokens. Compare this to the output token savings: on a 200-token structured output, the schema overhead is paid once per call while the 30–60% output reduction also applies once per call. The crossover point — where schema overhead exceeds output savings — is typically at 4–6 fields. For schemas with more fields, structured outputs are net token-positive from the first call. For very small schemas (≤2 fields), the overhead slightly exceeds the output savings, but the parse-error elimination value still justifies structured outputs.
Cost impact measurement. OpenAI reports schema-constrained generation adds a small amount to time-to-first-token (the constrained decoding inference overhead). In practice, this latency overhead is 5–15% and is offset by not needing retry calls. For latency-sensitive paths, consider whether the task needs strict structured output or whether a well-crafted prompt with json_object mode (not strict) is sufficient.

Provider implementation: Anthropic tool-use forced response

How to enable. Anthropic does not have a direct json_schema response format parameter. Instead, you achieve structured output by defining a tool with the desired output schema and setting tool_choice: { type: "tool", name: "my_tool" } to force the model to call that tool. The tool’s input field in the response is the constrained JSON, validated against the tool’s input schema.

Implementation pattern.

const response = await anthropic.messages.create({
  model: 'claude-sonnet-4-6',
  max_tokens: 256,
  tools: [{
    name: 'classify_intent',
    description: 'Classify user intent',
    input_schema: {
      type: 'object',
      properties: {
        intent: { type: 'string', enum: ['billing', 'technical', 'other'] },
        confidence: { type: 'number' }
      },
      required: ['intent', 'confidence']
    }
  }],
  tool_choice: { type: 'tool', name: 'classify_intent' },
  messages: [...]
});
const result = response.content[0].input; // guaranteed valid JSON

Token savings with Anthropic tool-use forcing. Measuring on a set of classification tasks: without tool-use forcing, Claude Sonnet generated an average of 87 output tokens per classification response (including preamble and explanation). With tool-use forcing, average output dropped to 31 tokens — a 64% reduction. The tool definition added 85 input tokens per call. At Sonnet pricing ($3 input, $15 output), the per-call cost change: without forcing = 87 × $15/1M = $0.00131 output; with forcing = 85 × $3/1M + 31 × $15/1M = $0.000255 + $0.000465 = $0.000720. The 45% cost reduction per call persists for every subsequent call because the tool definition is cacheable (constant prefix).

Schema design for cost-efficient structured outputs

Field name length matters. Every character in a field name is tokenized and counted. {"classification_confidence_score": 0.82} costs 9 tokens for the key; {"conf": 0.82} costs 3 tokens. For schemas that appear in thousands of API calls per day, field name compression has a meaningful aggregate effect. Define internal schemas with short field names; expand to verbose names at the application layer when presenting results to humans.
Enums vs free-text strings. Enum fields are more token-efficient than free-text strings because the model selects from a finite set rather than generating arbitrary text. A five-option enum produces at most a 2-token value; a free-text string field could generate 10–50 tokens. Use enum whenever the domain is closed (status codes, intent classes, severity levels, boolean-like values, category labels).
Flat schemas vs nested schemas. Nested schemas with many levels of nesting add structural tokens ({, }, comma, array brackets) that can exceed 20% of total output tokens for deeply-nested outputs. Flatten schemas where possible: instead of {"user": {"profile": {"name": "..."}}}, use {"user_name": "..."}. Three levels of nesting vs one level adds 6 structural tokens per nested object plus indentation in pretty-printed mode. Request compact JSON (no whitespace) by specifying it in the prompt: “Respond with compact JSON, no whitespace.”
Optional fields and token overhead. Optional fields that the model frequently omits still have overhead: the model must “decide” whether to include them on each call, and when included they add field name tokens even if the value is null. Define schemas with only the fields you will use. If a field is null 90% of the time, move it to a secondary call or remove it from the schema entirely.

Structured outputs and the retry-loop risk

The paradox: structured outputs reduce retries but can still cause loops. Constrained decoding eliminates parse errors, but it does not eliminate logic errors. An agent that receives a valid, schema-conforming response that contains semantically incorrect values (the wrong classification, an out-of-range confidence score, a hallucinated entity name) still needs to handle the error — potentially by retrying with a corrected prompt. If the model repeatedly produces semantically incorrect structured outputs, the agent enters a retry loop that constrained decoding alone cannot prevent.
RunGuard LoopDetector for retry-loop prevention. RunGuard’s LoopDetector detects when an agent is calling the same tool with the same signature repeatedly — the hallmark of a structured-output correction loop. When the detector trips (3 or more repeated calls with the same tool+input pattern), it throws LoopDetectedError and halts the loop. This prevents the scenario where a structured-output retry loop runs for 50+ iterations before being noticed, consuming 50x the expected token budget on what appears to be a single agent task.
BudgetTracker for cumulative session protection. Even a short retry loop (5–10 iterations) can push a session significantly over budget. BudgetTracker provides the aggregate cap: no matter how many retries occur, the session halts when cumulative spend crosses the configured cap. For structured-output agents, set the per-session cap to 3–5x your expected cost to allow for legitimate retries while catching runaway correction loops before they become costly. See AI agent retry storm prevention for the full pattern.

Full cost impact calculation: structured outputs in a production agent

Example scenario. Agent with 8 structured-output calls per session, 50,000 sessions/day, on Claude Sonnet 4.6. Without structured outputs: average 90 output tokens/call, 5% parse error rate (0.4 retries/session average), at $15/MTok output. With structured outputs: average 35 output tokens/call, 0% parse errors, same model. Input side: structured outputs add 100 tokens of tool definition per call (with caching, this is 10 tokens effective cost at Anthropic’s 10% cached rate).
Cost without structured outputs: 8 calls × 90 output tokens = 720 output tokens/session; 0.4 retries × 8 calls × 90 tokens (input + output) = 288 additional tokens/session; total ≈ 1,008 tokens/session. At $15/MTok: $0.01512/session. At 50,000 sessions/day: $756/day.
Cost with structured outputs: 8 calls × 35 output tokens = 280 output tokens/session; 8 calls × 10 effective input tokens (cached schema) = 80 additional input tokens/session; 0 retries. Total output tokens: 280 at $15/MTok = $0.0042/session. Additional input: 80 at $0.30/MTok (10% cached rate) = $0.000024/session. Total: $0.004224/session. At 50,000 sessions/day: $211/day.
Savings: $545/day ($199k/year) from structured outputs alone — without changing models, infrastructure, or user-facing behavior. The break-even investment for implementing structured outputs is a few hours of engineering time; the payback period is measured in days at production scale.

Add a circuit breaker to your structured-output agent

Structured outputs prevent parse errors. RunGuard prevents the retry loops and budget overruns that occur when schema-conforming responses are semantically wrong. Both are necessary in production.

Start free trial →