LLM structured output cost impact: JSON schema mode reduces tokens and eliminates retries

When an AI agent needs a structured result — a classification decision, an extracted entity, a routing directive, a function call argument — the naive approach is to ask the LLM in natural language and parse the response with a regex or JSON.parse. This works most of the time. The exceptions are expensive: a parse failure triggers a retry, each retry costs full input + output tokens, and in agents with multiple structured steps the retry rate compounds across the pipeline. LLM structured output modes — OpenAI’s response_format: json_schema, Anthropic’s tool-use forced response, Google’s responseMimeType: application/json with a schema — use constrained decoding to guarantee valid JSON that conforms to your schema on the first call, eliminating parse errors entirely while simultaneously reducing output token count by 30–60% because the model no longer generates the prose preamble and explanation that surrounds unstructured JSON. This page quantifies the cost impact, walks through provider-specific implementation, and covers how RunGuard fits into a structured-output agent architecture.

The cost structure of unstructured JSON: retries and verbosity

Provider implementation: OpenAI structured outputs

Provider implementation: Anthropic tool-use forced response

Schema design for cost-efficient structured outputs

Structured outputs and the retry-loop risk

Full cost impact calculation: structured outputs in a production agent

Add a circuit breaker to your structured-output agent

Structured outputs prevent parse errors. RunGuard prevents the retry loops and budget overruns that occur when schema-conforming responses are semantically wrong. Both are necessary in production.

Start free trial →