LangChain Structured Output Cost Control: Stopping with_structured_output Retry Loops
LangChain's with_structured_output() method is one of the most convenient abstractions in the Python AI ecosystem. Pass it a Pydantic model, get back a typed object — no manual JSON parsing, no format instructions. What the documentation doesn't highlight is that the abstraction hides up to four distinct retry loops, each of which silently multiplies your LLM call count until an exception escapes or a token limit is hit.
This is structurally different from the retry and memory failure modes covered in the LangChain LCEL cost control guide. LCEL-level .with_retry() loops operate on the transport layer — network errors and rate limits. The failure modes here operate on the semantic layer: the model produced a response, it passed transmission, and the structured output layer rejected it for format or validation reasons and silently asked again.
Four failure modes to understand:
- Pydantic validation retry loop —
with_structured_output()wraps each failed parse in a new LLM call asking for a corrected response. No cap by default. OutputParserExceptioncascade viaRetryWithErrorOutputParser— the "fix my JSON" prompt pattern compounds failures because the error-correction prompt biases the model toward the same invalid form.bind_tools()agent executor spiral —AgentExecutor's defaultmax_iterations=15means 15 full LLM round-trips before it stops, each potentially making a tool call.- Custom
@validatorinfinite re-call — Pydantic@model_validatorand@field_validatorraisingValueErrorlook like parse failures towith_structured_output(). The model can't satisfy a validator it doesn't know exists, so every retry fails identically.
Key insight: with_structured_output(method="function_calling") and method="json_mode" differ in where failures occur. Function-calling failures are caught by the tool call parser; JSON mode failures are caught by the JSON parser. Both are opaque to the caller — you see OutputParserException in both cases, but the retry behavior and cost per retry differ.
1. Pydantic validation retry loop
When you call llm.with_structured_output(MyModel), LangChain creates a chain that: (1) adds a system prompt instructing the model to return JSON matching your schema, (2) calls the LLM, (3) parses the response into your Pydantic model. Step 3 raises OutputParserException if the response doesn't parse. By default, that exception propagates up.
The dangerous pattern is wrapping with_structured_output inside .with_retry():
# DANGEROUS — unbounded retries on parse failure
from langchain_openai import ChatOpenAI
from pydantic import BaseModel, HttpUrl
from typing import List
class ResearchReport(BaseModel):
title: str
sources: List[HttpUrl] # expects full URLs like "https://..."
summary: str
confidence: float # expects 0.0–1.0
llm = ChatOpenAI(model="gpt-4o-mini")
structured_llm = llm.with_structured_output(ResearchReport)
# This chain retries the ENTIRE LLM call on any OutputParserException
chain = structured_llm.with_retry(
retry_if_exception_type=(Exception,),
stop_after_attempt=10, # 10 × LLM call cost per invocation
)
result = chain.invoke("Summarize recent AI safety research")
The failure mode: if the model returns "sources": ["arxiv.org/abs/2401.0001"] instead of "sources": ["https://arxiv.org/abs/2401.0001"], Pydantic's HttpUrl validator raises a ValidationError wrapped as OutputParserException. The .with_retry() layer catches it and sends the identical prompt again. Without a format-correcting retry prompt, the model repeats the same schema violation. Ten retries = ten LLM calls, all failing at validation.
The fix is a bounded StructuredOutputGuard that intercepts failures after a configurable attempt count and optionally injects the error into the next attempt's prompt:
from langchain_openai import ChatOpenAI
from langchain_core.output_parsers import PydanticOutputParser
from langchain_core.prompts import ChatPromptTemplate
from langchain_core.exceptions import OutputParserException
from pydantic import BaseModel, HttpUrl, field_validator
from typing import List, Optional, Type, TypeVar
import time
T = TypeVar("T", bound=BaseModel)
class StructuredOutputGuard:
"""
Bounded structured output with transparent error-context injection.
Limits LLM retries on validation failure and caps total token spend.
"""
def __init__(
self,
llm,
schema: Type[T],
max_attempts: int = 3,
error_feedback: bool = True,
):
self.llm = llm
self.schema = schema
self.max_attempts = max_attempts
self.error_feedback = error_feedback
self.parser = PydanticOutputParser(pydantic_object=schema)
self._calls = 0
self._failures = 0
def invoke(self, user_input: str) -> T:
format_instructions = self.parser.get_format_instructions()
last_error: Optional[str] = None
for attempt in range(1, self.max_attempts + 1):
self._calls += 1
# Inject previous error into prompt on retry
if last_error and self.error_feedback:
prompt_text = (
f"{user_input}\n\n"
f"Your previous response failed validation with this error:\n"
f"{last_error}\n\n"
f"Correct the error and respond again.\n\n"
f"{format_instructions}"
)
else:
prompt_text = f"{user_input}\n\n{format_instructions}"
messages = [{"role": "user", "content": prompt_text}]
try:
response = self.llm.invoke(messages)
result = self.parser.parse(response.content)
return result
except (OutputParserException, Exception) as exc:
self._failures += 1
last_error = str(exc)[:500] # truncate long Pydantic traces
if attempt == self.max_attempts:
raise RuntimeError(
f"StructuredOutputGuard: {self.max_attempts} attempts "
f"exhausted for {self.schema.__name__}. "
f"Last error: {last_error}"
) from exc
def summary(self) -> dict:
return {
"total_calls": self._calls,
"total_failures": self._failures,
"schema": self.schema.__name__,
"max_attempts": self.max_attempts,
}
# Usage
class ResearchReport(BaseModel):
title: str
sources: List[HttpUrl]
summary: str
confidence: float
llm = ChatOpenAI(model="gpt-4o-mini", temperature=0)
guard = StructuredOutputGuard(llm, ResearchReport, max_attempts=3, error_feedback=True)
report = guard.invoke("Summarize recent AI safety research with source URLs.")
print(guard.summary())
# {"total_calls": 1, "total_failures": 0, "schema": "ResearchReport", "max_attempts": 3}
Three attempts is the right default for structured output validation failures. A schema error that isn't fixed after one corrective-feedback retry is unlikely to be fixed by a third attempt without fundamentally changing the schema or the prompt. If you're seeing more than one retry per request in production, your schema or format instructions need revision — not more retries.
2. OutputParserException cascade via RetryWithErrorOutputParser
LangChain ships RetryWithErrorOutputParser as a convenience wrapper that catches OutputParserException and re-invokes the LLM with an error-correction prompt. The design is sound in theory — give the model the parse error and ask it to fix its response. In practice, it produces two compounding failure modes.
The first failure mode is error anchoring. The error-correction prompt includes the model's previous response verbatim alongside the parse error. When the model sees its previous response plus an error message, it tends to make minimal edits to the existing response rather than regenerating from scratch. If the fundamental schema misunderstanding is in the structure (e.g., the model insists on returning a string where an array is required), minimal edits produce a response with the same structure — which fails validation again.
The second failure mode is retry amplification with nested chains. If RetryWithErrorOutputParser is itself wrapped inside a .with_retry() chain, each parse failure triggers: one RetryWithErrorOutputParser re-call (an LLM call), which then raises OutputParserException again, which triggers the outer .with_retry() — producing up to retry_outer × retry_inner total LLM calls.
from langchain.output_parsers import RetryWithErrorOutputParser
from langchain_openai import ChatOpenAI
from langchain_core.output_parsers import PydanticOutputParser
from langchain_core.prompts import PromptTemplate
from pydantic import BaseModel
from typing import List
class AnalysisResult(BaseModel):
findings: List[str]
score: int # expects integer 1–10
recommendation: str
llm = ChatOpenAI(model="gpt-4o-mini", temperature=0)
parser = PydanticOutputParser(pydantic_object=AnalysisResult)
# RetryWithErrorOutputParser itself is fine in isolation
retry_parser = RetryWithErrorOutputParser.from_llm(
parser=parser,
llm=llm,
max_retries=2, # MUST set this — default is no limit in older versions
)
# But wrapping it inside .with_retry() creates 2 × 2 = 4 LLM calls on every failure
# BAD: don't do this
# chain = retry_parser.with_retry(stop_after_attempt=2)
# GOOD: use RetryWithErrorOutputParser as the only retry layer
prompt = PromptTemplate(
template="Analyze the following text and respond in the requested format.\n{format_instructions}\n\nText: {text}",
input_variables=["text"],
partial_variables={"format_instructions": parser.get_format_instructions()},
)
# Build a flat chain — no outer .with_retry()
flat_chain = prompt | llm | retry_parser
The practical rule: use either RetryWithErrorOutputParser or .with_retry() on the structured output layer, never both. RetryWithErrorOutputParser.from_llm(max_retries=2) is the safer choice because it injects error context; raw .with_retry() sends the same prompt blindly. Cap max_retries=2 explicitly — the parameter name changed between LangChain versions and older versions default to no limit.
Cost comparison for a schema that fails validation on 30% of requests:
| Configuration | LLM calls per 100 requests | Cost multiplier |
|---|---|---|
| No retry | 100 | 1.0× |
| RetryWithErrorOutputParser(max_retries=2) | 160 (30 requests × 2 retries) | 1.6× |
| .with_retry(stop_after_attempt=3) only | 190 (30 × 3 retries) | 1.9× |
| RetryWithErrorOutputParser + .with_retry() | 370 (30 × 2 × 3 retries) | 3.7× |
| No cap (default in some versions) | ∞ (until exception or rate limit) | unbounded |
3. bind_tools() agent executor spiral
llm.bind_tools(tools) tells the model it can request tool invocations. AgentExecutor runs the model-tool loop until the model produces a final answer without requesting a tool. The default max_iterations=15 means the loop runs up to 15 times before the executor raises AgentFinishedException. Each iteration includes at minimum one full LLM call — and if the model chooses a tool, a tool invocation and then another LLM call to process the tool output.
The failure mode is tool call spiraling: the model calls a tool, gets a result, calls the same tool again with slightly different parameters, gets a similar result, and so on. This happens most frequently with search tools (the model never decides the search results are "good enough") and with tools that return errors (the model keeps retrying the tool call as if more iterations will fix the underlying error).
from langchain_openai import ChatOpenAI
from langchain.agents import AgentExecutor, create_tool_calling_agent
from langchain_core.prompts import ChatPromptTemplate
from langchain_core.tools import tool
from typing import Optional
import time
@tool
def web_search(query: str) -> str:
"""Search the web for information."""
# Your search implementation here
return f"Search results for: {query}"
@tool
def fetch_url(url: str) -> str:
"""Fetch content from a URL."""
return f"Content from: {url}"
class BoundedAgentExecutor:
"""
AgentExecutor wrapper with hard cost ceiling and spiral detection.
Stops the agent loop before max_iterations if tool call pattern repeats.
"""
def __init__(
self,
llm,
tools: list,
max_iterations: int = 5, # far below AgentExecutor default of 15
max_llm_calls: int = 8, # total LLM calls including tool-response calls
spiral_window: int = 3, # consecutive same-tool calls = spiral
verbose: bool = False,
):
self.max_iterations = max_iterations
self.max_llm_calls = max_llm_calls
self.spiral_window = spiral_window
self.verbose = verbose
prompt = ChatPromptTemplate.from_messages([
("system", "You are a helpful research assistant. Use the available tools to answer questions. "
"Stop searching once you have enough information to answer confidently — "
"do not keep searching for additional confirmation."),
("human", "{input}"),
("placeholder", "{agent_scratchpad}"),
])
agent = create_tool_calling_agent(llm, tools, prompt)
self._executor = AgentExecutor(
agent=agent,
tools=tools,
max_iterations=max_iterations,
return_intermediate_steps=True,
verbose=verbose,
)
self._llm_calls = 0
self._tool_calls = []
self._spirals = 0
def _check_spiral(self, tool_name: str) -> bool:
self._tool_calls.append(tool_name)
if len(self._tool_calls) >= self.spiral_window:
recent = self._tool_calls[-self.spiral_window:]
if len(set(recent)) == 1:
self._spirals += 1
return True
return False
def invoke(self, input_text: str) -> dict:
result = self._executor.invoke(
{"input": input_text},
config={"callbacks": [self._build_callback()]},
)
# Check for tool spiral in intermediate steps
steps = result.get("intermediate_steps", [])
for action, _ in steps:
tool_name = getattr(action, "tool", None)
if tool_name and self._check_spiral(tool_name):
return {
"output": f"[Stopped: repeated {tool_name} calls detected after {self.spiral_window} consecutive invocations]",
"intermediate_steps": steps,
"spiral_detected": True,
"stats": self.summary(),
}
return {**result, "stats": self.summary()}
def _build_callback(self):
from langchain_core.callbacks import BaseCallbackHandler
guard = self
class LLMCallCounter(BaseCallbackHandler):
def on_llm_start(self, *args, **kwargs):
guard._llm_calls += 1
if guard._llm_calls > guard.max_llm_calls:
raise RuntimeError(
f"BoundedAgentExecutor: max_llm_calls={guard.max_llm_calls} exceeded. "
f"Agent stopped to prevent cost overrun."
)
return LLMCallCounter()
def summary(self) -> dict:
return {
"llm_calls": self._llm_calls,
"tool_calls": len(self._tool_calls),
"spirals": self._spirals,
"max_iterations": self.max_iterations,
}
# Usage
llm = ChatOpenAI(model="gpt-4o-mini", temperature=0)
tools = [web_search, fetch_url]
executor = BoundedAgentExecutor(
llm=llm,
tools=tools,
max_iterations=5, # 5 agent steps maximum
max_llm_calls=8, # hard ceiling regardless of step count
spiral_window=3, # stop if same tool called 3× in a row
)
result = executor.invoke("What are the latest developments in quantum computing?")
print(result["stats"])
# {"llm_calls": 3, "tool_calls": 2, "spirals": 0, "max_iterations": 5}
The key insight with AgentExecutor cost control is that max_iterations alone is insufficient — you need both a step ceiling and a call ceiling. An agent that calls three tools per iteration at max_iterations=15 makes up to 45 tool calls plus 15 LLM synthesis calls = 60 LLM-level operations. Set max_iterations between 4 and 6 for most research agents; reserve 10+ only for complex multi-step workflows where you understand the expected call depth.
4. Custom @validator infinite re-call
Pydantic @field_validator and @model_validator functions give you arbitrary Python logic at parse time. This is powerful but creates a subtle trap with with_structured_output(): validators that raise ValueError look identical to schema violations from the model's perspective. The model has no visibility into your validator logic, so it cannot change its output to satisfy the validator — it doesn't know the validator exists.
from pydantic import BaseModel, field_validator, model_validator
from langchain_openai import ChatOpenAI
from typing import List
import re
# PROBLEM: custom validators that the model cannot satisfy
class ContentAnalysis(BaseModel):
topics: List[str]
sentiment: str
keywords: List[str]
@field_validator("sentiment")
@classmethod
def sentiment_must_be_valid(cls, v: str) -> str:
allowed = {"positive", "negative", "neutral", "mixed"}
if v.lower() not in allowed:
# Model returns "somewhat positive" or "largely negative"
# These fail validation → with_structured_output retries → same failure
raise ValueError(f"sentiment must be one of {allowed}, got: {v!r}")
return v.lower()
@field_validator("keywords")
@classmethod
def keywords_must_be_lowercase(cls, v: List[str]) -> List[str]:
for kw in v:
if kw != kw.lower():
raise ValueError(f"All keywords must be lowercase, got: {kw!r}")
return v
# This creates an infinite retry loop if the model returns "Positive" or "mostly positive"
llm = ChatOpenAI(model="gpt-4o-mini")
structured = llm.with_structured_output(ContentAnalysis)
# result = structured.invoke("Analyze this text: ...") # LOOPS if model outputs free-form sentiment
There are two fixes. The first is schema-side normalization — move validation logic into a post-parse normalizer so the model never fails, you just clean the output:
from pydantic import BaseModel, field_validator
from langchain_openai import ChatOpenAI
from typing import List
class ContentAnalysis(BaseModel):
topics: List[str]
sentiment: str
keywords: List[str]
@field_validator("sentiment", mode="before")
@classmethod
def normalize_sentiment(cls, v: str) -> str:
"""Normalize free-form sentiment to canonical values — never raises."""
v_lower = str(v).lower().strip()
if any(w in v_lower for w in ("positive", "good", "great", "excellent")):
return "positive"
if any(w in v_lower for w in ("negative", "bad", "poor", "terrible")):
return "negative"
if any(w in v_lower for w in ("mixed", "both", "ambivalent")):
return "mixed"
return "neutral" # safe fallback — never raises ValueError
@field_validator("keywords", mode="before")
@classmethod
def normalize_keywords(cls, v) -> List[str]:
"""Lowercase all keywords — never raises."""
if isinstance(v, list):
return [str(kw).lower().strip() for kw in v]
return []
The second fix is prompt-side enumeration — if the model knows the exact allowed values, it reliably returns them. Add the constraint directly to the format instructions via a Literal type hint; Pydantic generates the enum constraint in the JSON schema, and with_structured_output() includes that schema in its function-calling definition:
from typing import Literal, List
from pydantic import BaseModel
class ContentAnalysis(BaseModel):
topics: List[str]
sentiment: Literal["positive", "negative", "neutral", "mixed"] # model sees exact enum
keywords: List[str]
Literal annotations are the cleanest solution: the JSON schema includes an enum constraint, modern function-calling models (GPT-4o, Claude 3.5+) respect enum constraints natively, and validation never fails on properly enumerated fields. Reserve @field_validator for post-parse cleanup that doesn't raise — not for enforcing constraints the model must guess.
Comparison: structured output failure modes across LangChain approaches
| Approach | Where failure occurs | Default retry behavior | Recommended cap |
|---|---|---|---|
with_structured_output() |
Pydantic parse / validation | Raises OutputParserException — no retry unless wrapped | Wrap in StructuredOutputGuard(max_attempts=3) |
RetryWithErrorOutputParser |
OutputParserException catch | Unbounded in older versions; max_retries param required |
max_retries=2; never nest with .with_retry() |
bind_tools() + AgentExecutor |
Tool result → LLM → next action | max_iterations=15 (15 LLM calls + tool calls) | max_iterations=5, max_llm_calls=8 |
@field_validator raising ValueError |
Post-parse application logic | Identical to schema error — retries indefinitely | Use Literal types; make validators normalize, not raise |
json_mode + manual parse |
JSON parse → application parse | None — you control retry | 3 attempts max; explicit error-feedback prompt |
Composite guard: SafeStructuredChain
For production use, compose all four protections into a single drop-in:
from langchain_openai import ChatOpenAI
from langchain_core.output_parsers import PydanticOutputParser
from langchain_core.exceptions import OutputParserException
from pydantic import BaseModel
from typing import Type, TypeVar, Optional, Dict, Any
import time
T = TypeVar("T", bound=BaseModel)
class SafeStructuredChain:
"""
Single drop-in for safe structured output calls.
Protections:
- Capped validation retries (max_attempts)
- Error-feedback injection on retry (not blind re-call)
- Session token budget ceiling
- Per-call timeout
Does NOT use RetryWithErrorOutputParser or .with_retry() internally
to avoid double-retry stacking.
"""
def __init__(
self,
llm,
schema: Type[T],
max_attempts: int = 3,
session_token_ceiling: int = 50_000,
timeout_seconds: float = 30.0,
):
self.llm = llm
self.schema = schema
self.max_attempts = max_attempts
self.session_token_ceiling = session_token_ceiling
self.timeout_seconds = timeout_seconds
self.parser = PydanticOutputParser(pydantic_object=schema)
self._session_tokens = 0
self._total_calls = 0
self._total_failures = 0
self._session_start = time.monotonic()
def _check_budget(self) -> None:
if self._session_tokens >= self.session_token_ceiling:
raise RuntimeError(
f"SafeStructuredChain: session token ceiling "
f"({self.session_token_ceiling:,}) reached. "
f"Total calls this session: {self._total_calls}."
)
def invoke(self, user_input: str, context: Optional[str] = None) -> T:
self._check_budget()
format_instructions = self.parser.get_format_instructions()
base_prompt = f"{user_input}\n\n{format_instructions}"
if context:
base_prompt = f"Context:\n{context}\n\n{base_prompt}"
last_error: Optional[str] = None
for attempt in range(1, self.max_attempts + 1):
self._total_calls += 1
prompt = base_prompt
if last_error and attempt > 1:
prompt = (
f"{base_prompt}\n\n"
f"NOTE: Your previous response had this validation error:\n"
f"{last_error}\n"
f"Please correct it."
)
messages = [{"role": "user", "content": prompt}]
try:
response = self.llm.invoke(
messages,
timeout=self.timeout_seconds,
)
usage = getattr(response, "usage_metadata", None)
if usage:
self._session_tokens += (
usage.get("input_tokens", 0) +
usage.get("output_tokens", 0)
)
result = self.parser.parse(response.content)
return result
except (OutputParserException, Exception) as exc:
self._total_failures += 1
last_error = str(exc)[:400]
if attempt == self.max_attempts:
raise RuntimeError(
f"SafeStructuredChain: exhausted {self.max_attempts} "
f"attempts for {self.schema.__name__}. "
f"Last error: {last_error}"
) from exc
def summary(self) -> Dict[str, Any]:
return {
"schema": self.schema.__name__,
"total_calls": self._total_calls,
"total_failures": self._total_failures,
"session_tokens_used": self._session_tokens,
"session_token_ceiling": self.session_token_ceiling,
"session_elapsed_s": round(time.monotonic() - self._session_start, 1),
}
# Example: report to RunGuard SDK
import httpx
async def report_to_runguard(chain: SafeStructuredChain, app_id: str, key: str) -> None:
summary = chain.summary()
async with httpx.AsyncClient() as client:
await client.post(
"https://runguard.dev/api/v1/report",
headers={"Authorization": f"Bearer {key}"},
json={"app": app_id, "stats": summary},
timeout=2.0,
)
Wire SafeStructuredChain.summary() into your observability pipeline. Track total_failures / total_calls as a parse failure rate. A rate above 0.05 (5%) means your schema, format instructions, or model selection needs attention — retries are masking a systematic prompt engineering problem.
Frequently asked questions
Does with_structured_output(method="function_calling") behave differently from method="json_mode" for retry cost?
Yes, meaningfully. With method="function_calling" (the default for models that support it), the model returns JSON via a structured tool call definition. The model has the JSON schema at generation time and is trained to follow it — parse failure rates are typically under 2% for well-specified schemas. With method="json_mode", the model is only told to return valid JSON; it infers the schema from format instructions in the prompt. Failure rates are typically 5–15% higher, meaning you pay for more retries. Use method="function_calling" where available and reserve json_mode for models that don't support function calling.
Is there a safe way to use RetryWithErrorOutputParser alongside with_structured_output()?
The two are redundant and should not be composed. with_structured_output() already handles the LLM call and parse step; RetryWithErrorOutputParser wraps a parser, not a complete chain. If you chain them, you end up with two independent retry layers on the same failure type. Choose one: use with_structured_output() with a bounded outer guard (like StructuredOutputGuard above), or use PydanticOutputParser with RetryWithErrorOutputParser(max_retries=2) inside a custom LCEL chain. Don't mix both approaches on the same schema.
What's a safe max_iterations for AgentExecutor in production?
Four to six for most research and question-answering agents. The model typically identifies the relevant tools in the first two iterations, retrieves information in iterations three and four, and synthesizes in the final iteration. Agents that legitimately need more iterations — multi-step planning agents, agents orchestrating other agents — should be structured as explicit workflow graphs (LangGraph) rather than AgentExecutor, because the graph gives you per-node cost visibility and deterministic retry control. AgentExecutor is designed for open-ended tool use; its cost profile is inherently unpredictable at max_iterations=15.
If I use Literal types in my Pydantic schema, do all LLM providers respect the enum constraints?
Modern providers that support function calling (OpenAI GPT-4o, Anthropic Claude 3.5+, Google Gemini 1.5+) respect JSON schema enum constraints derived from Literal — their constrained-decoding implementations actively suppress tokens that would violate the constraint. Older models and providers using JSON mode without structured decoding treat the schema as advisory; the model may still return values outside the enum. For providers without constrained decoding, use the normalizer-validator pattern (mode="before" validators that never raise) as a fallback alongside Literal type hints.
How do I integrate SafeStructuredChain with LangSmith or Langfuse for observability?
LangSmith traces are attached via the LANGCHAIN_TRACING_V2=true environment variable and captured automatically on every llm.invoke() call. SafeStructuredChain calls llm.invoke() directly, so all LLM calls appear in LangSmith traces with their token counts. Add the chain summary to the trace root span by passing a RunTree via the langsmith.traceable decorator on your invoke() method. For Langfuse, use the observe decorator from langfuse.decorators. Neither integration changes the retry logic — observability and circuit breaking are complementary: observability tells you what happened, RunGuard stops it before it gets expensive.