Instructor (Python) Cost Control: Validation Retry Storms, Batch Extraction Loops, and Context Accumulation
Instructor has become the de facto standard for structured output extraction from LLMs in Python. Its model is elegant: annotate a Pydantic model with field descriptions, call client.chat.completions.create(response_model=MyModel), and get back a validated, type-safe Python object. When the LLM's output fails Pydantic validation, Instructor automatically retries up to max_retries times, passing the validation error back to the model so it can correct its output. This self-healing pattern is what makes Instructor so popular in production — it quietly handles the noise of LLM output inconsistency without requiring the caller to write retry logic.
The same mechanism that makes Instructor reliable is also the mechanism most likely to silently multiply your LLM spend. max_retries=3 means up to four full LLM calls per extraction for any schema that consistently fails validation. In an agent pipeline where a single agent step calls several Instructor extractors in sequence, a validator logic bug in one extractor doesn't raise an exception — it retries three times, pays four times the tokens, and then either returns a best-effort result or raises after exhausting retries. The budget impact is invisible in the success path and only shows up in your provider dashboard at the end of the month.
Four structural failure modes account for the majority of unexpected Instructor costs in production:
- Validator retry storm — a Pydantic field validator or model-level validator that encodes business logic (date ranges, enum sets, cross-field constraints) consistently rejects the LLM's output because the prompt doesn't adequately constrain it; each rejection triggers a retry at full call cost, with validation error context appended to the prompt for each attempt.
- Batch extraction all-or-nothing retry — extracting
list[Item]from a document where a single invalid item in the list causes Instructor to retry the entire extraction, re-paying the full document prompt cost plus all the tokens for all valid items that were already correctly extracted. - Context accumulation in retry chains — each Instructor retry appends the previous attempt's output and validation error to the message history, so retry N costs more than retry N-1 in absolute tokens; for schemas with long field descriptions and complex error messages, the third retry can cost 3–4× the original call.
- Multi-provider fallback amplification — pipelines that configure a provider fallback chain (try OpenAI, then Anthropic, then Mistral) with shared
max_retriesbudgets exhaust all retries on the primary provider before cascading to the next, paying the full retry budget across every provider before surfacing the error.
Instructor's cost model
Instructor is a thin wrapper over your provider's chat API. Its cost footprint is entirely a function of LLM calls: every client.chat.completions.create() call that Instructor issues — whether original or retry — is billed by your provider at the standard per-token rate. Instructor itself does not charge. The cost model has three components:
- Base extraction cost: one LLM call combining your system prompt (including Pydantic model schema as JSON), user-provided document or context, and any tool-use plumbing Instructor injects for structured output. For a typical extraction with a 200-token schema definition and 800-token document, this is 1,000+ input tokens plus 200–400 output tokens per call.
- Retry surcharge: each retry passes the previous attempt's output and the Pydantic
ValidationErrormessage as additional context before asking the model to try again. Retry 1 adds ~100–300 extra input tokens. Retry 2 adds the retry 1 output plus the retry 2 validation error — another 200–500 tokens. Withmax_retries=3, the third retry is paying for the original context plus three rounds of failed outputs and error messages, which can be 40–60% more tokens than the original call. - Provider tooling overhead: Instructor's structured output mode uses provider-specific mechanisms (OpenAI's
response_format, Anthropic'stool_usewith a schema-shaped tool, Gemini'sgenerationConfig.responseSchema). Anthropic's tool-use mechanism for structured output is particularly relevant: it injects a tool schema into the API call that Instructor must pass in every request, including retries, adding 100–200 extra input tokens per call compared to plain text extraction.
The practical implication: a pipeline that calls five Instructor extractors per user request, where one extractor consistently fails and retries three times, is paying 4× the expected cost for that extractor on every request — not on failures only. At $0.03 per original call, that extractor alone costs $0.12 per request instead of $0.03. At 10,000 requests per day, the cost difference is $900/day from a single misconfigured validator.
Failure mode 1: validator retry storm
The validator retry storm is the most common Instructor cost amplifier, and also the hardest to notice because it looks like normal operation. The scenario: you define a Pydantic model with a field that has a custom validator checking business logic — a date that must be in the future, a status field that must be one of a specific set of values, a numeric field that must be within a range. The LLM's training data doesn't include your specific enum or range constraint, so it consistently generates values that fail the validator. Instructor retries with the error message, the model generates a different invalid value, Instructor retries again. This continues until max_retries is exhausted, at which point Instructor either raises a ValidationError or (in some modes) returns the last best-effort object.
The trap is that the validator looks correct in isolation. The problem is at the intersection of prompt design and validator semantics: the prompt doesn't tell the model which values are valid, so the model generates plausible-but-wrong values, and the validator correctly rejects them. The fix requires either improving the prompt or relaxing the validator — but without monitoring, you never know the retry rate is high.
The guard wraps Instructor calls and tracks validation failure rates per schema type. If a given schema's retry rate exceeds a threshold, it raises a circuit-open error rather than allowing more retries.
import time
import hashlib
from collections import defaultdict
from dataclasses import dataclass, field
from typing import Type, TypeVar
import anthropic
import instructor
from pydantic import BaseModel
from runguard import BudgetTracker, BudgetExceededError, LoopDetector
T = TypeVar("T", bound=BaseModel)
@dataclass
class SchemaStats:
total_calls: int = 0
total_retries: int = 0
consecutive_failures: int = 0
last_failure_ts: float = 0.0
class ValidatorRetryGuard:
"""
Wraps Instructor extract calls with per-schema retry rate monitoring.
Raises CircuitOpenError when a schema's retry rate exceeds the threshold,
preventing further spend on a consistently-failing extractor.
"""
def __init__(
self,
max_retry_rate: float = 0.4, # trip if >40% of calls for a schema are retries
circuit_reset_seconds: int = 300, # reset after 5 minutes quiet
session_budget_usd: float = 1.0,
base_call_cost_usd: float = 0.025, # cost per original LLM call (input+output estimate)
retry_cost_multiplier: float = 1.35, # each retry costs ~35% more than original
):
self.max_retry_rate = max_retry_rate
self.circuit_reset_seconds = circuit_reset_seconds
self.budget = BudgetTracker(cap=session_budget_usd)
self.base_cost = base_call_cost_usd
self.retry_multiplier = retry_cost_multiplier
self._stats: dict[str, SchemaStats] = defaultdict(SchemaStats)
self._open_circuits: set[str] = set()
self._client = instructor.from_anthropic(anthropic.Anthropic())
def _schema_key(self, response_model: Type[BaseModel]) -> str:
return hashlib.md5(response_model.model_json_schema().__repr__().encode()).hexdigest()[:12]
def _estimate_call_cost(self, retry_count: int) -> float:
"""Estimates total cost of an extraction including retries."""
cost = self.base_cost # original call
for i in range(retry_count):
cost += self.base_cost * (self.retry_multiplier ** (i + 1))
return cost
def _maybe_reset_circuit(self, schema_key: str) -> None:
stats = self._stats[schema_key]
if (
schema_key in self._open_circuits
and time.time() - stats.last_failure_ts > self.circuit_reset_seconds
):
self._open_circuits.discard(schema_key)
stats.consecutive_failures = 0
def extract(
self,
response_model: Type[T],
messages: list[dict],
model: str = "claude-sonnet-4-6",
max_retries: int = 2,
) -> T:
"""
Extracts a structured object using Instructor with retry storm protection.
Raises RuntimeError if the schema's circuit is open (too many recent retries).
"""
schema_key = self._schema_key(response_model)
stats = self._stats[schema_key]
self._maybe_reset_circuit(schema_key)
if schema_key in self._open_circuits:
raise RuntimeError(
f"Circuit open for schema '{response_model.__name__}': retry rate exceeded "
f"{self.max_retry_rate:.0%} threshold. Check your Pydantic validators and prompt "
"to ensure the LLM can generate valid outputs. Circuit resets after "
f"{self.circuit_reset_seconds}s of inactivity."
)
# Pre-flight budget check: estimate worst-case cost for this call
worst_case_cost = self._estimate_call_cost(max_retries)
try:
self.budget.add(worst_case_cost)
except BudgetExceededError:
raise RuntimeError(
f"Session budget would be exceeded by worst-case extraction cost "
f"(${worst_case_cost:.4f} for {max_retries} retries). "
f"Budget remaining: ${self.budget.cap - self.budget.spent:.4f}"
)
stats.total_calls += 1
actual_retries = 0
try:
result = self._client.messages.create(
model=model,
max_tokens=1024,
messages=messages,
response_model=response_model,
max_retries=max_retries,
)
# Success: refund the unused retry budget estimate
unused = self._estimate_call_cost(max_retries) - self._estimate_call_cost(actual_retries)
self.budget.spent = max(0.0, self.budget.spent - unused)
return result
except Exception as exc:
# Instructor exhausted retries — record failure and potentially open circuit
actual_retries = max_retries
stats.total_retries += actual_retries
stats.consecutive_failures += 1
stats.last_failure_ts = time.time()
retry_rate = stats.total_retries / max(1, stats.total_calls * max_retries)
if retry_rate > self.max_retry_rate and stats.total_calls >= 3:
self._open_circuits.add(schema_key)
raise RuntimeError(
f"Instructor extraction failed after {max_retries} retries for "
f"'{response_model.__name__}'. "
f"Schema retry rate: {retry_rate:.1%} (threshold: {self.max_retry_rate:.0%}). "
f"{'Circuit opened — fix validators before next call.' if schema_key in self._open_circuits else ''}"
) from exc
def schema_stats(self, response_model: Type[BaseModel]) -> dict:
key = self._schema_key(response_model)
s = self._stats[key]
rate = s.total_retries / max(1, s.total_calls) if s.total_calls else 0.0
return {
"schema": response_model.__name__,
"total_calls": s.total_calls,
"total_retries": s.total_retries,
"retry_rate": rate,
"circuit_open": key in self._open_circuits,
}
The guard operates on two tracks simultaneously. The per-call track estimates the worst-case cost of a call before issuing it — max_retries=2 means up to three LLM calls, and the budget deduction happens upfront. If retries are not actually consumed, the unused portion is refunded. This prevents the situation where the last call in a session burns through the remaining budget on retries. The per-schema track accumulates the actual retry rate for each Pydantic model class and opens the circuit when the rate exceeds the threshold. A schema with a 45% retry rate after 10 calls has a validation problem that no amount of retrying will fix — it needs a prompt or validator change, not another attempt.
Failure mode 2: batch extraction all-or-nothing retry
Extracting a list of structured objects from a document is one of the most common Instructor use cases: entity extraction from text, line-item parsing from invoices, event extraction from logs. The natural model is response_model=list[Item]. Instructor treats the entire list as a single extraction unit — if any item in the returned list fails Pydantic validation, the entire list extraction is retried from scratch. The model re-generates all N items, paying input tokens for the full document context again, plus output tokens for all N items.
The cost amplification is proportional to list size. A document that should produce 30 items but where one item consistently fails validation pays 30 items × max_retries full re-extractions. At $0.02 per extraction of 30 items, with max_retries=3, the cost is $0.08 instead of $0.02 — 4× for a single failing item out of 30. The failure item is usually the edge case: a date in an unusual format, a name with special characters, a numeric field where the source text is ambiguous. The 29 valid items are re-extracted identically on every retry, paying their token cost with zero benefit.
The guard breaks batch extractions into smaller chunks and applies item-level retry limits, preventing a single bad item from triggering a full-batch re-extraction.
from typing import Type, TypeVar, Iterable
import anthropic
import instructor
from pydantic import BaseModel
from runguard import BudgetTracker, BudgetExceededError
T = TypeVar("T", bound=BaseModel)
class BatchExtractionGuard:
"""
Breaks Instructor list[Item] extractions into chunks and handles item-level
validation failures without triggering a full-batch re-extraction.
Valid items from a completed chunk are preserved even if the chunk's
last item fails validation.
"""
def __init__(
self,
chunk_size: int = 10,
max_item_retries: int = 1,
session_budget_usd: float = 2.0,
cost_per_item_usd: float = 0.002,
):
self.chunk_size = chunk_size
self.max_item_retries = max_item_retries
self.budget = BudgetTracker(cap=session_budget_usd)
self.cost_per_item = cost_per_item_usd
def _build_chunk_prompt(
self,
document: str,
item_model: Type[BaseModel],
chunk_index: int,
total_chunks: int,
target_count: int,
) -> list[dict]:
schema_description = "\n".join(
f"- {name}: {info.get('description', 'required field')}"
for name, info in item_model.model_json_schema()
.get("properties", {}).items()
)
return [
{
"role": "user",
"content": (
f"Extract structured {item_model.__name__} items from the following document "
f"(chunk {chunk_index + 1} of {total_chunks}, targeting ~{target_count} items).\n\n"
f"Schema fields:\n{schema_description}\n\n"
f"Document:\n{document}\n\n"
f"Return exactly the items found in this chunk. Do not invent items. "
f"If a field value is unclear or missing, use null rather than guessing."
),
}
]
def extract_list(
self,
item_model: Type[T],
documents: list[str],
model: str = "claude-haiku-4-5-20251001",
) -> list[T]:
"""
Extracts a list of item_model instances from each document.
Processes documents in chunks; preserves valid items from partially-failed chunks.
"""
from pydantic import RootModel
import json
base_client = anthropic.Anthropic()
client = instructor.from_anthropic(base_client)
# Wrapper model for list extraction
class ItemList(BaseModel):
items: list[item_model] # type: ignore[valid-type]
all_results: list[T] = []
for doc_idx, document in enumerate(documents):
# Budget check per document
estimated_doc_cost = self.cost_per_item * self.chunk_size * self.max_item_retries
try:
self.budget.add(estimated_doc_cost)
except BudgetExceededError:
break # Stop processing documents when budget exhausted
# Try full document extraction first
try:
full_result = client.messages.create(
model=model,
max_tokens=2048,
messages=self._build_chunk_prompt(
document, item_model, 0, 1, self.chunk_size
),
response_model=ItemList,
max_retries=self.max_item_retries,
)
all_results.extend(full_result.items)
# Refund unused budget — extraction succeeded on first try
self.budget.spent = max(0.0, self.budget.spent - estimated_doc_cost * 0.7)
continue
except Exception:
pass # Full extraction failed — fall through to chunked approach
# Chunked fallback: split document into sections and extract separately
words = document.split()
chunk_word_count = max(200, len(words) // max(1, len(words) // 400 + 1))
chunks = []
for i in range(0, len(words), chunk_word_count):
chunks.append(" ".join(words[i:i + chunk_word_count]))
for chunk_idx, chunk in enumerate(chunks):
try:
chunk_result = client.messages.create(
model=model,
max_tokens=1024,
messages=self._build_chunk_prompt(
chunk, item_model, chunk_idx, len(chunks),
max(3, self.chunk_size // len(chunks))
),
response_model=ItemList,
max_retries=1, # One retry per chunk maximum
)
all_results.extend(chunk_result.items)
except Exception:
# This chunk failed even with retry — skip rather than re-extract
# Valid items from other chunks are preserved
continue
return all_results
The two-pass design is the key: a single full-document extraction attempt first, which succeeds for well-structured documents and costs the minimum, followed by chunked extraction only when the full pass fails. Chunking prevents a single bad item from forcing re-extraction of the entire document — each chunk succeeds or fails independently. The cost_per_item estimate is intentionally conservative and applied upfront so the budget ceiling is enforced before any API calls are made for a given document, not after. Items successfully extracted from completed chunks are preserved in all_results regardless of what happens with later chunks.
Failure mode 3: context accumulation in retry chains
Instructor's retry mechanism passes the previous failed attempt's output back to the model as additional context. The message history for retry N looks like: original system prompt + original user message + [assistant output N-1 that failed validation] + [user message: "that output failed validation with error: X, please try again"] + [assistant output N-2 that failed validation] + ... This accumulation means each successive retry is paying for all previous failed outputs plus their error messages in the input token count.
For simple schemas with short field descriptions, this overhead is minor — 100–200 extra input tokens per retry. For complex schemas with rich field descriptions, nested objects, and multi-field validators that produce verbose error messages, the overhead compounds significantly. A schema with a 400-token description and a validator that generates 200-token error messages produces retry context of 600, 1,200, 1,800 extra tokens for retries 1, 2, 3. At Claude Sonnet 4.6 input rates ($0.000003/token), retry 3 alone costs $0.0054 more than the original call in context overhead alone — before counting the output tokens for the re-generated schema.
The guard monitors context size growth across retries and short-circuits when context accumulation indicates the error is structural (the model is not converging) rather than transient.
import anthropic
import instructor
from pydantic import BaseModel, ValidationError
from runguard import BudgetTracker, BudgetExceededError
class ContextAccumulationGuard:
"""
Monitors context size growth across Instructor retries.
Short-circuits when the accumulated retry context signals a structural
validation failure (model not converging toward a valid output).
Also caps absolute context size to prevent runaway token spend.
"""
def __init__(
self,
max_context_tokens: int = 4000,
max_context_growth_factor: float = 2.0,
session_budget_usd: float = 0.50,
):
self.max_context_tokens = max_context_tokens
self.max_growth_factor = max_context_growth_factor
self.budget = BudgetTracker(cap=session_budget_usd)
self._base_client = anthropic.Anthropic()
self._instructor_client = instructor.from_anthropic(self._base_client)
def _estimate_tokens(self, messages: list[dict]) -> int:
total_chars = sum(
len(m.get("content", "") if isinstance(m.get("content"), str)
else str(m.get("content", "")))
for m in messages
)
return total_chars // 4 # rough char-to-token estimate
def extract_with_context_guard(
self,
response_model: type[BaseModel],
messages: list[dict],
model: str = "claude-sonnet-4-6",
max_retries: int = 2,
) -> BaseModel:
"""
Extracts a structured object, monitoring context token growth across retries.
Raises RuntimeError if context exceeds ceiling or grows faster than expected
for a converging model (indicating structural rather than transient failure).
"""
base_tokens = self._estimate_tokens(messages)
# Pre-check: reject if base context already near ceiling
if base_tokens > self.max_context_tokens * 0.8:
raise RuntimeError(
f"Base context ({base_tokens} tokens estimated) already near ceiling "
f"({self.max_context_tokens} tokens). Reduce document size or simplify schema "
"before attempting extraction."
)
retry_messages = list(messages)
last_error = None
last_output = ""
for attempt in range(max_retries + 1):
# Context size check before each attempt
current_tokens = self._estimate_tokens(retry_messages)
if current_tokens > self.max_context_tokens:
raise RuntimeError(
f"Context size ({current_tokens} tokens) exceeded ceiling "
f"({self.max_context_tokens} tokens) after {attempt} retries. "
"Structural validation failure: model is not converging. "
f"Last error: {str(last_error)[:200] if last_error else 'unknown'}"
)
# Growth factor check: if context grew more than expected, model is not converging
if attempt > 0:
growth_factor = current_tokens / max(1, base_tokens)
if growth_factor > self.max_growth_factor:
raise RuntimeError(
f"Context grew {growth_factor:.1f}× from base ({base_tokens} → "
f"{current_tokens} tokens) after {attempt} retries. "
"Model is not converging toward a valid output — likely a structural "
"mismatch between the schema and the source document."
)
# Budget check
call_cost_estimate = current_tokens * 0.000003 + 400 * 0.000015
try:
self.budget.add(call_cost_estimate)
except BudgetExceededError:
raise RuntimeError(
f"Budget exhausted on attempt {attempt + 1}. "
f"Spent: ${self.budget.spent:.4f} / ${self.budget.cap:.2f}"
)
# Attempt extraction
try:
result = self._instructor_client.messages.create(
model=model,
max_tokens=1024,
messages=retry_messages,
response_model=response_model,
max_retries=0, # We handle retries manually to track context
)
return result
except (ValidationError, Exception) as exc:
last_error = exc
if attempt == max_retries:
raise RuntimeError(
f"Extraction failed after {max_retries + 1} attempts. "
f"Final context size: {current_tokens} tokens. "
f"Last error: {str(exc)[:300]}"
) from exc
# Add error context for next retry — this is what causes accumulation
error_msg = str(exc)[:300]
retry_messages = retry_messages + [
{"role": "assistant", "content": last_output[:500] if last_output else "[no output]"},
{
"role": "user",
"content": (
f"That output failed validation with the following error:\n{error_msg}\n\n"
"Please correct only the failing fields and return the full valid object."
),
},
]
Taking manual control of the retry loop — setting max_retries=0 on the Instructor call and handling retries in the guard's own loop — is necessary to insert the growth factor check between attempts. With Instructor's built-in retry mechanism, there is no hook between individual retry attempts. The max_growth_factor=2.0 threshold works as follows: a model that is genuinely converging toward a valid output should produce shorter, more targeted corrections on each retry, which means context should grow sub-linearly. A model that is not converging — because the schema constraint cannot be satisfied with the available context — produces equally long (or longer) failed outputs on each retry, causing context to grow roughly linearly with attempt count. When growth exceeds 2×, the loop is structural rather than convergent, and the guard raises rather than allowing more attempts.
Failure mode 4: multi-provider fallback amplification
Teams building resilient extraction pipelines often configure a provider fallback chain: if OpenAI is unavailable or returns an error, try Anthropic; if Anthropic fails, try Mistral or Cohere. This is sensible for availability failures (rate limits, API outages, transient errors) but becomes a cost amplifier for validation failures. A Pydantic validator that rejects the OpenAI output will also reject the Anthropic output and the Mistral output — the constraint is in the schema, not in any provider's specific output style. When the fallback chain is combined with max_retries, you pay the full retry budget on every provider before learning that the problem is the schema, not the provider.
The amplification: three providers × three retries = 12 LLM calls for a single schema validation failure. At $0.025 per call, one extraction that should cost $0.025 costs $0.30 in the worst case. In pipelines that process thousands of documents per day, a single overly-strict validator with a three-provider fallback burns through budgets that were sized for single-provider operation.
The guard distinguishes validation failures (schema constraint mismatch — same on all providers) from infrastructure failures (rate limits, timeouts — provider-specific) before triggering a fallback.
import anthropic
import instructor
from pydantic import BaseModel, ValidationError
from runguard import BudgetTracker, BudgetExceededError
class ProviderConfig:
def __init__(self, name: str, client, model: str, cost_per_call: float):
self.name = name
self.client = client
self.model = model
self.cost_per_call = cost_per_call
class MultiProviderExtractionGuard:
"""
Manages multi-provider fallback for Instructor extractions.
Distinguishes validation failures (schema issue — don't fallback) from
infrastructure failures (rate limit, timeout — do fallback).
Prevents the fallback chain from amplifying schema validation costs.
"""
VALIDATION_ERRORS = (ValidationError,)
INFRASTRUCTURE_ERRORS = (
anthropic.RateLimitError,
anthropic.APIConnectionError,
anthropic.APIStatusError,
)
def __init__(
self,
providers: list[ProviderConfig],
max_retries_per_provider: int = 1,
session_budget_usd: float = 1.0,
):
self.providers = providers
self.max_retries = max_retries_per_provider
self.budget = BudgetTracker(cap=session_budget_usd)
self._validation_failure_counts: dict[str, int] = {}
def _schema_key(self, response_model: type[BaseModel]) -> str:
return response_model.__name__
def extract(
self,
response_model: type[BaseModel],
messages: list[dict],
) -> BaseModel:
"""
Attempts extraction across providers in order. Falls back to the next
provider only on infrastructure errors, not on validation failures.
"""
schema_key = self._schema_key(response_model)
last_exception = None
last_validation_error = None
for provider in self.providers:
# Check if this schema has a known validation failure pattern
if self._validation_failure_counts.get(schema_key, 0) >= 3:
raise RuntimeError(
f"Schema '{response_model.__name__}' has failed validation "
f"{self._validation_failure_counts[schema_key]} times across providers. "
"This is a schema constraint issue, not a provider issue. "
"Review your Pydantic validators and prompt before retrying."
)
# Budget check
worst_case = provider.cost_per_call * (self.max_retries + 1)
try:
self.budget.add(worst_case)
except BudgetExceededError:
break
try:
result = provider.client.messages.create(
model=provider.model,
max_tokens=1024,
messages=messages,
response_model=response_model,
max_retries=self.max_retries,
)
# Success: refund unused retry budget
self.budget.spent = max(0.0, self.budget.spent - worst_case + provider.cost_per_call)
return result
except self.VALIDATION_ERRORS as exc:
# Validation failure: record and do NOT fall back to next provider
# (the next provider will produce the same validation failure)
self._validation_failure_counts[schema_key] = (
self._validation_failure_counts.get(schema_key, 0) + 1
)
last_validation_error = exc
raise RuntimeError(
f"Validation failure on provider '{provider.name}' for schema "
f"'{response_model.__name__}'. Skipping fallback — validation failures "
"are schema-level issues, not provider-specific. "
f"Error: {str(exc)[:300]}"
) from exc
except self.INFRASTRUCTURE_ERRORS as exc:
# Infrastructure failure: fall back to next provider
last_exception = exc
# Refund the unused retry budget for this provider since we're falling back
self.budget.spent = max(0.0, self.budget.spent - worst_case + provider.cost_per_call)
continue
except Exception as exc:
# Unknown error — treat as infrastructure failure and fall back
last_exception = exc
continue
raise RuntimeError(
f"All providers exhausted for schema '{response_model.__name__}'. "
f"Last infrastructure error: {str(last_exception)[:200] if last_exception else 'none'}. "
f"Last validation error: {str(last_validation_error)[:200] if last_validation_error else 'none'}."
)
def build_anthropic_provider(budget_per_call: float = 0.03) -> ProviderConfig:
base = anthropic.Anthropic()
client = instructor.from_anthropic(base)
return ProviderConfig(
name="anthropic",
client=client,
model="claude-sonnet-4-6",
cost_per_call=budget_per_call,
)
The key logic is in the except hierarchy. ValidationError is caught before the infrastructure error exceptions. When Instructor raises a ValidationError (which it does after exhausting retries on a consistently-failing schema), the guard records the failure and raises immediately — it does not fall through to the next provider. The counter tracks cross-provider validation failures: if the same schema fails validation three times across the full session (including across different requests, not just within one call), the circuit opens and all future calls for that schema raise before any provider is contacted. This prevents the sustained burn of a systematically misconfigured extractor running across thousands of documents.
Combining guards in a production Instructor pipeline
In a real extraction pipeline, these four failure modes can occur simultaneously at different points. A recommended layering for production:
- The ValidatorRetryGuard wraps every individual
client.messages.create(response_model=...)call, providing per-schema retry rate monitoring and circuit breakers at the extractor level. - The BatchExtractionGuard handles all
list[Item]extractions where documents may contain variable numbers of items, replacing the default all-or-nothing retry with chunked extraction. - The ContextAccumulationGuard replaces Instructor's built-in retry mechanism for complex nested schemas where retry context is known to grow quickly — typically schemas with 5+ fields, cross-field validators, or custom validator methods that produce multi-line error messages.
- The MultiProviderExtractionGuard wraps the outermost extraction call only when your pipeline genuinely requires multi-provider fallback; if you're using a single provider, this guard is unnecessary overhead.
The shared BudgetTracker instance — passed to all four guards — ensures that the guards collectively enforce a single per-session ceiling rather than each running its own independent sub-budget that can add up to a surprise total at session end.
Instructor v1.x hooks note: Recent Instructor versions expose a hooks parameter on the patched client (client.on("completion:response", callback)) that fires after every LLM call including retries. You can use this hook to record actual token counts and update a shared BudgetTracker with measured costs rather than estimates. This gives tighter budget enforcement than cost estimates and removes the need to refund unused retry budget — the tracker knows exactly what each call cost. The completion:error hook fires on validation failures before retries, making it the cleanest insertion point for retry rate monitoring without taking manual control of the retry loop.
Summary: Instructor cost amplification patterns
| Pattern | Cost multiplier | Guard |
|---|---|---|
Validator retry stormmax_retries=3, consistent validation failure |
Up to (max_retries + 1)× per call | Per-schema retry rate circuit breaker; upfront budget deduction |
Batch all-or-nothing retrylist[Item] with one failing item |
(max_retries + 1)× × full document tokens | Chunked extraction; per-chunk independent retry; preserve valid items |
Context accumulationcomplex schemas with long error messages |
1.35–4× per retry for input tokens | Context growth factor check; circuit break at 2× base context |
Multi-provider fallbackproviders × retries for schema failures |
N providers × (max_retries + 1)× | Distinguish validation vs infrastructure errors; no fallback on ValidationError |
Frequently asked questions
Does Instructor have built-in cost controls or token budgets?
Instructor does not include native cost controls, token budgets, or circuit breakers. It provides max_retries to cap the number of retry attempts and, in recent versions, a hooks system for observing calls. max_retries prevents infinite retry loops but does not distinguish between validation failures that will never succeed and transient failures that will. A max_retries=3 on a consistently-failing schema always burns all four calls before raising — there is no early exit based on failure patterns or accumulated spend. The guards above build that logic on top of the hooks system or by wrapping the client directly.
What is the most common cause of high Instructor retry rates in production?
In order of frequency: (1) enum constraints where the prompt does not enumerate valid values — the LLM generates a plausible synonym that fails the Literal["A", "B", "C"] validator; (2) date and time validators that enforce business-specific formats (ISO 8601 vs. "Month DD, YYYY") without specifying the format in the field description; (3) cross-field validators using @model_validator(mode="after") that check relationships between fields — the LLM generates each field independently without understanding the constraint; (4) numeric range validators without range hints in the field description. All four are fixable by adding constraint information to Field(description="...")` rather than encoding it only in the validator logic.
How does Instructor handle validation failures differently for Anthropic vs OpenAI?
Instructor uses fundamentally different structured output mechanisms per provider. With OpenAI, it uses response_format={"type": "json_schema", "json_schema": ...} which enforces schema compliance at the API level — the model is constrained to produce valid JSON matching the schema, so only Pydantic business-logic validators (not JSON structure validators) can fail. With Anthropic, Instructor uses the tool_use API with a schema-shaped tool, where the model is instructed to call the tool with valid arguments but is not hard-constrained. Anthropic extractions have a higher baseline validation failure rate because the model can deviate from the schema structure in ways that OpenAI's constrained decoding prevents. This means the retry storm failure mode is more pronounced with Anthropic than OpenAI for structurally complex schemas.
Should I use Claude Haiku or Claude Sonnet for Instructor extractions?
For extractions with well-defined schemas and clear source text, Haiku extracts reliably at roughly 10× lower cost than Sonnet and is the right default. Reserve Sonnet for: schemas with complex cross-field validators where understanding the semantic relationship between fields matters; documents with ambiguous or noisy source text where Haiku produces higher retry rates; and extraction tasks where the output quality difference justifies the cost differential. A practical approach is to run a shadow comparison: extract a sample of your production documents with both models, compare retry rates and output quality, and use the result to pick the model per schema class. A retry rate 15% higher on Haiku than on Sonnet means Haiku is more expensive for that schema once retries are factored in.
How does RunGuard integrate with Instructor?
RunGuard's BudgetTracker and LoopDetector primitives are the building blocks used in all four guards above. BudgetTracker provides a thread-safe accumulator with a configurable cap that raises BudgetExceededError when adding a cost would exceed the cap — it works at the call site (before the API call) or after (with actual token counts from the response). LoopDetector identifies repeated signatures across a session: pass it the schema key on every extraction call and it trips after N consecutive extractions of the same schema with the same validation error signature. For Instructor specifically, the cleanest integration is via the client.on("completion:error", handler) hook, where the handler calls budget.add(estimated_cost) and loop_detector.step(schema_key) on every failed attempt.