Instructor (Python) Cost Control: Validation Retry Storms, Batch Extraction Loops, and Context Accumulation

Instructor has become the de facto standard for structured output extraction from LLMs in Python. Its model is elegant: annotate a Pydantic model with field descriptions, call client.chat.completions.create(response_model=MyModel), and get back a validated, type-safe Python object. When the LLM's output fails Pydantic validation, Instructor automatically retries up to max_retries times, passing the validation error back to the model so it can correct its output. This self-healing pattern is what makes Instructor so popular in production — it quietly handles the noise of LLM output inconsistency without requiring the caller to write retry logic.

The same mechanism that makes Instructor reliable is also the mechanism most likely to silently multiply your LLM spend. max_retries=3 means up to four full LLM calls per extraction for any schema that consistently fails validation. In an agent pipeline where a single agent step calls several Instructor extractors in sequence, a validator logic bug in one extractor doesn't raise an exception — it retries three times, pays four times the tokens, and then either returns a best-effort result or raises after exhausting retries. The budget impact is invisible in the success path and only shows up in your provider dashboard at the end of the month.

Four structural failure modes account for the majority of unexpected Instructor costs in production:

  • Validator retry storm — a Pydantic field validator or model-level validator that encodes business logic (date ranges, enum sets, cross-field constraints) consistently rejects the LLM's output because the prompt doesn't adequately constrain it; each rejection triggers a retry at full call cost, with validation error context appended to the prompt for each attempt.
  • Batch extraction all-or-nothing retry — extracting list[Item] from a document where a single invalid item in the list causes Instructor to retry the entire extraction, re-paying the full document prompt cost plus all the tokens for all valid items that were already correctly extracted.
  • Context accumulation in retry chains — each Instructor retry appends the previous attempt's output and validation error to the message history, so retry N costs more than retry N-1 in absolute tokens; for schemas with long field descriptions and complex error messages, the third retry can cost 3–4× the original call.
  • Multi-provider fallback amplification — pipelines that configure a provider fallback chain (try OpenAI, then Anthropic, then Mistral) with shared max_retries budgets exhaust all retries on the primary provider before cascading to the next, paying the full retry budget across every provider before surfacing the error.

Instructor's cost model

Instructor is a thin wrapper over your provider's chat API. Its cost footprint is entirely a function of LLM calls: every client.chat.completions.create() call that Instructor issues — whether original or retry — is billed by your provider at the standard per-token rate. Instructor itself does not charge. The cost model has three components:

  • Base extraction cost: one LLM call combining your system prompt (including Pydantic model schema as JSON), user-provided document or context, and any tool-use plumbing Instructor injects for structured output. For a typical extraction with a 200-token schema definition and 800-token document, this is 1,000+ input tokens plus 200–400 output tokens per call.
  • Retry surcharge: each retry passes the previous attempt's output and the Pydantic ValidationError message as additional context before asking the model to try again. Retry 1 adds ~100–300 extra input tokens. Retry 2 adds the retry 1 output plus the retry 2 validation error — another 200–500 tokens. With max_retries=3, the third retry is paying for the original context plus three rounds of failed outputs and error messages, which can be 40–60% more tokens than the original call.
  • Provider tooling overhead: Instructor's structured output mode uses provider-specific mechanisms (OpenAI's response_format, Anthropic's tool_use with a schema-shaped tool, Gemini's generationConfig.responseSchema). Anthropic's tool-use mechanism for structured output is particularly relevant: it injects a tool schema into the API call that Instructor must pass in every request, including retries, adding 100–200 extra input tokens per call compared to plain text extraction.

The practical implication: a pipeline that calls five Instructor extractors per user request, where one extractor consistently fails and retries three times, is paying 4× the expected cost for that extractor on every request — not on failures only. At $0.03 per original call, that extractor alone costs $0.12 per request instead of $0.03. At 10,000 requests per day, the cost difference is $900/day from a single misconfigured validator.

Failure mode 1: validator retry storm

The validator retry storm is the most common Instructor cost amplifier, and also the hardest to notice because it looks like normal operation. The scenario: you define a Pydantic model with a field that has a custom validator checking business logic — a date that must be in the future, a status field that must be one of a specific set of values, a numeric field that must be within a range. The LLM's training data doesn't include your specific enum or range constraint, so it consistently generates values that fail the validator. Instructor retries with the error message, the model generates a different invalid value, Instructor retries again. This continues until max_retries is exhausted, at which point Instructor either raises a ValidationError or (in some modes) returns the last best-effort object.

The trap is that the validator looks correct in isolation. The problem is at the intersection of prompt design and validator semantics: the prompt doesn't tell the model which values are valid, so the model generates plausible-but-wrong values, and the validator correctly rejects them. The fix requires either improving the prompt or relaxing the validator — but without monitoring, you never know the retry rate is high.

The guard wraps Instructor calls and tracks validation failure rates per schema type. If a given schema's retry rate exceeds a threshold, it raises a circuit-open error rather than allowing more retries.

Python — validator retry storm guard for Instructor extraction pipelines
import time
import hashlib
from collections import defaultdict
from dataclasses import dataclass, field
from typing import Type, TypeVar
import anthropic
import instructor
from pydantic import BaseModel
from runguard import BudgetTracker, BudgetExceededError, LoopDetector

T = TypeVar("T", bound=BaseModel)

@dataclass
class SchemaStats:
    total_calls: int = 0
    total_retries: int = 0
    consecutive_failures: int = 0
    last_failure_ts: float = 0.0


class ValidatorRetryGuard:
    """
    Wraps Instructor extract calls with per-schema retry rate monitoring.
    Raises CircuitOpenError when a schema's retry rate exceeds the threshold,
    preventing further spend on a consistently-failing extractor.
    """
    def __init__(
        self,
        max_retry_rate: float = 0.4,       # trip if >40% of calls for a schema are retries
        circuit_reset_seconds: int = 300,   # reset after 5 minutes quiet
        session_budget_usd: float = 1.0,
        base_call_cost_usd: float = 0.025,  # cost per original LLM call (input+output estimate)
        retry_cost_multiplier: float = 1.35, # each retry costs ~35% more than original
    ):
        self.max_retry_rate = max_retry_rate
        self.circuit_reset_seconds = circuit_reset_seconds
        self.budget = BudgetTracker(cap=session_budget_usd)
        self.base_cost = base_call_cost_usd
        self.retry_multiplier = retry_cost_multiplier
        self._stats: dict[str, SchemaStats] = defaultdict(SchemaStats)
        self._open_circuits: set[str] = set()
        self._client = instructor.from_anthropic(anthropic.Anthropic())

    def _schema_key(self, response_model: Type[BaseModel]) -> str:
        return hashlib.md5(response_model.model_json_schema().__repr__().encode()).hexdigest()[:12]

    def _estimate_call_cost(self, retry_count: int) -> float:
        """Estimates total cost of an extraction including retries."""
        cost = self.base_cost  # original call
        for i in range(retry_count):
            cost += self.base_cost * (self.retry_multiplier ** (i + 1))
        return cost

    def _maybe_reset_circuit(self, schema_key: str) -> None:
        stats = self._stats[schema_key]
        if (
            schema_key in self._open_circuits
            and time.time() - stats.last_failure_ts > self.circuit_reset_seconds
        ):
            self._open_circuits.discard(schema_key)
            stats.consecutive_failures = 0

    def extract(
        self,
        response_model: Type[T],
        messages: list[dict],
        model: str = "claude-sonnet-4-6",
        max_retries: int = 2,
    ) -> T:
        """
        Extracts a structured object using Instructor with retry storm protection.
        Raises RuntimeError if the schema's circuit is open (too many recent retries).
        """
        schema_key = self._schema_key(response_model)
        stats = self._stats[schema_key]

        self._maybe_reset_circuit(schema_key)

        if schema_key in self._open_circuits:
            raise RuntimeError(
                f"Circuit open for schema '{response_model.__name__}': retry rate exceeded "
                f"{self.max_retry_rate:.0%} threshold. Check your Pydantic validators and prompt "
                "to ensure the LLM can generate valid outputs. Circuit resets after "
                f"{self.circuit_reset_seconds}s of inactivity."
            )

        # Pre-flight budget check: estimate worst-case cost for this call
        worst_case_cost = self._estimate_call_cost(max_retries)
        try:
            self.budget.add(worst_case_cost)
        except BudgetExceededError:
            raise RuntimeError(
                f"Session budget would be exceeded by worst-case extraction cost "
                f"(${worst_case_cost:.4f} for {max_retries} retries). "
                f"Budget remaining: ${self.budget.cap - self.budget.spent:.4f}"
            )

        stats.total_calls += 1
        actual_retries = 0

        try:
            result = self._client.messages.create(
                model=model,
                max_tokens=1024,
                messages=messages,
                response_model=response_model,
                max_retries=max_retries,
            )
            # Success: refund the unused retry budget estimate
            unused = self._estimate_call_cost(max_retries) - self._estimate_call_cost(actual_retries)
            self.budget.spent = max(0.0, self.budget.spent - unused)
            return result

        except Exception as exc:
            # Instructor exhausted retries — record failure and potentially open circuit
            actual_retries = max_retries
            stats.total_retries += actual_retries
            stats.consecutive_failures += 1
            stats.last_failure_ts = time.time()

            retry_rate = stats.total_retries / max(1, stats.total_calls * max_retries)
            if retry_rate > self.max_retry_rate and stats.total_calls >= 3:
                self._open_circuits.add(schema_key)

            raise RuntimeError(
                f"Instructor extraction failed after {max_retries} retries for "
                f"'{response_model.__name__}'. "
                f"Schema retry rate: {retry_rate:.1%} (threshold: {self.max_retry_rate:.0%}). "
                f"{'Circuit opened — fix validators before next call.' if schema_key in self._open_circuits else ''}"
            ) from exc

    def schema_stats(self, response_model: Type[BaseModel]) -> dict:
        key = self._schema_key(response_model)
        s = self._stats[key]
        rate = s.total_retries / max(1, s.total_calls) if s.total_calls else 0.0
        return {
            "schema": response_model.__name__,
            "total_calls": s.total_calls,
            "total_retries": s.total_retries,
            "retry_rate": rate,
            "circuit_open": key in self._open_circuits,
        }

The guard operates on two tracks simultaneously. The per-call track estimates the worst-case cost of a call before issuing it — max_retries=2 means up to three LLM calls, and the budget deduction happens upfront. If retries are not actually consumed, the unused portion is refunded. This prevents the situation where the last call in a session burns through the remaining budget on retries. The per-schema track accumulates the actual retry rate for each Pydantic model class and opens the circuit when the rate exceeds the threshold. A schema with a 45% retry rate after 10 calls has a validation problem that no amount of retrying will fix — it needs a prompt or validator change, not another attempt.

Failure mode 2: batch extraction all-or-nothing retry

Extracting a list of structured objects from a document is one of the most common Instructor use cases: entity extraction from text, line-item parsing from invoices, event extraction from logs. The natural model is response_model=list[Item]. Instructor treats the entire list as a single extraction unit — if any item in the returned list fails Pydantic validation, the entire list extraction is retried from scratch. The model re-generates all N items, paying input tokens for the full document context again, plus output tokens for all N items.

The cost amplification is proportional to list size. A document that should produce 30 items but where one item consistently fails validation pays 30 items × max_retries full re-extractions. At $0.02 per extraction of 30 items, with max_retries=3, the cost is $0.08 instead of $0.02 — 4× for a single failing item out of 30. The failure item is usually the edge case: a date in an unusual format, a name with special characters, a numeric field where the source text is ambiguous. The 29 valid items are re-extracted identically on every retry, paying their token cost with zero benefit.

The guard breaks batch extractions into smaller chunks and applies item-level retry limits, preventing a single bad item from triggering a full-batch re-extraction.

Python — batch extraction guard for Instructor list[Item] extractions
from typing import Type, TypeVar, Iterable
import anthropic
import instructor
from pydantic import BaseModel
from runguard import BudgetTracker, BudgetExceededError

T = TypeVar("T", bound=BaseModel)

class BatchExtractionGuard:
    """
    Breaks Instructor list[Item] extractions into chunks and handles item-level
    validation failures without triggering a full-batch re-extraction.
    Valid items from a completed chunk are preserved even if the chunk's
    last item fails validation.
    """
    def __init__(
        self,
        chunk_size: int = 10,
        max_item_retries: int = 1,
        session_budget_usd: float = 2.0,
        cost_per_item_usd: float = 0.002,
    ):
        self.chunk_size = chunk_size
        self.max_item_retries = max_item_retries
        self.budget = BudgetTracker(cap=session_budget_usd)
        self.cost_per_item = cost_per_item_usd

    def _build_chunk_prompt(
        self,
        document: str,
        item_model: Type[BaseModel],
        chunk_index: int,
        total_chunks: int,
        target_count: int,
    ) -> list[dict]:
        schema_description = "\n".join(
            f"- {name}: {info.get('description', 'required field')}"
            for name, info in item_model.model_json_schema()
            .get("properties", {}).items()
        )
        return [
            {
                "role": "user",
                "content": (
                    f"Extract structured {item_model.__name__} items from the following document "
                    f"(chunk {chunk_index + 1} of {total_chunks}, targeting ~{target_count} items).\n\n"
                    f"Schema fields:\n{schema_description}\n\n"
                    f"Document:\n{document}\n\n"
                    f"Return exactly the items found in this chunk. Do not invent items. "
                    f"If a field value is unclear or missing, use null rather than guessing."
                ),
            }
        ]

    def extract_list(
        self,
        item_model: Type[T],
        documents: list[str],
        model: str = "claude-haiku-4-5-20251001",
    ) -> list[T]:
        """
        Extracts a list of item_model instances from each document.
        Processes documents in chunks; preserves valid items from partially-failed chunks.
        """
        from pydantic import RootModel
        import json

        base_client = anthropic.Anthropic()
        client = instructor.from_anthropic(base_client)

        # Wrapper model for list extraction
        class ItemList(BaseModel):
            items: list[item_model]  # type: ignore[valid-type]

        all_results: list[T] = []

        for doc_idx, document in enumerate(documents):
            # Budget check per document
            estimated_doc_cost = self.cost_per_item * self.chunk_size * self.max_item_retries
            try:
                self.budget.add(estimated_doc_cost)
            except BudgetExceededError:
                break  # Stop processing documents when budget exhausted

            # Try full document extraction first
            try:
                full_result = client.messages.create(
                    model=model,
                    max_tokens=2048,
                    messages=self._build_chunk_prompt(
                        document, item_model, 0, 1, self.chunk_size
                    ),
                    response_model=ItemList,
                    max_retries=self.max_item_retries,
                )
                all_results.extend(full_result.items)
                # Refund unused budget — extraction succeeded on first try
                self.budget.spent = max(0.0, self.budget.spent - estimated_doc_cost * 0.7)
                continue

            except Exception:
                pass  # Full extraction failed — fall through to chunked approach

            # Chunked fallback: split document into sections and extract separately
            words = document.split()
            chunk_word_count = max(200, len(words) // max(1, len(words) // 400 + 1))
            chunks = []
            for i in range(0, len(words), chunk_word_count):
                chunks.append(" ".join(words[i:i + chunk_word_count]))

            for chunk_idx, chunk in enumerate(chunks):
                try:
                    chunk_result = client.messages.create(
                        model=model,
                        max_tokens=1024,
                        messages=self._build_chunk_prompt(
                            chunk, item_model, chunk_idx, len(chunks),
                            max(3, self.chunk_size // len(chunks))
                        ),
                        response_model=ItemList,
                        max_retries=1,  # One retry per chunk maximum
                    )
                    all_results.extend(chunk_result.items)
                except Exception:
                    # This chunk failed even with retry — skip rather than re-extract
                    # Valid items from other chunks are preserved
                    continue

        return all_results

The two-pass design is the key: a single full-document extraction attempt first, which succeeds for well-structured documents and costs the minimum, followed by chunked extraction only when the full pass fails. Chunking prevents a single bad item from forcing re-extraction of the entire document — each chunk succeeds or fails independently. The cost_per_item estimate is intentionally conservative and applied upfront so the budget ceiling is enforced before any API calls are made for a given document, not after. Items successfully extracted from completed chunks are preserved in all_results regardless of what happens with later chunks.

Failure mode 3: context accumulation in retry chains

Instructor's retry mechanism passes the previous failed attempt's output back to the model as additional context. The message history for retry N looks like: original system prompt + original user message + [assistant output N-1 that failed validation] + [user message: "that output failed validation with error: X, please try again"] + [assistant output N-2 that failed validation] + ... This accumulation means each successive retry is paying for all previous failed outputs plus their error messages in the input token count.

For simple schemas with short field descriptions, this overhead is minor — 100–200 extra input tokens per retry. For complex schemas with rich field descriptions, nested objects, and multi-field validators that produce verbose error messages, the overhead compounds significantly. A schema with a 400-token description and a validator that generates 200-token error messages produces retry context of 600, 1,200, 1,800 extra tokens for retries 1, 2, 3. At Claude Sonnet 4.6 input rates ($0.000003/token), retry 3 alone costs $0.0054 more than the original call in context overhead alone — before counting the output tokens for the re-generated schema.

The guard monitors context size growth across retries and short-circuits when context accumulation indicates the error is structural (the model is not converging) rather than transient.

Python — context accumulation guard for deep retry chains in Instructor
import anthropic
import instructor
from pydantic import BaseModel, ValidationError
from runguard import BudgetTracker, BudgetExceededError

class ContextAccumulationGuard:
    """
    Monitors context size growth across Instructor retries.
    Short-circuits when the accumulated retry context signals a structural
    validation failure (model not converging toward a valid output).
    Also caps absolute context size to prevent runaway token spend.
    """
    def __init__(
        self,
        max_context_tokens: int = 4000,
        max_context_growth_factor: float = 2.0,
        session_budget_usd: float = 0.50,
    ):
        self.max_context_tokens = max_context_tokens
        self.max_growth_factor = max_context_growth_factor
        self.budget = BudgetTracker(cap=session_budget_usd)
        self._base_client = anthropic.Anthropic()
        self._instructor_client = instructor.from_anthropic(self._base_client)

    def _estimate_tokens(self, messages: list[dict]) -> int:
        total_chars = sum(
            len(m.get("content", "") if isinstance(m.get("content"), str)
               else str(m.get("content", "")))
            for m in messages
        )
        return total_chars // 4  # rough char-to-token estimate

    def extract_with_context_guard(
        self,
        response_model: type[BaseModel],
        messages: list[dict],
        model: str = "claude-sonnet-4-6",
        max_retries: int = 2,
    ) -> BaseModel:
        """
        Extracts a structured object, monitoring context token growth across retries.
        Raises RuntimeError if context exceeds ceiling or grows faster than expected
        for a converging model (indicating structural rather than transient failure).
        """
        base_tokens = self._estimate_tokens(messages)

        # Pre-check: reject if base context already near ceiling
        if base_tokens > self.max_context_tokens * 0.8:
            raise RuntimeError(
                f"Base context ({base_tokens} tokens estimated) already near ceiling "
                f"({self.max_context_tokens} tokens). Reduce document size or simplify schema "
                "before attempting extraction."
            )

        retry_messages = list(messages)
        last_error = None
        last_output = ""

        for attempt in range(max_retries + 1):
            # Context size check before each attempt
            current_tokens = self._estimate_tokens(retry_messages)
            if current_tokens > self.max_context_tokens:
                raise RuntimeError(
                    f"Context size ({current_tokens} tokens) exceeded ceiling "
                    f"({self.max_context_tokens} tokens) after {attempt} retries. "
                    "Structural validation failure: model is not converging. "
                    f"Last error: {str(last_error)[:200] if last_error else 'unknown'}"
                )

            # Growth factor check: if context grew more than expected, model is not converging
            if attempt > 0:
                growth_factor = current_tokens / max(1, base_tokens)
                if growth_factor > self.max_growth_factor:
                    raise RuntimeError(
                        f"Context grew {growth_factor:.1f}× from base ({base_tokens} → "
                        f"{current_tokens} tokens) after {attempt} retries. "
                        "Model is not converging toward a valid output — likely a structural "
                        "mismatch between the schema and the source document."
                    )

            # Budget check
            call_cost_estimate = current_tokens * 0.000003 + 400 * 0.000015
            try:
                self.budget.add(call_cost_estimate)
            except BudgetExceededError:
                raise RuntimeError(
                    f"Budget exhausted on attempt {attempt + 1}. "
                    f"Spent: ${self.budget.spent:.4f} / ${self.budget.cap:.2f}"
                )

            # Attempt extraction
            try:
                result = self._instructor_client.messages.create(
                    model=model,
                    max_tokens=1024,
                    messages=retry_messages,
                    response_model=response_model,
                    max_retries=0,  # We handle retries manually to track context
                )
                return result

            except (ValidationError, Exception) as exc:
                last_error = exc

                if attempt == max_retries:
                    raise RuntimeError(
                        f"Extraction failed after {max_retries + 1} attempts. "
                        f"Final context size: {current_tokens} tokens. "
                        f"Last error: {str(exc)[:300]}"
                    ) from exc

                # Add error context for next retry — this is what causes accumulation
                error_msg = str(exc)[:300]
                retry_messages = retry_messages + [
                    {"role": "assistant", "content": last_output[:500] if last_output else "[no output]"},
                    {
                        "role": "user",
                        "content": (
                            f"That output failed validation with the following error:\n{error_msg}\n\n"
                            "Please correct only the failing fields and return the full valid object."
                        ),
                    },
                ]

Taking manual control of the retry loop — setting max_retries=0 on the Instructor call and handling retries in the guard's own loop — is necessary to insert the growth factor check between attempts. With Instructor's built-in retry mechanism, there is no hook between individual retry attempts. The max_growth_factor=2.0 threshold works as follows: a model that is genuinely converging toward a valid output should produce shorter, more targeted corrections on each retry, which means context should grow sub-linearly. A model that is not converging — because the schema constraint cannot be satisfied with the available context — produces equally long (or longer) failed outputs on each retry, causing context to grow roughly linearly with attempt count. When growth exceeds 2×, the loop is structural rather than convergent, and the guard raises rather than allowing more attempts.

Failure mode 4: multi-provider fallback amplification

Teams building resilient extraction pipelines often configure a provider fallback chain: if OpenAI is unavailable or returns an error, try Anthropic; if Anthropic fails, try Mistral or Cohere. This is sensible for availability failures (rate limits, API outages, transient errors) but becomes a cost amplifier for validation failures. A Pydantic validator that rejects the OpenAI output will also reject the Anthropic output and the Mistral output — the constraint is in the schema, not in any provider's specific output style. When the fallback chain is combined with max_retries, you pay the full retry budget on every provider before learning that the problem is the schema, not the provider.

The amplification: three providers × three retries = 12 LLM calls for a single schema validation failure. At $0.025 per call, one extraction that should cost $0.025 costs $0.30 in the worst case. In pipelines that process thousands of documents per day, a single overly-strict validator with a three-provider fallback burns through budgets that were sized for single-provider operation.

The guard distinguishes validation failures (schema constraint mismatch — same on all providers) from infrastructure failures (rate limits, timeouts — provider-specific) before triggering a fallback.

Python — multi-provider fallback guard for Instructor extraction
import anthropic
import instructor
from pydantic import BaseModel, ValidationError
from runguard import BudgetTracker, BudgetExceededError

class ProviderConfig:
    def __init__(self, name: str, client, model: str, cost_per_call: float):
        self.name = name
        self.client = client
        self.model = model
        self.cost_per_call = cost_per_call


class MultiProviderExtractionGuard:
    """
    Manages multi-provider fallback for Instructor extractions.
    Distinguishes validation failures (schema issue — don't fallback) from
    infrastructure failures (rate limit, timeout — do fallback).
    Prevents the fallback chain from amplifying schema validation costs.
    """
    VALIDATION_ERRORS = (ValidationError,)
    INFRASTRUCTURE_ERRORS = (
        anthropic.RateLimitError,
        anthropic.APIConnectionError,
        anthropic.APIStatusError,
    )

    def __init__(
        self,
        providers: list[ProviderConfig],
        max_retries_per_provider: int = 1,
        session_budget_usd: float = 1.0,
    ):
        self.providers = providers
        self.max_retries = max_retries_per_provider
        self.budget = BudgetTracker(cap=session_budget_usd)
        self._validation_failure_counts: dict[str, int] = {}

    def _schema_key(self, response_model: type[BaseModel]) -> str:
        return response_model.__name__

    def extract(
        self,
        response_model: type[BaseModel],
        messages: list[dict],
    ) -> BaseModel:
        """
        Attempts extraction across providers in order. Falls back to the next
        provider only on infrastructure errors, not on validation failures.
        """
        schema_key = self._schema_key(response_model)
        last_exception = None
        last_validation_error = None

        for provider in self.providers:
            # Check if this schema has a known validation failure pattern
            if self._validation_failure_counts.get(schema_key, 0) >= 3:
                raise RuntimeError(
                    f"Schema '{response_model.__name__}' has failed validation "
                    f"{self._validation_failure_counts[schema_key]} times across providers. "
                    "This is a schema constraint issue, not a provider issue. "
                    "Review your Pydantic validators and prompt before retrying."
                )

            # Budget check
            worst_case = provider.cost_per_call * (self.max_retries + 1)
            try:
                self.budget.add(worst_case)
            except BudgetExceededError:
                break

            try:
                result = provider.client.messages.create(
                    model=provider.model,
                    max_tokens=1024,
                    messages=messages,
                    response_model=response_model,
                    max_retries=self.max_retries,
                )
                # Success: refund unused retry budget
                self.budget.spent = max(0.0, self.budget.spent - worst_case + provider.cost_per_call)
                return result

            except self.VALIDATION_ERRORS as exc:
                # Validation failure: record and do NOT fall back to next provider
                # (the next provider will produce the same validation failure)
                self._validation_failure_counts[schema_key] = (
                    self._validation_failure_counts.get(schema_key, 0) + 1
                )
                last_validation_error = exc
                raise RuntimeError(
                    f"Validation failure on provider '{provider.name}' for schema "
                    f"'{response_model.__name__}'. Skipping fallback — validation failures "
                    "are schema-level issues, not provider-specific. "
                    f"Error: {str(exc)[:300]}"
                ) from exc

            except self.INFRASTRUCTURE_ERRORS as exc:
                # Infrastructure failure: fall back to next provider
                last_exception = exc
                # Refund the unused retry budget for this provider since we're falling back
                self.budget.spent = max(0.0, self.budget.spent - worst_case + provider.cost_per_call)
                continue

            except Exception as exc:
                # Unknown error — treat as infrastructure failure and fall back
                last_exception = exc
                continue

        raise RuntimeError(
            f"All providers exhausted for schema '{response_model.__name__}'. "
            f"Last infrastructure error: {str(last_exception)[:200] if last_exception else 'none'}. "
            f"Last validation error: {str(last_validation_error)[:200] if last_validation_error else 'none'}."
        )


def build_anthropic_provider(budget_per_call: float = 0.03) -> ProviderConfig:
    base = anthropic.Anthropic()
    client = instructor.from_anthropic(base)
    return ProviderConfig(
        name="anthropic",
        client=client,
        model="claude-sonnet-4-6",
        cost_per_call=budget_per_call,
    )

The key logic is in the except hierarchy. ValidationError is caught before the infrastructure error exceptions. When Instructor raises a ValidationError (which it does after exhausting retries on a consistently-failing schema), the guard records the failure and raises immediately — it does not fall through to the next provider. The counter tracks cross-provider validation failures: if the same schema fails validation three times across the full session (including across different requests, not just within one call), the circuit opens and all future calls for that schema raise before any provider is contacted. This prevents the sustained burn of a systematically misconfigured extractor running across thousands of documents.

Combining guards in a production Instructor pipeline

In a real extraction pipeline, these four failure modes can occur simultaneously at different points. A recommended layering for production:

  • The ValidatorRetryGuard wraps every individual client.messages.create(response_model=...) call, providing per-schema retry rate monitoring and circuit breakers at the extractor level.
  • The BatchExtractionGuard handles all list[Item] extractions where documents may contain variable numbers of items, replacing the default all-or-nothing retry with chunked extraction.
  • The ContextAccumulationGuard replaces Instructor's built-in retry mechanism for complex nested schemas where retry context is known to grow quickly — typically schemas with 5+ fields, cross-field validators, or custom validator methods that produce multi-line error messages.
  • The MultiProviderExtractionGuard wraps the outermost extraction call only when your pipeline genuinely requires multi-provider fallback; if you're using a single provider, this guard is unnecessary overhead.

The shared BudgetTracker instance — passed to all four guards — ensures that the guards collectively enforce a single per-session ceiling rather than each running its own independent sub-budget that can add up to a surprise total at session end.

Instructor v1.x hooks note: Recent Instructor versions expose a hooks parameter on the patched client (client.on("completion:response", callback)) that fires after every LLM call including retries. You can use this hook to record actual token counts and update a shared BudgetTracker with measured costs rather than estimates. This gives tighter budget enforcement than cost estimates and removes the need to refund unused retry budget — the tracker knows exactly what each call cost. The completion:error hook fires on validation failures before retries, making it the cleanest insertion point for retry rate monitoring without taking manual control of the retry loop.

Summary: Instructor cost amplification patterns

Pattern Cost multiplier Guard
Validator retry storm
max_retries=3, consistent validation failure
Up to (max_retries + 1)× per call Per-schema retry rate circuit breaker; upfront budget deduction
Batch all-or-nothing retry
list[Item] with one failing item
(max_retries + 1)× × full document tokens Chunked extraction; per-chunk independent retry; preserve valid items
Context accumulation
complex schemas with long error messages
1.35–4× per retry for input tokens Context growth factor check; circuit break at 2× base context
Multi-provider fallback
providers × retries for schema failures
N providers × (max_retries + 1)× Distinguish validation vs infrastructure errors; no fallback on ValidationError

Frequently asked questions

Does Instructor have built-in cost controls or token budgets?

Instructor does not include native cost controls, token budgets, or circuit breakers. It provides max_retries to cap the number of retry attempts and, in recent versions, a hooks system for observing calls. max_retries prevents infinite retry loops but does not distinguish between validation failures that will never succeed and transient failures that will. A max_retries=3 on a consistently-failing schema always burns all four calls before raising — there is no early exit based on failure patterns or accumulated spend. The guards above build that logic on top of the hooks system or by wrapping the client directly.

What is the most common cause of high Instructor retry rates in production?

In order of frequency: (1) enum constraints where the prompt does not enumerate valid values — the LLM generates a plausible synonym that fails the Literal["A", "B", "C"] validator; (2) date and time validators that enforce business-specific formats (ISO 8601 vs. "Month DD, YYYY") without specifying the format in the field description; (3) cross-field validators using @model_validator(mode="after") that check relationships between fields — the LLM generates each field independently without understanding the constraint; (4) numeric range validators without range hints in the field description. All four are fixable by adding constraint information to Field(description="...")` rather than encoding it only in the validator logic.

How does Instructor handle validation failures differently for Anthropic vs OpenAI?

Instructor uses fundamentally different structured output mechanisms per provider. With OpenAI, it uses response_format={"type": "json_schema", "json_schema": ...} which enforces schema compliance at the API level — the model is constrained to produce valid JSON matching the schema, so only Pydantic business-logic validators (not JSON structure validators) can fail. With Anthropic, Instructor uses the tool_use API with a schema-shaped tool, where the model is instructed to call the tool with valid arguments but is not hard-constrained. Anthropic extractions have a higher baseline validation failure rate because the model can deviate from the schema structure in ways that OpenAI's constrained decoding prevents. This means the retry storm failure mode is more pronounced with Anthropic than OpenAI for structurally complex schemas.

Should I use Claude Haiku or Claude Sonnet for Instructor extractions?

For extractions with well-defined schemas and clear source text, Haiku extracts reliably at roughly 10× lower cost than Sonnet and is the right default. Reserve Sonnet for: schemas with complex cross-field validators where understanding the semantic relationship between fields matters; documents with ambiguous or noisy source text where Haiku produces higher retry rates; and extraction tasks where the output quality difference justifies the cost differential. A practical approach is to run a shadow comparison: extract a sample of your production documents with both models, compare retry rates and output quality, and use the result to pick the model per schema class. A retry rate 15% higher on Haiku than on Sonnet means Haiku is more expensive for that schema once retries are factored in.

How does RunGuard integrate with Instructor?

RunGuard's BudgetTracker and LoopDetector primitives are the building blocks used in all four guards above. BudgetTracker provides a thread-safe accumulator with a configurable cap that raises BudgetExceededError when adding a cost would exceed the cap — it works at the call site (before the API call) or after (with actual token counts from the response). LoopDetector identifies repeated signatures across a session: pass it the schema key on every extraction call and it trips after N consecutive extractions of the same schema with the same validation error signature. For Instructor specifically, the cleanest integration is via the client.on("completion:error", handler) hook, where the handler calls budget.add(estimated_cost) and loop_detector.step(schema_key) on every failed attempt.

Stop Instructor retry storms before they hit your bill

RunGuard's BudgetTracker and LoopDetector work with any Instructor version and any provider — one install, zero changes to your existing extraction schemas.

See pricing Learn more