LLM agent fault tolerance patterns: retry, circuit breaker, fallback, and graceful degradation
LLM agents fail in ways that traditional web services do not. A web service either returns a response or throws an exception — faults are discrete, per-request events. An LLM agent runs over multiple turns, accumulates state, and has failure modes that compound: a 429 rate-limit error on turn 3 triggers a retry loop that burns budget on turns 4 through 8; a context window approaching its limit causes model quality to silently degrade across turns 10 through 15 before a hard truncation error fires on turn 16; a tool that returns an unexpected error type causes the model to try every variant it can think of, each variant burning tokens and money. Standard microservice fault tolerance patterns — retry with exponential backoff, circuit breakers, health checks — address transient infrastructure failures. They do not address the unique failure modes of multi-turn, stateful, cost-accumulating systems. This guide covers four fault tolerance patterns adapted for LLM agents: (1) retry with jitter for transient provider errors, (2) circuit breaking for tool-call loops and retry storms, (3) model fallback for provider outages and budget exhaustion, and (4) graceful degradation when the current session cannot complete its task within constraints.
Pattern 1: retry with jitter for transient LLM API errors
- The transient errors that need retry. LLM APIs return four categories of transient errors that warrant retry: 429 rate-limit (requests per minute or tokens per minute exceeded), 500 internal server error (provider-side transient fault), 503 service unavailable (provider maintenance or overload), and 529 overloaded (Anthropic-specific). Each of these resolves within seconds to minutes without any change to the request. A 400 bad request or a 401 authentication error does not warrant retry — retrying these burns tokens without a resolution path.
-
Python: retry with exponential backoff and jitter
import anthropic import time import random from runguard import guard, LoopDetectedError, BudgetExceededError client = anthropic.Anthropic() RETRYABLE_STATUS_CODES = {429, 500, 503, 529} MAX_RETRIES = 4 BASE_DELAY_S = 1.0 MAX_DELAY_S = 60.0 def call_with_retry( messages: list, model: str = "claude-sonnet-4-6", max_tokens: int = 1024, ) -> anthropic.types.Message: """ Call the Anthropic API with exponential backoff + full jitter. Full jitter: sleep = random(0, min(cap, base * 2**attempt)) This avoids thundering-herd when many agents retry simultaneously. """ for attempt in range(MAX_RETRIES + 1): try: return client.messages.create( model=model, max_tokens=max_tokens, messages=messages, ) except anthropic.APIStatusError as e: if e.status_code not in RETRYABLE_STATUS_CODES: raise # non-retryable: bad request, auth error, etc. if attempt == MAX_RETRIES: raise # exhausted retries cap = min(MAX_DELAY_S, BASE_DELAY_S * (2 ** attempt)) delay = random.uniform(0, cap) time.sleep(delay) except anthropic.APIConnectionError: if attempt == MAX_RETRIES: raise cap = min(MAX_DELAY_S, BASE_DELAY_S * (2 ** attempt)) delay = random.uniform(0, cap) time.sleep(delay) def run_agent(task: str, budget_usd: float = 5.0) -> str: """Agent that retries transient errors without burning budget on loops.""" messages = [{"role": "user", "content": task}] def inner_turn(turn_input): response = call_with_retry(messages) return response.content[0].text guarded = guard( inner_turn, max_budget_usd=budget_usd, max_loop_repeats=3, ) for turn in range(20): try: result = guarded(f"turn_{turn}") messages.append({"role": "assistant", "content": result}) if "[DONE]" in result: return result except LoopDetectedError: return "[FAULT] Loop detected — tool call pattern repeated 3 times." except BudgetExceededError: return "[FAULT] Budget exceeded — agent stopped before overspend." return "[FAULT] Max turns reached without completing task." -
Why full jitter beats fixed backoff for agents. If your production system runs 50 agents simultaneously and all of them hit a 429 at the same time (token-per-minute limit exhausted by a spike), fixed-delay retry causes all 50 to retry at the same second, hitting the rate limit again. Full jitter distributes retries across the backoff window, reducing the chance of a synchronized second hit. The
random.uniform(0, cap)formula is the simplest correct implementation — it has the same expected value as equal jitter but lower variance.
Pattern 2: circuit breaking for tool-call loops and retry storms
- Why standard retry logic creates agent loops. Standard retry-with-backoff works well for transient network errors. It does not work for tool-call loops: when an agent calls the same tool with the same arguments multiple times because the tool returned an unexpected result the model does not know how to handle, each retry is not addressing a transient fault — it is re-running the same failed strategy. The loop may look like success from a latency standpoint (no 5xx errors) while silently burning hundreds of tool invocations and thousands of tokens per minute.
-
Python: loop circuit breaker with RunGuard
from runguard import LoopDetector, LoopDetectedError # The circuit breaker tracks tool call signatures across turns. # A "signature" is a canonical representation of what was called # and what the critical input parameters were. detector = LoopDetector(repeats=3, max_cycle_len=4) def tool_router(tool_name: str, tool_args: dict) -> str: """ Route a tool call, checking for loops before execution. Returns tool output or raises LoopDetectedError if a loop is detected. """ # Build a canonical signature: tool name + key discriminating args # Drop volatile args (timestamps, request IDs) that would prevent loop detection discriminating_args = { k: v for k, v in tool_args.items() if k not in {"request_id", "timestamp", "trace_id"} } sig = f"{tool_name}:{sorted(discriminating_args.items())}" match = detector.record(sig) if match: raise LoopDetectedError( f"Tool call loop detected: '{tool_name}' called in pattern " f"repeating {match.repeats}x with cycle length {match.cycle_length}", match=match, ) # Execute the actual tool return dispatch_tool(tool_name, tool_args) def dispatch_tool(name: str, args: dict) -> str: """Placeholder — replace with your actual tool dispatch.""" if name == "search": return f"Search results for: {args.get('query', '')}" if name == "read_file": return f"File contents of: {args.get('path', '')}" raise ValueError(f"Unknown tool: {name}") -
TypeScript: circuit breaker for tool loops
import { LoopDetector, LoopDetectedError } from "@runguard/sdk"; const detector = new LoopDetector({ repeats: 3, maxCycleLen: 4 }); interface ToolCall { name: string; args: Record<string, unknown>; } function toolSignature(call: ToolCall): string { const { request_id, timestamp, trace_id, ...discriminating } = call.args; return `${call.name}:${JSON.stringify(Object.entries(discriminating).sort())}`; } function executeWithCircuitBreaker(call: ToolCall): string { const sig = toolSignature(call); const match = detector.record(sig); if (match) { throw new LoopDetectedError( `Tool loop: '${call.name}' repeated ${match.repeats}x`, match, ); } return dispatchTool(call); } function dispatchTool(call: ToolCall): string { switch (call.name) { case "search": return `Results for: ${call.args.query}`; case "read_file": return `Contents of: ${call.args.path}`; default: throw new Error(`Unknown tool: ${call.name}`); } } -
Choosing the right cycle length for your agent. The
max_cycle_len=4setting catches sequences of up to 4 distinct tool calls that repeat as a block. For a coding agent that callsread_file→edit_file→run_tests→read_test_outputin a loop,max_cycle_len=4catches the pattern after 3 complete cycles. For agents with longer deliberation sequences, increase this value. For agents that primarily call one tool,max_cycle_len=1withrepeats=3is sufficient and trips faster.
Pattern 3: model fallback for provider outages and cost escalation
- When to fall back to a cheaper or alternative model. Model fallback is appropriate in two scenarios: (1) the primary model’s API is experiencing elevated error rates or timeouts that exceed your retry budget, and (2) the session’s budget utilization rate is on track to exceed the cap before the task completes, and a cheaper model can complete the remaining work at lower cost. The first case is a reliability fallback; the second is a cost-efficiency fallback. Both should be implemented explicitly — silently falling back to a cheaper model without logging the fallback makes cost attribution impossible.
-
Python: two-tier model fallback with budget trigger
import anthropic from dataclasses import dataclass, field from runguard import BudgetExceededError @dataclass class FallbackSession: cap_usd: float spent_usd: float = 0.0 turns: int = 0 fallback_triggered: bool = False # Pricing (per token) SONNET_IN = 3.0 / 1_000_000 # $3/M input SONNET_OUT = 15.0 / 1_000_000 # $15/M output HAIKU_IN = 0.8 / 1_000_000 # $0.80/M input HAIKU_OUT = 4.0 / 1_000_000 # $4/M output def record(self, in_tok: int, out_tok: int, model: str) -> None: rate_in = self.HAIKU_IN if "haiku" in model else self.SONNET_IN rate_out = self.HAIKU_OUT if "haiku" in model else self.SONNET_OUT self.spent_usd += in_tok * rate_in + out_tok * rate_out self.turns += 1 @property def utilization(self) -> float: return self.spent_usd / self.cap_usd if self.cap_usd else 0.0 client = anthropic.Anthropic() FALLBACK_UTILIZATION_THRESHOLD = 0.65 # fall back when 65% of budget is used def call_with_fallback( messages: list, session: FallbackSession, task_fraction_remaining: float = 0.5, ) -> str: """ Call Sonnet unless budget utilization suggests we should fall back to Haiku. task_fraction_remaining: estimated fraction of work left (0.0–1.0). Falls back when projected cost at current burn rate exceeds remaining budget. """ # Trigger fallback if we're on track to run over budget if (session.utilization >= FALLBACK_UTILIZATION_THRESHOLD and task_fraction_remaining > 0.1): if not session.fallback_triggered: print(f"[FALLBACK] Switching to Haiku at {session.utilization:.0%} budget used") session.fallback_triggered = True model = "claude-haiku-4-5-20251001" else: model = "claude-sonnet-4-6" response = client.messages.create( model=model, max_tokens=1024, messages=messages, ) usage = response.usage session.record(usage.input_tokens, usage.output_tokens, model) if session.spent_usd > session.cap_usd: raise BudgetExceededError( f"Budget exceeded after fallback: ${session.spent_usd:.4f} > ${session.cap_usd}" ) return response.content[0].text - Fallback ordering and graceful handling. A practical fallback chain for production agents in 2026: Claude Sonnet 4.6 (primary, best quality) → Claude Haiku 4.5 (cost fallback, 80% quality, 5× cheaper) → cached partial result (return what was computed so far). Never fall back silently — log the trigger condition (utilization rate, error type, turn count) and the model selected, so post-incident analysis can determine whether the fallback was appropriate and whether the task quality degraded.
Pattern 4: graceful degradation when the task cannot complete within constraints
-
Defining graceful degradation for agents. Graceful degradation means returning a useful partial result rather than throwing an unhandled exception when an agent runs out of budget, hits a loop, or exhausts its turn limit. The partial result should be clearly marked as incomplete, include what was accomplished, and indicate why the task stopped. A naked exception caught at the API boundary is not graceful; returning
{"status": "partial", "completed_steps": [...], "stopped_reason": "budget_exceeded", "remaining_steps": [...]}is. -
Python: structured partial result on budget or loop trip
from dataclasses import dataclass, field from typing import Optional from runguard import LoopDetectedError, BudgetExceededError @dataclass class AgentResult: status: str # "complete" | "partial_budget" | "partial_loop" | "partial_turns" completed_steps: list = field(default_factory=list) remaining_steps: list = field(default_factory=list) stopped_reason: Optional[str] = None budget_used_usd: float = 0.0 turns_used: int = 0 last_output: Optional[str] = None def run_task_with_degradation( steps: list[str], budget_usd: float = 3.0, max_turns: int = 20, ) -> AgentResult: """ Run a multi-step task. Returns partial result on fault rather than crashing. """ result = AgentResult( status="partial_turns", remaining_steps=list(steps), ) session_cost = 0.0 for turn_idx, step in enumerate(steps): if turn_idx >= max_turns: result.status = "partial_turns" result.stopped_reason = f"Turn limit {max_turns} reached" break try: output = execute_step_with_guard(step, session_cost, budget_usd) session_cost += output["cost"] result.budget_used_usd = session_cost result.turns_used = turn_idx + 1 result.completed_steps.append({"step": step, "output": output["text"]}) result.remaining_steps.remove(step) result.last_output = output["text"] except BudgetExceededError as e: result.status = "partial_budget" result.stopped_reason = str(e) result.budget_used_usd = session_cost break except LoopDetectedError as e: result.status = "partial_loop" result.stopped_reason = str(e) result.budget_used_usd = session_cost break else: result.status = "complete" result.remaining_steps = [] return result def execute_step_with_guard( step: str, current_cost: float, budget_usd: float, ) -> dict: """Execute one step with budget pre-check. Returns {"text": ..., "cost": ...}.""" if current_cost >= budget_usd * 0.95: raise BudgetExceededError( f"Pre-step budget check: ${current_cost:.4f} is ≥95% of ${budget_usd} cap" ) # Replace with real LLM call return {"text": f"Completed: {step}", "cost": 0.01} - What to include in the partial result. The minimum useful partial result for an AI agent has four fields: the steps that completed successfully (enough for a human or downstream system to avoid re-running them), the steps that did not run (so the caller can resume from the right point), the stop reason (budget, loop, turn limit, or provider error), and the cost incurred. Anything less shifts the recovery burden to the caller, who now has to figure out both what happened and where to restart.
LLM agent fault tolerance pattern comparison
| Pattern | Failure mode addressed | Without pattern | With pattern |
|---|---|---|---|
| Retry with jitter | Transient 429 / 5xx errors | Task fails on first transient error; no retry budget management | Distributed retry avoids thundering-herd; non-retryable errors fail fast |
| Circuit breaker (loop detection) | Tool-call loops and retry storms | Agent loops indefinitely, billing thousands of tokens per minute | Breaker trips after N repeats; loop detected before the bill lands |
| Model fallback | Budget exhaustion and provider outages | Task hard-fails or overspends when primary model is unavailable or expensive | Cheaper model completes remaining work within budget; outage path has an alternate |
| Graceful degradation | Any fault that prevents task completion | Unhandled exception; caller has no idea what completed and where to resume | Structured partial result; caller can resume from the correct checkpoint |
For the broader cost control architecture these patterns sit inside, see autonomous agent cost control best practices. For retry-storm-specific circuit breaking with RunGuard, see AI agent retry storm prevention. For the graceful degradation model-fallback approach in detail, see AI agent graceful degradation patterns.
Add fault tolerance to your LLM agent in minutes
RunGuard installs in one command: pip install runguard for Python, npm install @runguard/sdk for TypeScript. The circuit breaking and budget tracking patterns above use RunGuard’s LoopDetector and BudgetExceededError directly — no wrappers or config files required. Start with the circuit breaker (Pattern 2) as it catches the highest-cost failure mode: tool loops. Add the budget cap next. Then layer in retry with jitter and graceful degradation as your agent moves toward production.
RunGuard pricing: Solo plan at $19/month for individual developers. Team plan at $79/month adds Slack and PagerDuty webhook alerts, shared dashboards, and audit log. Both plans include a 14-day free trial — no credit card required.
Start your 14-day free trial — or explore related: AI agent retry storm prevention, graceful degradation patterns, autonomous agent cost control best practices, prevent runaway cost in real time, and production LLM agent reliability checklist.