LLM agent fault tolerance patterns: retry, circuit breaker, fallback, and graceful degradation

LLM agents fail in ways that traditional web services do not. A web service either returns a response or throws an exception — faults are discrete, per-request events. An LLM agent runs over multiple turns, accumulates state, and has failure modes that compound: a 429 rate-limit error on turn 3 triggers a retry loop that burns budget on turns 4 through 8; a context window approaching its limit causes model quality to silently degrade across turns 10 through 15 before a hard truncation error fires on turn 16; a tool that returns an unexpected error type causes the model to try every variant it can think of, each variant burning tokens and money. Standard microservice fault tolerance patterns — retry with exponential backoff, circuit breakers, health checks — address transient infrastructure failures. They do not address the unique failure modes of multi-turn, stateful, cost-accumulating systems. This guide covers four fault tolerance patterns adapted for LLM agents: (1) retry with jitter for transient provider errors, (2) circuit breaking for tool-call loops and retry storms, (3) model fallback for provider outages and budget exhaustion, and (4) graceful degradation when the current session cannot complete its task within constraints.

Pattern 1: retry with jitter for transient LLM API errors

The transient errors that need retry. LLM APIs return four categories of transient errors that warrant retry: 429 rate-limit (requests per minute or tokens per minute exceeded), 500 internal server error (provider-side transient fault), 503 service unavailable (provider maintenance or overload), and 529 overloaded (Anthropic-specific). Each of these resolves within seconds to minutes without any change to the request. A 400 bad request or a 401 authentication error does not warrant retry — retrying these burns tokens without a resolution path.

Python: retry with exponential backoff and jitter

import anthropic
import time
import random
from runguard import guard, LoopDetectedError, BudgetExceededError

client = anthropic.Anthropic()

RETRYABLE_STATUS_CODES = {429, 500, 503, 529}
MAX_RETRIES = 4
BASE_DELAY_S = 1.0
MAX_DELAY_S = 60.0

def call_with_retry(
    messages: list,
    model: str = "claude-sonnet-4-6",
    max_tokens: int = 1024,
) -> anthropic.types.Message:
    """
    Call the Anthropic API with exponential backoff + full jitter.
    Full jitter: sleep = random(0, min(cap, base * 2**attempt))
    This avoids thundering-herd when many agents retry simultaneously.
    """
    for attempt in range(MAX_RETRIES + 1):
        try:
            return client.messages.create(
                model=model,
                max_tokens=max_tokens,
                messages=messages,
            )
        except anthropic.APIStatusError as e:
            if e.status_code not in RETRYABLE_STATUS_CODES:
                raise  # non-retryable: bad request, auth error, etc.
            if attempt == MAX_RETRIES:
                raise  # exhausted retries
            cap = min(MAX_DELAY_S, BASE_DELAY_S * (2 ** attempt))
            delay = random.uniform(0, cap)
            time.sleep(delay)
        except anthropic.APIConnectionError:
            if attempt == MAX_RETRIES:
                raise
            cap = min(MAX_DELAY_S, BASE_DELAY_S * (2 ** attempt))
            delay = random.uniform(0, cap)
            time.sleep(delay)

def run_agent(task: str, budget_usd: float = 5.0) -> str:
    """Agent that retries transient errors without burning budget on loops."""
    messages = [{"role": "user", "content": task}]

    def inner_turn(turn_input):
        response = call_with_retry(messages)
        return response.content[0].text

    guarded = guard(
        inner_turn,
        max_budget_usd=budget_usd,
        max_loop_repeats=3,
    )

    for turn in range(20):
        try:
            result = guarded(f"turn_{turn}")
            messages.append({"role": "assistant", "content": result})
            if "[DONE]" in result:
                return result
        except LoopDetectedError:
            return "[FAULT] Loop detected — tool call pattern repeated 3 times."
        except BudgetExceededError:
            return "[FAULT] Budget exceeded — agent stopped before overspend."
    return "[FAULT] Max turns reached without completing task."

Why full jitter beats fixed backoff for agents. If your production system runs 50 agents simultaneously and all of them hit a 429 at the same time (token-per-minute limit exhausted by a spike), fixed-delay retry causes all 50 to retry at the same second, hitting the rate limit again. Full jitter distributes retries across the backoff window, reducing the chance of a synchronized second hit. The random.uniform(0, cap) formula is the simplest correct implementation — it has the same expected value as equal jitter but lower variance.

Pattern 2: circuit breaking for tool-call loops and retry storms

Why standard retry logic creates agent loops. Standard retry-with-backoff works well for transient network errors. It does not work for tool-call loops: when an agent calls the same tool with the same arguments multiple times because the tool returned an unexpected result the model does not know how to handle, each retry is not addressing a transient fault — it is re-running the same failed strategy. The loop may look like success from a latency standpoint (no 5xx errors) while silently burning hundreds of tool invocations and thousands of tokens per minute.

Python: loop circuit breaker with RunGuard

from runguard import LoopDetector, LoopDetectedError

# The circuit breaker tracks tool call signatures across turns.
# A "signature" is a canonical representation of what was called
# and what the critical input parameters were.
detector = LoopDetector(repeats=3, max_cycle_len=4)

def tool_router(tool_name: str, tool_args: dict) -> str:
    """
    Route a tool call, checking for loops before execution.
    Returns tool output or raises LoopDetectedError if a loop is detected.
    """
    # Build a canonical signature: tool name + key discriminating args
    # Drop volatile args (timestamps, request IDs) that would prevent loop detection
    discriminating_args = {
        k: v for k, v in tool_args.items()
        if k not in {"request_id", "timestamp", "trace_id"}
    }
    sig = f"{tool_name}:{sorted(discriminating_args.items())}"

    match = detector.record(sig)
    if match:
        raise LoopDetectedError(
            f"Tool call loop detected: '{tool_name}' called in pattern "
            f"repeating {match.repeats}x with cycle length {match.cycle_length}",
            match=match,
        )

    # Execute the actual tool
    return dispatch_tool(tool_name, tool_args)

def dispatch_tool(name: str, args: dict) -> str:
    """Placeholder — replace with your actual tool dispatch."""
    if name == "search":
        return f"Search results for: {args.get('query', '')}"
    if name == "read_file":
        return f"File contents of: {args.get('path', '')}"
    raise ValueError(f"Unknown tool: {name}")

TypeScript: circuit breaker for tool loops

import { LoopDetector, LoopDetectedError } from "@runguard/sdk";

const detector = new LoopDetector({ repeats: 3, maxCycleLen: 4 });

interface ToolCall {
  name: string;
  args: Record<string, unknown>;
}

function toolSignature(call: ToolCall): string {
  const { request_id, timestamp, trace_id, ...discriminating } = call.args;
  return `${call.name}:${JSON.stringify(Object.entries(discriminating).sort())}`;
}

function executeWithCircuitBreaker(call: ToolCall): string {
  const sig = toolSignature(call);
  const match = detector.record(sig);
  if (match) {
    throw new LoopDetectedError(
      `Tool loop: '${call.name}' repeated ${match.repeats}x`,
      match,
    );
  }
  return dispatchTool(call);
}

function dispatchTool(call: ToolCall): string {
  switch (call.name) {
    case "search": return `Results for: ${call.args.query}`;
    case "read_file": return `Contents of: ${call.args.path}`;
    default: throw new Error(`Unknown tool: ${call.name}`);
  }
}

Choosing the right cycle length for your agent. The max_cycle_len=4 setting catches sequences of up to 4 distinct tool calls that repeat as a block. For a coding agent that calls read_file → edit_file → run_tests → read_test_output in a loop, max_cycle_len=4 catches the pattern after 3 complete cycles. For agents with longer deliberation sequences, increase this value. For agents that primarily call one tool, max_cycle_len=1 with repeats=3 is sufficient and trips faster.

Pattern 3: model fallback for provider outages and cost escalation

When to fall back to a cheaper or alternative model. Model fallback is appropriate in two scenarios: (1) the primary model’s API is experiencing elevated error rates or timeouts that exceed your retry budget, and (2) the session’s budget utilization rate is on track to exceed the cap before the task completes, and a cheaper model can complete the remaining work at lower cost. The first case is a reliability fallback; the second is a cost-efficiency fallback. Both should be implemented explicitly — silently falling back to a cheaper model without logging the fallback makes cost attribution impossible.

Python: two-tier model fallback with budget trigger

import anthropic
from dataclasses import dataclass, field
from runguard import BudgetExceededError

@dataclass
class FallbackSession:
    cap_usd: float
    spent_usd: float = 0.0
    turns: int = 0
    fallback_triggered: bool = False

    # Pricing (per token)
    SONNET_IN  = 3.0  / 1_000_000   # $3/M input
    SONNET_OUT = 15.0 / 1_000_000   # $15/M output
    HAIKU_IN   = 0.8  / 1_000_000   # $0.80/M input
    HAIKU_OUT  = 4.0  / 1_000_000   # $4/M output

    def record(self, in_tok: int, out_tok: int, model: str) -> None:
        rate_in = self.HAIKU_IN if "haiku" in model else self.SONNET_IN
        rate_out = self.HAIKU_OUT if "haiku" in model else self.SONNET_OUT
        self.spent_usd += in_tok * rate_in + out_tok * rate_out
        self.turns += 1

    @property
    def utilization(self) -> float:
        return self.spent_usd / self.cap_usd if self.cap_usd else 0.0

client = anthropic.Anthropic()

FALLBACK_UTILIZATION_THRESHOLD = 0.65  # fall back when 65% of budget is used

def call_with_fallback(
    messages: list,
    session: FallbackSession,
    task_fraction_remaining: float = 0.5,
) -> str:
    """
    Call Sonnet unless budget utilization suggests we should fall back to Haiku.
    task_fraction_remaining: estimated fraction of work left (0.0–1.0).
    Falls back when projected cost at current burn rate exceeds remaining budget.
    """
    # Trigger fallback if we're on track to run over budget
    if (session.utilization >= FALLBACK_UTILIZATION_THRESHOLD
            and task_fraction_remaining > 0.1):
        if not session.fallback_triggered:
            print(f"[FALLBACK] Switching to Haiku at {session.utilization:.0%} budget used")
            session.fallback_triggered = True
        model = "claude-haiku-4-5-20251001"
    else:
        model = "claude-sonnet-4-6"

    response = client.messages.create(
        model=model,
        max_tokens=1024,
        messages=messages,
    )
    usage = response.usage
    session.record(usage.input_tokens, usage.output_tokens, model)

    if session.spent_usd > session.cap_usd:
        raise BudgetExceededError(
            f"Budget exceeded after fallback: ${session.spent_usd:.4f} > ${session.cap_usd}"
        )

    return response.content[0].text

Fallback ordering and graceful handling. A practical fallback chain for production agents in 2026: Claude Sonnet 4.6 (primary, best quality) → Claude Haiku 4.5 (cost fallback, 80% quality, 5× cheaper) → cached partial result (return what was computed so far). Never fall back silently — log the trigger condition (utilization rate, error type, turn count) and the model selected, so post-incident analysis can determine whether the fallback was appropriate and whether the task quality degraded.

Pattern 4: graceful degradation when the task cannot complete within constraints

Defining graceful degradation for agents. Graceful degradation means returning a useful partial result rather than throwing an unhandled exception when an agent runs out of budget, hits a loop, or exhausts its turn limit. The partial result should be clearly marked as incomplete, include what was accomplished, and indicate why the task stopped. A naked exception caught at the API boundary is not graceful; returning {"status": "partial", "completed_steps": [...], "stopped_reason": "budget_exceeded", "remaining_steps": [...]} is.

Python: structured partial result on budget or loop trip

from dataclasses import dataclass, field
from typing import Optional
from runguard import LoopDetectedError, BudgetExceededError

@dataclass
class AgentResult:
    status: str  # "complete" | "partial_budget" | "partial_loop" | "partial_turns"
    completed_steps: list = field(default_factory=list)
    remaining_steps: list = field(default_factory=list)
    stopped_reason: Optional[str] = None
    budget_used_usd: float = 0.0
    turns_used: int = 0
    last_output: Optional[str] = None

def run_task_with_degradation(
    steps: list[str],
    budget_usd: float = 3.0,
    max_turns: int = 20,
) -> AgentResult:
    """
    Run a multi-step task. Returns partial result on fault rather than crashing.
    """
    result = AgentResult(
        status="partial_turns",
        remaining_steps=list(steps),
    )
    session_cost = 0.0

    for turn_idx, step in enumerate(steps):
        if turn_idx >= max_turns:
            result.status = "partial_turns"
            result.stopped_reason = f"Turn limit {max_turns} reached"
            break

        try:
            output = execute_step_with_guard(step, session_cost, budget_usd)
            session_cost += output["cost"]
            result.budget_used_usd = session_cost
            result.turns_used = turn_idx + 1
            result.completed_steps.append({"step": step, "output": output["text"]})
            result.remaining_steps.remove(step)
            result.last_output = output["text"]

        except BudgetExceededError as e:
            result.status = "partial_budget"
            result.stopped_reason = str(e)
            result.budget_used_usd = session_cost
            break

        except LoopDetectedError as e:
            result.status = "partial_loop"
            result.stopped_reason = str(e)
            result.budget_used_usd = session_cost
            break

    else:
        result.status = "complete"
        result.remaining_steps = []

    return result

def execute_step_with_guard(
    step: str,
    current_cost: float,
    budget_usd: float,
) -> dict:
    """Execute one step with budget pre-check. Returns {"text": ..., "cost": ...}."""
    if current_cost >= budget_usd * 0.95:
        raise BudgetExceededError(
            f"Pre-step budget check: ${current_cost:.4f} is ≥95% of ${budget_usd} cap"
        )
    # Replace with real LLM call
    return {"text": f"Completed: {step}", "cost": 0.01}

What to include in the partial result. The minimum useful partial result for an AI agent has four fields: the steps that completed successfully (enough for a human or downstream system to avoid re-running them), the steps that did not run (so the caller can resume from the right point), the stop reason (budget, loop, turn limit, or provider error), and the cost incurred. Anything less shifts the recovery burden to the caller, who now has to figure out both what happened and where to restart.

LLM agent fault tolerance pattern comparison

Pattern	Failure mode addressed	Without pattern	With pattern
Retry with jitter	Transient 429 / 5xx errors	Task fails on first transient error; no retry budget management	Distributed retry avoids thundering-herd; non-retryable errors fail fast
Circuit breaker (loop detection)	Tool-call loops and retry storms	Agent loops indefinitely, billing thousands of tokens per minute	Breaker trips after N repeats; loop detected before the bill lands
Model fallback	Budget exhaustion and provider outages	Task hard-fails or overspends when primary model is unavailable or expensive	Cheaper model completes remaining work within budget; outage path has an alternate
Graceful degradation	Any fault that prevents task completion	Unhandled exception; caller has no idea what completed and where to resume	Structured partial result; caller can resume from the correct checkpoint

For the broader cost control architecture these patterns sit inside, see autonomous agent cost control best practices. For retry-storm-specific circuit breaking with RunGuard, see AI agent retry storm prevention. For the graceful degradation model-fallback approach in detail, see AI agent graceful degradation patterns.

Add fault tolerance to your LLM agent in minutes

RunGuard installs in one command: pip install runguard for Python, npm install @runguard/sdk for TypeScript. The circuit breaking and budget tracking patterns above use RunGuard’s LoopDetector and BudgetExceededError directly — no wrappers or config files required. Start with the circuit breaker (Pattern 2) as it catches the highest-cost failure mode: tool loops. Add the budget cap next. Then layer in retry with jitter and graceful degradation as your agent moves toward production.

RunGuard pricing: Solo plan at $19/month for individual developers. Team plan at $79/month adds Slack and PagerDuty webhook alerts, shared dashboards, and audit log. Both plans include a 14-day free trial — no credit card required.

Start your 14-day free trial — or explore related: AI agent retry storm prevention, graceful degradation patterns, autonomous agent cost control best practices, prevent runaway cost in real time, and production LLM agent reliability checklist.