AI agent retry storms: how a transient tool failure becomes a four-figure LLM bill

A retry storm is what happens when your agent’s retry logic encounters a condition it cannot recover from, and the retry loop runs until the process dies, the budget is exhausted, or someone notices. Unlike a simple tool-call loop (where the agent calls the same tool with the same arguments because it believes the call might eventually succeed), a retry storm has an explicit retry mechanism — usually a tenacity or backoff decorator, a for attempt in range(n) loop, or a framework-level retry setting — that triggers repeatedly. The result is the same: the LLM is called again and again, the accumulated cost climbs, and the agent produces no useful output. This page explains the three retry-storm archetypes, how to configure retry logic correctly for agent workloads, and how to add a circuit breaker that catches storms that correct retry config alone won’t stop.

The three retry-storm archetypes

Archetype 1: the permanently failed tool. A tool raises an exception on every call because the underlying dependency is down, the API key is wrong, or the target resource no longer exists. The agent’s retry decorator retries 5 times per tool call with exponential backoff. On each retry, the agent calls the LLM again to re-plan (because the tool result is an error string that goes into the context). The LLM plans to call the same tool again (it has no information that the tool is permanently broken). The retry decorator fires again. The net effect: 5 retries × N planning calls × M agent steps = potentially hundreds of LLM calls before max_steps fires. The correct fix is to distinguish permanent failures (raise immediately, do not retry) from transient ones (retry with backoff). A ToolUnavailableError should never be retried by the agent loop; a RateLimitError or TimeoutError should be retried with backoff.
Archetype 2: the self-reinforcing error-string loop. A tool returns an error string (“Error: no results found for query”) instead of raising an exception. The LLM sees the error string in its context, interprets it as a result that should be retried, and calls the same tool with a slightly modified query. That query also returns an error string. The agent retries. This is the most insidious retry storm because there is no explicit retry decorator — the retry is driven by the LLM’s own reasoning. The LLM is doing what it was trained to do: retry when things fail. But the failure is permanent, and each retry costs a full planning call. Detection: the tool-call signature (tool name + argument hash) repeats in the agent’s recent call window. RunGuard fires after the third repetition and prevents the fourth call from going out.
Archetype 3: the cascading downstream failure. Tool A calls Tool B calls an external API. The external API is rate-limiting. Tool B returns a 429. Tool A propagates the error. The agent re-plans, decides to call Tool A again (because the task requires it), which calls Tool B again, which hits the 429 again. Each planning step costs a full LLM call. The agent is effectively DDoSing the external API with its own retry logic while simultaneously spending on LLM calls. The correct fix: Tool B should catch 429 and raise a typed error that tells the agent to wait, not to retry immediately. If the agent framework does not support typed error back-propagation, use a circuit breaker on Tool B itself: after 2 consecutive 429s, open the circuit (raise immediately for the next N seconds) and close it on success.

Correct retry configuration for agent workloads

Distinguish exception types. Never retry on all exceptions. Use typed exceptions and retry only on retriable ones:

import tenacity
from tenacity import retry, stop_after_attempt, wait_exponential, retry_if_exception_type

class TransientToolError(Exception): pass
class PermanentToolError(Exception): pass

@retry(
    stop=stop_after_attempt(3),
    wait=wait_exponential(multiplier=1, min=1, max=10),
    retry=retry_if_exception_type(TransientToolError),
    reraise=True,
)
def call_external_api(query: str) -> str:
    try:
        response = requests.get(API_URL, params={"q": query}, timeout=5)
        response.raise_for_status()
        return response.json()["result"]
    except requests.Timeout:
        raise TransientToolError(f"Timeout for query: {query!r}")
    except requests.HTTPError as e:
        if e.response.status_code == 429:
            raise TransientToolError(f"Rate limit for query: {query!r}")
        # 404, 403, 400 — permanent
        raise PermanentToolError(f"Permanent error {e.response.status_code}: {query!r}")

Cap total retry time, not just attempts. Exponential backoff with stop_after_attempt(5) can run for 30+ seconds if each attempt waits. In an agent loop that calls the LLM on each step, 30 seconds of retry delay means 30 seconds before the next planning call. Use stop_after_delay(10) or combine: stop=stop_after_attempt(3) | stop_after_delay(8). This prevents a single stuck tool call from blocking the entire agent run for minutes.
Do not retry inside the agent loop if the agent framework already retries. LangChain’s AgentExecutor with handle_parsing_errors=True retries failed tool calls internally. If you also add @retry to the tool function, retries compound. One retry layer is correct; two is a storm risk. Pick the right layer (framework or function, not both).
Add jitter to prevent synchronized retries. In multi-agent or multi-user systems where many agents hit the same tool simultaneously, synchronized retry backoff can DDoS the downstream service. Add wait=wait_random_exponential(min=1, max=10) instead of wait_exponential to spread retries across time.

Adding a circuit breaker: catching storms that correct config won’t stop

Why correct retry config is not enough. Correct retry config prevents Archetype 1 (typed exception) and Archetype 3 (cascading downstream) storms if you implement it perfectly. But Archetype 2 — the LLM-driven self-reinforcing loop — is not caught by retry config because there is no explicit retry decorator to configure. The LLM is replanning on each step; the retry is implicit in the agent’s reasoning. For Archetype 2, you need a pattern detector at the LLM call layer that notices the same tool is being called with the same arguments repeatedly.

RunGuard as a retry-storm circuit breaker.

from runguard import guard, LoopDetectedError, BudgetExceededError

def agent_llm_call(messages: list, tools: list) -> dict:
    """Your LLM call — returns {"response": ..., "usd": float, "sig": str}"""
    response = llm_client.chat(messages=messages, tools=tools)
    tool_name = response.tool_calls[0].function.name if response.tool_calls else "end_turn"
    usd = compute_cost(response.usage)
    return {"response": response, "usd": usd, "sig": tool_name}

# Create guard once per agent run
run_guard = guard(
    agent_llm_call,
    # Fire after 3 consecutive calls with same tool signature
    loop={"repeats": 3, "max_cycle_len": 8},
    # $2 hard cap catches budget blow-through from any storm type
    budget={"max_usd": 2.00},
)

def run_agent(task: str):
    messages = [{"role": "user", "content": task}]
    for step in range(50):  # outer iteration limit
        try:
            result = run_guard(messages, tools=AGENT_TOOLS)
        except LoopDetectedError as e:
            return f"Retry storm detected: tool={e.pattern!r} called {e.repeats}x. Stopping."
        except BudgetExceededError as e:
            return f"Budget exceeded after {e.spent:.4f} USD. Partial result: {messages[-1]}"
        # Process result, update messages, check for completion...
        if result["response"].content and not result["response"].tool_calls:
            return result["response"].content
        messages.append(result["response"])

Per-tool circuit breaker for Archetype 3. For cascading downstream failures, add a circuit breaker at the tool layer rather than the LLM layer. The circuit opens after N failures and closes after a timeout:

from runguard import ToolCircuitBreaker

# Opens after 3 consecutive failures, stays open for 30 seconds
@ToolCircuitBreaker(fail_threshold=3, recovery_timeout=30)
def call_search_api(query: str) -> str:
    return search_client.search(query)

Retry storm prevention: comparison table

Storm type	Root cause	Prevention	Backstop
Permanently failed tool	Retry on PermanentError	Typed exceptions, retry_if_exception_type	RunGuard budget cap
Self-reinforcing error string	LLM-driven replanning on error text	Raise exceptions, never return error strings	RunGuard loop detector (repeats=3)
Cascading downstream 429	Uncoordinated retry on rate limit	Jittered backoff, circuit breaker on tool	RunGuard tool circuit breaker
Compounding retry layers	Framework + function both retry	Single retry layer (framework OR function)	RunGuard per-run step count limit