AI agent retry storms: how a transient tool failure becomes a four-figure LLM bill
A retry storm is what happens when your agent’s retry logic encounters a condition it cannot recover from, and the retry loop runs until the process dies, the budget is exhausted, or someone notices. Unlike a simple tool-call loop (where the agent calls the same tool with the same arguments because it believes the call might eventually succeed), a retry storm has an explicit retry mechanism — usually a tenacity or backoff decorator, a for attempt in range(n) loop, or a framework-level retry setting — that triggers repeatedly. The result is the same: the LLM is called again and again, the accumulated cost climbs, and the agent produces no useful output. This page explains the three retry-storm archetypes, how to configure retry logic correctly for agent workloads, and how to add a circuit breaker that catches storms that correct retry config alone won’t stop.
The three retry-storm archetypes
- Archetype 1: the permanently failed tool. A tool raises an exception on every call because the underlying dependency is down, the API key is wrong, or the target resource no longer exists. The agent’s retry decorator retries 5 times per tool call with exponential backoff. On each retry, the agent calls the LLM again to re-plan (because the tool result is an error string that goes into the context). The LLM plans to call the same tool again (it has no information that the tool is permanently broken). The retry decorator fires again. The net effect: 5 retries × N planning calls × M agent steps = potentially hundreds of LLM calls before
max_stepsfires. The correct fix is to distinguish permanent failures (raise immediately, do not retry) from transient ones (retry with backoff). AToolUnavailableErrorshould never be retried by the agent loop; aRateLimitErrororTimeoutErrorshould be retried with backoff. - Archetype 2: the self-reinforcing error-string loop. A tool returns an error string (“Error: no results found for query”) instead of raising an exception. The LLM sees the error string in its context, interprets it as a result that should be retried, and calls the same tool with a slightly modified query. That query also returns an error string. The agent retries. This is the most insidious retry storm because there is no explicit retry decorator — the retry is driven by the LLM’s own reasoning. The LLM is doing what it was trained to do: retry when things fail. But the failure is permanent, and each retry costs a full planning call. Detection: the tool-call signature (tool name + argument hash) repeats in the agent’s recent call window. RunGuard fires after the third repetition and prevents the fourth call from going out.
- Archetype 3: the cascading downstream failure. Tool A calls Tool B calls an external API. The external API is rate-limiting. Tool B returns a 429. Tool A propagates the error. The agent re-plans, decides to call Tool A again (because the task requires it), which calls Tool B again, which hits the 429 again. Each planning step costs a full LLM call. The agent is effectively DDoSing the external API with its own retry logic while simultaneously spending on LLM calls. The correct fix: Tool B should catch 429 and raise a typed error that tells the agent to wait, not to retry immediately. If the agent framework does not support typed error back-propagation, use a circuit breaker on Tool B itself: after 2 consecutive 429s, open the circuit (raise immediately for the next N seconds) and close it on success.
Correct retry configuration for agent workloads
- Distinguish exception types. Never retry on all exceptions. Use typed exceptions and retry only on retriable ones:
import tenacity from tenacity import retry, stop_after_attempt, wait_exponential, retry_if_exception_type class TransientToolError(Exception): pass class PermanentToolError(Exception): pass @retry( stop=stop_after_attempt(3), wait=wait_exponential(multiplier=1, min=1, max=10), retry=retry_if_exception_type(TransientToolError), reraise=True, ) def call_external_api(query: str) -> str: try: response = requests.get(API_URL, params={"q": query}, timeout=5) response.raise_for_status() return response.json()["result"] except requests.Timeout: raise TransientToolError(f"Timeout for query: {query!r}") except requests.HTTPError as e: if e.response.status_code == 429: raise TransientToolError(f"Rate limit for query: {query!r}") # 404, 403, 400 — permanent raise PermanentToolError(f"Permanent error {e.response.status_code}: {query!r}") - Cap total retry time, not just attempts. Exponential backoff with
stop_after_attempt(5)can run for 30+ seconds if each attempt waits. In an agent loop that calls the LLM on each step, 30 seconds of retry delay means 30 seconds before the next planning call. Usestop_after_delay(10)or combine:stop=stop_after_attempt(3) | stop_after_delay(8). This prevents a single stuck tool call from blocking the entire agent run for minutes. - Do not retry inside the agent loop if the agent framework already retries. LangChain’s
AgentExecutorwithhandle_parsing_errors=Trueretries failed tool calls internally. If you also add@retryto the tool function, retries compound. One retry layer is correct; two is a storm risk. Pick the right layer (framework or function, not both). - Add jitter to prevent synchronized retries. In multi-agent or multi-user systems where many agents hit the same tool simultaneously, synchronized retry backoff can DDoS the downstream service. Add
wait=wait_random_exponential(min=1, max=10)instead ofwait_exponentialto spread retries across time.
Adding a circuit breaker: catching storms that correct config won’t stop
- Why correct retry config is not enough. Correct retry config prevents Archetype 1 (typed exception) and Archetype 3 (cascading downstream) storms if you implement it perfectly. But Archetype 2 — the LLM-driven self-reinforcing loop — is not caught by retry config because there is no explicit retry decorator to configure. The LLM is replanning on each step; the retry is implicit in the agent’s reasoning. For Archetype 2, you need a pattern detector at the LLM call layer that notices the same tool is being called with the same arguments repeatedly.
- RunGuard as a retry-storm circuit breaker.
from runguard import guard, LoopDetectedError, BudgetExceededError def agent_llm_call(messages: list, tools: list) -> dict: """Your LLM call — returns {"response": ..., "usd": float, "sig": str}""" response = llm_client.chat(messages=messages, tools=tools) tool_name = response.tool_calls[0].function.name if response.tool_calls else "end_turn" usd = compute_cost(response.usage) return {"response": response, "usd": usd, "sig": tool_name} # Create guard once per agent run run_guard = guard( agent_llm_call, # Fire after 3 consecutive calls with same tool signature loop={"repeats": 3, "max_cycle_len": 8}, # $2 hard cap catches budget blow-through from any storm type budget={"max_usd": 2.00}, ) def run_agent(task: str): messages = [{"role": "user", "content": task}] for step in range(50): # outer iteration limit try: result = run_guard(messages, tools=AGENT_TOOLS) except LoopDetectedError as e: return f"Retry storm detected: tool={e.pattern!r} called {e.repeats}x. Stopping." except BudgetExceededError as e: return f"Budget exceeded after {e.spent:.4f} USD. Partial result: {messages[-1]}" # Process result, update messages, check for completion... if result["response"].content and not result["response"].tool_calls: return result["response"].content messages.append(result["response"]) - Per-tool circuit breaker for Archetype 3. For cascading downstream failures, add a circuit breaker at the tool layer rather than the LLM layer. The circuit opens after N failures and closes after a timeout:
from runguard import ToolCircuitBreaker # Opens after 3 consecutive failures, stays open for 30 seconds @ToolCircuitBreaker(fail_threshold=3, recovery_timeout=30) def call_search_api(query: str) -> str: return search_client.search(query)
Retry storm prevention: comparison table
| Storm type | Root cause | Prevention | Backstop |
|---|---|---|---|
| Permanently failed tool | Retry on PermanentError | Typed exceptions, retry_if_exception_type | RunGuard budget cap |
| Self-reinforcing error string | LLM-driven replanning on error text | Raise exceptions, never return error strings | RunGuard loop detector (repeats=3) |
| Cascading downstream 429 | Uncoordinated retry on rate limit | Jittered backoff, circuit breaker on tool | RunGuard tool circuit breaker |
| Compounding retry layers | Framework + function both retry | Single retry layer (framework OR function) | RunGuard per-run step count limit |