AI agent retry storms: how a transient tool failure becomes a four-figure LLM bill

A retry storm is what happens when your agent’s retry logic encounters a condition it cannot recover from, and the retry loop runs until the process dies, the budget is exhausted, or someone notices. Unlike a simple tool-call loop (where the agent calls the same tool with the same arguments because it believes the call might eventually succeed), a retry storm has an explicit retry mechanism — usually a tenacity or backoff decorator, a for attempt in range(n) loop, or a framework-level retry setting — that triggers repeatedly. The result is the same: the LLM is called again and again, the accumulated cost climbs, and the agent produces no useful output. This page explains the three retry-storm archetypes, how to configure retry logic correctly for agent workloads, and how to add a circuit breaker that catches storms that correct retry config alone won’t stop.

The three retry-storm archetypes

Correct retry configuration for agent workloads

Adding a circuit breaker: catching storms that correct config won’t stop

Retry storm prevention: comparison table

Storm typeRoot causePreventionBackstop
Permanently failed toolRetry on PermanentErrorTyped exceptions, retry_if_exception_typeRunGuard budget cap
Self-reinforcing error stringLLM-driven replanning on error textRaise exceptions, never return error stringsRunGuard loop detector (repeats=3)
Cascading downstream 429Uncoordinated retry on rate limitJittered backoff, circuit breaker on toolRunGuard tool circuit breaker
Compounding retry layersFramework + function both retrySingle retry layer (framework OR function)RunGuard per-run step count limit