LLM agent rate limit backoff strategy: why naive retry loops fail and how to stop the 429 storm

LLM APIs enforce two independent rate limits simultaneously: requests per minute (RPM) and tokens per minute (TPM). When an agent hits a 429 and retries without proper backoff, it creates a retry storm — the agent loops on 429 errors indefinitely, burning capacity, amplifying downstream load, and potentially increasing costs through repeated failed attempts that still consume your quota window. The problem is especially acute for autonomous agents that fire parallel tool calls or operate with large context windows: a single agent with a 100k-token context can exhaust a tier-1 TPM budget on the first request of the minute. This page explains why naive retry logic fails, how to implement correct exponential backoff with jitter, and how RunGuard’s LoopDetector catches 429 retry storms as loops — because even well-implemented backoff does not prevent an agent from cycling on the same failing pattern across multiple retry windows.

The two rate limit types and why they trip agents differently

Every major LLM provider enforces rate limits at two levels simultaneously. Understanding both is essential for implementing a backoff strategy that actually works.

RPM limits: parallel tool calls blow through the ceiling. Requests-per-minute limits count individual API calls regardless of size. Agents that fire multiple parallel tool calls — a research agent fetching three web pages simultaneously, a code agent running four tool invocations in parallel — can hit the RPM ceiling almost instantly even on modest workloads. At OpenAI tier 1, GPT-4o is limited to 500 RPM. An agent that fires 10 parallel calls and runs 60 steps per minute is already at capacity on its own; add a second concurrent session and both sessions start seeing 429s.
TPM limits: large context windows exhaust token budgets before you make many calls. Tokens-per-minute limits count all tokens across input and output for every request in the current minute window. A single call with a 100k-token context window consumes 100,000 TPM in one shot. At OpenAI tier 1 (30,000 TPM for GPT-4o), that call alone would exceed the entire per-minute budget by 3.3×. Even at tier 2 (450,000 TPM), a single 500k-token context call blows the budget. Agents that accumulate tool results and conversation history hit TPM limits faster as the session length grows — the very runs that are the most expensive to restart.

Provider rate limits reference: what the real ceilings look like.

Provider / Model	Tier	RPM limit	TPM limit
OpenAI GPT-4o	Tier 1	500 RPM	30,000 TPM
OpenAI GPT-4o	Tier 2	5,000 RPM	450,000 TPM
OpenAI GPT-4o-mini	Tier 1	500 RPM	200,000 TPM
Anthropic claude-sonnet-4	Tier 1	50 RPM	40,000 TPM
Anthropic claude-haiku-3.5	Tier 1	50 RPM	50,000 TPM

Limits as of June 2026. Always verify against provider documentation as they change with account tier upgrades.

How the two limits interact in practice. An agent can hit either limit independently. A very fast agent with small prompts hits RPM. A slow agent with large prompts hits TPM. An agent with large prompts firing parallel requests hits both simultaneously, which produces cascading 429s: the retry attempts themselves consume RPM quota, which triggers more 429s, which consume more RPM quota. This self-amplifying pattern is the retry storm.

How retry storms form: the naive retry loop

Most engineers’ first instinct when they see a 429 is to add a sleep and retry. The result is code like this:

from openai import OpenAI, RateLimitError
import time

client = OpenAI()

# BUG: this creates a retry storm
while True:
    try:
        response = client.chat.completions.create(
            model="gpt-4o",
            messages=messages,
        )
        break
    except RateLimitError:
        time.sleep(1)  # BUG: 1 second is not enough for the rate limit window to reset
        continue       # BUG: immediately retries — no max retries, no backoff

This pattern has four compounding problems:

(a) Fixed sleep of 1 second is not enough. Rate limit windows are 60-second windows. If you exhausted your RPM quota at second 0, a 1-second sleep at second 1 puts you at second 2 of the same 60-second window. The limit has not reset. The retry fails immediately with another 429. You are sleeping 1 second, retrying, getting another 429, sleeping 1 second … indefinitely. You need to sleep for the remainder of the current window, not an arbitrary fixed delay.
(b) No maximum retry count means infinite loop. The while True combined with a catch-and-continue on RateLimitError has no exit condition. If the rate limit persists (because the account has been suspended, the quota is too low for the request, or the rate limit error is actually a different error misclassified as 429), the loop never terminates. The process runs forever, consuming CPU and potentially accumulating costs through other mechanisms.
(c) Concurrent agents sharing the same API key amplify the problem. In production, you typically have many agent sessions running concurrently. Each session that hits a 429 enters its own retry loop. All of them retry at nearly the same time (because they all slept for the same 1-second fixed delay), which means they all hit the rate limit simultaneously again, which means they all sleep 1 second again. This synchronized retry pattern — called the thundering herd — keeps the rate limit window perpetually exhausted.
(d) The 429 response fills the retry queue with the same request. When the 429 occurs, the agent re-queues the exact same request with the exact same messages. If the rate limit is being triggered by a large context (TPM), re-queueing the same large context on every retry means every retry attempt costs the same number of tokens against the TPM window. The very act of retrying continues to exhaust the quota.

The result: an agent that hit a momentary rate limit 30 seconds ago is still looping, making no progress, while the rate-limited API continues to see a stream of failing requests.

Proper exponential backoff with jitter: the correct implementation

The standard fix for 429 retry loops is exponential backoff with jitter. Each successive retry waits exponentially longer than the previous one, with a small random component that desynchronizes concurrent retriers:

import time
import random
from openai import OpenAI, RateLimitError

client = OpenAI()

def retry_with_backoff(fn, max_retries: int = 5, base_delay: float = 1.0):
    """
    Retry fn() on RateLimitError with exponential backoff and jitter.

    Parameters
    ----------
    fn          : callable — the LLM call to retry (zero-argument lambda or partial)
    max_retries : int     — maximum number of attempts before raising (default 5)
    base_delay  : float   — initial sleep in seconds before first retry (default 1.0)

    Backoff schedule (with base_delay=1.0):
      attempt 0: sleep ~1s  (1.0 * 2^0 + jitter)
      attempt 1: sleep ~2s  (1.0 * 2^1 + jitter)
      attempt 2: sleep ~4s  (1.0 * 2^2 + jitter)
      attempt 3: sleep ~8s  (1.0 * 2^3 + jitter)
      attempt 4: sleep ~16s (1.0 * 2^4 + jitter)
      capped at 60s per sleep to prevent multi-minute waits
    """
    for attempt in range(max_retries):
        try:
            return fn()
        except RateLimitError as e:
            if attempt == max_retries - 1:
                raise  # re-raise on the final attempt — do not swallow
            delay = base_delay * (2 ** attempt) + random.uniform(0, 1)
            time.sleep(min(delay, 60))  # cap at 60s — never wait more than a minute
    raise RuntimeError("max retries exceeded")  # unreachable, but satisfies type checkers


# Usage
def make_completion(messages):
    return retry_with_backoff(
        lambda: client.chat.completions.create(
            model="gpt-4o",
            messages=messages,
            max_tokens=4096,
        )
    )

Key decisions in this implementation:

max_retries=5: bounded loop, not infinite. Five attempts is enough to survive a transient rate limit spike (typically 30–60 seconds). After five attempts with exponential backoff the total elapsed time is 1 + 2 + 4 + 8 + 16 = 31 seconds minimum, which is long enough for most 60-second rate limit windows to partially reset. If the error persists after five attempts, it is not transient and should propagate to the caller.
base_delay * (2 ** attempt): exponential growth spreads retries across time. The first retry fires after ~1 second (fast enough to catch a short burst), the second after ~2 seconds, and so on. By attempt 4, you are sleeping ~16 seconds between tries — long enough for almost any rate-limit window to have rotated at least partially.
random.uniform(0, 1) jitter: prevents thundering herd. Without jitter, all concurrent agents that entered the retry loop at the same time sleep for exactly the same duration and retry simultaneously — recreating the original burst. Adding a random 0–1 second offset to each agent’s sleep time staggers the retries across a 1-second window, which is usually enough to prevent them from all firing at the same instant.
min(delay, 60) cap: prevents multi-minute waits. Without the cap, base_delay * (2 ** attempt) for large attempt counts becomes impractically long (attempt 10 = 1024 seconds). Capping at 60 seconds ensures that even repeated failures do not block the agent for more than one minute per retry. If the rate limit is still not cleared after 60 seconds, there is likely a structural problem (quota too low, wrong tier, suspended account) that exponential backoff cannot solve.
if attempt == max_retries - 1: raise: re-raise after the last attempt. Never swallow the RateLimitError silently. After exhausting retries, raise the original exception so the caller can decide what to do: log it, return a partial result, increment an error counter, or trigger an alert.

RunGuard loop detection for 429 retry storms

Proper exponential backoff prevents the naive infinite-loop pattern. But there is a second failure mode that backoff alone does not catch: the cross-window retry storm. This happens when an agent retries successfully after backoff (the rate limit window resets), makes the same request, hits the rate limit again in the new window, backs off again, retries, hits the limit again. The agent is not spinning in a tight loop — it is patiently sleeping and retrying — but it is cycling on the same failure pattern indefinitely across multiple rate limit windows. From a cost perspective, each retry attempt is an LLM call that may partially consume tokens before the 429 fires. From a UX perspective, the agent is making no progress.

RunGuard’s LoopDetector catches this by treating the 429 error response as a loop signature. When the same error type appears at consecutive positions in the agent’s call history, the detector fires — regardless of how long the agent slept between attempts:

from openai import OpenAI, RateLimitError
from runguard import guard, LoopDetector, LoopDetectedError, BudgetTracker
import time
import random

client = OpenAI()

# detector fires after 3 consecutive calls with the same error/tool signature
# max_cycle_len=2 also catches A-B-A-B patterns (alternating failures)
detector = LoopDetector(repeats=3, max_cycle_len=2)
budget = BudgetTracker(cap_usd=5.0)

def make_llm_call(messages: list) -> dict:
    """
    Wraps an OpenAI chat completion with error-signature extraction.
    Returns a dict with keys 'response', 'usd', and 'sig'.
    The 'sig' field is what LoopDetector uses to identify loop patterns.
    """
    try:
        response = client.chat.completions.create(
            model="gpt-4o",
            messages=messages,
            max_tokens=4096,
        )
        usage = response.usage
        usd = (usage.prompt_tokens * 2.50 + usage.completion_tokens * 10.0) / 1_000_000
        # sig = tool being called, or "end_turn" if the model is responding
        choice = response.choices[0]
        sig = "end_turn"
        if choice.message.tool_calls:
            sig = choice.message.tool_calls[0].function.name
        return {"response": choice.message, "usd": usd, "sig": sig}

    except RateLimitError as e:
        # Expose the error type as the loop signature.
        # After 3 consecutive RateLimitError responses, LoopDetector fires.
        # Return a synthetic result with the error signature so the guard
        # can observe the pattern — then re-raise for the caller.
        budget.record(usd=0.0)  # 429 costs $0 but counts as an attempt
        raise  # propagate after recording

# One guard instance per agent run — wraps make_llm_call
run_guard = guard(
    make_llm_call,
    loop={"repeats": 3, "max_cycle_len": 2},
    budget={"max_usd": 5.0},
)

def run_agent_with_rate_limit_guard(task: str) -> str:
    messages = [{"role": "user", "content": task}]
    backoff_base = 1.0

    for step in range(50):
        try:
            result = run_guard(messages)
        except LoopDetectedError as e:
            # The same failure pattern repeated e.repeats times — abort
            return (
                f"Rate limit retry storm detected: pattern {e.pattern!r} "
                f"repeated {e.repeats}x. The API may be consistently over-quota "
                f"for the current request size. Consider reducing context length "
                f"or upgrading to a higher rate-limit tier."
            )
        except BudgetExceededError as e:
            return f"Budget cap reached at ${e.spent:.4f} of $5.00. Partial result above."
        except RateLimitError:
            # Exponential backoff before the next step
            delay = backoff_base * (2 ** min(step, 5)) + random.uniform(0, 1)
            time.sleep(min(delay, 60))
            continue  # retry the same step — RunGuard will detect if this repeats 3x

        # Successful response — advance agent state
        response = result["response"]
        if not response.tool_calls:
            return response.content  # final answer
        messages.append(response)
        # ... execute tool calls and append results ...

    return "Max steps reached."

The LoopDetector(repeats=3, max_cycle_len=2) configuration is calibrated for rate-limit detection specifically:

repeats=3: fire after the third consecutive repeat. A single retry is legitimate; two retries across a window reset could be coincidence. Three consecutive rate-limit errors in the same error position is a pattern that warrants halting — the request is too large for the current quota tier, or the quota has not been provisioned correctly.
max_cycle_len=2: detect alternating failure patterns. Some agents cycle between two different endpoints or tool calls, each hitting a rate limit in turn. max_cycle_len=2 catches A-B-A-B patterns as well as A-A-A patterns, which covers the common case of an agent that tries a large-context call, gets a 429, retries with a slightly different prompt, gets another 429, and alternates indefinitely.

TypeScript: wrapping OpenAI ChatCompletions with retry-storm detection

The same pattern in TypeScript, using @runguard/sdk with the OpenAI Node SDK:

import OpenAI from "openai";
import { guard, LoopDetectedError, BudgetExceededError } from "@runguard/sdk";

const client = new OpenAI();

// Backoff helper — same logic as the Python version
async function sleep(ms: number) {
  return new Promise((resolve) => setTimeout(resolve, ms));
}

async function retryWithBackoff<T>(
  fn: () => Promise<T>,
  maxRetries = 5,
  baseDelayMs = 1000,
): Promise<T> {
  for (let attempt = 0; attempt < maxRetries; attempt++) {
    try {
      return await fn();
    } catch (err: unknown) {
      if (!(err instanceof OpenAI.RateLimitError)) throw err;
      if (attempt === maxRetries - 1) throw err;
      const delay = baseDelayMs * Math.pow(2, attempt) + Math.random() * 1000;
      await sleep(Math.min(delay, 60_000));
    }
  }
  throw new Error("max retries exceeded");
}

// Inner function that RunGuard wraps
async function callLLM(
  messages: OpenAI.ChatCompletionMessageParam[],
): Promise<{ response: OpenAI.ChatCompletionMessage; usd: number; sig: string }> {
  const raw = await retryWithBackoff(() =>
    client.chat.completions.create({
      model: "gpt-4o",
      messages,
      max_tokens: 4096,
    }),
  );

  const usage = raw.usage!;
  const usd = (usage.prompt_tokens * 2.5 + usage.completion_tokens * 10.0) / 1_000_000;
  const sig = raw.choices[0].message.tool_calls?.[0]?.function?.name ?? "end_turn";
  return { response: raw.choices[0].message, usd, sig };
}

// One guard instance per agent run — detects repeated 429-driven signatures
const runGuard = guard(callLLM, {
  loop: { repeats: 3, maxCycleLen: 2 },
  budget: { maxUsd: 5.0 },
});

async function runAgent(task: string): Promise<string> {
  const messages: OpenAI.ChatCompletionMessageParam[] = [
    { role: "user", content: task },
  ];

  for (let step = 0; step < 50; step++) {
    try {
      const result = await runGuard(messages);
      const { response } = result;

      if (!response.tool_calls) {
        return response.content ?? "";  // final answer
      }
      messages.push(response);
      // ... execute tool calls, push results ...

    } catch (e) {
      if (e instanceof LoopDetectedError) {
        return (
          `Rate limit storm detected: ${e.pattern} repeated ${e.repeats}x. ` +
          `Request size may exceed tier quota — consider chunking input or upgrading tier.`
        );
      }
      if (e instanceof BudgetExceededError) {
        return `Budget cap: $${e.spent.toFixed(4)} of $5.00 spent.`;
      }
      throw e;
    }
  }
  return "Max steps reached.";
}

Rate limit budgeting: calculating the right TPM cap for your agent

The most actionable intervention for agents hitting TPM limits is matching the agent’s context size to its tier’s token budget before deploying. Here is the math for a typical research agent running on GPT-4o tier 2 at $2.50/M input tokens and $10.00/M output tokens:

Estimate tokens per call. A research agent with a 100k-token context and 2,000 output tokens per call uses 102,000 tokens per API call. At GPT-4o tier 2 (450,000 TPM), that is 4.4 calls per minute before hitting the TPM ceiling.
Estimate cost per call. 100,000 input tokens × $2.50/M = $0.25 input. 2,000 output tokens × $10.00/M = $0.02 output. Total: $0.27 per call.
Estimate cost per session. At 10 calls per session, total = 10 × $0.27 = $2.70. A $3.00 budget cap per session is tight but safe: it allows all 10 calls to complete and fires only on anomalous runs that exceed 11 calls (typically a loop).

Set the dollar cap accordingly.

from runguard import guard, BudgetExceededError

# 100k tokens/call * $2.50/M + 2k tokens/call * $10.00/M = $0.27/call
# 10 calls/session (expected) = $2.70 expected cost
# Budget cap at $3.00 = 11 calls before circuit trips — catches loops at step 12
COST_PER_CALL_USD = (100_000 * 2.50 + 2_000 * 10.0) / 1_000_000  # $0.27
EXPECTED_CALLS    = 10
BUDGET_CAP_USD    = round(COST_PER_CALL_USD * (EXPECTED_CALLS + 1), 2)  # $3.00

run_guard = guard(
    make_llm_call,
    budget={"max_usd": BUDGET_CAP_USD},
    loop={"repeats": 3, "max_cycle_len": 2},
)

Chunking as an alternative to tier upgrades. If your agent’s context regularly exceeds the tier’s TPM budget, the first solution is to reduce context size rather than upgrade tier. Trim old messages with a sliding window (see context window truncation handling), summarize earlier turns, or split long documents into chunks before passing them to the agent. A 100k context is often 90% tool results that the agent no longer needs for the current step — pruning that context to 20k reduces your TPM footprint by 80% and cuts cost per call from $0.27 to $0.05.

Comparison: naive retry vs. exponential backoff vs. RunGuard loop-aware retry

Capability	Naive retry (sleep 1s, loop)	Exponential backoff + jitter	RunGuard loop-aware retry
429 handling	Retries immediately — stays in same rate-limit window	Waits exponentially; usually clears the window by attempt 3–4	Backoff + loop detection; aborts if pattern repeats 3×
Infinite loop prevention	None — loops indefinitely on persistent 429	Bounded by `max_retries`; raises after N attempts	Both `max_retries` and `LoopDetectedError` after 3 repeated signatures
Thundering herd protection	None — all concurrent agents retry at the same time	Yes — jitter desynchronizes concurrent retriers	Yes — same jitter; guard instance is per-run so sessions are independent
Cross-window retry storm detection	None	None — backoff resets between invocations	Yes — `LoopDetector` observes history across retry windows
Cost cap	None	None	Yes — `BudgetTracker` fires before each call when cap is reached
Per-session budget	None	None	Yes — `cap_usd` set per agent run; resets with each new guard instance
Real-time alert on trip	None	None	Yes — `LoopDetectedError` / `BudgetExceededError` with structured context for webhook integration

Exponential backoff with jitter is necessary but not sufficient. It prevents the naive retry storm in a single-call retry loop, but it does not detect the pattern of a broader agent loop where the same 429 error recurs across multiple agent steps and multiple retry windows. Retry storm prevention requires a guard that observes the full history of the agent run, not just the current retry invocation. RunGuard’s LoopDetector is that guard: it sees every call signature the agent produces — including error signatures from 429 responses — and fires when the pattern crosses the repeat threshold.

Related patterns

Retry storm prevention (general). The 429 retry storm is one of three retry-storm archetypes. The others are the permanently-failed-tool loop and the self-reinforcing error-string loop. All three are covered on the AI agent retry storm prevention page, including per-tool circuit breakers for cascading downstream failures.
Cost control at scale. Rate-limit 429s are a cost signal as well as a reliability signal: if your agent is hitting TPM limits, your per-session cost is likely higher than expected. The autonomous agent cost control best practices page covers the five layers every production agent needs, including per-run budget enforcement and cost observability.
Setting the right dollar cap. The how to set max cost per LLM request page explains the 2–3× P95 calibration rule in detail, including how to instrument your agent to measure P95 run cost before setting the cap.
Runaway cost prevention. When a rate-limit loop escapes backoff detection, it becomes a runaway cost event. The prevent AI agent runaway cost in real time page covers the full guard configuration that pairs loop detection with cost caps and context overflow detection.
LangChain-specific budget limiting. If you are running agents through LangChain’s AgentExecutor, the LangChain agent budget limit page shows how to apply RunGuard at the framework callback layer rather than directly on the OpenAI client.

Add rate-limit storm protection to your agent today

RunGuard installs with pip install runguard (Python) or npm install @runguard/sdk (TypeScript/JavaScript). The LoopDetector and BudgetTracker shown on this page are part of the core SDK — no additional dependencies, no external API calls, no telemetry by default. A complete integration is five lines of code and one new return field (usd and sig) from your existing LLM call function.

RunGuard plans: Solo at $19 / mo covers one agent project with full loop detection and budget tracking. Team at $79 / mo adds multiple projects, webhook alerts, and team seat sharing. Both plans include a 14-day free trial — no credit card required to start.

Get started with RunGuard — or read about retry storm prevention in depth, autonomous agent cost control, and real-time runaway cost prevention.