LLM agent rate limit backoff strategy: why naive retry loops fail and how to stop the 429 storm

LLM APIs enforce two independent rate limits simultaneously: requests per minute (RPM) and tokens per minute (TPM). When an agent hits a 429 and retries without proper backoff, it creates a retry storm — the agent loops on 429 errors indefinitely, burning capacity, amplifying downstream load, and potentially increasing costs through repeated failed attempts that still consume your quota window. The problem is especially acute for autonomous agents that fire parallel tool calls or operate with large context windows: a single agent with a 100k-token context can exhaust a tier-1 TPM budget on the first request of the minute. This page explains why naive retry logic fails, how to implement correct exponential backoff with jitter, and how RunGuard’s LoopDetector catches 429 retry storms as loops — because even well-implemented backoff does not prevent an agent from cycling on the same failing pattern across multiple retry windows.

The two rate limit types and why they trip agents differently

Every major LLM provider enforces rate limits at two levels simultaneously. Understanding both is essential for implementing a backoff strategy that actually works.

How retry storms form: the naive retry loop

Most engineers’ first instinct when they see a 429 is to add a sleep and retry. The result is code like this:

from openai import OpenAI, RateLimitError
import time

client = OpenAI()

# BUG: this creates a retry storm
while True:
    try:
        response = client.chat.completions.create(
            model="gpt-4o",
            messages=messages,
        )
        break
    except RateLimitError:
        time.sleep(1)  # BUG: 1 second is not enough for the rate limit window to reset
        continue       # BUG: immediately retries — no max retries, no backoff

This pattern has four compounding problems:

The result: an agent that hit a momentary rate limit 30 seconds ago is still looping, making no progress, while the rate-limited API continues to see a stream of failing requests.

Proper exponential backoff with jitter: the correct implementation

The standard fix for 429 retry loops is exponential backoff with jitter. Each successive retry waits exponentially longer than the previous one, with a small random component that desynchronizes concurrent retriers:

import time
import random
from openai import OpenAI, RateLimitError

client = OpenAI()

def retry_with_backoff(fn, max_retries: int = 5, base_delay: float = 1.0):
    """
    Retry fn() on RateLimitError with exponential backoff and jitter.

    Parameters
    ----------
    fn          : callable — the LLM call to retry (zero-argument lambda or partial)
    max_retries : int     — maximum number of attempts before raising (default 5)
    base_delay  : float   — initial sleep in seconds before first retry (default 1.0)

    Backoff schedule (with base_delay=1.0):
      attempt 0: sleep ~1s  (1.0 * 2^0 + jitter)
      attempt 1: sleep ~2s  (1.0 * 2^1 + jitter)
      attempt 2: sleep ~4s  (1.0 * 2^2 + jitter)
      attempt 3: sleep ~8s  (1.0 * 2^3 + jitter)
      attempt 4: sleep ~16s (1.0 * 2^4 + jitter)
      capped at 60s per sleep to prevent multi-minute waits
    """
    for attempt in range(max_retries):
        try:
            return fn()
        except RateLimitError as e:
            if attempt == max_retries - 1:
                raise  # re-raise on the final attempt — do not swallow
            delay = base_delay * (2 ** attempt) + random.uniform(0, 1)
            time.sleep(min(delay, 60))  # cap at 60s — never wait more than a minute
    raise RuntimeError("max retries exceeded")  # unreachable, but satisfies type checkers


# Usage
def make_completion(messages):
    return retry_with_backoff(
        lambda: client.chat.completions.create(
            model="gpt-4o",
            messages=messages,
            max_tokens=4096,
        )
    )

Key decisions in this implementation:

RunGuard loop detection for 429 retry storms

Proper exponential backoff prevents the naive infinite-loop pattern. But there is a second failure mode that backoff alone does not catch: the cross-window retry storm. This happens when an agent retries successfully after backoff (the rate limit window resets), makes the same request, hits the rate limit again in the new window, backs off again, retries, hits the limit again. The agent is not spinning in a tight loop — it is patiently sleeping and retrying — but it is cycling on the same failure pattern indefinitely across multiple rate limit windows. From a cost perspective, each retry attempt is an LLM call that may partially consume tokens before the 429 fires. From a UX perspective, the agent is making no progress.

RunGuard’s LoopDetector catches this by treating the 429 error response as a loop signature. When the same error type appears at consecutive positions in the agent’s call history, the detector fires — regardless of how long the agent slept between attempts:

from openai import OpenAI, RateLimitError
from runguard import guard, LoopDetector, LoopDetectedError, BudgetTracker
import time
import random

client = OpenAI()

# detector fires after 3 consecutive calls with the same error/tool signature
# max_cycle_len=2 also catches A-B-A-B patterns (alternating failures)
detector = LoopDetector(repeats=3, max_cycle_len=2)
budget = BudgetTracker(cap_usd=5.0)

def make_llm_call(messages: list) -> dict:
    """
    Wraps an OpenAI chat completion with error-signature extraction.
    Returns a dict with keys 'response', 'usd', and 'sig'.
    The 'sig' field is what LoopDetector uses to identify loop patterns.
    """
    try:
        response = client.chat.completions.create(
            model="gpt-4o",
            messages=messages,
            max_tokens=4096,
        )
        usage = response.usage
        usd = (usage.prompt_tokens * 2.50 + usage.completion_tokens * 10.0) / 1_000_000
        # sig = tool being called, or "end_turn" if the model is responding
        choice = response.choices[0]
        sig = "end_turn"
        if choice.message.tool_calls:
            sig = choice.message.tool_calls[0].function.name
        return {"response": choice.message, "usd": usd, "sig": sig}

    except RateLimitError as e:
        # Expose the error type as the loop signature.
        # After 3 consecutive RateLimitError responses, LoopDetector fires.
        # Return a synthetic result with the error signature so the guard
        # can observe the pattern — then re-raise for the caller.
        budget.record(usd=0.0)  # 429 costs $0 but counts as an attempt
        raise  # propagate after recording

# One guard instance per agent run — wraps make_llm_call
run_guard = guard(
    make_llm_call,
    loop={"repeats": 3, "max_cycle_len": 2},
    budget={"max_usd": 5.0},
)

def run_agent_with_rate_limit_guard(task: str) -> str:
    messages = [{"role": "user", "content": task}]
    backoff_base = 1.0

    for step in range(50):
        try:
            result = run_guard(messages)
        except LoopDetectedError as e:
            # The same failure pattern repeated e.repeats times — abort
            return (
                f"Rate limit retry storm detected: pattern {e.pattern!r} "
                f"repeated {e.repeats}x. The API may be consistently over-quota "
                f"for the current request size. Consider reducing context length "
                f"or upgrading to a higher rate-limit tier."
            )
        except BudgetExceededError as e:
            return f"Budget cap reached at ${e.spent:.4f} of $5.00. Partial result above."
        except RateLimitError:
            # Exponential backoff before the next step
            delay = backoff_base * (2 ** min(step, 5)) + random.uniform(0, 1)
            time.sleep(min(delay, 60))
            continue  # retry the same step — RunGuard will detect if this repeats 3x

        # Successful response — advance agent state
        response = result["response"]
        if not response.tool_calls:
            return response.content  # final answer
        messages.append(response)
        # ... execute tool calls and append results ...

    return "Max steps reached."

The LoopDetector(repeats=3, max_cycle_len=2) configuration is calibrated for rate-limit detection specifically:

TypeScript: wrapping OpenAI ChatCompletions with retry-storm detection

The same pattern in TypeScript, using @runguard/sdk with the OpenAI Node SDK:

import OpenAI from "openai";
import { guard, LoopDetectedError, BudgetExceededError } from "@runguard/sdk";

const client = new OpenAI();

// Backoff helper — same logic as the Python version
async function sleep(ms: number) {
  return new Promise((resolve) => setTimeout(resolve, ms));
}

async function retryWithBackoff<T>(
  fn: () => Promise<T>,
  maxRetries = 5,
  baseDelayMs = 1000,
): Promise<T> {
  for (let attempt = 0; attempt < maxRetries; attempt++) {
    try {
      return await fn();
    } catch (err: unknown) {
      if (!(err instanceof OpenAI.RateLimitError)) throw err;
      if (attempt === maxRetries - 1) throw err;
      const delay = baseDelayMs * Math.pow(2, attempt) + Math.random() * 1000;
      await sleep(Math.min(delay, 60_000));
    }
  }
  throw new Error("max retries exceeded");
}

// Inner function that RunGuard wraps
async function callLLM(
  messages: OpenAI.ChatCompletionMessageParam[],
): Promise<{ response: OpenAI.ChatCompletionMessage; usd: number; sig: string }> {
  const raw = await retryWithBackoff(() =>
    client.chat.completions.create({
      model: "gpt-4o",
      messages,
      max_tokens: 4096,
    }),
  );

  const usage = raw.usage!;
  const usd = (usage.prompt_tokens * 2.5 + usage.completion_tokens * 10.0) / 1_000_000;
  const sig = raw.choices[0].message.tool_calls?.[0]?.function?.name ?? "end_turn";
  return { response: raw.choices[0].message, usd, sig };
}

// One guard instance per agent run — detects repeated 429-driven signatures
const runGuard = guard(callLLM, {
  loop: { repeats: 3, maxCycleLen: 2 },
  budget: { maxUsd: 5.0 },
});

async function runAgent(task: string): Promise<string> {
  const messages: OpenAI.ChatCompletionMessageParam[] = [
    { role: "user", content: task },
  ];

  for (let step = 0; step < 50; step++) {
    try {
      const result = await runGuard(messages);
      const { response } = result;

      if (!response.tool_calls) {
        return response.content ?? "";  // final answer
      }
      messages.push(response);
      // ... execute tool calls, push results ...

    } catch (e) {
      if (e instanceof LoopDetectedError) {
        return (
          `Rate limit storm detected: ${e.pattern} repeated ${e.repeats}x. ` +
          `Request size may exceed tier quota — consider chunking input or upgrading tier.`
        );
      }
      if (e instanceof BudgetExceededError) {
        return `Budget cap: $${e.spent.toFixed(4)} of $5.00 spent.`;
      }
      throw e;
    }
  }
  return "Max steps reached.";
}

Rate limit budgeting: calculating the right TPM cap for your agent

The most actionable intervention for agents hitting TPM limits is matching the agent’s context size to its tier’s token budget before deploying. Here is the math for a typical research agent running on GPT-4o tier 2 at $2.50/M input tokens and $10.00/M output tokens:

Comparison: naive retry vs. exponential backoff vs. RunGuard loop-aware retry

Capability Naive retry (sleep 1s, loop) Exponential backoff + jitter RunGuard loop-aware retry
429 handling Retries immediately — stays in same rate-limit window Waits exponentially; usually clears the window by attempt 3–4 Backoff + loop detection; aborts if pattern repeats 3×
Infinite loop prevention None — loops indefinitely on persistent 429 Bounded by max_retries; raises after N attempts Both max_retries and LoopDetectedError after 3 repeated signatures
Thundering herd protection None — all concurrent agents retry at the same time Yes — jitter desynchronizes concurrent retriers Yes — same jitter; guard instance is per-run so sessions are independent
Cross-window retry storm detection None None — backoff resets between invocations Yes — LoopDetector observes history across retry windows
Cost cap None None Yes — BudgetTracker fires before each call when cap is reached
Per-session budget None None Yes — cap_usd set per agent run; resets with each new guard instance
Real-time alert on trip None None Yes — LoopDetectedError / BudgetExceededError with structured context for webhook integration

Exponential backoff with jitter is necessary but not sufficient. It prevents the naive retry storm in a single-call retry loop, but it does not detect the pattern of a broader agent loop where the same 429 error recurs across multiple agent steps and multiple retry windows. Retry storm prevention requires a guard that observes the full history of the agent run, not just the current retry invocation. RunGuard’s LoopDetector is that guard: it sees every call signature the agent produces — including error signatures from 429 responses — and fires when the pattern crosses the repeat threshold.

Related patterns

Add rate-limit storm protection to your agent today

RunGuard installs with pip install runguard (Python) or npm install @runguard/sdk (TypeScript/JavaScript). The LoopDetector and BudgetTracker shown on this page are part of the core SDK — no additional dependencies, no external API calls, no telemetry by default. A complete integration is five lines of code and one new return field (usd and sig) from your existing LLM call function.

RunGuard plans: Solo at $19 / mo covers one agent project with full loop detection and budget tracking. Team at $79 / mo adds multiple projects, webhook alerts, and team seat sharing. Both plans include a 14-day free trial — no credit card required to start.

Get started with RunGuard — or read about retry storm prevention in depth, autonomous agent cost control, and real-time runaway cost prevention.