LLM API timeout cost impact: what you actually pay when a request times out

A common misconception about LLM API timeouts is that killing a request cancels the bill. It doesn’t — at least not for streaming responses. For the major providers (Anthropic, OpenAI, Google Gemini), the billing clock starts ticking when the model begins generating tokens, not when your client acknowledges receipt. A GPT-4o streaming call that generated 800 output tokens before your 30-second timeout fired costs exactly the same as a fully-completed 800-token response: 800 × $10.00/MTok = $0.008. Multiply that by a retry-on-timeout policy that retries three times, and a single logical request costs $0.032 while returning nothing useful. This page breaks down how LLM providers bill for timeouts, the two failure modes that destroy budgets, how to calculate a mathematically correct timeout window, and how to wire a circuit breaker so timeout-induced retry storms don’t ruin your month.

How LLM providers bill for timed-out requests

The billing mechanics vary by failure point, and the distinction matters enormously for your retry logic.

Connection timeout (server never responds). If your TCP connection times out before the server sends a single byte of the HTTP response, you are charged nothing. The model never started inference. This is the safest failure mode: fire the request, set a short connect timeout (5–10 seconds), and if the server doesn’t acknowledge, retry freely because no cost was incurred. Anthropic’s API typically responds within 1–3 seconds under normal load; if you haven’t received an HTTP 200 status within 10 seconds, you can safely classify it as a connection failure.
Stream timeout (server responds, then goes silent). Once the server begins streaming tokens, you are charged for every token delivered to the wire — whether your client consumed them or not. If the stream stalls after 800 tokens and your read timeout fires, Anthropic and OpenAI record that request as 800 output tokens in their billing systems. Retrying that same request without reducing max_tokens means you’re paying for 800 wasted output tokens per attempt. Three retries = 2,400 charged output tokens + the final successful call’s tokens on top. For GPT-4o at $10/MTok output, a 2,400-token retry tax costs $0.024 — before you’ve even gotten a useful answer.
Rate-limit 429 responses. A 429 is not a timeout — it’s a clean HTTP error returned before inference begins. No tokens are billed. However, agents that receive 429s and immediately retry (without exponential backoff) create a different class of storm. See AI agent retry storm prevention for the full treatment of that failure mode.
Server-side errors (500, 503). When the provider’s infrastructure faults mid-generation, providers generally do not charge for the partial output. This varies: verify with each provider’s usage policy documentation and cross-check against your actual billing dashboard. Treat these as free retries, but still apply backoff to avoid exacerbating a provider outage.

The two timeout failure modes that drain budgets

Getting timeout configuration wrong costs money in two opposite directions. Most teams discover only one of them.

Failure mode A: timeout too short (<10 seconds for long outputs). If you’re asking GPT-4o to generate a 2,000-token analysis and set a 10-second read timeout, you will time out on nearly every request. GPT-4o generates roughly 100 tokens/second under normal load; a 2,000-token response takes approximately 20 seconds. A 10-second timeout fires halfway through, charges you for ~1,000 output tokens ($0.010), and delivers nothing. If your agent retries three times before failing: $0.030 in output charges + $0.030 in input charges for a task that never completed. The correct timeout for a 2,000-token expected response on GPT-4o is at least 30 seconds (20 seconds × 1.5 safety margin).

Teams that set low timeouts “to fail fast” often discover this failure mode during load testing when costs spike with no corresponding successes. The fix is to derive the timeout from max_tokens and model output speed, not from a gut-feel number like 30 seconds that happens to be too short for long generations.
Failure mode B: timeout too long (>120 seconds for most workloads). The opposite problem is a timeout that allows a request to block for minutes. When an LLM provider experiences infrastructure degradation, responses can slow dramatically — from 100 tokens/second to 5 tokens/second or stall entirely. If your timeout is 300 seconds, your agent will happily hold a connection open for five minutes. During that five minutes, your per-session cost per user session accumulates. Other agent tasks are blocked waiting for this call to resolve. And your budget guard — if it checks balances between calls — can’t intervene because the call is still in flight.

The correct upper bound for a read timeout is derived from your budget guard’s polling interval. If RunGuard checks the budget every 30 seconds, set your maximum read timeout to 60 seconds. This guarantees the budget guard gets at least one check between the call start and a forced abort. For very long expected outputs (4,000+ tokens), use streaming with a per-chunk timeout (e.g., 10 seconds between chunks) rather than a wall-clock read timeout on the entire response.

Calculating the mathematically correct timeout window

The formula for a safe read timeout is straightforward. What varies is the model speed constant you use.

Model output speed baselines. These are approximate median throughput values under normal load; actual speeds vary with prompt complexity, load balancing, and infrastructure state. GPT-4o: ~100 tokens/second. GPT-4o-mini: ~120 tokens/second. Claude Sonnet 3.5: ~80 tokens/second. Claude Haiku 3: ~100 tokens/second. Gemini 1.5 Pro: ~60–80 tokens/second. These are generation speeds, not time-to-first-token (TTFT). TTFT for these models is typically 0.5–3 seconds and should be accounted for separately.
The timeout formula. timeout_seconds = (max_tokens / model_speed_tps) + ttft_buffer + safety_margin. Working example for a GPT-4o call with max_tokens=2000: base generation time = 2000 ÷ 100 = 20 seconds. Add 2 seconds TTFT buffer. Add 50% safety margin on generation time = 10 seconds. Result: timeout = 20 + 2 + 10 = 32 seconds. Round up to 35 seconds. For Claude Sonnet with max_tokens=2000: 2000 ÷ 80 = 25 seconds base + 2 seconds TTFT + 12.5 seconds safety = 39.5 seconds, round to 40 seconds.
Halving max_tokens on retry. The most effective cost-control technique for stream timeouts is to retry with max_tokens cut in half. If a 2,000-token generation timed out, the task likely doesn’t need 2,000 tokens — or the model is padding. Retry with max_tokens=1000, which also cuts the retry timeout to ~15 seconds. If the second attempt times out again, abort. Two attempts at progressively smaller max_tokens is more cost-effective than three attempts at the original size.
Per-chunk stream timeout. For streaming responses, set a timeout on how long to wait between chunks, not a wall-clock timeout on the entire response. Python’s aiohttp and the httpx library both support this via read_timeout within a Timeout object. A 10-second per-chunk timeout means: if 10 seconds pass with no new token delivered, abort. This is far more precise than a wall-clock timeout because it doesn’t penalize legitimately long responses that are streaming steadily.

Python implementation: stream timeout with RunGuard budget guard

The following implementation uses aiohttp for HTTP/2 streaming with granular timeout control, exponential backoff that distinguishes stream timeouts from connection timeouts, and RunGuard’s BudgetGuard to abort before a retry storm accumulates cost.

import asyncio
import aiohttp
import json
from dataclasses import dataclass
from typing import AsyncIterator
import runguard

# RunGuard budget guard — aborts the agent if session cost exceeds $0.50
guard = runguard.BudgetGuard(
    session_budget_usd=0.50,
    on_budget_exceeded=lambda ctx: (_ for _ in ()).throw(
        runguard.BudgetExceededError(f"Session budget exceeded: {ctx.total_usd:.4f}")
    )
)

@dataclass
class TimeoutConfig:
    connect_seconds: float = 8.0      # Connection timeout — no charge if this fires
    first_token_seconds: float = 15.0  # Time to first token
    between_chunks_seconds: float = 10.0  # Max silence between streamed chunks

def build_timeout(max_tokens: int, model_speed_tps: float = 100.0) -> TimeoutConfig:
    """Derive timeout from max_tokens and model speed."""
    generation_seconds = max_tokens / model_speed_tps
    safety = generation_seconds * 0.5
    ttft_buffer = 3.0
    return TimeoutConfig(
        connect_seconds=8.0,
        first_token_seconds=ttft_buffer + 5.0,
        between_chunks_seconds=min(generation_seconds + safety, 60.0)
    )

async def stream_openai_with_timeout(
    messages: list[dict],
    max_tokens: int,
    model: str = "gpt-4o",
    timeout_cfg: TimeoutConfig | None = None,
) -> str:
    """
    Stream a GPT-4o response with granular timeout control.
    Returns the completed text or raises on timeout/budget exceeded.
    """
    if timeout_cfg is None:
        speed = 100.0 if "gpt-4o" in model else 80.0
        timeout_cfg = build_timeout(max_tokens, speed)

    aiohttp_timeout = aiohttp.ClientTimeout(
        sock_connect=timeout_cfg.connect_seconds,
        sock_read=timeout_cfg.between_chunks_seconds,
    )

    headers = {
        "Authorization": f"Bearer {OPENAI_API_KEY}",
        "Content-Type": "application/json",
    }
    payload = {
        "model": model,
        "messages": messages,
        "max_tokens": max_tokens,
        "stream": True,
    }

    chunks: list[str] = []
    token_count = 0
    connection_established = False

    async with aiohttp.ClientSession(timeout=aiohttp_timeout) as session:
        try:
            async with session.post(
                "https://api.openai.com/v1/chat/completions",
                headers=headers,
                json=payload,
            ) as resp:
                connection_established = True
                resp.raise_for_status()

                async for line in resp.content:
                    line_str = line.decode().strip()
                    if not line_str.startswith("data: "):
                        continue
                    data = line_str[6:]
                    if data == "[DONE]":
                        break

                    parsed = json.loads(data)
                    delta = parsed["choices"][0]["delta"].get("content", "")
                    if delta:
                        chunks.append(delta)
                        token_count += 1

                        # Report token cost to RunGuard every 50 tokens
                        if token_count % 50 == 0:
                            guard.record_tokens(
                                input_tokens=0,
                                output_tokens=50,
                                cost_per_output_mtok=10.00,  # GPT-4o output
                                model=model,
                            )

        except aiohttp.ServerTimeoutError as exc:
            if not connection_established:
                # Connection timeout: no charge, safe to retry
                raise ConnectionTimeoutError("Connection timeout — no tokens billed") from exc
            else:
                # Stream timeout: tokens already billed for what was generated
                partial = "".join(chunks)
                raise StreamTimeoutError(
                    f"Stream timeout after {token_count} tokens (${token_count * 10 / 1_000_000:.6f} billed). "
                    f"Partial: {partial[:100]}..."
                ) from exc

    return "".join(chunks)


class ConnectionTimeoutError(Exception):
    """No tokens billed — safe to retry with same parameters."""

class StreamTimeoutError(Exception):
    """Tokens billed up to timeout point — retry with reduced max_tokens."""


async def call_with_smart_retry(
    messages: list[dict],
    max_tokens: int,
    model: str = "gpt-4o",
    max_retries: int = 3,
) -> str:
    """
    Retry strategy that distinguishes timeout types:
    - ConnectionTimeoutError: full retry, same max_tokens
    - StreamTimeoutError: retry with max_tokens halved
    - BudgetExceededError: abort immediately, no retry
    """
    attempt_max_tokens = max_tokens
    backoff = 1.0

    for attempt in range(max_retries):
        try:
            timeout_cfg = build_timeout(attempt_max_tokens)
            return await stream_openai_with_timeout(
                messages=messages,
                max_tokens=attempt_max_tokens,
                model=model,
                timeout_cfg=timeout_cfg,
            )

        except runguard.BudgetExceededError:
            # Budget guard tripped — do not retry under any circumstances
            raise

        except ConnectionTimeoutError:
            # Free retry — no tokens were charged
            if attempt == max_retries - 1:
                raise
            await asyncio.sleep(backoff)
            backoff = min(backoff * 2, 30.0)
            continue

        except StreamTimeoutError as exc:
            # Paid partial tokens — retry with halved max_tokens and record cost
            if attempt == max_retries - 1:
                raise
            attempt_max_tokens = max(attempt_max_tokens // 2, 256)
            await asyncio.sleep(backoff)
            backoff = min(backoff * 2, 30.0)
            continue

    raise RuntimeError("Max retries exceeded")

The key insight in this implementation is the connection_established flag. It separates the two timeout classes before any retry decision is made. Connection timeouts get a free retry with unchanged parameters; stream timeouts get a retry with max_tokens halved, cutting both the retry cost and the retry timeout duration. RunGuard’s BudgetGuard fires before the retry loop can compound cost beyond the session budget. For more on preventing retry accumulation, see AI agent retry storm prevention.

TypeScript implementation: AbortController with timeout classification

The TypeScript approach uses the AbortController API for fine-grained timeout control, with RunGuard’s TypeScript SDK integrated for budget enforcement.

import RunGuard from "@runguard/sdk";

const guard = new RunGuard.BudgetGuard({
  sessionBudgetUsd: 0.5,
  onBudgetExceeded: (ctx) => {
    throw new Error(`Budget exceeded: $${ctx.totalUsd.toFixed(4)}`);
  },
});

interface TimeoutConfig {
  connectMs: number;
  streamMs: number;  // per-chunk timeout for streaming
}

function buildTimeout(maxTokens: number, modelSpeedTps = 100): TimeoutConfig {
  const generationMs = (maxTokens / modelSpeedTps) * 1000;
  const safetyMs = generationMs * 0.5;
  return {
    connectMs: 8_000,
    streamMs: Math.min(generationMs + safetyMs + 3_000, 60_000),
  };
}

class ConnectionTimeoutError extends Error {
  constructor(msg: string) { super(msg); this.name = "ConnectionTimeoutError"; }
}

class StreamTimeoutError extends Error {
  public readonly tokensGenerated: number;
  constructor(msg: string, tokensGenerated: number) {
    super(msg);
    this.name = "StreamTimeoutError";
    this.tokensGenerated = tokensGenerated;
  }
}

async function streamWithTimeout(
  messages: Array<{ role: string; content: string }>,
  maxTokens: number,
  model = "gpt-4o",
  timeoutCfg?: TimeoutConfig,
): Promise<string> {
  const cfg = timeoutCfg ?? buildTimeout(maxTokens);

  // Connect timeout: abort if no response within connectMs
  const connectController = new AbortController();
  const connectTimer = setTimeout(
    () => connectController.abort("connect-timeout"),
    cfg.connectMs,
  );

  let response: Response;
  try {
    response = await fetch("https://api.openai.com/v1/chat/completions", {
      method: "POST",
      headers: {
        Authorization: `Bearer ${process.env.OPENAI_API_KEY}`,
        "Content-Type": "application/json",
      },
      body: JSON.stringify({ model, messages, max_tokens: maxTokens, stream: true }),
      signal: connectController.signal,
    });
    clearTimeout(connectTimer);
  } catch (err: unknown) {
    clearTimeout(connectTimer);
    if ((err as Error).message?.includes("connect-timeout")) {
      throw new ConnectionTimeoutError("Server did not respond — no tokens billed");
    }
    throw err;
  }

  if (!response.ok) throw new Error(`HTTP ${response.status}`);

  // Switch to per-chunk stream timeout after connection established
  const chunks: string[] = [];
  let tokenCount = 0;
  let streamTimer: ReturnType<typeof setTimeout>;
  const reader = response.body!.getReader();
  const decoder = new TextDecoder();

  const resetStreamTimer = (): void => {
    clearTimeout(streamTimer);
    streamTimer = setTimeout(() => {
      reader.cancel();
    }, cfg.streamMs);
  };

  try {
    resetStreamTimer();

    while (true) {
      const { done, value } = await reader.read();
      if (done) { clearTimeout(streamTimer); break; }

      resetStreamTimer();
      const text = decoder.decode(value, { stream: true });

      for (const line of text.split("\n")) {
        if (!line.startsWith("data: ")) continue;
        const data = line.slice(6).trim();
        if (data === "[DONE]") continue;

        try {
          const parsed = JSON.parse(data);
          const delta = parsed.choices?.[0]?.delta?.content ?? "";
          if (delta) {
            chunks.push(delta);
            tokenCount++;

            if (tokenCount % 50 === 0) {
              guard.recordTokens({
                inputTokens: 0,
                outputTokens: 50,
                costPerOutputMtok: 10.0,
                model,
              });
            }
          }
        } catch {
          // skip malformed SSE lines
        }
      }
    }
  } catch (err: unknown) {
    clearTimeout(streamTimer);
    if (tokenCount > 0) {
      const billedUsd = (tokenCount * 10) / 1_000_000;
      throw new StreamTimeoutError(
        `Stream went silent after ${tokenCount} tokens ($${billedUsd.toFixed(6)} billed)`,
        tokenCount,
      );
    }
    throw new ConnectionTimeoutError("Stream never started — no tokens billed");
  }

  return chunks.join("");
}

async function callWithSmartRetry(
  messages: Array<{ role: string; content: string }>,
  maxTokens: number,
  model = "gpt-4o",
  maxRetries = 3,
): Promise<string> {
  let attemptMaxTokens = maxTokens;
  let backoffMs = 1_000;

  for (let attempt = 0; attempt < maxRetries; attempt++) {
    try {
      const cfg = buildTimeout(attemptMaxTokens);
      return await streamWithTimeout(messages, attemptMaxTokens, model, cfg);
    } catch (err: unknown) {
      if ((err as Error).message?.startsWith("Budget exceeded")) throw err;

      if (err instanceof ConnectionTimeoutError) {
        if (attempt === maxRetries - 1) throw err;
        await sleep(backoffMs);
        backoffMs = Math.min(backoffMs * 2, 30_000);
        continue;
      }

      if (err instanceof StreamTimeoutError) {
        if (attempt === maxRetries - 1) throw err;
        // Halve max_tokens on stream timeout: reduces retry cost + timeout duration
        attemptMaxTokens = Math.max(Math.floor(attemptMaxTokens / 2), 256);
        await sleep(backoffMs);
        backoffMs = Math.min(backoffMs * 2, 30_000);
        continue;
      }

      throw err;
    }
  }

  throw new Error("Max retries exceeded");
}

const sleep = (ms: number): Promise<void> => new Promise((r) => setTimeout(r, ms));

The resetStreamTimer pattern provides true per-chunk timeout semantics: the timer resets every time a chunk arrives, so a healthy streaming response that takes 45 seconds total won’t be interrupted as long as chunks arrive within 10-second windows. Only genuine stream stalls (no chunk for 10 seconds) trigger the abort. This approach is strictly more accurate than a wall-clock timeout on the full response. Pair it with real-time runaway cost prevention to cap session spend end-to-end.

Cost arithmetic: why wrong timeouts are so expensive

These numbers illustrate why timeout misconfiguration compounds rapidly at production scale.

Scenario: 10-second timeout on 2,000-token GPT-4o call, retry ×3. Model speed 100 tok/s, so at 10 seconds approximately 1,000 output tokens generated per attempt. Cost per attempt: 1,000 × $10/MTok = $0.010. Three failed attempts: $0.030. If the task runs 500 times/day under this misconfiguration: $15.00/day in pure waste. Monthly: $450 in charges that returned zero useful output. The fix — setting the correct 30-second timeout — costs nothing but a configuration change.
Scenario: 300-second timeout allowing a stalled stream. Claude Sonnet 3.5 stalls mid-generation. Input tokens: 1,500 at $3/MTok = $0.0045. Output tokens before stall: 400 at $15/MTok = $0.006. Total per stalled call: $0.0105. With a 300-second timeout and no circuit breaker, your agent blocks for five minutes per stall, accumulating charges and preventing other work. With RunGuard capping the session at $0.50 and a 60-second max timeout, the stall is detected within 60 seconds and budget spend is bounded.
Anthropic prompt caching reduces the input-side timeout tax. If your system prompt is cached (cache read at $0.30/MTok vs $3.00/MTok), a timeout-and-retry that re-sends the same system prompt only costs $0.30/MTok for the cached prefix. This means caching also reduces the cost of timeout-induced retries by up to 90% on the input side. See Anthropic Claude API cost optimization for full caching guidance.

Timeout strategy comparison: cost risk, retry behavior, and latency impact

Strategy	Timeout window	Cost risk on timeout	Retry behavior	Latency impact	Recommended for
Fixed short (<10s)	10 seconds hard	High — frequent partial charges + retry storms	Retries full cost each time	Fast failure, high retry rate	Not recommended for LLM calls
Fixed long (>120s)	120–300 seconds hard	High — stalled calls block budget guard	Infrequent but expensive	Very high tail latency	Not recommended without circuit breaker
Formula-derived wall-clock	max_tokens ÷ speed × 1.5	Medium — some partial charges on stall	Retry with same max_tokens	Moderate, bounded by formula	Non-streaming calls
Per-chunk stream timeout	10s between chunks	Low — catches stalls early	Retry with halved max_tokens	Minimal on healthy streams	Streaming calls (recommended)
Per-chunk + RunGuard budget cap	10s between chunks + $0.50 session cap	Very low — budget guard caps retry accumulation	Retries bounded by session budget	Minimal on healthy streams	Production agents (recommended)
Connection-only timeout (no stream timeout)	8s connect, unlimited stream	Extreme — unbounded stream cost	N/A (no stream retries triggered)	Unbounded tail latency	Never in production

Stop paying for timed-out tokens

LLM API timeout cost impact is one of the most overlooked sources of wasted spend in production agent systems. The mechanics are clear: connection timeouts are free, stream timeouts are charged at full output rates, and a retry policy that doesn’t distinguish between them will triple your bill on every timeout event. The fix requires three things working together: a per-chunk stream timeout derived from max_tokens and model output speed, a retry strategy that halves max_tokens on stream timeout retries, and a session-level budget guard that aborts before retry accumulation exceeds your cost threshold. RunGuard provides the budget guard layer with one-line integration into any Python or TypeScript agent.

RunGuard pricing: Solo plan at $19/month for individual developers. Team plan at $79/month adds Slack and PagerDuty webhook alerts, shared dashboards, and audit log. Both plans include a 14-day free trial — no credit card required.

Start your 14-day free trial — or explore related: retry storm prevention, context window truncation alerts, and autonomous agent cost control best practices.