LLM API timeout cost impact: what you actually pay when a request times out

A common misconception about LLM API timeouts is that killing a request cancels the bill. It doesn’t — at least not for streaming responses. For the major providers (Anthropic, OpenAI, Google Gemini), the billing clock starts ticking when the model begins generating tokens, not when your client acknowledges receipt. A GPT-4o streaming call that generated 800 output tokens before your 30-second timeout fired costs exactly the same as a fully-completed 800-token response: 800 × $10.00/MTok = $0.008. Multiply that by a retry-on-timeout policy that retries three times, and a single logical request costs $0.032 while returning nothing useful. This page breaks down how LLM providers bill for timeouts, the two failure modes that destroy budgets, how to calculate a mathematically correct timeout window, and how to wire a circuit breaker so timeout-induced retry storms don’t ruin your month.

How LLM providers bill for timed-out requests

The billing mechanics vary by failure point, and the distinction matters enormously for your retry logic.

The two timeout failure modes that drain budgets

Getting timeout configuration wrong costs money in two opposite directions. Most teams discover only one of them.

Calculating the mathematically correct timeout window

The formula for a safe read timeout is straightforward. What varies is the model speed constant you use.

Python implementation: stream timeout with RunGuard budget guard

The following implementation uses aiohttp for HTTP/2 streaming with granular timeout control, exponential backoff that distinguishes stream timeouts from connection timeouts, and RunGuard’s BudgetGuard to abort before a retry storm accumulates cost.

import asyncio
import aiohttp
import json
from dataclasses import dataclass
from typing import AsyncIterator
import runguard

# RunGuard budget guard — aborts the agent if session cost exceeds $0.50
guard = runguard.BudgetGuard(
    session_budget_usd=0.50,
    on_budget_exceeded=lambda ctx: (_ for _ in ()).throw(
        runguard.BudgetExceededError(f"Session budget exceeded: {ctx.total_usd:.4f}")
    )
)

@dataclass
class TimeoutConfig:
    connect_seconds: float = 8.0      # Connection timeout — no charge if this fires
    first_token_seconds: float = 15.0  # Time to first token
    between_chunks_seconds: float = 10.0  # Max silence between streamed chunks

def build_timeout(max_tokens: int, model_speed_tps: float = 100.0) -> TimeoutConfig:
    """Derive timeout from max_tokens and model speed."""
    generation_seconds = max_tokens / model_speed_tps
    safety = generation_seconds * 0.5
    ttft_buffer = 3.0
    return TimeoutConfig(
        connect_seconds=8.0,
        first_token_seconds=ttft_buffer + 5.0,
        between_chunks_seconds=min(generation_seconds + safety, 60.0)
    )

async def stream_openai_with_timeout(
    messages: list[dict],
    max_tokens: int,
    model: str = "gpt-4o",
    timeout_cfg: TimeoutConfig | None = None,
) -> str:
    """
    Stream a GPT-4o response with granular timeout control.
    Returns the completed text or raises on timeout/budget exceeded.
    """
    if timeout_cfg is None:
        speed = 100.0 if "gpt-4o" in model else 80.0
        timeout_cfg = build_timeout(max_tokens, speed)

    aiohttp_timeout = aiohttp.ClientTimeout(
        sock_connect=timeout_cfg.connect_seconds,
        sock_read=timeout_cfg.between_chunks_seconds,
    )

    headers = {
        "Authorization": f"Bearer {OPENAI_API_KEY}",
        "Content-Type": "application/json",
    }
    payload = {
        "model": model,
        "messages": messages,
        "max_tokens": max_tokens,
        "stream": True,
    }

    chunks: list[str] = []
    token_count = 0
    connection_established = False

    async with aiohttp.ClientSession(timeout=aiohttp_timeout) as session:
        try:
            async with session.post(
                "https://api.openai.com/v1/chat/completions",
                headers=headers,
                json=payload,
            ) as resp:
                connection_established = True
                resp.raise_for_status()

                async for line in resp.content:
                    line_str = line.decode().strip()
                    if not line_str.startswith("data: "):
                        continue
                    data = line_str[6:]
                    if data == "[DONE]":
                        break

                    parsed = json.loads(data)
                    delta = parsed["choices"][0]["delta"].get("content", "")
                    if delta:
                        chunks.append(delta)
                        token_count += 1

                        # Report token cost to RunGuard every 50 tokens
                        if token_count % 50 == 0:
                            guard.record_tokens(
                                input_tokens=0,
                                output_tokens=50,
                                cost_per_output_mtok=10.00,  # GPT-4o output
                                model=model,
                            )

        except aiohttp.ServerTimeoutError as exc:
            if not connection_established:
                # Connection timeout: no charge, safe to retry
                raise ConnectionTimeoutError("Connection timeout — no tokens billed") from exc
            else:
                # Stream timeout: tokens already billed for what was generated
                partial = "".join(chunks)
                raise StreamTimeoutError(
                    f"Stream timeout after {token_count} tokens (${token_count * 10 / 1_000_000:.6f} billed). "
                    f"Partial: {partial[:100]}..."
                ) from exc

    return "".join(chunks)


class ConnectionTimeoutError(Exception):
    """No tokens billed — safe to retry with same parameters."""

class StreamTimeoutError(Exception):
    """Tokens billed up to timeout point — retry with reduced max_tokens."""


async def call_with_smart_retry(
    messages: list[dict],
    max_tokens: int,
    model: str = "gpt-4o",
    max_retries: int = 3,
) -> str:
    """
    Retry strategy that distinguishes timeout types:
    - ConnectionTimeoutError: full retry, same max_tokens
    - StreamTimeoutError: retry with max_tokens halved
    - BudgetExceededError: abort immediately, no retry
    """
    attempt_max_tokens = max_tokens
    backoff = 1.0

    for attempt in range(max_retries):
        try:
            timeout_cfg = build_timeout(attempt_max_tokens)
            return await stream_openai_with_timeout(
                messages=messages,
                max_tokens=attempt_max_tokens,
                model=model,
                timeout_cfg=timeout_cfg,
            )

        except runguard.BudgetExceededError:
            # Budget guard tripped — do not retry under any circumstances
            raise

        except ConnectionTimeoutError:
            # Free retry — no tokens were charged
            if attempt == max_retries - 1:
                raise
            await asyncio.sleep(backoff)
            backoff = min(backoff * 2, 30.0)
            continue

        except StreamTimeoutError as exc:
            # Paid partial tokens — retry with halved max_tokens and record cost
            if attempt == max_retries - 1:
                raise
            attempt_max_tokens = max(attempt_max_tokens // 2, 256)
            await asyncio.sleep(backoff)
            backoff = min(backoff * 2, 30.0)
            continue

    raise RuntimeError("Max retries exceeded")

The key insight in this implementation is the connection_established flag. It separates the two timeout classes before any retry decision is made. Connection timeouts get a free retry with unchanged parameters; stream timeouts get a retry with max_tokens halved, cutting both the retry cost and the retry timeout duration. RunGuard’s BudgetGuard fires before the retry loop can compound cost beyond the session budget. For more on preventing retry accumulation, see AI agent retry storm prevention.

TypeScript implementation: AbortController with timeout classification

The TypeScript approach uses the AbortController API for fine-grained timeout control, with RunGuard’s TypeScript SDK integrated for budget enforcement.

import RunGuard from "@runguard/sdk";

const guard = new RunGuard.BudgetGuard({
  sessionBudgetUsd: 0.5,
  onBudgetExceeded: (ctx) => {
    throw new Error(`Budget exceeded: $${ctx.totalUsd.toFixed(4)}`);
  },
});

interface TimeoutConfig {
  connectMs: number;
  streamMs: number;  // per-chunk timeout for streaming
}

function buildTimeout(maxTokens: number, modelSpeedTps = 100): TimeoutConfig {
  const generationMs = (maxTokens / modelSpeedTps) * 1000;
  const safetyMs = generationMs * 0.5;
  return {
    connectMs: 8_000,
    streamMs: Math.min(generationMs + safetyMs + 3_000, 60_000),
  };
}

class ConnectionTimeoutError extends Error {
  constructor(msg: string) { super(msg); this.name = "ConnectionTimeoutError"; }
}

class StreamTimeoutError extends Error {
  public readonly tokensGenerated: number;
  constructor(msg: string, tokensGenerated: number) {
    super(msg);
    this.name = "StreamTimeoutError";
    this.tokensGenerated = tokensGenerated;
  }
}

async function streamWithTimeout(
  messages: Array<{ role: string; content: string }>,
  maxTokens: number,
  model = "gpt-4o",
  timeoutCfg?: TimeoutConfig,
): Promise<string> {
  const cfg = timeoutCfg ?? buildTimeout(maxTokens);

  // Connect timeout: abort if no response within connectMs
  const connectController = new AbortController();
  const connectTimer = setTimeout(
    () => connectController.abort("connect-timeout"),
    cfg.connectMs,
  );

  let response: Response;
  try {
    response = await fetch("https://api.openai.com/v1/chat/completions", {
      method: "POST",
      headers: {
        Authorization: `Bearer ${process.env.OPENAI_API_KEY}`,
        "Content-Type": "application/json",
      },
      body: JSON.stringify({ model, messages, max_tokens: maxTokens, stream: true }),
      signal: connectController.signal,
    });
    clearTimeout(connectTimer);
  } catch (err: unknown) {
    clearTimeout(connectTimer);
    if ((err as Error).message?.includes("connect-timeout")) {
      throw new ConnectionTimeoutError("Server did not respond — no tokens billed");
    }
    throw err;
  }

  if (!response.ok) throw new Error(`HTTP ${response.status}`);

  // Switch to per-chunk stream timeout after connection established
  const chunks: string[] = [];
  let tokenCount = 0;
  let streamTimer: ReturnType<typeof setTimeout>;
  const reader = response.body!.getReader();
  const decoder = new TextDecoder();

  const resetStreamTimer = (): void => {
    clearTimeout(streamTimer);
    streamTimer = setTimeout(() => {
      reader.cancel();
    }, cfg.streamMs);
  };

  try {
    resetStreamTimer();

    while (true) {
      const { done, value } = await reader.read();
      if (done) { clearTimeout(streamTimer); break; }

      resetStreamTimer();
      const text = decoder.decode(value, { stream: true });

      for (const line of text.split("\n")) {
        if (!line.startsWith("data: ")) continue;
        const data = line.slice(6).trim();
        if (data === "[DONE]") continue;

        try {
          const parsed = JSON.parse(data);
          const delta = parsed.choices?.[0]?.delta?.content ?? "";
          if (delta) {
            chunks.push(delta);
            tokenCount++;

            if (tokenCount % 50 === 0) {
              guard.recordTokens({
                inputTokens: 0,
                outputTokens: 50,
                costPerOutputMtok: 10.0,
                model,
              });
            }
          }
        } catch {
          // skip malformed SSE lines
        }
      }
    }
  } catch (err: unknown) {
    clearTimeout(streamTimer);
    if (tokenCount > 0) {
      const billedUsd = (tokenCount * 10) / 1_000_000;
      throw new StreamTimeoutError(
        `Stream went silent after ${tokenCount} tokens ($${billedUsd.toFixed(6)} billed)`,
        tokenCount,
      );
    }
    throw new ConnectionTimeoutError("Stream never started — no tokens billed");
  }

  return chunks.join("");
}

async function callWithSmartRetry(
  messages: Array<{ role: string; content: string }>,
  maxTokens: number,
  model = "gpt-4o",
  maxRetries = 3,
): Promise<string> {
  let attemptMaxTokens = maxTokens;
  let backoffMs = 1_000;

  for (let attempt = 0; attempt < maxRetries; attempt++) {
    try {
      const cfg = buildTimeout(attemptMaxTokens);
      return await streamWithTimeout(messages, attemptMaxTokens, model, cfg);
    } catch (err: unknown) {
      if ((err as Error).message?.startsWith("Budget exceeded")) throw err;

      if (err instanceof ConnectionTimeoutError) {
        if (attempt === maxRetries - 1) throw err;
        await sleep(backoffMs);
        backoffMs = Math.min(backoffMs * 2, 30_000);
        continue;
      }

      if (err instanceof StreamTimeoutError) {
        if (attempt === maxRetries - 1) throw err;
        // Halve max_tokens on stream timeout: reduces retry cost + timeout duration
        attemptMaxTokens = Math.max(Math.floor(attemptMaxTokens / 2), 256);
        await sleep(backoffMs);
        backoffMs = Math.min(backoffMs * 2, 30_000);
        continue;
      }

      throw err;
    }
  }

  throw new Error("Max retries exceeded");
}

const sleep = (ms: number): Promise<void> => new Promise((r) => setTimeout(r, ms));

The resetStreamTimer pattern provides true per-chunk timeout semantics: the timer resets every time a chunk arrives, so a healthy streaming response that takes 45 seconds total won’t be interrupted as long as chunks arrive within 10-second windows. Only genuine stream stalls (no chunk for 10 seconds) trigger the abort. This approach is strictly more accurate than a wall-clock timeout on the full response. Pair it with real-time runaway cost prevention to cap session spend end-to-end.

Cost arithmetic: why wrong timeouts are so expensive

These numbers illustrate why timeout misconfiguration compounds rapidly at production scale.

Timeout strategy comparison: cost risk, retry behavior, and latency impact

Strategy Timeout window Cost risk on timeout Retry behavior Latency impact Recommended for
Fixed short (<10s) 10 seconds hard High — frequent partial charges + retry storms Retries full cost each time Fast failure, high retry rate Not recommended for LLM calls
Fixed long (>120s) 120–300 seconds hard High — stalled calls block budget guard Infrequent but expensive Very high tail latency Not recommended without circuit breaker
Formula-derived wall-clock max_tokens ÷ speed × 1.5 Medium — some partial charges on stall Retry with same max_tokens Moderate, bounded by formula Non-streaming calls
Per-chunk stream timeout 10s between chunks Low — catches stalls early Retry with halved max_tokens Minimal on healthy streams Streaming calls (recommended)
Per-chunk + RunGuard budget cap 10s between chunks + $0.50 session cap Very low — budget guard caps retry accumulation Retries bounded by session budget Minimal on healthy streams Production agents (recommended)
Connection-only timeout (no stream timeout) 8s connect, unlimited stream Extreme — unbounded stream cost N/A (no stream retries triggered) Unbounded tail latency Never in production

Related: AI agent retry storm prevention · parallel tool call budget control · prevent AI agent runaway cost in real time

Stop paying for timed-out tokens

LLM API timeout cost impact is one of the most overlooked sources of wasted spend in production agent systems. The mechanics are clear: connection timeouts are free, stream timeouts are charged at full output rates, and a retry policy that doesn’t distinguish between them will triple your bill on every timeout event. The fix requires three things working together: a per-chunk stream timeout derived from max_tokens and model output speed, a retry strategy that halves max_tokens on stream timeout retries, and a session-level budget guard that aborts before retry accumulation exceeds your cost threshold. RunGuard provides the budget guard layer with one-line integration into any Python or TypeScript agent.

RunGuard pricing: Solo plan at $19/month for individual developers. Team plan at $79/month adds Slack and PagerDuty webhook alerts, shared dashboards, and audit log. Both plans include a 14-day free trial — no credit card required.

Start your 14-day free trial — or explore related: retry storm prevention, context window truncation alerts, and autonomous agent cost control best practices.