LLM API timeout cost impact: what you actually pay when a request times out
A common misconception about LLM API timeouts is that killing a request cancels the bill. It doesn’t — at least not for streaming responses. For the major providers (Anthropic, OpenAI, Google Gemini), the billing clock starts ticking when the model begins generating tokens, not when your client acknowledges receipt. A GPT-4o streaming call that generated 800 output tokens before your 30-second timeout fired costs exactly the same as a fully-completed 800-token response: 800 × $10.00/MTok = $0.008. Multiply that by a retry-on-timeout policy that retries three times, and a single logical request costs $0.032 while returning nothing useful. This page breaks down how LLM providers bill for timeouts, the two failure modes that destroy budgets, how to calculate a mathematically correct timeout window, and how to wire a circuit breaker so timeout-induced retry storms don’t ruin your month.
How LLM providers bill for timed-out requests
The billing mechanics vary by failure point, and the distinction matters enormously for your retry logic.
- Connection timeout (server never responds). If your TCP connection times out before the server sends a single byte of the HTTP response, you are charged nothing. The model never started inference. This is the safest failure mode: fire the request, set a short connect timeout (5–10 seconds), and if the server doesn’t acknowledge, retry freely because no cost was incurred. Anthropic’s API typically responds within 1–3 seconds under normal load; if you haven’t received an HTTP 200 status within 10 seconds, you can safely classify it as a connection failure.
- Stream timeout (server responds, then goes silent). Once the server begins streaming tokens, you are charged for every token delivered to the wire — whether your client consumed them or not. If the stream stalls after 800 tokens and your read timeout fires, Anthropic and OpenAI record that request as 800 output tokens in their billing systems. Retrying that same request without reducing
max_tokensmeans you’re paying for 800 wasted output tokens per attempt. Three retries = 2,400 charged output tokens + the final successful call’s tokens on top. For GPT-4o at $10/MTok output, a 2,400-token retry tax costs $0.024 — before you’ve even gotten a useful answer. - Rate-limit 429 responses. A 429 is not a timeout — it’s a clean HTTP error returned before inference begins. No tokens are billed. However, agents that receive 429s and immediately retry (without exponential backoff) create a different class of storm. See AI agent retry storm prevention for the full treatment of that failure mode.
- Server-side errors (500, 503). When the provider’s infrastructure faults mid-generation, providers generally do not charge for the partial output. This varies: verify with each provider’s usage policy documentation and cross-check against your actual billing dashboard. Treat these as free retries, but still apply backoff to avoid exacerbating a provider outage.
The two timeout failure modes that drain budgets
Getting timeout configuration wrong costs money in two opposite directions. Most teams discover only one of them.
- Failure mode A: timeout too short (<10 seconds for long outputs). If you’re asking GPT-4o to generate a 2,000-token analysis and set a 10-second read timeout, you will time out on nearly every request. GPT-4o generates roughly 100 tokens/second under normal load; a 2,000-token response takes approximately 20 seconds. A 10-second timeout fires halfway through, charges you for ~1,000 output tokens ($0.010), and delivers nothing. If your agent retries three times before failing: $0.030 in output charges + $0.030 in input charges for a task that never completed. The correct timeout for a 2,000-token expected response on GPT-4o is at least 30 seconds (20 seconds × 1.5 safety margin).
Teams that set low timeouts “to fail fast” often discover this failure mode during load testing when costs spike with no corresponding successes. The fix is to derive the timeout frommax_tokensand model output speed, not from a gut-feel number like 30 seconds that happens to be too short for long generations. - Failure mode B: timeout too long (>120 seconds for most workloads). The opposite problem is a timeout that allows a request to block for minutes. When an LLM provider experiences infrastructure degradation, responses can slow dramatically — from 100 tokens/second to 5 tokens/second or stall entirely. If your timeout is 300 seconds, your agent will happily hold a connection open for five minutes. During that five minutes, your per-session cost per user session accumulates. Other agent tasks are blocked waiting for this call to resolve. And your budget guard — if it checks balances between calls — can’t intervene because the call is still in flight.
The correct upper bound for a read timeout is derived from your budget guard’s polling interval. If RunGuard checks the budget every 30 seconds, set your maximum read timeout to 60 seconds. This guarantees the budget guard gets at least one check between the call start and a forced abort. For very long expected outputs (4,000+ tokens), use streaming with a per-chunk timeout (e.g., 10 seconds between chunks) rather than a wall-clock read timeout on the entire response.
Calculating the mathematically correct timeout window
The formula for a safe read timeout is straightforward. What varies is the model speed constant you use.
- Model output speed baselines. These are approximate median throughput values under normal load; actual speeds vary with prompt complexity, load balancing, and infrastructure state. GPT-4o: ~100 tokens/second. GPT-4o-mini: ~120 tokens/second. Claude Sonnet 3.5: ~80 tokens/second. Claude Haiku 3: ~100 tokens/second. Gemini 1.5 Pro: ~60–80 tokens/second. These are generation speeds, not time-to-first-token (TTFT). TTFT for these models is typically 0.5–3 seconds and should be accounted for separately.
- The timeout formula.
timeout_seconds = (max_tokens / model_speed_tps) + ttft_buffer + safety_margin. Working example for a GPT-4o call withmax_tokens=2000: base generation time = 2000 ÷ 100 = 20 seconds. Add 2 seconds TTFT buffer. Add 50% safety margin on generation time = 10 seconds. Result:timeout = 20 + 2 + 10 = 32 seconds. Round up to 35 seconds. For Claude Sonnet withmax_tokens=2000: 2000 ÷ 80 = 25 seconds base + 2 seconds TTFT + 12.5 seconds safety = 39.5 seconds, round to 40 seconds. - Halving max_tokens on retry. The most effective cost-control technique for stream timeouts is to retry with
max_tokenscut in half. If a 2,000-token generation timed out, the task likely doesn’t need 2,000 tokens — or the model is padding. Retry withmax_tokens=1000, which also cuts the retry timeout to ~15 seconds. If the second attempt times out again, abort. Two attempts at progressively smallermax_tokensis more cost-effective than three attempts at the original size. - Per-chunk stream timeout. For streaming responses, set a timeout on how long to wait between chunks, not a wall-clock timeout on the entire response. Python’s
aiohttpand thehttpxlibrary both support this viaread_timeoutwithin aTimeoutobject. A 10-second per-chunk timeout means: if 10 seconds pass with no new token delivered, abort. This is far more precise than a wall-clock timeout because it doesn’t penalize legitimately long responses that are streaming steadily.
Python implementation: stream timeout with RunGuard budget guard
The following implementation uses aiohttp for HTTP/2 streaming with granular timeout control, exponential backoff that distinguishes stream timeouts from connection timeouts, and RunGuard’s BudgetGuard to abort before a retry storm accumulates cost.
import asyncio
import aiohttp
import json
from dataclasses import dataclass
from typing import AsyncIterator
import runguard
# RunGuard budget guard — aborts the agent if session cost exceeds $0.50
guard = runguard.BudgetGuard(
session_budget_usd=0.50,
on_budget_exceeded=lambda ctx: (_ for _ in ()).throw(
runguard.BudgetExceededError(f"Session budget exceeded: {ctx.total_usd:.4f}")
)
)
@dataclass
class TimeoutConfig:
connect_seconds: float = 8.0 # Connection timeout — no charge if this fires
first_token_seconds: float = 15.0 # Time to first token
between_chunks_seconds: float = 10.0 # Max silence between streamed chunks
def build_timeout(max_tokens: int, model_speed_tps: float = 100.0) -> TimeoutConfig:
"""Derive timeout from max_tokens and model speed."""
generation_seconds = max_tokens / model_speed_tps
safety = generation_seconds * 0.5
ttft_buffer = 3.0
return TimeoutConfig(
connect_seconds=8.0,
first_token_seconds=ttft_buffer + 5.0,
between_chunks_seconds=min(generation_seconds + safety, 60.0)
)
async def stream_openai_with_timeout(
messages: list[dict],
max_tokens: int,
model: str = "gpt-4o",
timeout_cfg: TimeoutConfig | None = None,
) -> str:
"""
Stream a GPT-4o response with granular timeout control.
Returns the completed text or raises on timeout/budget exceeded.
"""
if timeout_cfg is None:
speed = 100.0 if "gpt-4o" in model else 80.0
timeout_cfg = build_timeout(max_tokens, speed)
aiohttp_timeout = aiohttp.ClientTimeout(
sock_connect=timeout_cfg.connect_seconds,
sock_read=timeout_cfg.between_chunks_seconds,
)
headers = {
"Authorization": f"Bearer {OPENAI_API_KEY}",
"Content-Type": "application/json",
}
payload = {
"model": model,
"messages": messages,
"max_tokens": max_tokens,
"stream": True,
}
chunks: list[str] = []
token_count = 0
connection_established = False
async with aiohttp.ClientSession(timeout=aiohttp_timeout) as session:
try:
async with session.post(
"https://api.openai.com/v1/chat/completions",
headers=headers,
json=payload,
) as resp:
connection_established = True
resp.raise_for_status()
async for line in resp.content:
line_str = line.decode().strip()
if not line_str.startswith("data: "):
continue
data = line_str[6:]
if data == "[DONE]":
break
parsed = json.loads(data)
delta = parsed["choices"][0]["delta"].get("content", "")
if delta:
chunks.append(delta)
token_count += 1
# Report token cost to RunGuard every 50 tokens
if token_count % 50 == 0:
guard.record_tokens(
input_tokens=0,
output_tokens=50,
cost_per_output_mtok=10.00, # GPT-4o output
model=model,
)
except aiohttp.ServerTimeoutError as exc:
if not connection_established:
# Connection timeout: no charge, safe to retry
raise ConnectionTimeoutError("Connection timeout — no tokens billed") from exc
else:
# Stream timeout: tokens already billed for what was generated
partial = "".join(chunks)
raise StreamTimeoutError(
f"Stream timeout after {token_count} tokens (${token_count * 10 / 1_000_000:.6f} billed). "
f"Partial: {partial[:100]}..."
) from exc
return "".join(chunks)
class ConnectionTimeoutError(Exception):
"""No tokens billed — safe to retry with same parameters."""
class StreamTimeoutError(Exception):
"""Tokens billed up to timeout point — retry with reduced max_tokens."""
async def call_with_smart_retry(
messages: list[dict],
max_tokens: int,
model: str = "gpt-4o",
max_retries: int = 3,
) -> str:
"""
Retry strategy that distinguishes timeout types:
- ConnectionTimeoutError: full retry, same max_tokens
- StreamTimeoutError: retry with max_tokens halved
- BudgetExceededError: abort immediately, no retry
"""
attempt_max_tokens = max_tokens
backoff = 1.0
for attempt in range(max_retries):
try:
timeout_cfg = build_timeout(attempt_max_tokens)
return await stream_openai_with_timeout(
messages=messages,
max_tokens=attempt_max_tokens,
model=model,
timeout_cfg=timeout_cfg,
)
except runguard.BudgetExceededError:
# Budget guard tripped — do not retry under any circumstances
raise
except ConnectionTimeoutError:
# Free retry — no tokens were charged
if attempt == max_retries - 1:
raise
await asyncio.sleep(backoff)
backoff = min(backoff * 2, 30.0)
continue
except StreamTimeoutError as exc:
# Paid partial tokens — retry with halved max_tokens and record cost
if attempt == max_retries - 1:
raise
attempt_max_tokens = max(attempt_max_tokens // 2, 256)
await asyncio.sleep(backoff)
backoff = min(backoff * 2, 30.0)
continue
raise RuntimeError("Max retries exceeded")
The key insight in this implementation is the connection_established flag. It separates the two timeout classes before any retry decision is made. Connection timeouts get a free retry with unchanged parameters; stream timeouts get a retry with max_tokens halved, cutting both the retry cost and the retry timeout duration. RunGuard’s BudgetGuard fires before the retry loop can compound cost beyond the session budget. For more on preventing retry accumulation, see AI agent retry storm prevention.
TypeScript implementation: AbortController with timeout classification
The TypeScript approach uses the AbortController API for fine-grained timeout control, with RunGuard’s TypeScript SDK integrated for budget enforcement.
import RunGuard from "@runguard/sdk";
const guard = new RunGuard.BudgetGuard({
sessionBudgetUsd: 0.5,
onBudgetExceeded: (ctx) => {
throw new Error(`Budget exceeded: $${ctx.totalUsd.toFixed(4)}`);
},
});
interface TimeoutConfig {
connectMs: number;
streamMs: number; // per-chunk timeout for streaming
}
function buildTimeout(maxTokens: number, modelSpeedTps = 100): TimeoutConfig {
const generationMs = (maxTokens / modelSpeedTps) * 1000;
const safetyMs = generationMs * 0.5;
return {
connectMs: 8_000,
streamMs: Math.min(generationMs + safetyMs + 3_000, 60_000),
};
}
class ConnectionTimeoutError extends Error {
constructor(msg: string) { super(msg); this.name = "ConnectionTimeoutError"; }
}
class StreamTimeoutError extends Error {
public readonly tokensGenerated: number;
constructor(msg: string, tokensGenerated: number) {
super(msg);
this.name = "StreamTimeoutError";
this.tokensGenerated = tokensGenerated;
}
}
async function streamWithTimeout(
messages: Array<{ role: string; content: string }>,
maxTokens: number,
model = "gpt-4o",
timeoutCfg?: TimeoutConfig,
): Promise<string> {
const cfg = timeoutCfg ?? buildTimeout(maxTokens);
// Connect timeout: abort if no response within connectMs
const connectController = new AbortController();
const connectTimer = setTimeout(
() => connectController.abort("connect-timeout"),
cfg.connectMs,
);
let response: Response;
try {
response = await fetch("https://api.openai.com/v1/chat/completions", {
method: "POST",
headers: {
Authorization: `Bearer ${process.env.OPENAI_API_KEY}`,
"Content-Type": "application/json",
},
body: JSON.stringify({ model, messages, max_tokens: maxTokens, stream: true }),
signal: connectController.signal,
});
clearTimeout(connectTimer);
} catch (err: unknown) {
clearTimeout(connectTimer);
if ((err as Error).message?.includes("connect-timeout")) {
throw new ConnectionTimeoutError("Server did not respond — no tokens billed");
}
throw err;
}
if (!response.ok) throw new Error(`HTTP ${response.status}`);
// Switch to per-chunk stream timeout after connection established
const chunks: string[] = [];
let tokenCount = 0;
let streamTimer: ReturnType<typeof setTimeout>;
const reader = response.body!.getReader();
const decoder = new TextDecoder();
const resetStreamTimer = (): void => {
clearTimeout(streamTimer);
streamTimer = setTimeout(() => {
reader.cancel();
}, cfg.streamMs);
};
try {
resetStreamTimer();
while (true) {
const { done, value } = await reader.read();
if (done) { clearTimeout(streamTimer); break; }
resetStreamTimer();
const text = decoder.decode(value, { stream: true });
for (const line of text.split("\n")) {
if (!line.startsWith("data: ")) continue;
const data = line.slice(6).trim();
if (data === "[DONE]") continue;
try {
const parsed = JSON.parse(data);
const delta = parsed.choices?.[0]?.delta?.content ?? "";
if (delta) {
chunks.push(delta);
tokenCount++;
if (tokenCount % 50 === 0) {
guard.recordTokens({
inputTokens: 0,
outputTokens: 50,
costPerOutputMtok: 10.0,
model,
});
}
}
} catch {
// skip malformed SSE lines
}
}
}
} catch (err: unknown) {
clearTimeout(streamTimer);
if (tokenCount > 0) {
const billedUsd = (tokenCount * 10) / 1_000_000;
throw new StreamTimeoutError(
`Stream went silent after ${tokenCount} tokens ($${billedUsd.toFixed(6)} billed)`,
tokenCount,
);
}
throw new ConnectionTimeoutError("Stream never started — no tokens billed");
}
return chunks.join("");
}
async function callWithSmartRetry(
messages: Array<{ role: string; content: string }>,
maxTokens: number,
model = "gpt-4o",
maxRetries = 3,
): Promise<string> {
let attemptMaxTokens = maxTokens;
let backoffMs = 1_000;
for (let attempt = 0; attempt < maxRetries; attempt++) {
try {
const cfg = buildTimeout(attemptMaxTokens);
return await streamWithTimeout(messages, attemptMaxTokens, model, cfg);
} catch (err: unknown) {
if ((err as Error).message?.startsWith("Budget exceeded")) throw err;
if (err instanceof ConnectionTimeoutError) {
if (attempt === maxRetries - 1) throw err;
await sleep(backoffMs);
backoffMs = Math.min(backoffMs * 2, 30_000);
continue;
}
if (err instanceof StreamTimeoutError) {
if (attempt === maxRetries - 1) throw err;
// Halve max_tokens on stream timeout: reduces retry cost + timeout duration
attemptMaxTokens = Math.max(Math.floor(attemptMaxTokens / 2), 256);
await sleep(backoffMs);
backoffMs = Math.min(backoffMs * 2, 30_000);
continue;
}
throw err;
}
}
throw new Error("Max retries exceeded");
}
const sleep = (ms: number): Promise<void> => new Promise((r) => setTimeout(r, ms));
The resetStreamTimer pattern provides true per-chunk timeout semantics: the timer resets every time a chunk arrives, so a healthy streaming response that takes 45 seconds total won’t be interrupted as long as chunks arrive within 10-second windows. Only genuine stream stalls (no chunk for 10 seconds) trigger the abort. This approach is strictly more accurate than a wall-clock timeout on the full response. Pair it with real-time runaway cost prevention to cap session spend end-to-end.
Cost arithmetic: why wrong timeouts are so expensive
These numbers illustrate why timeout misconfiguration compounds rapidly at production scale.
- Scenario: 10-second timeout on 2,000-token GPT-4o call, retry ×3. Model speed 100 tok/s, so at 10 seconds approximately 1,000 output tokens generated per attempt. Cost per attempt: 1,000 × $10/MTok = $0.010. Three failed attempts: $0.030. If the task runs 500 times/day under this misconfiguration: $15.00/day in pure waste. Monthly: $450 in charges that returned zero useful output. The fix — setting the correct 30-second timeout — costs nothing but a configuration change.
- Scenario: 300-second timeout allowing a stalled stream. Claude Sonnet 3.5 stalls mid-generation. Input tokens: 1,500 at $3/MTok = $0.0045. Output tokens before stall: 400 at $15/MTok = $0.006. Total per stalled call: $0.0105. With a 300-second timeout and no circuit breaker, your agent blocks for five minutes per stall, accumulating charges and preventing other work. With RunGuard capping the session at $0.50 and a 60-second max timeout, the stall is detected within 60 seconds and budget spend is bounded.
- Anthropic prompt caching reduces the input-side timeout tax. If your system prompt is cached (cache read at $0.30/MTok vs $3.00/MTok), a timeout-and-retry that re-sends the same system prompt only costs $0.30/MTok for the cached prefix. This means caching also reduces the cost of timeout-induced retries by up to 90% on the input side. See Anthropic Claude API cost optimization for full caching guidance.
Timeout strategy comparison: cost risk, retry behavior, and latency impact
| Strategy | Timeout window | Cost risk on timeout | Retry behavior | Latency impact | Recommended for |
|---|---|---|---|---|---|
| Fixed short (<10s) | 10 seconds hard | High — frequent partial charges + retry storms | Retries full cost each time | Fast failure, high retry rate | Not recommended for LLM calls |
| Fixed long (>120s) | 120–300 seconds hard | High — stalled calls block budget guard | Infrequent but expensive | Very high tail latency | Not recommended without circuit breaker |
| Formula-derived wall-clock | max_tokens ÷ speed × 1.5 | Medium — some partial charges on stall | Retry with same max_tokens | Moderate, bounded by formula | Non-streaming calls |
| Per-chunk stream timeout | 10s between chunks | Low — catches stalls early | Retry with halved max_tokens | Minimal on healthy streams | Streaming calls (recommended) |
| Per-chunk + RunGuard budget cap | 10s between chunks + $0.50 session cap | Very low — budget guard caps retry accumulation | Retries bounded by session budget | Minimal on healthy streams | Production agents (recommended) |
| Connection-only timeout (no stream timeout) | 8s connect, unlimited stream | Extreme — unbounded stream cost | N/A (no stream retries triggered) | Unbounded tail latency | Never in production |
Related: AI agent retry storm prevention · parallel tool call budget control · prevent AI agent runaway cost in real time
Stop paying for timed-out tokens
LLM API timeout cost impact is one of the most overlooked sources of wasted spend in production agent systems. The mechanics are clear: connection timeouts are free, stream timeouts are charged at full output rates, and a retry policy that doesn’t distinguish between them will triple your bill on every timeout event. The fix requires three things working together: a per-chunk stream timeout derived from max_tokens and model output speed, a retry strategy that halves max_tokens on stream timeout retries, and a session-level budget guard that aborts before retry accumulation exceeds your cost threshold. RunGuard provides the budget guard layer with one-line integration into any Python or TypeScript agent.
RunGuard pricing: Solo plan at $19/month for individual developers. Team plan at $79/month adds Slack and PagerDuty webhook alerts, shared dashboards, and audit log. Both plans include a 14-day free trial — no credit card required.
Start your 14-day free trial — or explore related: retry storm prevention, context window truncation alerts, and autonomous agent cost control best practices.