Context window overflow is the silent failure mode your agent monitoring probably misses.

Every LLM has a context window limit: Claude Sonnet/Opus 4.x at 200,000 tokens, GPT-4.1 at 128,000, Llama 3.1 70B at 131,072. When an AI agent accumulates enough messages — tool call results, system prompts, long documents retrieved from RAG, multi-turn conversation history — it eventually approaches that limit. What happens next depends on the provider and the SDK version, and in most cases it is worse than a clean error. The Anthropic API returns HTTP 400 with "prompt is too long" when you exceed the hard limit and the request is rejected outright. That is at least observable: you get an exception, your error handler fires, you log the event. The silent failure happens before the hard limit: when your messages array plus the max_tokens headroom you reserved for the completion pushes the projected token count past the model’s effective context, some providers and SDK configurations will silently drop the oldest messages from the conversation history before sending the request. Your agent continues running. It still gets a response. The response quality has silently degraded because the tool call results from three turns ago — which the current turn depends on — were truncated out of the context window before the generation fired. No exception. No alert. Just wrong output and a bill. RunGuard’s ContextGuard detects this projected overflow before the API call goes out, so you can compact the context, escalate to a human, or trip the circuit breaker rather than continue with corrupted agent state.

How context window overflow actually manifests in production agents

Why most observability stacks miss the truncation case

LangSmith, Langfuse, Braintrust, and Helicone all record the token count from each LLM call’s usage field after the generation returns. The usage.input_tokens field tells you how many tokens the model processed for the request that completed. What it does not tell you is what tokens were in the request that was about to be sent before a framework trimmed the messages array. If your orchestration layer silently dropped the last five tool-call results before sending the request (because the accumulated context was over the threshold), the observability platform records the token count of the trimmed request — the number that fit, not the number that should have fit. The truncation is invisible in the trace because the observability platform captures data at the API boundary, after the trimming has already happened at the SDK or framework layer. The only place that knows the projected token count before any trimming — and before the HTTP call — is the code that assembles the messages array. That is exactly where RunGuard’s ContextGuard operates: the opts.tokens(input) function you provide is called with the full, un-trimmed messages array before the HTTP call, and the guard fires if the projected count exceeds maxContextTokens - headroom. The observability platform never sees a truncated request, because the guard fires before the request is sent.

Using RunGuard’s ContextGuard

What ContextOverflowError tells you that a 400 does not

An HTTP 400 from Anthropic’s API with "prompt is too long" tells you that the request you sent exceeded the hard limit. It does not tell you by how many tokens, which messages in the array contributed the most tokens, whether a smaller headroom reservation would have let the call proceed, or what the effective limit was accounting for the max_tokens you reserved for the response. RunGuard’s ContextOverflowError carries four fields: e.projectedTokens (the token count your opts.tokens() function returned), e.maxContextTokens (the model’s hard limit you configured), e.headroom (the completion reserve you set), and e.effectiveLimit (maxContextTokens - headroom — the actual ceiling the guard checks against). Armed with these four numbers, your error handler can make an informed decision: if projectedTokens - effectiveLimit < 10_000, try compacting the last tool-call result (the marginal contributor) and retry; if projectedTokens > effectiveLimit * 1.5, the context has grown too large for this run and a full summarization pass is needed. A 400 from the API gives you none of this structured data — it gives you a string error message after you have already paid for the tokens in every preceding call in the run.

Context compaction strategies after a ContextOverflowError

Context limits by model (2026)

Model Context window Recommended headroom
Claude Opus 4.7 / Sonnet 4.6 200,000 tokens 8,192–16,384
GPT-4.1 / GPT-4.1-mini 128,000 tokens 4,096–8,192
GPT-4o / GPT-4o-mini 128,000 tokens 4,096–8,192
Llama 3.1 70B / 405B 131,072 tokens 4,096–8,192
Gemini 1.5 Pro 1,048,576 tokens 32,768
Mistral Large 128,000 tokens 4,096–8,192

These limits are the numbers you should pass to maxContextTokens. The headroom should equal at least your configured max_tokens for completions, plus a safety buffer (4,096–8,192 tokens) for prompt growth between the token-count call and the API call. Token counting is an approximation — different tokenizers return slightly different counts for the same text — so the buffer also absorbs tokenizer mismatch.

What this is not

The minimum integration

One npm i @runguard/sdk (TypeScript) or pip install runguard (Python). One guard() wrap around the function that calls your LLM provider SDK. Three new option fields: context.maxContextTokens (the model’s hard limit), context.headroom (your completion reserve), and context.tokens (a function that counts tokens for your guarded function’s input). One catch block that handles ContextOverflowError. That is the entire integration delta for context window protection. The full API is documented in llms.txt. The LangChain circuit breaker page and the LangGraph infinite loop guard page cover the same guard wrapper applied to framework-specific call sites. RunGuard ships as @runguard/sdk on npm and runguard on PyPI. The loop detection fundamentals page explains how the loop guard and context guard interact when both are active.