Context window overflow is the silent failure mode your agent monitoring probably misses.
Every LLM has a context window limit: Claude Sonnet/Opus 4.x at 200,000 tokens, GPT-4.1 at 128,000, Llama 3.1 70B at 131,072. When an AI agent accumulates enough messages — tool call results, system prompts, long documents retrieved from RAG, multi-turn conversation history — it eventually approaches that limit. What happens next depends on the provider and the SDK version, and in most cases it is worse than a clean error. The Anthropic API returns HTTP 400 with "prompt is too long" when you exceed the hard limit and the request is rejected outright. That is at least observable: you get an exception, your error handler fires, you log the event. The silent failure happens before the hard limit: when your messages array plus the max_tokens headroom you reserved for the completion pushes the projected token count past the model’s effective context, some providers and SDK configurations will silently drop the oldest messages from the conversation history before sending the request. Your agent continues running. It still gets a response. The response quality has silently degraded because the tool call results from three turns ago — which the current turn depends on — were truncated out of the context window before the generation fired. No exception. No alert. Just wrong output and a bill. RunGuard’s ContextGuard detects this projected overflow before the API call goes out, so you can compact the context, escalate to a human, or trip the circuit breaker rather than continue with corrupted agent state.
How context window overflow actually manifests in production agents
- The silent truncation case: oldest messages dropped, no exception thrown. The Anthropic Python SDK’s
client.messages.create()does not validate context size before the HTTP call. If yourmessagesarray contains more tokens than the model’s context minusmax_tokens, the API rejects the request with HTTP 400. But at the SDK layer, there is no pre-call count. Some agent orchestration frameworks (LangChain, AutoGen, older CrewAI versions) implement their own context management that silently trims the messages array before each call when the projected token count exceeds a configured threshold. If you are using one of these frameworks and the threshold is set too low — or if the trimming strategy drops tool-call result messages rather than the equivalent number of tokens from earlier turns — your agent receives a response generated without the context it expected. The response is not marked as truncated. The token count inresponse.usagelooks normal. The quality problem is invisible at the API boundary. - The hard-limit 400 case: a valid run aborts mid-task. If you are not using a framework that pre-trims, the Anthropic and OpenAI APIs return a 400 when the request body exceeds the hard context limit. For a multi-step agent that has been running for thirty seconds and accumulating tool-call results, a mid-run 400 means the entire run aborts without completing the task. The user sees an error. The LLM calls that ran before the 400 are billed at normal rates. The task requires a retry from scratch. This is observable — you get an exception — but it is expensive in both money and latency, and it is entirely avoidable if you detect the projected overflow before the call that triggers the 400.
- The near-limit degradation case: the model loses relevant context without exceeding the limit. Even before you hit the hard 200,000-token limit on Claude 4.x, a context window approaching 150,000–180,000 tokens causes measurable quality degradation on tasks that require tracking state across the full conversation. LLM attention mechanisms degrade on very long contexts: the model still technically “sees” the early messages, but their effective influence on the generation decreases as the context grows. A research agent that called a search tool in turn 3 and accumulated 100,000 more tokens of context by turn 15 may fail to reference the search results from turn 3 in its turn-15 synthesis, not because the tokens were dropped but because the model’s effective attention weight on those early tokens is low. Detecting projected overflow before the limit lets you compact the context at a defined headroom threshold (say, compact when projected tokens exceed 80% of the limit) rather than letting the context grow unchecked.
- The compound failure case: a loop generates context overflow. A loop detector and a context guard are complementary. The most common path to context overflow in a production agent is a tool-call loop: the agent calls the same tool with the same arguments repeatedly, each call appends the tool result to the messages array, and the accumulating tool results fill the context window in a fraction of the time a well-behaved run would take. A loop detector catches the repetition pattern early (at the third repetition) and throws
LoopDetectedError; a context guard catches the overflow condition even if the loop produces non-identical outputs that don’t trigger the signature-based detector. Both guards are in the RunGuard SDK. Adding both —guard(fn, { loop: { repeats: 3 }, context: { maxContextTokens: 200000, headroom: 4096 } })— covers both failure modes with a single wrapper.
Why most observability stacks miss the truncation case
LangSmith, Langfuse, Braintrust, and Helicone all record the token count from each LLM call’s usage field after the generation returns. The usage.input_tokens field tells you how many tokens the model processed for the request that completed. What it does not tell you is what tokens were in the request that was about to be sent before a framework trimmed the messages array. If your orchestration layer silently dropped the last five tool-call results before sending the request (because the accumulated context was over the threshold), the observability platform records the token count of the trimmed request — the number that fit, not the number that should have fit. The truncation is invisible in the trace because the observability platform captures data at the API boundary, after the trimming has already happened at the SDK or framework layer. The only place that knows the projected token count before any trimming — and before the HTTP call — is the code that assembles the messages array. That is exactly where RunGuard’s ContextGuard operates: the opts.tokens(input) function you provide is called with the full, un-trimmed messages array before the HTTP call, and the guard fires if the projected count exceeds maxContextTokens - headroom. The observability platform never sees a truncated request, because the guard fires before the request is sent.
Using RunGuard’s ContextGuard
-
TypeScript: project token count before each call
import { guard } from "@runguard/sdk"; import Anthropic from "@anthropic-ai/sdk"; import { encode } from "gpt-tokenizer"; // or tiktoken, or your own counter const client = new Anthropic(); const MODEL_CONTEXT_TOKENS = 200_000; // Claude Sonnet/Opus 4.x const COMPLETION_HEADROOM = 8_192; // reserved for the response async function callClaude(messages: Anthropic.MessageParam[]) { const response = await client.messages.create({ model: "claude-sonnet-4-6", max_tokens: COMPLETION_HEADROOM, messages, }); const usd = (response.usage.input_tokens * 3 + response.usage.output_tokens * 15) / 1_000_000; const sig = response.content[0].type === "tool_use" ? response.content[0].name : "end_turn"; return { response, usd, sig }; } const guarded = guard(callClaude, { budget: { maxUsd: 10 }, loop: { repeats: 3, maxCycleLen: 8 }, context: { maxContextTokens: MODEL_CONTEXT_TOKENS, headroom: COMPLETION_HEADROOM, // tokens() receives the same input your guarded function receives tokens: (messages) => messages.reduce( (acc, m) => acc + encode(typeof m.content === "string" ? m.content : JSON.stringify(m.content)).length, 0 ), }, }); // In your agent loop: try { const { response } = await guarded(messages); } catch (e: any) { if (e.reason === "context") { console.error(`Context overflow: projected ${e.projectedTokens} tokens (limit ${e.maxContextTokens} - headroom ${e.headroom} = ${e.effectiveLimit})`); // compact messages, escalate, or fail gracefully } } -
Python: same guard with a token-counting lambda
from runguard import guard, ContextOverflowError import anthropic import tiktoken client = anthropic.Anthropic() enc = tiktoken.get_encoding("cl100k_base") # close enough for Anthropic models def count_tokens(messages: list) -> int: return sum( len(enc.encode(m["content"] if isinstance(m["content"], str) else str(m["content"]))) for m in messages ) def call_claude(messages: list) -> dict: response = client.messages.create( model="claude-sonnet-4-6", max_tokens=8192, messages=messages, ) usd = (response.usage.input_tokens * 3 + response.usage.output_tokens * 15) / 1_000_000 sig = response.content[0].name if response.stop_reason == "tool_use" else "end_turn" return {"response": response, "usd": usd, "sig": sig} guarded = guard( call_claude, budget={"max_usd": 10}, loop={"repeats": 3}, context={ "max_context_tokens": 200_000, "headroom": 8_192, "tokens": count_tokens, }, ) try: result = guarded(messages) except ContextOverflowError as e: print(f"Context overflow: {e.projected_tokens} projected, effective limit {e.effective_limit}") # compact messages here and retry
What ContextOverflowError tells you that a 400 does not
An HTTP 400 from Anthropic’s API with "prompt is too long" tells you that the request you sent exceeded the hard limit. It does not tell you by how many tokens, which messages in the array contributed the most tokens, whether a smaller headroom reservation would have let the call proceed, or what the effective limit was accounting for the max_tokens you reserved for the response. RunGuard’s ContextOverflowError carries four fields: e.projectedTokens (the token count your opts.tokens() function returned), e.maxContextTokens (the model’s hard limit you configured), e.headroom (the completion reserve you set), and e.effectiveLimit (maxContextTokens - headroom — the actual ceiling the guard checks against). Armed with these four numbers, your error handler can make an informed decision: if projectedTokens - effectiveLimit < 10_000, try compacting the last tool-call result (the marginal contributor) and retry; if projectedTokens > effectiveLimit * 1.5, the context has grown too large for this run and a full summarization pass is needed. A 400 from the API gives you none of this structured data — it gives you a string error message after you have already paid for the tokens in every preceding call in the run.
Context compaction strategies after a ContextOverflowError
- Drop the oldest tool-call result pairs. If your messages array contains many tool-use / tool-result pairs from earlier turns, dropping the oldest pair and retrying is often safe: the agent’s current reasoning is based on recent context, and early retrieval results may already be incorporated into the conversation narrative. Check that your framework’s message list invariants are preserved (some frameworks require that every tool-use message is immediately followed by the corresponding tool-result message before the next user message) before dropping individual messages.
- Summarize the oldest N turns with a short LLM call. Make a separate LLM call (a fast, cheap model) that summarizes the first half of the messages array into a compact paragraph, replace those messages with a single
{"role": "user", "content": "Summary of earlier context: ..."}message, and retry the guarded call. This preserves the semantic content of the early context while dramatically reducing token count. The summary call itself is a new LLM invocation — make sure it passes through a separate RunGuard instance so the summary generation is also budget-tracked. - Trip the circuit breaker and escalate to a human. For some tasks, context overflow means the run has grown beyond what can be completed automatically. A coding agent that has accumulated 180,000 tokens of context over forty tool calls is probably in a failure mode that no compaction strategy will cleanly resolve. In that case, throwing
ContextOverflowErrorand escalating to a human review queue is the correct response. Thee.projectedTokensande.effectiveLimitfields give the reviewer the concrete information needed to understand why the run aborted. - Set the headroom conservatively and compact proactively. Rather than waiting for a
ContextOverflowError, you can catch a warning threshold by settingheadroomlarger than your actualmax_tokensreservation. For example, if your completions are always under 2,048 tokens but you setheadroom: 20_000, RunGuard fires the error at 180,000 projected tokens for a 200,000-token model, giving you 18,000 tokens of buffer to compact. This proactive approach prevents the near-limit quality degradation problem as well as the hard truncation case.
Context limits by model (2026)
| Model | Context window | Recommended headroom |
|---|---|---|
| Claude Opus 4.7 / Sonnet 4.6 | 200,000 tokens | 8,192–16,384 |
| GPT-4.1 / GPT-4.1-mini | 128,000 tokens | 4,096–8,192 |
| GPT-4o / GPT-4o-mini | 128,000 tokens | 4,096–8,192 |
| Llama 3.1 70B / 405B | 131,072 tokens | 4,096–8,192 |
| Gemini 1.5 Pro | 1,048,576 tokens | 32,768 |
| Mistral Large | 128,000 tokens | 4,096–8,192 |
These limits are the numbers you should pass to maxContextTokens. The headroom should equal at least your configured max_tokens for completions, plus a safety buffer (4,096–8,192 tokens) for prompt growth between the token-count call and the API call. Token counting is an approximation — different tokenizers return slightly different counts for the same text — so the buffer also absorbs tokenizer mismatch.
What this is not
- Not a token counting library. RunGuard does not implement its own tokenizer. The
opts.tokens(input)function is yours to provide: it receives the same input your guarded function receives and should return an integer token count. You are responsible for choosing the right tokenizer for your model (tiktokenfor OpenAI models,gpt-tokenizerfor a JS port, a rough character-based estimate for quick integrations). RunGuard uses the number your function returns to compute the guard decision. If your token counter is imprecise (off by 10%), your headroom setting absorbs that imprecision. - Not a context compression service. RunGuard throws
ContextOverflowErrorand gives you the structured numbers to make a compaction decision. It does not compress or summarize your context for you. The compaction strategy (drop oldest messages, summarize, escalate) is application logic that depends on your agent’s domain, your acceptable quality tradeoffs, and your cost constraints. RunGuard provides the trigger point and the diagnostic data; your error handler provides the response. - Not a replacement for prompt engineering. If your agent consistently hits context limits, that is a signal to redesign the prompt architecture: retrieve fewer documents per turn, write more compact tool-call result summaries, use a tiered approach that routes short tasks to a cheaper short-context model. RunGuard’s context guard is a circuit breaker for overflow events that slip past your prompt architecture, not a substitute for designing an agent that fits within context limits under normal operation.
- Not limited to context overflow — RunGuard also handles loops and budget. The same
guard()wrapper that enablesContextGuardalso enablesLoopDetector(catches repeated tool-call signatures before they grow the context window) andBudgetTracker(catches per-run cost before it exceeds your cap). For most production agents, all three guards should be active simultaneously: loops generate context overflow, context overflow is expensive, and runaway cost is the downstream effect of both. One wrapper, three guards:guard(fn, { loop: { repeats: 3 }, budget: { maxUsd: 10 }, context: { maxContextTokens: 200_000, headroom: 8192, tokens: countFn } }).
The minimum integration
One npm i @runguard/sdk (TypeScript) or pip install runguard (Python). One guard() wrap around the function that calls your LLM provider SDK. Three new option fields: context.maxContextTokens (the model’s hard limit), context.headroom (your completion reserve), and context.tokens (a function that counts tokens for your guarded function’s input). One catch block that handles ContextOverflowError. That is the entire integration delta for context window protection. The full API is documented in llms.txt. The LangChain circuit breaker page and the LangGraph infinite loop guard page cover the same guard wrapper applied to framework-specific call sites. RunGuard ships as @runguard/sdk on npm and runguard on PyPI. The loop detection fundamentals page explains how the loop guard and context guard interact when both are active.