LLM agent token limit exceeded in TypeScript: detect before the provider throws, recover with partial output
A TypeScript agent running a long research or coding task accumulates tool results in its context window. At some point the projected input tokens for the next call will exceed the model’s limit — 128k for GPT-4o, 200k for Claude Sonnet 4, 131k for Llama 3.1 70B. What happens next depends on which provider you’re using and how you’ve wired the call. OpenAI throws a 400 context_length_exceeded error. Anthropic throws a 400 prompt_too_long. Some older hosted models silently truncate from the beginning, corrupting earlier context without any error at all. None of these outcomes give you a graceful recovery path. RunGuard’s context guard projects the token count before each call and trips a recoverable error with the full accumulated context still in memory — so you can compact, checkpoint, or summarize instead of losing the run.
Why token-limit overruns happen in agent loops
- Tool results are unbounded. An agent that calls a web search tool or a code execution tool appends the full result to its conversation history. A single web-search result can be 2,000–5,000 tokens. After 30 iterations, the accumulated context can easily reach 100k tokens even if the initial prompt was small. The agent has no native mechanism to evict old results — it adds to the history, never removes from it.
- The model’s effective output window shrinks as context grows. Most providers reserve headroom for the completion: a 128k model with 120k tokens in the prompt can only produce an 8k-token response. At 127k prompt tokens, the response budget is 1k — often too small for a meaningful tool call or answer. The agent is degrading silently long before the hard limit is hit.
- Loops accelerate context growth. A tool-call loop that repeats the same search query 15 times appends 15 identical (or near-identical) results to the context, each 2–4k tokens. By the time the budget guard or loop detector fires, the context has already been polluted with redundant content that consumes the next session’s effective window.
- The crash is non-recoverable without pre-call detection. Once the provider throws a 400, the entire run state is inside the exception’s stack frame. Your catch block can read the error type, but the accumulated work — every tool result, every intermediate reasoning step — is gone unless you serialized it elsewhere. Pre-call detection fires before the request leaves your process, so the context is still in memory and addressable.
The two failure modes: hard 400 vs. silent truncation
- Hard 400: OpenAI
context_length_exceeded. The OpenAI API returnsHTTP 400witherror.code === "context_length_exceeded"and a message listing the prompt token count vs. the model’s limit. The TypeScript SDK surfaces this as anAPIErrorwithstatus: 400. You can catch it, but by this point the request has already been sent, the tokens have been counted, and any tool results appended to the conversation since the last successful call are already part of the rejected payload. There is no partial output — the call either succeeds or fails entirely. - Hard 400: Anthropic
prompt_too_long. Anthropic returnsHTTP 400witherror.type === "invalid_request_error"anderror.messagecontaining “prompt is too long”. Same situation: the request was sent, the payload was counted, no output was produced. Anthropic’s context limit for Claude Sonnet 4 and Opus 4 is 200k tokens — much larger than OpenAI’s — but an agent accumulating tool results can still reach it within an hour of operation. - Silent truncation: older hosted models and some proxies. Some open-source model hosts and proxy layers (Ollama, LiteLLM in certain configurations, older VLLM deployments) truncate the prompt from the beginning rather than throwing. This means the model receives a conversation history with the oldest turns silently dropped. The agent appears to succeed — it gets a response — but it has lost context about its goals, its earlier tool results, and any constraints that were in the system prompt. The downstream output is wrong in ways that are hard to detect without re-reading the full conversation.
Detecting token overruns before the call: RunGuard context guard in TypeScript
- How pre-call detection works. Before sending each request, RunGuard’s context guard calls a user-supplied
tokens(input)function to estimate the projected input token count. Ifprojected > maxContextTokens - headroom, it throws aContextOverflowErrorbefore the HTTP request is made. The guard never calls the provider — the context is still in your process, addressable and compressible. - TypeScript: wiring the context guard.
import OpenAI from "openai"; import { guard, ContextOverflowError, LoopDetectedError, BudgetExceededError } from "@runguard/sdk"; import { encode } from "gpt-tokenizer"; // or tiktoken, your choice const client = new OpenAI(); type Message = OpenAI.Chat.ChatCompletionMessageParam; async function callModel(messages: Message[]): Promise<OpenAI.Chat.ChatCompletion> { return client.chat.completions.create({ model: "gpt-4o", messages, }); } const guardedCall = guard(callModel, { // Project token count before each call — fires before the HTTP request context: { maxContextTokens: 128_000, headroom: 8_000, // reserve 8k for the completion }, tokens: (messages: Message[]) => { // gpt-tokenizer gives token-accurate counts for GPT-4o return messages.reduce((total, msg) => { const text = typeof msg.content === "string" ? msg.content : (msg.content ?? []) .map((c) => (c.type === "text" ? c.text : "")) .join(""); return total + encode(text).length + 4; // 4 overhead per message }, 3); // 3 for reply priming }, loop: { repeats: 3, maxCycleLen: 4 }, budget: { maxUsd: 3.0 }, }); async function runAgent(initialMessages: Message[]): Promise<string> { const history: Message[] = [...initialMessages]; while (true) { try { const completion = await guardedCall(history); const reply = completion.choices[0].message; history.push(reply); if (!reply.tool_calls?.length) { return reply.content ?? ""; } // Execute tools, append results... for (const tc of reply.tool_calls) { const result = await executeTool(tc); history.push({ role: "tool", tool_call_id: tc.id, content: JSON.stringify(result), }); } } catch (err) { if (err instanceof ContextOverflowError) { // Context is still in memory — compact and retry console.warn( `Context guard fired: projected ${err.projectedTokens} tokens ` + `vs ${err.maxContextTokens - err.headroom} limit. Compacting.` ); return await compactAndResume(history); } if (err instanceof LoopDetectedError) { return `Agent loop detected (pattern: ${err.pattern}). Run aborted.`; } if (err instanceof BudgetExceededError) { return `Budget cap reached ($${err.spent.toFixed(3)}). Run aborted.`; } throw err; } } } async function compactAndResume(history: Message[]): Promise<string> { // Strategy: summarize the middle of the conversation, keep system + last 5 turns const system = history.filter((m) => m.role === "system"); const tail = history.filter((m) => m.role !== "system").slice(-10); const middle = history.filter((m) => m.role !== "system").slice(0, -10); const summary = await client.chat.completions.create({ model: "gpt-4o-mini", // cheap model for summarization messages: [ { role: "system", content: "Summarize the following conversation history concisely, preserving key findings and decisions." }, { role: "user", content: middle.map((m) => `${m.role}: ${typeof m.content === "string" ? m.content : "[multi-part]"}`).join("\n") }, ], }); const compacted: Message[] = [ ...system, { role: "assistant", content: `[Conversation summary: ${summary.choices[0].message.content}]` }, ...tail, ]; // Re-run with compacted history return runAgent(compacted); } declare function executeTool(tc: OpenAI.Chat.ChatCompletionMessageToolCall): Promise<unknown>; - Why
headroommatters. Settingheadroom: 8000means the guard fires when projected input tokens exceed128,000 - 8,000 = 120,000. This ensures the model always has at least 8k tokens for its response — enough for a substantive tool call or answer. Without headroom, you could successfully send a 127,999-token prompt and get a 1-token response that truncates mid-sentence. The guard fires while the completion budget is still meaningful. - Token counting accuracy.
gpt-tokenizerproduces accurate counts for GPT-4o using the same cl100k_base tokenizer. For Anthropic models,@anthropic-ai/tokenizerprovides accurate counts. For open-source models,tiktokenwith the appropriate encoding gives a close approximation. A 5–10% overcount is acceptable because the guard fires before the limit, not at the limit — false positives are a compaction, not a crash.
Recovery strategies when the context guard fires
- Strategy 1: summarize the middle, keep head and tail. This is the example above. The system prompt (head) and the last N turns (tail) are preserved verbatim; the middle section is summarized by a cheap model. This works well when the agent’s goal is in the system prompt and the most recent tool results are the most relevant. The risk: the summary loses specific details from early tool calls that the agent may still need.
- Strategy 2: checkpoint and restart. When the context guard fires, serialize the current history to disk (or a key-value store like Redis or SQLite) with a task ID. Return a partial result to the caller with a resumption token. On the next invocation, restore the serialized history, apply your compaction strategy, and continue. This is appropriate for long-horizon agents where a single run may take hours and the partial output has value even if the full task is not complete.
- Strategy 3: evict redundant tool results. Many agent runs accumulate duplicate or superseded tool results. If a search was run five times with slightly different queries, the earlier three results may be fully redundant given the fourth and fifth. A simple deduplication pass over tool results by content hash or semantic similarity can free 30–50% of context without losing any unique information. Implement this as a context pre-processor that runs on every call, not just when the guard fires, to prevent the problem from building up.
- Strategy 4: structured context budget allocation. Allocate explicit token budgets to each type of context: system prompt (2k), task description (1k), tool results (50k), conversation history (30k), reserved for completion (8k). When any budget is exceeded, evict within that category first (oldest tool results, oldest conversation turns). This prevents any single category from crowding out others and makes the compaction logic deterministic and testable.
Token-limit handling approaches compared
| Approach | When it fires | Context preserved? | Recovery possible? | Cost impact |
|---|---|---|---|---|
| Catch provider 400 | After HTTP request fails | No (only error type) | Manual, state lost | Full prompt tokens billed |
| max_tokens on completion | Never for input overrun | N/A | N/A (wrong problem) | No protection |
| Manual token count before call | Pre-call (if implemented) | Yes | Yes | Full prompt counted, not sent |
| RunGuard context guard | Pre-call, automatic | Yes — ContextOverflowError exposes projectedTokens | Yes — compaction or checkpoint | Zero tokens billed on trip |
Multi-provider token limits in TypeScript agents (2026 reference)
- OpenAI GPT-4o: 128k context window. Token counting via
gpt-tokenizer(cl100k_base). SetmaxContextTokens: 128_000, headroom: 8_000for a safe effective limit of 120k. Input tokens billed whether the call succeeds or fails on 400. - Anthropic Claude Sonnet 4 / Opus 4: 200k context window. Token counting via
@anthropic-ai/tokenizer. SetmaxContextTokens: 200_000, headroom: 16_000for a safe effective limit of 184k. Anthropic counts input tokens in a pre-flight pass before generating any output — a 400prompt_too_longstill consumes input tokens on some billing models. - Meta Llama 3.1 70B (hosted): 131k context window. Token counting approximated via
tiktokenllama3 encoding or a character-count heuristic (÷ 4). SetmaxContextTokens: 131_000, headroom: 8_000. Host behaviour on overflow varies: Groq throws a 400, Together.ai may truncate silently. - Google Gemini 1.5 Pro: 1M token context window — rarely hit in practice, but tool-result accumulation can still reach it for long-running crawlers or code analysis agents. Token counting via the Vertex AI
countTokensAPI (network call) or a character-count heuristic for speed.