LLM agent token limit exceeded in TypeScript: detect before the provider throws, recover with partial output

A TypeScript agent running a long research or coding task accumulates tool results in its context window. At some point the projected input tokens for the next call will exceed the model’s limit — 128k for GPT-4o, 200k for Claude Sonnet 4, 131k for Llama 3.1 70B. What happens next depends on which provider you’re using and how you’ve wired the call. OpenAI throws a 400 context_length_exceeded error. Anthropic throws a 400 prompt_too_long. Some older hosted models silently truncate from the beginning, corrupting earlier context without any error at all. None of these outcomes give you a graceful recovery path. RunGuard’s context guard projects the token count before each call and trips a recoverable error with the full accumulated context still in memory — so you can compact, checkpoint, or summarize instead of losing the run.

Why token-limit overruns happen in agent loops

Tool results are unbounded. An agent that calls a web search tool or a code execution tool appends the full result to its conversation history. A single web-search result can be 2,000–5,000 tokens. After 30 iterations, the accumulated context can easily reach 100k tokens even if the initial prompt was small. The agent has no native mechanism to evict old results — it adds to the history, never removes from it.
The model’s effective output window shrinks as context grows. Most providers reserve headroom for the completion: a 128k model with 120k tokens in the prompt can only produce an 8k-token response. At 127k prompt tokens, the response budget is 1k — often too small for a meaningful tool call or answer. The agent is degrading silently long before the hard limit is hit.
Loops accelerate context growth. A tool-call loop that repeats the same search query 15 times appends 15 identical (or near-identical) results to the context, each 2–4k tokens. By the time the budget guard or loop detector fires, the context has already been polluted with redundant content that consumes the next session’s effective window.
The crash is non-recoverable without pre-call detection. Once the provider throws a 400, the entire run state is inside the exception’s stack frame. Your catch block can read the error type, but the accumulated work — every tool result, every intermediate reasoning step — is gone unless you serialized it elsewhere. Pre-call detection fires before the request leaves your process, so the context is still in memory and addressable.

The two failure modes: hard 400 vs. silent truncation

Hard 400: OpenAI context_length_exceeded. The OpenAI API returns HTTP 400 with error.code === "context_length_exceeded" and a message listing the prompt token count vs. the model’s limit. The TypeScript SDK surfaces this as an APIError with status: 400. You can catch it, but by this point the request has already been sent, the tokens have been counted, and any tool results appended to the conversation since the last successful call are already part of the rejected payload. There is no partial output — the call either succeeds or fails entirely.
Hard 400: Anthropic prompt_too_long. Anthropic returns HTTP 400 with error.type === "invalid_request_error" and error.message containing “prompt is too long”. Same situation: the request was sent, the payload was counted, no output was produced. Anthropic’s context limit for Claude Sonnet 4 and Opus 4 is 200k tokens — much larger than OpenAI’s — but an agent accumulating tool results can still reach it within an hour of operation.
Silent truncation: older hosted models and some proxies. Some open-source model hosts and proxy layers (Ollama, LiteLLM in certain configurations, older VLLM deployments) truncate the prompt from the beginning rather than throwing. This means the model receives a conversation history with the oldest turns silently dropped. The agent appears to succeed — it gets a response — but it has lost context about its goals, its earlier tool results, and any constraints that were in the system prompt. The downstream output is wrong in ways that are hard to detect without re-reading the full conversation.

Detecting token overruns before the call: RunGuard context guard in TypeScript

How pre-call detection works. Before sending each request, RunGuard’s context guard calls a user-supplied tokens(input) function to estimate the projected input token count. If projected > maxContextTokens - headroom, it throws a ContextOverflowError before the HTTP request is made. The guard never calls the provider — the context is still in your process, addressable and compressible.

TypeScript: wiring the context guard.

import OpenAI from "openai";
import { guard, ContextOverflowError, LoopDetectedError, BudgetExceededError } from "@runguard/sdk";
import { encode } from "gpt-tokenizer"; // or tiktoken, your choice

const client = new OpenAI();

type Message = OpenAI.Chat.ChatCompletionMessageParam;

async function callModel(messages: Message[]): Promise<OpenAI.Chat.ChatCompletion> {
  return client.chat.completions.create({
    model: "gpt-4o",
    messages,
  });
}

const guardedCall = guard(callModel, {
  // Project token count before each call — fires before the HTTP request
  context: {
    maxContextTokens: 128_000,
    headroom: 8_000,  // reserve 8k for the completion
  },
  tokens: (messages: Message[]) => {
    // gpt-tokenizer gives token-accurate counts for GPT-4o
    return messages.reduce((total, msg) => {
      const text = typeof msg.content === "string"
        ? msg.content
        : (msg.content ?? [])
            .map((c) => (c.type === "text" ? c.text : ""))
            .join("");
      return total + encode(text).length + 4; // 4 overhead per message
    }, 3); // 3 for reply priming
  },
  loop: { repeats: 3, maxCycleLen: 4 },
  budget: { maxUsd: 3.0 },
});

async function runAgent(initialMessages: Message[]): Promise<string> {
  const history: Message[] = [...initialMessages];

  while (true) {
    try {
      const completion = await guardedCall(history);
      const reply = completion.choices[0].message;
      history.push(reply);

      if (!reply.tool_calls?.length) {
        return reply.content ?? "";
      }

      // Execute tools, append results...
      for (const tc of reply.tool_calls) {
        const result = await executeTool(tc);
        history.push({
          role: "tool",
          tool_call_id: tc.id,
          content: JSON.stringify(result),
        });
      }

    } catch (err) {
      if (err instanceof ContextOverflowError) {
        // Context is still in memory — compact and retry
        console.warn(
          `Context guard fired: projected ${err.projectedTokens} tokens ` +
          `vs ${err.maxContextTokens - err.headroom} limit. Compacting.`
        );
        return await compactAndResume(history);
      }
      if (err instanceof LoopDetectedError) {
        return `Agent loop detected (pattern: ${err.pattern}). Run aborted.`;
      }
      if (err instanceof BudgetExceededError) {
        return `Budget cap reached ($${err.spent.toFixed(3)}). Run aborted.`;
      }
      throw err;
    }
  }
}

async function compactAndResume(history: Message[]): Promise<string> {
  // Strategy: summarize the middle of the conversation, keep system + last 5 turns
  const system = history.filter((m) => m.role === "system");
  const tail = history.filter((m) => m.role !== "system").slice(-10);
  const middle = history.filter((m) => m.role !== "system").slice(0, -10);

  const summary = await client.chat.completions.create({
    model: "gpt-4o-mini",  // cheap model for summarization
    messages: [
      { role: "system", content: "Summarize the following conversation history concisely, preserving key findings and decisions." },
      { role: "user", content: middle.map((m) => `${m.role}: ${typeof m.content === "string" ? m.content : "[multi-part]"}`).join("\n") },
    ],
  });

  const compacted: Message[] = [
    ...system,
    { role: "assistant", content: `[Conversation summary: ${summary.choices[0].message.content}]` },
    ...tail,
  ];

  // Re-run with compacted history
  return runAgent(compacted);
}

declare function executeTool(tc: OpenAI.Chat.ChatCompletionMessageToolCall): Promise<unknown>;

Why headroom matters. Setting headroom: 8000 means the guard fires when projected input tokens exceed 128,000 - 8,000 = 120,000. This ensures the model always has at least 8k tokens for its response — enough for a substantive tool call or answer. Without headroom, you could successfully send a 127,999-token prompt and get a 1-token response that truncates mid-sentence. The guard fires while the completion budget is still meaningful.
Token counting accuracy. gpt-tokenizer produces accurate counts for GPT-4o using the same cl100k_base tokenizer. For Anthropic models, @anthropic-ai/tokenizer provides accurate counts. For open-source models, tiktoken with the appropriate encoding gives a close approximation. A 5–10% overcount is acceptable because the guard fires before the limit, not at the limit — false positives are a compaction, not a crash.

Recovery strategies when the context guard fires

Strategy 1: summarize the middle, keep head and tail. This is the example above. The system prompt (head) and the last N turns (tail) are preserved verbatim; the middle section is summarized by a cheap model. This works well when the agent’s goal is in the system prompt and the most recent tool results are the most relevant. The risk: the summary loses specific details from early tool calls that the agent may still need.
Strategy 2: checkpoint and restart. When the context guard fires, serialize the current history to disk (or a key-value store like Redis or SQLite) with a task ID. Return a partial result to the caller with a resumption token. On the next invocation, restore the serialized history, apply your compaction strategy, and continue. This is appropriate for long-horizon agents where a single run may take hours and the partial output has value even if the full task is not complete.
Strategy 3: evict redundant tool results. Many agent runs accumulate duplicate or superseded tool results. If a search was run five times with slightly different queries, the earlier three results may be fully redundant given the fourth and fifth. A simple deduplication pass over tool results by content hash or semantic similarity can free 30–50% of context without losing any unique information. Implement this as a context pre-processor that runs on every call, not just when the guard fires, to prevent the problem from building up.
Strategy 4: structured context budget allocation. Allocate explicit token budgets to each type of context: system prompt (2k), task description (1k), tool results (50k), conversation history (30k), reserved for completion (8k). When any budget is exceeded, evict within that category first (oldest tool results, oldest conversation turns). This prevents any single category from crowding out others and makes the compaction logic deterministic and testable.

Token-limit handling approaches compared

Approach	When it fires	Context preserved?	Recovery possible?	Cost impact
Catch provider 400	After HTTP request fails	No (only error type)	Manual, state lost	Full prompt tokens billed
max_tokens on completion	Never for input overrun	N/A	N/A (wrong problem)	No protection
Manual token count before call	Pre-call (if implemented)	Yes	Yes	Full prompt counted, not sent
RunGuard context guard	Pre-call, automatic	Yes — ContextOverflowError exposes projectedTokens	Yes — compaction or checkpoint	Zero tokens billed on trip

Multi-provider token limits in TypeScript agents (2026 reference)

OpenAI GPT-4o: 128k context window. Token counting via gpt-tokenizer (cl100k_base). Set maxContextTokens: 128_000, headroom: 8_000 for a safe effective limit of 120k. Input tokens billed whether the call succeeds or fails on 400.
Anthropic Claude Sonnet 4 / Opus 4: 200k context window. Token counting via @anthropic-ai/tokenizer. Set maxContextTokens: 200_000, headroom: 16_000 for a safe effective limit of 184k. Anthropic counts input tokens in a pre-flight pass before generating any output — a 400 prompt_too_long still consumes input tokens on some billing models.
Meta Llama 3.1 70B (hosted): 131k context window. Token counting approximated via tiktoken llama3 encoding or a character-count heuristic (÷ 4). Set maxContextTokens: 131_000, headroom: 8_000. Host behaviour on overflow varies: Groq throws a 400, Together.ai may truncate silently.
Google Gemini 1.5 Pro: 1M token context window — rarely hit in practice, but tool-result accumulation can still reach it for long-running crawlers or code analysis agents. Token counting via the Vertex AI countTokens API (network call) or a character-count heuristic for speed.