LLM context window exceeded: recovery strategies for long-running agents that hit the 128k limit

GPT-4o has a 128k context window. Claude 3.5 Sonnet has 200k. Gemini 1.5 Pro has 1M. They all have a limit, and long-running agents hit it. When an agent’s message history — system prompt, conversation turns, tool call results — exceeds the model’s context window, you get a hard API error: context_length_exceeded or max_tokens must be less than or reduce your prompt size. The agent fails mid-task with no partial output. This page explains four recovery strategies, when to use each, and how to prevent the condition before it fires using token tracking and context compaction.

Why agents hit the context limit

Recovery strategy 1: sliding window truncation

Recovery strategy 2: context summarization

Recovery strategy 3: external memory with retrieval

Prevention strategy: token tracking with RunGuard

Recovery strategy comparison

StrategyBest forDrawback
Sliding window truncationConversational agents, simple Q&ALoses early steps — bad for multi-step task solvers
Context summarizationMost task-solving agentsCompression may lose specific data points; adds latency
External memory + retrievalVery long-running or multi-session agentsHigh implementation overhead; adds retrieval latency per step
RunGuard token tracking (prevention)All agentsRequires configuring token limit alongside USD budget
Hard API error recoveryLast resort — catch and retry with truncationLoses the failed step entirely; bad user experience