LLM context window exceeded: recovery strategies for long-running agents that hit the 128k limit
GPT-4o has a 128k context window. Claude 3.5 Sonnet has 200k. Gemini 1.5 Pro has 1M. They all have a limit, and long-running agents hit it. When an agent’s message history — system prompt, conversation turns, tool call results — exceeds the model’s context window, you get a hard API error: context_length_exceeded or max_tokens must be less than or reduce your prompt size. The agent fails mid-task with no partial output. This page explains four recovery strategies, when to use each, and how to prevent the condition before it fires using token tracking and context compaction.
Why agents hit the context limit
- Tool result accumulation. Every tool call appends a tool result message to the conversation. Web scraping tools return thousands of characters per page. Database queries return rows that serialize to long JSON blobs. Code execution tools return stdout, stderr, and tracebacks. A 10-step agent that scrapes web pages can easily accumulate 80k–100k tokens of tool results before the model has produced significant output.
- The exponential growth pattern. Context growth is not linear. Early steps add modest context (the system prompt, a few messages, small tool results). But later steps are larger: the model’s reasoning is more verbose as it synthesizes more information, tool results reference data from prior steps (causing repetition), and planning outputs include summaries of prior work. A 20-step agent often uses 4x the context of a 10-step agent, not 2x.
- Re-injected outputs. Some agent patterns take prior outputs and pass them back in as inputs — code agents that pass prior code into a rewrite step, research agents that pass a prior draft into a refinement step, multi-agent pipelines that pass one agent’s output to the next. Every re-injection adds the full output size to the receiving context. A 5k-token research summary passed through 4 refinement steps contributes 20k tokens of re-injected content.
- The loop-and-context-explosion pattern. When an agent enters a loop (see retry storm prevention), every loop iteration appends more messages to the history. A loop that runs 20 iterations before hitting
max_itercan triple the context size — and the final iterations are the most expensive per call because they send the largest context. RunGuard’s loop detection fires at 3 iterations, before the context explosion reaches critical mass.
Recovery strategy 1: sliding window truncation
- How it works. Keep only the most recent N messages in the conversation history sent to the model. Older messages are dropped. The system prompt is always preserved. This is the simplest recovery strategy and the one most frameworks implement as a built-in option (
ConversationBufferWindowMemoryin LangChain, for example). - When it works. Sliding window is appropriate for conversational agents where the agent’s task in the current turn doesn’t depend on the full history — customer service agents, Q&A agents, simple ReAct agents where each step is mostly independent.
- When it fails. Task-solving agents where each step builds on prior steps — code agents, research agents, multi-step planners — cannot use simple truncation. Dropping early steps means the agent loses the original task specification, prior sub-results, and reasoning that informs the current step. The agent may re-do work it already completed, contradict prior outputs, or ask for information it already obtained.
- Implementation with LangChain.
from langchain.memory import ConversationBufferWindowMemory from langchain.agents import AgentExecutor, create_react_agent from langchain_openai import ChatOpenAI memory = ConversationBufferWindowMemory( memory_key="chat_history", k=10, # keep last 10 exchanges return_messages=True, ) agent_executor = AgentExecutor( agent=create_react_agent(llm, tools, prompt), tools=tools, memory=memory, max_iterations=15, verbose=True, )
Recovery strategy 2: context summarization
- How it works. When the conversation history approaches the context limit (e.g., at 80% of the model’s max tokens), trigger a summarization step: call the LLM to produce a compact summary of the prior conversation, replace the full history with the summary, and continue. The summary preserves the essential information from prior steps while dramatically reducing token count.
- Implementation: proactive summarization checkpoint.
import tiktoken from openai import AsyncOpenAI client = AsyncOpenAI() enc = tiktoken.encoding_for_model("gpt-4o") CONTEXT_LIMIT = 128_000 SUMMARIZE_AT = int(CONTEXT_LIMIT * 0.75) # trigger at 75% def count_tokens(messages: list) -> int: total = 0 for m in messages: total += 4 + len(enc.encode(m.get("content") or "")) if m.get("role") == "tool": total += 2 return total async def maybe_compact(messages: list, system: str) -> list: """Summarize and compact history if approaching context limit.""" token_count = count_tokens(messages) + len(enc.encode(system)) if token_count < SUMMARIZE_AT: return messages # Summarize everything except the last 4 messages (keep recent context intact) to_summarize = messages[:-4] recent = messages[-4:] summary_resp = await client.chat.completions.create( model="gpt-4o-mini", # use cheap model for summarization messages=[ {"role": "system", "content": "Summarize the following agent conversation history. Preserve: task goals, sub-results, decisions made, and any data the agent needs to complete the task. Be comprehensive but concise."}, {"role": "user", "content": str(to_summarize)}, ], max_tokens=2000, ) summary_text = summary_resp.choices[0].message.content compacted = [ {"role": "user", "content": f"[Prior conversation summary]: {summary_text}"}, {"role": "assistant", "content": "Understood. I have the context from prior steps. Continuing..."}, ] + recent return compacted - Using gpt-4o-mini for summarization. The summarization call doesn’t need the full reasoning capability of your primary model. Use a cheap, fast model (
gpt-4o-miniat $0.15/M input tokens vs. $2.50/M for gpt-4o). A typical summarization of a 50k-token history costs roughly $0.008 — a trivial price for preventing a context overflow failure.
Recovery strategy 3: external memory with retrieval
- How it works. Instead of keeping all history in the message context, store it in a vector database or key-value store. Before each agent step, retrieve only the most relevant prior context chunks and inject them into the system message. The message history itself stays short — only the last N turns.
- When to use it. Appropriate for very long-running agents (hundreds of steps), multi-session agents (the agent can resume a task from a prior session), or agents that need to reference specific prior results rather than summarized history. The overhead of vector embedding and retrieval adds latency and cost per step, so it’s only worthwhile for tasks where the alternative (context overflow or summarization loss) would be worse.
- Practical limitation. Retrieval-augmented agent memory requires a vector database, an embedding model, and retrieval logic. For most agents, context summarization (strategy 2) provides 90% of the benefit with 10% of the implementation complexity. Use retrieval-augmented memory only when summarization loses critical information that the agent needs to reference at arbitrary later steps.
Prevention strategy: token tracking with RunGuard
- Track tokens, not just dollars. RunGuard’s budget guard tracks both USD spend and token consumption per call. Configure a
max_tokenslimit alongsidemax_usd— the guard fires when either limit is reached, whichever comes first. This gives you an early warning before the context window overflows. - Setting a context headroom limit.
from runguard import guard async def guarded_agent_step(messages, tools): response = await client.chat.completions.create( model="gpt-4o", messages=messages, tools=tools, ) usage = response.usage return { "response": response, "usd": (usage.prompt_tokens * 2.50 + usage.completion_tokens * 10.0) / 1_000_000, "sig": extract_tool_sig(response), "tokens": usage.prompt_tokens, } agent = guard( guarded_agent_step, budget={"max_usd": 1.50, "max_input_tokens": 90_000}, # fire at 90k tokens — before 128k limit loop={"repeats": 3, "window": 6}, ) # RunGuard raises BudgetExceededError with e.spent_tokens when limit is hit # Your catch block can trigger summarization compaction at this point - The compaction checkpoint pattern. The cleanest pattern is: run the agent loop, catch
BudgetExceededErrorwhene.reason == "tokens", compact the history using strategy 2, reset the RunGuard instance, and resume. The agent never hits the hard API context error — RunGuard’s proactive limit fires first at 90k tokens, triggering controlled compaction, so the model always has 38k tokens of headroom for its response.
Recovery strategy comparison
| Strategy | Best for | Drawback |
|---|---|---|
| Sliding window truncation | Conversational agents, simple Q&A | Loses early steps — bad for multi-step task solvers |
| Context summarization | Most task-solving agents | Compression may lose specific data points; adds latency |
| External memory + retrieval | Very long-running or multi-session agents | High implementation overhead; adds retrieval latency per step |
| RunGuard token tracking (prevention) | All agents | Requires configuring token limit alongside USD budget |
| Hard API error recovery | Last resort — catch and retry with truncation | Loses the failed step entirely; bad user experience |