AI agent context pruning strategies: remove low-value tokens before they cost you
Every LLM API call charges for every token in the prompt — including tokens from tool results retrieved 20 turns ago, exploratory reasoning that led nowhere, and failed sub-attempts that the agent has already moved past. In long-running agents (research agents, coding assistants, customer support bots that handle extended sessions), the accumulation of spent context can easily dominate the total token cost of a session. An agent that runs for 30 turns and accumulates 80,000 tokens of history pays $0.24 per session in input tokens at Sonnet pricing for context that is largely irrelevant to the final 10 turns where the actual work is happening. Context pruning is the systematic removal of low-relevance tokens from the active context window before each API call. Implemented correctly, pruning reduces input token costs by 40–70% on long-running agent sessions with no detectable quality impact on task completion. This page covers the five primary pruning strategies, the relevance scoring methods that drive them, and how RunGuard’s ContextGuard serves as the last-resort safety net when pruning fails to prevent context overflow.
Why context accumulates: the causes of unbounded growth
- Tool result verbosity without eviction. Most agent frameworks append tool results to the message history in full and never evict them. A research agent that calls
web_search15 times accumulates 15 full search-result payloads, each 2,000–5,000 tokens. By turn 20, the tool-result history alone exceeds the combined size of the system prompt, the user’s original request, and all the assistant’s reasoning steps. The search results from turn 3 are irrelevant to turn 20; they exist in the context only because nothing evicted them. - Dead-end exploration paths. Agents frequently explore approaches that fail and backtrack. A coding agent that tries to solve a bug with approach A, discovers it doesn’t work, and pivots to approach B still carries the entire approach-A exploration in its context for all of approach B. That exploration history is irrelevant to the current work and may even be confusing to the model (the model sees failed attempts alongside the current attempt and may repeat the failed approach).
- Repetitive tool calls. Agents in loops (which RunGuard’s LoopDetector catches) or agents that make related but slightly different calls to the same tool accumulate multiple copies of similar tool results. If the agent calls
read_file(path)three times with slightly different parameters, it has three copies of largely overlapping file content in the context. Deduplication of near-duplicate tool results is a form of context pruning that requires no relevance scoring. - System prompt anti-patterns. Dynamic system prompts that embed session-specific data (current time, user preferences, conversation summary) create novel prefixes that defeat provider-side prefix caching and cause the entire prompt to be charged at full input rates. Static system prompts with stable prefixes enable caching; dynamic data should be injected in the user message, not the system message.
Pruning strategy 1: sliding window (recency-based eviction)
- How it works. The simplest pruning strategy: keep only the last N turns (or last N tokens) of conversation history, plus the system prompt. Older turns are evicted regardless of their content. This is O(1) to implement: slice the message array to the last N elements before every API call.
- Window size selection. Window size should be calibrated to your task’s “attention span” — how many turns back the model needs to look to produce a correct response. For most tasks, the model’s effective attention is the last 3–7 turns; beyond that, earlier context rarely changes the response. Measure empirically: run 50 sessions with a 3-turn window and 50 with a 10-turn window; compare task completion rates. Most teams find that 5–7 turns captures 95% of the quality benefit of an unbounded window.
- Limitation: abrupt loss of earlier context. Sliding windows lose all context before the window boundary, even if that context is highly relevant. A research agent that established a key fact in turn 2 and needs it again in turn 25 will fail if the window is 5 turns. Mitigate by pinning high-importance messages (explicitly mark certain messages as unprunable) or by combining the sliding window with a working memory object (see AI agent prompt compression cost savings) that carries key facts across window boundaries.
- Token-based windows vs turn-based windows. Turn-based windows (last N messages) can be unpredictable in token count because turns vary widely in length. Token-based windows (last M tokens) are more predictable for cost budgeting. Implement with:
while (estimateTokens(history) > maxHistoryTokens) history.shift(). Always remove from the front (oldest first), never from the middle, to preserve narrative coherence.
Pruning strategy 2: relevance scoring
- How it works. Rather than evicting by age, you score each message in the history by its relevance to the current task step, then retain the top-K most relevant messages plus the most recent N messages. Messages that are old AND low-relevance are evicted; messages that are old but high-relevance (the original problem statement, a key fact established early) are retained.
- Relevance scoring methods.
- Embedding cosine similarity: embed the current step description and each historical message; compute cosine similarity; retain messages above a threshold. Requires an embedding model call (cheap: $0.13/MTok for text-embedding-3-small at 100 tokens each = $0.000013 per message scored) for each pruning pass.
- Keyword overlap: extract key terms from the current task step (nouns, technical terms, entity names); score each historical message by keyword overlap. Deterministic, no LLM call, fast, but misses semantic similarity (two messages about the same topic with different wording may score low on keyword overlap).
- LLM-judged relevance: ask a cheap model to score each message’s relevance to the current step on a 1–5 scale. Accurate but adds latency and cost; only worthwhile for large, high-stakes sessions.
- Always-retain candidates. Some messages should always be retained regardless of relevance score: the original user request, the most recent user message, any message containing an explicit constraint (“do not use library X”, “the budget is $500”), and messages the agent has explicitly referenced in a previous step (“as I found in step 3, the error is...”). Build an “always retain” flag into your message objects and filter before relevance scoring.
Pruning strategy 3: message-type segmentation
- Segment by message type before applying retention rules. Different message types have different relevance decay curves. Tool results decay fast (a search result from 10 turns ago is rarely needed); user messages decay slowly (they encode the task requirements); assistant reasoning messages decay at an intermediate rate. Applying the same eviction policy to all message types discards high-value user messages too early and retains low-value stale tool results too long.
- Retention policy by type.
- System prompt: always retained; cached if possible
- Initial user messages: always retained for the session lifetime
- Subsequent user messages: retain last 5 turns
- Tool results: retain last 3 results per tool; evict older results of the same tool type first
- Assistant reasoning: retain last 7 turns; summarize older reasoning into a compressed summary block
- Error messages / failed attempts: retain only a summary: “approach X failed at turn N: [one-sentence reason]” (10–20 tokens) vs full failure transcript (200–2,000 tokens)
- Implementation. Tag each message with a type at insertion time. Your pruning function iterates over the type-segmented history and applies per-type policies before the final token count check. This is more code than a sliding window but significantly better quality per token than pure recency-based eviction.
Pruning strategy 4: importance sampling and pinning
- Explicit importance tagging. During agent execution, certain messages are clearly high-importance at the time they are created: the discovery of a critical constraint, the resolution of a blocking ambiguity, an intermediate result that multiple later steps depend on. Tag these messages as “pinned” at creation time; pruning logic never evicts pinned messages regardless of age or relevance score.
- Agent-driven importance signaling. An agent that reasons with chain-of-thought often signals its own importance assessments: “This is the key insight...”, “Note: this constraint applies throughout...”. A lightweight classifier on assistant messages can detect these phrases and auto-pin the containing message. This makes importance tagging automatic rather than requiring explicit instrumentation of every tool call.
- Pinning budget. Without a pinning budget, overly aggressive importance classifiers pin everything and pruning never reduces context. Set a maximum number of pinned messages (e.g., 10) and a maximum pinned token budget (e.g., 4,000 tokens). When the pinning budget is exceeded, the oldest or lowest-importance pinned message is demoted and becomes eligible for eviction.
Pruning strategy 5: tool-result deduplication
- Near-duplicate detection. Agents frequently issue semantically similar tool calls:
search("LangChain circuit breaker")followed later bysearch("LangChain loop guard"). The results heavily overlap. Before adding a new tool result to context, compute its similarity to existing tool results of the same type using a lightweight hash (for exact duplicates) or embedding similarity (for near-duplicates). If similarity exceeds a threshold, merge the results rather than appending a second copy. - Stale result eviction. Tool results that are time-sensitive (web searches, API calls for live data) have a natural freshness window. A search result from 5 turns ago may be stale if the agent is still researching the same topic. Implement a “stale after N turns” policy for tool results: once a result is older than N turns, replace it with a summary annotation (“[search at turn 5 returned 8 results; most recent result for this topic is retained below]”) and the single most relevant result from the stale set.
- File and code deduplication. A coding agent that reads the same file multiple times (to recheck a function, to verify a previous change) accumulates multiple copies of the file content. Implement a “last read wins” policy for file tool results: when a file is read again, replace the previous file-read result in the context rather than appending a new copy. The saving is proportional to file size; for a 3,000-token file read 4 times, deduplication saves 9,000 input tokens per call after the fourth read.
RunGuard ContextGuard: the last-resort overflow catcher
- Why pruning alone is insufficient. Context pruning reduces expected context size on the average session. It does not guarantee context stays below the model’s limit on every session. Edge cases — unusually complex tasks requiring more exploration, users who inject large documents, tool results that exceed expected size bounds — can still drive context beyond the pruning logic’s budget. When that happens, the LLM API returns a 400 context-length-exceeded error, the agent fails, and the user loses all progress from the session.
- ContextGuard as a pre-call circuit breaker. RunGuard’s ContextGuard intercepts LLM calls before they are sent. You supply a
tokens()function that estimates the context size; ContextGuard compares it tomaxContextTokens - headroom. If the projection exceeds the budget, ContextGuard throwsContextOverflowErrorbefore the API call is made. Your agent loop catches this and triggers its most aggressive pruning path (or compresses with a rolling summary) before retrying. This converts a hard API failure into a graceful degradation. - Configuring the headroom parameter.
headroomspecifies the buffer between your projected context and the model’s limit. Set headroom to at least 2,000 tokens (the maximum output you expect from the model on this call). Without adequate headroom, the context can pass the ContextGuard check but still fail at the API layer because the model’s combined input+output limit is exceeded during generation. Headroom =max_tokens+ 500 (for safety margin) is a good default. - Integration with the pruning pipeline.
async function agentCall(history, input) { const context = buildContext(history, input); try { return await guard( () => llm.call(context), { context: { maxContextTokens: 180000, headroom: 4000 }, tokens: () => estimateTokens(context) } ); } catch (e) { if (e instanceof ContextOverflowError) { history = aggressivePrune(history); // trigger hard pruning return agentCall(history, input); // retry once } throw e; } }
Prune aggressively. Catch what escapes.
Context pruning reduces expected costs on long sessions. RunGuard’s ContextGuard catches the edge cases pruning misses before they produce provider errors or unexpected bills. Both belong in a production agent stack.
Start free trial →