AI agent prompt compression: cut token costs 40–60% without losing context
In a multi-turn AI agent, the prompt grows with every tool call and every assistant reply. After 10 turns, what started as a 1,000-token system prompt has often ballooned to 15,000–40,000 tokens of accumulated conversation history, tool results, and intermediate reasoning. At Claude Sonnet pricing, that single run can cost $0.12–$0.50 per session — and if the agent runs thousands of sessions per day, the monthly bill arrives as a shock. Prompt compression is the practice of systematically reducing the token count passed to the model on each turn without degrading the quality of the model’s responses. Implemented correctly, compression routinely reduces per-session token spend by 40–60% with no measurable regression in output quality. This page covers the four primary compression techniques, the tradeoffs between them, and how RunGuard’s ContextGuard enforces a hard token budget so that context bloat is caught at the guardrail layer rather than in your AWS bill.
Why prompts bloat: the multi-turn compounding problem
- Tool result verbosity. Agent frameworks typically append the full text of every tool result to the conversation history. A single
web_searchcall returning 10 results adds 2,000–5,000 tokens. A file-read returning a 200-line Python module adds another 3,000 tokens. After three or four tool calls, the history is dominated by raw tool output rather than distilled insight. Models charge the same per token whether the token encodes a novel insight or a redundant HTML boilerplate tag from a web scrape. - Repeated system context. Many agent frameworks include the full system prompt on every API call. If the system prompt is 2,000 tokens and you run 15 turns, you pay for 30,000 tokens of system prompt alone — 15 redundant copies. Provider-level prompt caching (Anthropic’s
cache_control, OpenAI’s automatic prefix caching) reduces but does not eliminate this because the cached prefix must still appear in the request. - Intermediate reasoning accumulation. Chain-of-thought and ReAct-style agents emit multi-paragraph reasoning traces that become part of the history. This reasoning was useful when the model produced it but becomes progressively less relevant to the current step as the task advances. Yet frameworks faithfully append it to every subsequent prompt.
- Conversation history as a liability after context is resolved. Early turns often contain goal-clarification and scoping exchanges. Once the task is scoped, those early turns add tokens without adding decision-relevant information. The model has already integrated the scoping information into its internal representation; replaying the raw text does not improve its response quality, it only drives up cost.
Compression technique 1: rolling summary replacement
- How it works. Instead of keeping the full verbatim history, you maintain a running summary of completed work. After every N turns (commonly 3–5), you call the LLM with the accumulated turns and ask it to produce a concise summary of: (a) what has been established, (b) what actions have been taken, (c) what was found, and (d) what remains to do. That summary (typically 200–400 tokens) replaces the N turns it summarizes (typically 3,000–15,000 tokens). The summary becomes the new “history head” and subsequent turns continue from there.
- Compression ratio. Summarization typically achieves a 10:1 to 30:1 compression ratio on old turns. For an agent that runs 20 turns with heavy tool use, rolling summaries typically reduce total context by 50–70%, since the first 15 turns are replaced by 2–3 rolling summaries totaling 600–1,200 tokens while the verbatim representation would be 30,000–60,000 tokens.
- The summary call cost. Rolling summaries require an LLM call to generate the summary. This call costs tokens too. For cost efficiency, generate summaries using a smaller, cheaper model (GPT-4o mini, Claude 3 Haiku) rather than the primary reasoning model. The summary-generation call typically costs 1/10 of what it saves on subsequent turns.
- Tradeoff: verbatim recall. If your agent task requires verbatim recall of specific facts from early turns — exact URLs, precise numbers, code snippets — rolling summaries can lose them. Mitigate by extracting key facts into a structured “working memory” dict that is maintained separately and injected alongside the summary. The dict carries only the facts; the summary carries the narrative.
Compression technique 2: selective tool-result truncation
- The 80/20 of tool verbosity. In most agent workloads, 80% of the token bloat comes from 20% of tool calls: web searches returning full page text, code execution returning verbose stack traces, database queries returning full row sets, file reads returning entire documents. Targeted truncation of these outputs — rather than broad compression of the entire history — is the highest-leverage first step and requires no LLM calls to implement.
- Truncation strategies by tool type. For web search results: extract the title, URL, and first 2–3 sentences of each result; discard boilerplate HTML, navigation, and footer text. For code execution: keep the final output lines and any exception traceback; discard verbose logging if the run succeeded. For database results: if the query returns more than 20 rows, include the first 10 rows plus a “...and N more rows” note; include column names and types once, not per row. For file reads: if the file exceeds a line threshold, include the first 100 lines plus a summary line; the model can request more with a targeted sub-range tool call.
- Semantic filtering vs hard truncation. Hard truncation (take the first N characters) is simple but can cut mid-sentence. Semantic filtering uses a lightweight embedding similarity check to score each chunk of a tool result against the current task description and retains only the top-K most relevant chunks. Libraries like
LLMLinguaandrecompimplement token-level compression that preserves semantic content more faithfully than hard truncation for a modest additional cost. - Annotating tool results with quality metadata. A simple optimization: prepend each tool result with a brief “relevance annotation” added by a heuristic or classifier. If the relevance score is below a threshold, replace the full result with the annotation alone (“[web search returned 8 results; none appear relevant to the current subtask — retrying with different query]”). The model can read the annotation and decide to re-query rather than processing thousands of tokens of irrelevant content.
Compression technique 3: provider-level prompt caching
- Anthropic prompt caching. Anthropic’s
cache_control: {"type": "ephemeral"}marker tells the API to cache everything up to that marker for 5 minutes. On a cache hit, cached input tokens are billed at 10% of the standard input rate. For a 4,000-token system prompt that appears on every turn of a 15-turn session, caching saves 90% of the input cost for turns 2–15 of the system prompt section. The savings compound across high-frequency agents: an agent running 1,000 sessions per day with a 4,000-token system prompt saves roughly $22/day at Sonnet pricing purely from prompt caching. - OpenAI automatic prefix caching. OpenAI automatically caches common prefixes for prompts exceeding 1,024 tokens. The cached portion is billed at 50% of the standard input price. Unlike Anthropic, this is automatic — no code change required — but the cache hit rate depends on how stable the prefix is across requests. The more the beginning of your prompt changes between requests (e.g., if you embed a timestamp or user-specific data early in the system prompt), the lower your cache hit rate. Structure prompts so that stable, reusable content (instructions, persona, output format) appears before dynamic content (user data, session state).
- Stacking caching with compression. Caching and compression are complementary, not alternatives. Use caching to reduce the per-request cost of the stable prefix; use compression (rolling summaries, truncation) to keep the dynamic tail of the conversation short. A well-tuned agent stack applies both: the system prompt is cached, the tool-heavy middle turns are truncated and summarized, and only the last 2–3 turns are sent verbatim.
Compression technique 4: structured working memory
- What working memory is. Instead of keeping a free-form conversation history, you maintain a structured JSON object that records what the agent has learned. Each tool call updates the object rather than appending to a flat history. The object is injected into each prompt as a compact JSON block. Because JSON encoding is token-efficient relative to natural language — you can encode 20 key-value facts in 200 tokens that would require 800–1,200 tokens in prose form — working memory compresses information-dense state by 4–6x compared to narrative history.
- Example schema for a research agent. A research agent tracking a multi-step investigation might maintain:
{ "goal": "compare pricing of competitors X, Y, Z", "confirmed": {"X_price": "$49/mo", "Y_price": "$79/mo"}, "pending": ["Z_price"], "sources": ["x.com/pricing", "y.com/pricing"], "dead_ends": ["z.com/pricing — paywall"] }This 100-token block replaces several turns of tool calls and results that established the same facts. The narrative of how the facts were found is discarded; only the facts are kept. - Memory staleness and eviction. Working memory requires a staleness policy: facts that were relevant in step 3 may be irrelevant in step 12. A simple time-to-live (TTL) policy tags each fact with the turn at which it was added and evicts facts older than N turns. More sophisticated policies use relevance scoring: before each API call, score each fact against the current step description and evict facts below a relevance threshold.
- Combining working memory with rolling summaries. Working memory handles structured facts; rolling summaries handle narrative continuity. The two are complementary. Inject working memory as a JSON block at the start of the context, followed by the rolling summary, followed by the last 2–3 verbatim turns. This three-layer structure gives the model: (1) the current known facts in compact form, (2) the narrative of what has been done, and (3) the immediate exchange for turn-level continuity — while keeping total context well below the 8,000-token threshold where costs start to compound aggressively.
Measuring compression effectiveness: the metrics that matter
- Input tokens per session. The primary cost driver. Log
usage.input_tokensfrom every API response and compute the median and 95th percentile per session. Establish a baseline before implementing compression; compare post-compression. A 40% reduction in median input tokens per session translates directly to a 40% reduction in input costs (output costs are unaffected by input compression). - Task completion rate. Compression only has value if it does not degrade output quality. Track the percentage of agent sessions that complete their task successfully (however your system defines success: tool call succeeded, final answer accepted, no error exit). If compression causes the completion rate to drop by more than 2–3%, the compression is too aggressive and you are discarding context the model genuinely needs.
- Context window headroom. Track the gap between your context used per session and the model’s context window limit. If you regularly hit 80%+ of the context limit, you are at risk of context overflow errors mid-session. Compression should keep your median session well below 50% of the context window, leaving headroom for tasks that run longer than expected.
- Summary quality degradation over long sessions. For rolling summaries, periodically spot-check whether the summaries accurately capture the key facts from the turns they replace. The easiest check: run 20 sessions with summaries enabled and 20 with full verbatim history; compare task completion rates. If there is no statistically significant difference, summaries are working; if completion rate drops with summaries, you need to improve the summarization prompt or increase the verbatim window.
RunGuard ContextGuard: enforcing token budgets as a circuit breaker
- Why a guardrail layer is necessary. Compression techniques require engineering effort to implement and tune. During development, they typically work well on the happy path. Under real-world load — unusually long sessions, edge-case tool results, user inputs that expand context faster than expected — context can still grow beyond budget even with compression in place. RunGuard’s ContextGuard provides the last line of defense: a hard token cap that trips before the context overflow causes a provider 400 error or a four-figure invoice.
- How ContextGuard works. You wrap your agent’s LLM call with
guard(fn, { context: { maxContextTokens: 32000, headroom: 4000 }, tokens: (input) => estimateTokens(input) }). Before each LLM call, RunGuard projects the context size using yourtokens()function. If the projection exceedsmaxContextTokens - headroom, RunGuard throws aContextOverflowErrorbefore the API call is made — giving your agent loop the opportunity to trigger compression, discard history, or alert the user rather than sending an oversized request that either fails or costs unexpectedly. - Combining ContextGuard with your compression pipeline. The integration point is the
catch (e)block around your agent’s LLM call:if (e instanceof ContextOverflowError) { history = applyRollingCompression(history); continue; }. This makes compression reactive rather than proactive: context grows freely until it approaches the budget, then compression fires automatically. The result is that you compress only when necessary (lower overhead on short sessions) while still preventing overflow on long sessions. - Budget-per-session enforcement. ContextGuard addresses context-window limits; RunGuard’s BudgetTracker addresses dollar-cost limits. Use both together: set a context cap per call (ContextGuard) and a dollar cap per session (BudgetTracker). If token costs are spiraling because context is large even after compression, BudgetTracker trips and halts the session before the user’s monthly allocation is consumed. See AI agent cost per user session for the math on per-session budget allocation.
Implementation checklist: rolling out compression to an existing agent
- Step 1 — instrument first. Before writing a line of compression code, add token logging to every API call:
logger.info({ input_tokens: response.usage.input_tokens, output_tokens: response.usage.output_tokens, session_id, turn_number }). Run for one week. Plot the per-session token distribution. Identify the P90 session length and token count. This baseline tells you where compression will have the most impact. - Step 2 — identify the top token contributors. Break down tokens by source: system prompt, tool results, conversation history. Typically one of these dominates. If tool results dominate, start with selective truncation (no LLM calls required, highest ROI). If history dominates, implement rolling summaries. If system prompt dominates, implement prompt caching.
- Step 3 — implement one technique at a time. Rolling out all four techniques simultaneously makes it impossible to attribute quality changes to a specific change. Implement, measure impact on both token count and task completion rate, ship, then implement the next technique.
- Step 4 — add ContextGuard as a safety net. After compression is in place, add ContextGuard with a limit 20% above your post-compression P95 session token count. This gives compression room to breathe on normal sessions while catching edge cases where context escapes the compression logic.
- Step 5 — review quarterly. Token pricing changes. Model context windows expand. New providers offer better caching terms. Review your compression architecture quarterly and update truncation thresholds, summary prompts, and token budgets based on current provider pricing and your actual usage patterns.
Stop paying for tokens you don’t need
RunGuard’s ContextGuard trips before context overflow costs you a provider error or an unexpected bill. Add it alongside your compression pipeline in one line of code — no infrastructure changes required.
Start free trial →