AI agent memory consolidation cost optimization: cut context token spend 60–80% in long-running agents
The cost model for LLM API calls is simple: you pay for every input token and every output token in each call. In a multi-turn agent session, the input tokens include the full conversation history — every user message, every assistant response, every tool call result — accumulated since the session started. For short sessions (3–5 turns), this is negligible. For long-running agents (research tasks, autonomous coding sessions, customer support bots that handle 30-turn conversations), the cost dynamics are different. Turn N costs tokens proportional to the sum of all previous turns: a 30-turn conversation where each turn adds 500 tokens means turn 30 has 15,000 tokens of history in its input alone. The average cost per turn grows linearly with conversation length; the total session cost grows quadratically. Memory consolidation breaks this growth curve. By summarizing and pruning older conversation turns, you reduce the context passed into each LLM call while preserving the information the model needs to continue the task. This guide covers three consolidation strategies, when to trigger consolidation, how to measure consolidation effectiveness, and how to use RunGuard’s context-size budget alerts to automate consolidation before token costs spike.
Why context cost grows quadratically without consolidation
- The accumulated context problem, illustrated. Consider a research agent that completes 20 tool calls per session (web searches, document reads, code executions). Each tool call adds approximately 200 tokens of result to the context. By turn 20, the context includes: system prompt (1,000 tokens) + 20 user/assistant turn pairs (~400 tokens each = 8,000 tokens) + 20 tool call results (~200 tokens each = 4,000 tokens) = 13,000 input tokens for turn 20’s call. Turn 1 cost 1,200 input tokens. Turn 20 costs 13,000 — nearly 11× more per call. The average input cost per turn across the session is ~7,100 tokens. Without consolidation, a 20-turn session at $0.015/1K tokens costs $0.015 × (sum of turns 1–20) ≈ $0.015 × 142,000 / 1,000 = $2.13. With consolidation at turn 10 (compressing turns 1–10 to a 500-token summary), turns 11–20 see a flatter context. Rough savings: 40–60% on input token cost for turns 11–20.
- Three conditions that accelerate context growth. (1) Large tool call outputs: if your agent runs code and captures stdout, a single tool call can inject 10,000+ tokens of output into the context. (2) Verbose model responses: models like Claude tend to produce thorough responses; each assistant turn may add 1,000–2,000 tokens to context. (3) Document ingestion: if the agent reads files or web pages, those documents land in context at their full length. Any of these three conditions alone can drive context-per-turn well above the 200-token average assumed in the example above.
Three memory consolidation strategies
- Strategy 1: rolling summary (most common). At a defined trigger point (e.g., every 10 turns, or when context exceeds 50% of the context window), call the LLM with the oldest N turns and ask it to produce a dense summary. Replace those N turns with the summary. New turns are added normally after the summary. The summary is typically 3–10× shorter than the content it replaces, reducing context by 65–90%. Tradeoff: the summarization call itself costs tokens. Budget the consolidation call at roughly 0.2–0.5× the cost of the turns being consolidated. Net savings are still strongly positive if the consolidated turns would otherwise have appeared in 5+ future turns.
- Strategy 2: selective pruning (most token-efficient). Rather than summarizing, identify turns that are no longer relevant and drop them entirely. Tool call results that were used to make a decision (but the decision is now recorded in a later turn) are prime pruning candidates. System-generated status messages ("Tool executed successfully") are often safe to prune once acknowledged. Pruning has zero consolidation call cost but requires identifying which turns are safe to drop — which requires either heuristics or a classification call. Heuristic-based pruning (drop tool results older than 5 turns unless they contain the word "ERROR" or "IMPORTANT") works well for structured agents where tool output types are predictable.
- Strategy 3: hierarchical memory (most information-preserving). Maintain two memory tiers: working memory (recent 5–10 turns in context) and long-term memory (a structured knowledge store outside context, queried by the model on demand). When turns fall out of working memory, they are compressed into the long-term store. The model retrieves from long-term memory via a tool call ("recall what I found about X"). This approach is the most complex to implement but preserves the most information: long-term memory is never pruned, just summarized into structured form. Token cost is dramatically lower because only working memory is in context, plus the cost of any recall queries.
When to trigger consolidation: context budget thresholds
- Threshold-based triggers. The simplest trigger: consolidate when the current context token count exceeds a threshold. Common thresholds: 25% of context window (early consolidation, minimal quality impact), 50% of context window (balanced consolidation), 75% of context window (late consolidation, risk of hitting window limit before consolidation completes). The 50% threshold is the standard recommendation: it leaves enough room for the consolidation call’s output without risking truncation, and intervenes early enough that turns 1–N/2 are still compressible without losing important recent context.
-
Cost-based triggers (more precise). Rather than token count, trigger consolidation when the cumulative session cost exceeds a threshold. If your per-session budget is $2.00, trigger consolidation at $0.80 spent (40% of budget). This ties consolidation to the actual cost driver rather than a proxy metric. Cost-based triggers handle the case where tokens are cheap (consolidation can wait) versus expensive (consolidate aggressively). RunGuard’s
on_budget_thresholdcallback is the natural integration point for cost-based consolidation triggers. - Turn-count triggers (most predictable). Consolidate every N turns regardless of context size or cost. Predictable behavior, easy to reason about, easy to test. Downside: may consolidate too early (wasting consolidation call cost when context is still small) or too late (if turns are large). Best suited for agents with consistent turn sizes where N can be calibrated experimentally.
Python: cost-triggered memory consolidation with RunGuard
-
import runguard from anthropic import Anthropic client = Anthropic() guard = runguard.Guard( app_id="research-agent", budget_usd=2.00, on_budget_threshold=[ # Trigger consolidation at 40% of budget {"threshold_usd": 0.80, "callback": "consolidate_memory"}, ] ) async def consolidate_memory(session_context: dict) -> dict: """Summarize oldest half of conversation turns.""" messages = session_context["messages"] if len(messages) < 6: return session_context # Too few turns to consolidate # Oldest half of turns (excluding system prompt) turns_to_summarize = messages[1:len(messages)//2] recent_turns = messages[len(messages)//2:] # Summarization call (modest cost, big savings on future turns) summary_response = client.messages.create( model="claude-haiku-4-5-20251001", # Use cheap model for summaries max_tokens=500, messages=[ { "role": "user", "content": f"Summarize these conversation turns concisely, preserving all decisions made, key findings, and unresolved questions:\n\n{turns_to_summarize}" } ] ) summary = summary_response.content[0].text # Replace old turns with summary session_context["messages"] = [ messages[0], # System prompt preserved {"role": "assistant", "content": f"[Memory summary of earlier conversation]: {summary}"}, *recent_turns, ] return session_context @guard.protect async def run_agent_turn(user_message: str, session_id: str, context: dict) -> str: response = client.messages.create( model="claude-sonnet-4-6", max_tokens=1024, system=context["messages"][0]["content"], messages=context["messages"][1:] + [{"role": "user", "content": user_message}], ) return response.content[0].text - Why use a cheap model for consolidation. The consolidation call reads long content and produces a short summary — this is exactly the use case where a smaller, cheaper model (Haiku, GPT-4o-mini) performs as well as a frontier model. The frontier model’s strength (reasoning, code generation, nuanced response) is not needed for summarization. Using Haiku for consolidation versus Claude Sonnet cuts the consolidation call cost by ~10×, making the net economics of consolidation strongly positive even for moderate context lengths.
Memory consolidation strategy comparison
| Strategy | Consolidation call cost | Context reduction | Information loss risk | Best for |
|---|---|---|---|---|
| Rolling summary | Low (cheap model) | 65–90% | Low (summary preserves key facts) | Research agents, long coding sessions |
| Selective pruning | Zero (heuristics) to low (classifier) | 20–50% | Medium (heuristics can drop needed context) | Structured agents with predictable tool output types |
| Hierarchical memory | Low (on-demand recall calls) | 70–90% | Very low (long-term store is lossless) | Long multi-session agents, knowledge workers |
| No consolidation | Zero | 0% | None (full context preserved) | Short sessions (<8 turns), tight latency requirements |
For context window budget alerts, see AI agent context window truncation alert. For broader token cost optimization, see Anthropic Claude API cost optimization.
Automate memory consolidation with RunGuard budget triggers
For long-running agents, memory consolidation is one of the highest-ROI cost optimizations available. A rolling summary triggered at 40% of budget spend can cut the remaining 60% of session cost in half. The key is automating the trigger so consolidation happens consistently, not manually. RunGuard’s on_budget_threshold callback wires that trigger directly to your session cost, ensuring consolidation fires at the right time regardless of how many turns the session has taken.
RunGuard pricing: Solo plan at $19/month for individual developers. Team plan at $79/month adds Slack and PagerDuty webhook alerts, shared dashboards, and audit log. Both plans include a 14-day free trial — no credit card required.
Start your 14-day free trial — or explore related: context window truncation alert, Claude API cost optimization, context window exceeded recovery, autonomous agent cost control, and set max cost per LLM request.