AI agent memory leak detection: how unbounded context growth crashes agents and amplifies costs

Every AI agent accumulates state as it runs. Conversation turns pile up. Tool results get appended to the message history. Retrieved documents are injected inline. Chain-of-thought traces are kept for the model to reason over. None of this is inherently wrong — but when that state is never evicted, bounded, or summarised, you have a memory leak. The symptoms are predictable: a context window overflow error that kills the agent mid-run; exponentially growing input token counts where turn 100 sends 100× more tokens than turn 1; unbounded RAM usage in the host process as the message list grows; and cost amplification loops where the agent re-retrieves and re-processes the same documents on every step. This page explains the four memory leak patterns, how to instrument token tracking at runtime, how RunGuard’s ContextGuard fires before the provider returns a 400, and the three eviction strategies you can combine to keep context bounded without losing agent coherence.

The four AI agent memory leak patterns

Pattern 1: conversation history leak

The most common leak. Your agent loop appends every user message and assistant response to a list that is passed to the LLM on every call. No turn is ever removed. This is correct behaviour in a short session — the model needs recent context to reason — but it becomes a leak when sessions run for tens or hundreds of turns.

At a rough average of 1,000 tokens per turn (user message + assistant response + tool call + tool result), the context size grows linearly:

Turn 10: ~10,000 input tokens per call — inexpensive, no problem.
Turn 50: ~50,000 input tokens per call — approaching one-third of GPT-4o’s 128k context. At $5/M input tokens, that is $0.25 per planning step.
Turn 100: ~100,000 input tokens per call — at turn 100, every single planning step costs $0.50 in input tokens alone, and the agent is 22,000 tokens from a hard overflow error.

The cost is not just the final turn. Because every prior turn is re-sent on every call, the total cost is the triangular sum: 1 + 2 + 3 + … + N turns. A 100-turn session without eviction costs roughly 50× more than the same session with a 10-turn sliding window.

Pattern 2: tool result accumulation

An agent using a web-scraping tool appends the full HTML of each scraped page to its context so it can reason over the content. At 5–50 KB per page and many pages scraped per session, the context fills rapidly. The agent was not designed to be a document store; the LLM’s context window was not designed to hold raw HTML. The correct pattern is to extract the relevant span of text before injecting it, or to store the full content in a vector store and inject only the top-k retrieved chunks. Injecting full tool outputs verbatim is the tool-result equivalent of a memory leak.

Pattern 3: vector retrieval loop

An agent using RAG (retrieval-augmented generation) embeds a query, retrieves the top-k document chunks, and injects them into the context. On the next turn it reasons over the retrieved chunks, decides it needs more information, embeds another query that is semantically similar to the first, retrieves overlapping chunks, and injects them again. After ten turns, the same five document chunks have been injected ten times. The context is full of duplicate content. The retrieval cost (embedding calls) and injection cost (tokens) compound without producing new information. This is a retrieval loop — a memory leak caused by the agent re-doing work it already did and appending the results rather than deduplicating them.

Pattern 4: sub-agent result injection

An orchestrator agent spawns sub-agents and appends their full output — including the sub-agent’s internal reasoning and intermediate steps — to its own context. If the sub-agent runs a 20-turn inner loop and produces a 15,000-token reasoning trace, the parent agent’s context grows by 15,000 tokens for each sub-task. With five sub-tasks, the parent is carrying 75,000 tokens of intermediate reasoning that it cannot use and the model struggles to attend to. The correct pattern is to inject only the sub-agent’s final answer, not its full trace. See autonomous agent cost control best practices for guidance on structuring multi-agent context boundaries.

Detecting memory leaks at runtime: token tracking and warning thresholds

The first step is visibility. Before you can enforce limits, you need to know how many tokens are in the context on each call. Most LLM provider SDKs return usage counts in the response; you can also estimate pre-call with a tokenizer. Here is a minimal instrumentation pattern in Python:

import tiktoken
from dataclasses import dataclass, field
from typing import Sequence

enc = tiktoken.encoding_for_model("gpt-4o")

def estimate_tokens(messages: list[dict]) -> int:
    """Rough token estimate: 4 overhead per message + content tokens."""
    total = 0
    for msg in messages:
        total += 4  # per-message overhead
        content = msg.get("content") or ""
        if isinstance(content, str):
            total += len(enc.encode(content))
        elif isinstance(content, list):
            for block in content:
                if isinstance(block, dict) and block.get("type") == "text":
                    total += len(enc.encode(block.get("text", "")))
    return total

@dataclass
class ContextTracker:
    max_tokens: int = 128_000
    warn_at: int = 100_000   # warn when 78 % full
    peak: int = 0

    def check(self, messages: list[dict]) -> int:
        tokens = estimate_tokens(messages)
        if tokens > self.peak:
            self.peak = tokens
        if tokens >= self.max_tokens:
            raise RuntimeError(
                f"Context overflow: {tokens} tokens >= {self.max_tokens} limit. "
                "Apply eviction before next call."
            )
        if tokens >= self.warn_at:
            print(f"[ContextTracker] WARNING: {tokens}/{self.max_tokens} tokens "
                  f"({tokens / self.max_tokens:.0%} full, peak={self.peak})")
        return tokens

This is useful for visibility, but it is manual and error-prone — you must remember to call tracker.check(messages) before every LLM call. RunGuard’s ContextGuard wraps this at the guard layer so it fires automatically on every invocation. The guard fires before the provider call goes out, which means you get a clean exception you can handle rather than a provider 400 with no recovery path. For more on recovering from 400-class errors mid-run, see LLM context window exceeded agent recovery.

from runguard import guard, ContextOverflowError

# ContextGuard fires when estimated tokens exceed (max_context_tokens - headroom).
# headroom=10_000 means it fires at 118,000 tokens for a 128k model —
# giving you room to inject an eviction summary before the hard limit.
run_guard = guard(
    my_llm_call,
    context={"max_context_tokens": 128_000, "headroom": 10_000},
)

def agent_step(messages: list[dict]) -> dict:
    try:
        return run_guard(messages)
    except ContextOverflowError as e:
        print(f"Context overflow at step — peak was {e.peak_tokens} tokens. "
              "Triggering eviction.")
        messages = evict_oldest_turns(messages, keep=20)
        return run_guard(messages)  # retry after eviction

After the run, state().contextTokensPeak gives you the high-water mark across all guard() invocations — useful for capacity planning and alerting. See also AI agent context window truncation alerts for setting up proactive monitoring before overflow happens.

RunGuard ContextGuard integration: combining overflow prevention with loop detection

Memory leaks and retrieval loops often co-occur: the agent that keeps re-querying the same documents is also the agent whose context fills with duplicate content. RunGuard lets you combine ContextGuard and LoopDetector in a single guard so both conditions are caught in one place.

from runguard import guard, ContextOverflowError, LoopDetectedError, BudgetExceededError

def my_llm_call(messages: list[dict], tools: list[dict]) -> dict:
    """Thin wrapper around your LLM provider call."""
    response = llm_client.chat(messages=messages, tools=tools)
    tool_sig = (
        response.tool_calls[0].function.name
        if response.tool_calls else "end_turn"
    )
    usd = compute_cost(response.usage)
    return {
        "response": response,
        "usd": usd,
        "sig": tool_sig,
        # tokens() tells ContextGuard how many tokens are in the current context.
        # Pass a callable so it is evaluated lazily (before the call goes out).
        "tokens": lambda: estimate_tokens(messages),
    }

run_guard = guard(
    my_llm_call,
    # Fire at 118k tokens (128k limit minus 10k headroom)
    context={"max_context_tokens": 128_000, "headroom": 10_000},
    # Fire after the same tool signature appears 4 times in a 10-step window
    loop={"repeats": 4, "max_cycle_len": 10},
    # Hard budget backstop catches any cost amplification that slips through
    budget={"max_usd": 5.00},
)

def run_agent(task: str) -> str:
    messages = [{"role": "system", "content": SYSTEM_PROMPT},
                {"role": "user", "content": task}]
    for step in range(200):
        try:
            result = run_guard(messages, tools=AGENT_TOOLS)
        except ContextOverflowError as e:
            # Apply sliding-window eviction and continue
            messages = sliding_window_evict(messages, keep_turns=15)
            continue
        except LoopDetectedError as e:
            return f"Retrieval loop detected: {e.pattern!r} repeated {e.repeats}x."
        except BudgetExceededError as e:
            return f"Budget cap reached after ${e.spent:.4f}."

        response = result["response"]
        if response.content and not response.tool_calls:
            peak = run_guard.state().contextTokensPeak
            print(f"Run complete. Context peak: {peak:,} tokens.")
            return response.content

        messages.append({"role": "assistant", "content": response.content,
                         "tool_calls": response.tool_calls})
        for tc in (response.tool_calls or []):
            tool_result = dispatch_tool(tc)
            messages.append({"role": "tool", "tool_call_id": tc.id,
                             "content": tool_result})
    return "Max steps reached."

The "tokens" key in the return dict is the integration point for ContextGuard. Passing a callable means the token estimate is computed from the current messages list at the moment the guard evaluates it — before the outbound call. This lets RunGuard intercept the call and raise ContextOverflowError cleanly rather than forwarding a request that will be rejected with a provider 400. For real-time cost alerting alongside context tracking, see preventing AI agent runaway costs in real time.

Memory eviction strategies: sliding window, summarisation checkpoint, and selective eviction

Detecting overflow is necessary but not sufficient — you also need a strategy for what to remove from the context when the guard fires. The three main strategies are not mutually exclusive; most production agents combine all three.

Strategy 1: sliding window

Keep only the system prompt and the last N turns. Simple, predictable, and fast. The trade-off is that the agent loses access to early turns — fine for task-focused agents where recent context is most relevant, problematic for agents that need to refer back to earlier decisions.

def sliding_window_evict(
    messages: list[dict],
    keep_turns: int = 15,
) -> list[dict]:
    """
    Keep system prompt(s) at head + last `keep_turns` non-system messages.
    A "turn" is one user+assistant exchange (2 messages), so keep_turns=15
    retains the last 30 messages.
    """
    system = [m for m in messages if m["role"] == "system"]
    non_system = [m for m in messages if m["role"] != "system"]
    # Each turn = user + assistant + optional tool messages; approximate as pairs
    keep_messages = non_system[-(keep_turns * 2):]
    return system + keep_messages

Strategy 2: summarisation checkpoint

When the context reaches the warning threshold, call the LLM to produce a one-paragraph summary of everything up to that point, then replace the evicted turns with the summary as a synthetic assistant message. The agent retains a compressed representation of earlier work. This is more expensive than sliding window (one extra LLM call per eviction) but preserves continuity for long-running agents.

def summarisation_evict(
    messages: list[dict],
    keep_recent: int = 10,
) -> list[dict]:
    """
    Summarise all but the last `keep_recent` non-system messages,
    then replace them with a single summary assistant message.
    """
    system = [m for m in messages if m["role"] == "system"]
    non_system = [m for m in messages if m["role"] != "system"]

    to_summarise = non_system[:-keep_recent]
    to_keep = non_system[-keep_recent:]

    if not to_summarise:
        return messages  # nothing to evict

    summary_prompt = (
        "Summarise the following conversation history in 3-5 sentences, "
        "preserving all decisions made, tools called, and key findings:\n\n"
        + "\n".join(
            f"{m['role'].upper()}: {m.get('content', '')}" for m in to_summarise
        )
    )
    summary_response = llm_client.chat(
        messages=[{"role": "user", "content": summary_prompt}],
        max_tokens=400,
    )
    summary_msg = {
        "role": "assistant",
        "content": f"[Summary of earlier context]: {summary_response.content}",
    }
    return system + [summary_msg] + to_keep

Strategy 3: selective eviction

Keep the system prompt, the last N turns, and a deduplicated set of unique tool results. Duplicate tool results — the signature of a retrieval loop — are the most expensive content to carry because they are large and provide no marginal information. Deduplicating them before each LLM call is the most efficient eviction strategy when the leak is caused by Pattern 3 or Pattern 4 above.

import hashlib, json

def selective_evict(
    messages: list[dict],
    keep_turns: int = 20,
) -> list[dict]:
    """
    Sliding window + deduplication of tool result messages.
    Identical tool content (same hash) is reduced to a single occurrence.
    """
    system = [m for m in messages if m["role"] == "system"]
    non_system = [m for m in messages if m["role"] != "system"]
    windowed = non_system[-(keep_turns * 2):]

    seen_tool_hashes: set[str] = set()
    deduped: list[dict] = []
    for msg in windowed:
        if msg["role"] == "tool":
            content_hash = hashlib.md5(
                json.dumps(msg.get("content", ""), sort_keys=True).encode()
            ).hexdigest()
            if content_hash in seen_tool_hashes:
                continue  # drop duplicate tool result
            seen_tool_hashes.add(content_hash)
        deduped.append(msg)

    return system + deduped

For further guidance on bounding context costs across your entire agent fleet, see autonomous agent cost control best practices and how to set max cost per LLM request.

TypeScript: memory leak detection with `@runguard/sdk`

The same patterns apply in TypeScript. Use gpt-tokenizer (a zero-dependency port of tiktoken) for pre-call token estimation, and pass the token count into RunGuard’s guard() via the tokens option.

import { guard, ContextOverflowError, LoopDetectedError } from "@runguard/sdk";
import { encode } from "gpt-tokenizer";

interface Message {
  role: "system" | "user" | "assistant" | "tool";
  content: string;
  tool_call_id?: string;
}

function estimateTokens(messages: Message[]): number {
  return messages.reduce((sum, msg) => sum + 4 + encode(msg.content ?? "").length, 0);
}

async function myLLMCall(
  messages: Message[],
  tools: object[]
): Promise<{ response: object; usd: number; sig: string; tokens: () => number }> {
  const response = await llmClient.chat({ messages, tools });
  const toolSig =
    response.tool_calls?.[0]?.function?.name ?? "end_turn";
  const usd = computeCost(response.usage);
  return {
    response,
    usd,
    sig: toolSig,
    tokens: () => estimateTokens(messages),
  };
}

const runGuard = guard(myLLMCall, {
  context: { maxContextTokens: 128_000, headroom: 10_000 },
  loop:    { repeats: 4, maxCycleLen: 10 },
  budget:  { maxUsd: 5.00 },
});

async function runAgent(task: string): Promise {
  const messages: Message[] = [
    { role: "system",  content: SYSTEM_PROMPT },
    { role: "user",    content: task },
  ];

  for (let step = 0; step < 200; step++) {
    try {
      const result = await runGuard(messages, AGENT_TOOLS);
      const { response } = result as { response: any };

      if (response.content && !response.tool_calls?.length) {
        const peak = runGuard.state().contextTokensPeak;
        console.log(`Run complete. Context peak: ${peak.toLocaleString()} tokens.`);
        return response.content;
      }

      messages.push({ role: "assistant", content: response.content ?? "",
                      ...response.tool_calls && { tool_calls: response.tool_calls } } as any);
      for (const tc of response.tool_calls ?? []) {
        const toolResult = await dispatchTool(tc);
        messages.push({ role: "tool", tool_call_id: tc.id, content: toolResult });
      }
    } catch (err) {
      if (err instanceof ContextOverflowError) {
        // Apply sliding-window eviction in TypeScript
        const system  = messages.filter(m => m.role === "system");
        const nonSys  = messages.filter(m => m.role !== "system");
        messages.splice(0, messages.length, ...system, ...nonSys.slice(-30));
        continue;
      }
      if (err instanceof LoopDetectedError) {
        return `Retrieval loop detected: ${err.pattern} repeated ${err.repeats}x.`;
      }
      throw err;
    }
  }
  return "Max steps reached.";
}

Install with npm install @runguard/sdk gpt-tokenizer. The tokens function is evaluated lazily by the guard just before the outbound call, so it always reflects the current message list including any tool results appended in the current step. For token-limit handling specific to TypeScript agent stacks, see LLM agent token limit exceeded TypeScript.

Manual memory management vs RunGuard ContextGuard

Capability	Manual tracking	RunGuard ContextGuard
Overflow prevention	Must call `check()` before every LLM call; easy to forget	Fires automatically on every `guard()` invocation — no per-call code
Token peak tracking	Must maintain a max variable manually across all code paths	`state().contextTokensPeak` — high-water mark updated automatically
Runtime alerts before 400	Provider returns 400 after the call is already sent and billed	Raises `ContextOverflowError` before the call goes out — no wasted spend
Eviction policy enforcement	Eviction must be wired manually in every exception handler	Catch `ContextOverflowError` in one place; eviction strategy is centralised
Retrieval loop detection	Requires separate instrumentation; not composable with context tracking	Combine `context` + `loop` options in a single guard — both fire from one wrapper
Cost projection	No built-in spend tracking alongside token tracking	Budget cap + token guard co-located; `state().spentUsd` available alongside peak tokens

Getting started

Install RunGuard in your existing agent with one command — no framework lock-in, no required refactor of your agent loop:

# Python
pip install runguard

// TypeScript / Node
npm install @runguard/sdk

Wrap your LLM call function with guard(), pass context={"max_context_tokens": 128_000, "headroom": 10_000}, catch ContextOverflowError, and apply your preferred eviction strategy. The guard handles the rest: token estimation before each call, peak tracking across the run, and clean exceptions you can recover from rather than opaque provider 400 errors that kill the agent. For a broader look at what can go wrong in production agents and how to guard against it, see AI agent retry storm prevention and context window truncation alerting.

Stop context overflow before your provider does

RunGuard catches memory leaks, retrieval loops, and cost amplification before they become incidents. Solo plan $19/mo · Team plan $79/mo · 14-day free trial, no credit card required.

See pricing →