Letta (MemGPT) Cost Control: Loop Detection and Budget Enforcement in Production

Letta (formerly MemGPT) takes a different architectural stance than most agent frameworks. Rather than treating memory as ephemeral context that disappears when the context window fills, Letta gives every agent three persistent memory tiers: core memory (always in-context — persona and human sections the agent reads on every turn), archival memory (vector-indexed external store the agent searches with archival_memory_search), and recall memory (searchable conversation history via recall_memory_search). The agent itself manages all three through built-in memory tools, deciding when to write, search, replace, and evict.

This architecture enables genuinely long-running agents that remember context across thousands of turns. It also creates a new class of failure modes that don't exist in stateless frameworks. When a stateless agent loops, it repeats tool calls against external APIs. When a Letta agent loops, it can spiral through its own memory — calling search tools against memory stores it already queried, rewriting core memory sections it already updated, paginating through recall history it has already read. Each cycle costs LLM tokens on the agent turn, embedding tokens on the memory search, and sometimes multiple retrieved passages sent back into context. Letta's max_steps parameter — the primary built-in limit — counts steps without detecting whether those steps are making progress or iterating in place.

This post covers four Letta-specific failure modes and shows how to build a LettaBreaker circuit breaker that intercepts all of them.

Why max_steps is not a circuit breaker

Letta exposes a max_steps parameter on client.run_agent() and related calls that limits the total number of agent steps before the server halts execution. A step is one LLM inference call — the agent produces a response that may include tool calls; each such inference is one step. The default is generous (often 10–100 depending on the configuration), and teams frequently raise it to avoid premature termination on legitimate long-running tasks.

A circuit breaker detects a pattern — a sequence of behavior indicating the agent is spending without making progress — and halts before the pattern can fully unwind. max_steps is a counter, not a pattern detector. It answers "has this agent taken too many steps?" not "is this agent making progress?" The distinction matters because:

An agent performing 30 distinct productive steps should not be halted. An agent repeating the same memory search 30 times should.
max_steps applies the same limit to both. A pattern-aware breaker can trip after 3 repeated searches and leave the budget intact for real work.
In multi-agent setups, each participating agent has its own max_steps. A message exchange between two agents each with max_steps=20 can generate 40 LLM calls while each individual agent's counter stays within bounds.

The four failure modes below each consume steps that look, individually, like valid agent behavior. Only a pattern-level view reveals the loop.

Failure mode 1: Archival memory search spiral

Archival memory search spirals occur when an agent is looking for information that either doesn't exist in its memory store or exists under a slightly different representation than the query can reach. The pattern: the agent calls archival_memory_search(query="…"), receives results that don't satisfy its reasoning, then generates a slightly modified query — different wording, more specific, broader — and searches again. And again. And again.

This is the memory equivalent of a browser agent refreshing a page hoping for different content. Each call incurs embedding inference on the query string, retrieval of the top-K passages (sent back into context as additional tokens), and one LLM inference step to process the results and decide to search again. With a medium-sized embedding model and a few retrieved passages, a single search cycle might cost 500–1500 tokens. A 20-cycle spiral costs 10,000–30,000 tokens on what looks like "memory search" — a step category that seems cheap in isolation.

The failure mode is particularly common when:

The agent was asked about an event that happened before its memory was populated (the information genuinely isn't there).
The archival memory store uses embedding-based similarity and the agent's query vocabulary doesn't match how the stored passages were indexed.
The agent is following an instruction to "check memory for X before proceeding" and interprets each failed search as evidence it should try harder rather than give up.

Detection requires tracking the sequence of archival search queries and detecting when consecutive queries are semantically near-identical. Exact string matching misses paraphrases; token-set Jaccard similarity above a threshold (0.65 is a good starting point) is a reliable signal that the agent is covering the same semantic ground. After N consecutive near-identical queries, trip the breaker and return the best result seen so far.

Failure mode 2: Core memory contradiction rewrite loop

Core memory rewrite loops are unique to Letta's architecture. Core memory — the persona section and human section that appear in every agent system prompt — is fully agent-editable via core_memory_replace and core_memory_append. This enables agents to update their understanding of the user over time, which is the feature. The failure mode is the agent editing core memory, reading it back on the next turn, finding it incomplete or contradictory relative to new context, and rewriting it again.

A typical sequence: the agent receives new information about the user ("I changed jobs last month"). It appends this to the human section of core memory. On the next turn, the core memory section now contains both the old job and the new job. The agent determines this is contradictory and uses core_memory_replace to fix it. The replacement is slightly ambiguous. On the following turn, the agent notices the ambiguity and rewrites again. If the incoming context keeps providing information that the agent judges as inconsistent with core memory, the rewrite loop can continue indefinitely.

The cost structure here is different from search spirals: each rewrite is cheap in tokens (core memory is compact), but the rewrites cause the agent to defer its actual task ("I need to sort out my understanding first"), meaning the useful work never progresses. In terms of max_steps consumption, each rewrite consumes a step, and the task that was supposed to be completed within the step budget gets crowded out.

Detection requires counting core_memory_replace calls per agent run and tripping when the count exceeds a threshold relative to total steps. An agent that rewrites core memory more than twice in a 10-step run is probably looping. The secondary signal is the target field: if the same core memory key (e.g., human.occupation) is being replaced on consecutive turns, the contradiction loop hypothesis is confirmed.

Failure mode 3: Recall memory pagination deadlock

Recall memory in Letta is searchable conversation history. The recall_memory_search tool accepts a query and an optional page parameter, returning a paginated slice of matching messages. The pagination deadlock occurs when an agent that cannot find what it's looking for in page 0 advances to page 1, then page 2, eventually reaches the end of history, and — because nothing in its reasoning tells it the search is exhausted — wraps back to page 0 and begins again.

Letta doesn't automatically signal "you've reached the end of history" in a way that guarantees the agent will interpret it as a stopping condition. If the tool returns an empty page, the agent's LLM reasoning might conclude "the results are empty, I should try a different query" — which means it starts the pagination cycle again with a new query, potentially looping through all pages under each of a sequence of reformulated queries. This compounds with the archival search spiral failure mode: both can fire in the same run, each consuming steps independently.

The cost of recall pagination is lower than archival search per call (conversation history is already indexed by Letta; there's no embedding inference on recall searches). But the pattern can run for many more cycles because the "end of history" signal is weak and the set of possible query reformulations is effectively unbounded. An agent that runs 50 recall searches over 5 query variations × 10 pages each will consume 50 steps on zero-value work.

Detection requires tracking the (query_normalized, page) pairs seen in the current run. When the same pair appears for the second time — meaning the agent has looped back to a query-page combination it already read — the breaker trips. The normalization step is important: lowercase + strip punctuation + deduplicate tokens so that "What did we discuss about pricing?" and "pricing discussion?" hash to the same signature.

Failure mode 4: Multi-agent message ping-pong

Letta supports multi-agent setups where one agent can send messages to another via send_message. The receiving agent processes the message, potentially generates its own tool calls and memory operations, and sends a response back. This is architecturally useful for decomposing complex tasks across specialized agents. It is also the most expensive failure mode in Letta's design space.

The ping-pong failure mode occurs when two agents exchange messages without either reaching a conclusion that terminates the exchange. Agent A asks Agent B for clarification. Agent B's response is ambiguous or introduces new uncertainty. Agent A sends a follow-up question. Agent B responds with another ambiguous answer. The exchange continues until one agent's max_steps is exhausted. But because each step on Agent A's side triggers a full Agent B run (which may itself consume multiple steps on B's side), the total cost is Agent A steps × Agent B steps per message, not just the sum.

With two agents each configured for max_steps=10, a ping-pong sequence of 5 message exchanges costs: 5 Agent A steps (the send_message calls) + 5 × Agent B runs, where each Agent B run might take 2–4 steps to formulate a response. Total: 15–25 steps consuming resources from both agents' quotas. With more message exchanges and more complex Agent B processing, the multiplication becomes significant.

Detection at the sending-agent side requires tracking the message send count per run and the response pattern. If the agent sends more than N messages in a single run without receiving a task_complete or equivalent terminal signal in any response, the exchange has failed to converge. Additionally, if consecutive responses from Agent B are semantically near-identical (Jaccard similarity above threshold), the receiving agent is producing templated non-answers — the classic sign of a stuck LLM responding to ambiguous queries with plausible-but-useless hedges.

Building LettaBreaker

The breaker tracks all four failure modes through instrumentation of Letta's tool call stream. Letta surfaces agent actions via its streaming API — each message in the stream includes a message_type field that identifies tool calls by name. The breaker intercepts this stream, classifies each tool call, and trips before the next LLM inference if a pattern threshold is crossed.

from __future__ import annotations

import hashlib
import re
import time
from dataclasses import dataclass, field
from enum import Enum
from typing import Optional


class BreakerState(Enum):
    CLOSED = "closed"
    OPEN = "open"
    HALF_OPEN = "half_open"


@dataclass
class LettaBreakerConfig:
    # Archival search spiral (failure mode 1)
    archival_spiral_jaccard_threshold: float = 0.65
    archival_spiral_window: int = 3       # consecutive near-identical queries

    # Core memory rewrite loop (failure mode 2)
    max_core_rewrites_per_run: int = 3
    max_same_key_rewrites: int = 2        # same key rewritten consecutively

    # Recall pagination deadlock (failure mode 3)
    # trips on first repeated (query_norm, page) pair

    # Multi-agent ping-pong (failure mode 4)
    max_send_message_calls: int = 6
    ping_pong_jaccard_threshold: float = 0.70

    # Global step budget (independent of Letta max_steps)
    max_steps: int = 40

    # Recovery
    reset_timeout_seconds: float = 90.0


@dataclass
class LettaBreakerState:
    breaker_state: BreakerState = BreakerState.CLOSED
    trip_reason: Optional[str] = None
    opened_at: Optional[float] = None

    # Per-run counters
    step_count: int = 0

    # Archival search tracking
    archival_queries: list = field(default_factory=list)
    last_archival_query_norm: Optional[str] = None
    archival_near_dup_streak: int = 0

    # Core memory tracking
    core_rewrite_count: int = 0
    last_core_key: Optional[str] = None
    same_key_streak: int = 0

    # Recall pagination tracking
    recall_seen: set = field(default_factory=set)

    # Multi-agent tracking
    send_message_count: int = 0
    last_response_norm: Optional[str] = None
    response_near_dup_streak: int = 0

    def reset(self) -> None:
        self.breaker_state = BreakerState.CLOSED
        self.trip_reason = None
        self.opened_at = None
        self.step_count = 0
        self.archival_queries.clear()
        self.last_archival_query_norm = None
        self.archival_near_dup_streak = 0
        self.core_rewrite_count = 0
        self.last_core_key = None
        self.same_key_streak = 0
        self.recall_seen.clear()
        self.send_message_count = 0
        self.last_response_norm = None
        self.response_near_dup_streak = 0


def _normalize(text: str) -> str:
    """Lowercase, strip punctuation, deduplicate tokens, sort."""
    text = text.lower()
    text = re.sub(r"[^a-z0-9\s]", " ", text)
    tokens = sorted(set(text.split()))
    return " ".join(tokens)


def _jaccard(a: str, b: str) -> float:
    set_a = set(a.split())
    set_b = set(b.split())
    if not set_a and not set_b:
        return 1.0
    intersection = len(set_a & set_b)
    union = len(set_a | set_b)
    return intersection / union if union else 0.0


def _sig(text: str) -> str:
    return hashlib.sha256(text.encode()).hexdigest()[:16]


class LettaBreaker:
    """Circuit breaker for Letta (MemGPT) agents."""

    def __init__(self, config: Optional[LettaBreakerConfig] = None):
        self.config = config or LettaBreakerConfig()
        self._state = LettaBreakerState()

    # ── state machine ────────────────────────────────────────────────────────

    def _trip(self, reason: str) -> None:
        self._state.breaker_state = BreakerState.OPEN
        self._state.opened_at = time.monotonic()
        self._state.trip_reason = reason

    def _check_open(self) -> None:
        if self._state.breaker_state == BreakerState.OPEN:
            elapsed = time.monotonic() - (self._state.opened_at or 0)
            if elapsed >= self.config.reset_timeout_seconds:
                self._state.breaker_state = BreakerState.HALF_OPEN
            else:
                raise RuntimeError(
                    f"LettaBreaker OPEN: {self._state.trip_reason} "
                    f"(resets in {self.config.reset_timeout_seconds - elapsed:.0f}s)"
                )

    def on_probe_success(self) -> None:
        if self._state.breaker_state == BreakerState.HALF_OPEN:
            self._state.reset()

    def reset(self) -> None:
        self._state.reset()

    @property
    def is_open(self) -> bool:
        return self._state.breaker_state == BreakerState.OPEN

    @property
    def trip_reason(self) -> Optional[str]:
        return self._state.trip_reason

    # ── step counter ─────────────────────────────────────────────────────────

    def record_step(self) -> None:
        """Call once per agent LLM inference step."""
        self._check_open()
        self._state.step_count += 1
        if self._state.step_count >= self.config.max_steps:
            self._trip(
                f"step budget: {self._state.step_count} steps "
                f"(limit={self.config.max_steps})"
            )

    # ── failure mode 1: archival search spiral ───────────────────────────────

    def record_archival_search(self, query: str) -> None:
        """Call each time the agent calls archival_memory_search."""
        self._check_open()
        norm = _normalize(query)
        self._state.archival_queries.append(norm)

        if self._state.last_archival_query_norm is not None:
            similarity = _jaccard(norm, self._state.last_archival_query_norm)
            if similarity >= self.config.archival_spiral_jaccard_threshold:
                self._state.archival_near_dup_streak += 1
            else:
                self._state.archival_near_dup_streak = 0

            if self._state.archival_near_dup_streak >= self.config.archival_spiral_window:
                self._trip(
                    f"archival search spiral: {self._state.archival_near_dup_streak + 1} "
                    f"consecutive near-identical queries "
                    f"(jaccard >= {self.config.archival_spiral_jaccard_threshold})"
                )

        self._state.last_archival_query_norm = norm

    # ── failure mode 2: core memory rewrite loop ─────────────────────────────

    def record_core_rewrite(self, key: str) -> None:
        """Call each time the agent calls core_memory_replace or core_memory_append."""
        self._check_open()
        self._state.core_rewrite_count += 1

        if self._state.last_core_key == key:
            self._state.same_key_streak += 1
        else:
            self._state.same_key_streak = 0
        self._state.last_core_key = key

        if self._state.core_rewrite_count >= self.config.max_core_rewrites_per_run:
            self._trip(
                f"core memory rewrite loop: {self._state.core_rewrite_count} rewrites "
                f"this run (limit={self.config.max_core_rewrites_per_run})"
            )
        elif self._state.same_key_streak >= self.config.max_same_key_rewrites:
            self._trip(
                f"core memory rewrite loop: key '{key}' rewritten "
                f"{self._state.same_key_streak + 1} consecutive turns"
            )

    # ── failure mode 3: recall pagination deadlock ────────────────────────────

    def record_recall_search(self, query: str, page: int = 0) -> None:
        """Call each time the agent calls recall_memory_search."""
        self._check_open()
        pair = (_sig(_normalize(query)), page)
        if pair in self._state.recall_seen:
            self._trip(
                f"recall pagination deadlock: (query_sig={pair[0]}, page={page}) "
                f"already seen this run — agent has looped back to a prior page"
            )
        self._state.recall_seen.add(pair)

    # ── failure mode 4: multi-agent ping-pong ─────────────────────────────────

    def record_send_message(self) -> None:
        """Call each time the agent calls send_message to another agent."""
        self._check_open()
        self._state.send_message_count += 1
        if self._state.send_message_count >= self.config.max_send_message_calls:
            self._trip(
                f"multi-agent ping-pong: {self._state.send_message_count} send_message "
                f"calls this run (limit={self.config.max_send_message_calls})"
            )

    def record_received_response(self, response_text: str) -> None:
        """Call with the text of a response received from another agent."""
        self._check_open()
        norm = _normalize(response_text[:400])  # first 400 chars is sufficient for pattern detection
        if self._state.last_response_norm is not None:
            similarity = _jaccard(norm, self._state.last_response_norm)
            if similarity >= self.config.ping_pong_jaccard_threshold:
                self._state.response_near_dup_streak += 1
            else:
                self._state.response_near_dup_streak = 0
            if self._state.response_near_dup_streak >= 2:
                self._trip(
                    f"multi-agent ping-pong: received {self._state.response_near_dup_streak + 1} "
                    f"near-identical responses (jaccard >= {self.config.ping_pong_jaccard_threshold})"
                )
        self._state.last_response_norm = norm

Wiring LettaBreaker into a Letta agent run

Letta's Python client exposes a streaming interface that yields typed message objects. Each object carries a message_type indicating whether it is a tool call, a tool response, an internal monologue, or an assistant message. Intercepting tool call objects by name gives you all the hooks LettaBreaker needs.

from letta_client import Letta

def run_with_breaker(
    client: Letta,
    agent_id: str,
    user_message: str,
    breaker: LettaBreaker,
) -> str:
    """Run a Letta agent turn with LettaBreaker protection."""
    breaker._check_open()

    collected_text = []

    response = client.agents.messages.create(
        agent_id=agent_id,
        messages=[{"role": "user", "content": user_message}],
    )

    for msg in response:
        # Each msg is a Letta message object with a message_type field.
        # The exact attribute names depend on letta-client version;
        # adjust to match the SDK you're using.

        msg_type = getattr(msg, "message_type", None)

        if msg_type == "tool_call_message":
            tool_name = getattr(msg, "tool_call", {}).get("name", "")
            tool_input = getattr(msg, "tool_call", {}).get("input", {})

            breaker.record_step()

            if tool_name == "archival_memory_search":
                query = tool_input.get("query", "")
                breaker.record_archival_search(query)

            elif tool_name in ("core_memory_replace", "core_memory_append"):
                key = tool_input.get("name", tool_input.get("label", "unknown"))
                breaker.record_core_rewrite(key)

            elif tool_name == "recall_memory_search":
                query = tool_input.get("query", "")
                page = int(tool_input.get("page", 0))
                breaker.record_recall_search(query, page)

            elif tool_name == "send_message":
                # Distinguish inter-agent send_message from user-facing send_message
                # by checking for a recipient_agent_id field in tool_input.
                if "recipient_agent_id" in tool_input or "agent_id" in tool_input:
                    breaker.record_send_message()

        elif msg_type == "tool_return_message":
            # Capture inter-agent responses for ping-pong detection.
            tool_return = getattr(msg, "tool_return", "")
            if tool_return and breaker._state.send_message_count > 0:
                breaker.record_received_response(str(tool_return))

        elif msg_type == "assistant_message":
            text = getattr(msg, "content", "") or ""
            collected_text.append(text)

    result = " ".join(collected_text).strip()
    breaker.on_probe_success()
    return result

The run_with_breaker wrapper is intentionally thin. It delegates all state management to LettaBreaker and does not modify the Letta client, the agent configuration, or the message flow. The breaker raises RuntimeError when it trips; the caller catches this and decides how to handle it — return a partial result, log the trip, restart with a different query, or escalate to a human.

Tuning the thresholds

The defaults in LettaBreakerConfig are conservative enough to let healthy agents run freely while tripping on clear loops. Three guidelines for production tuning:

Archival spiral threshold (0.65). If your agent legitimately makes several archival searches with overlapping vocabulary (e.g., a research agent exploring closely related topics), raise this to 0.80. If you're seeing false negatives where spirals escape detection, lower to 0.55. The Jaccard threshold should be calibrated against your actual query vocabulary, not set by intuition alone.
Core rewrite limit (3 per run). An agent that performs onboarding interviews and aggressively updates its understanding of the user might legitimately write 4–5 core memory updates in a single long session. In that case, raise the limit or scope it to consecutive same-key rewrites only. For agents with stable personas and users, 2 is sufficient.
Send message limit (6 per run). If your multi-agent setup uses Letta for legitimate peer review where several message exchanges are expected, tune this to the maximum exchanges a healthy flow requires plus two. A 4-exchange review flow should have a limit of 6, not 4 — you want headroom for normal variance without hitting the breaker on every production run.

Pre-flight: memory store health check

Many archival search spirals are caused not by agent behavior but by an archival memory store that has grown too large or too noisy for the agent's queries to produce useful results. Running a pre-flight check before starting a long agent task can prevent the spiral entirely:

def preflight_memory_check(
    client: Letta,
    agent_id: str,
    test_query: str,
    min_useful_results: int = 1,
) -> bool:
    """
    Returns True if archival memory responds usefully to test_query.
    Returns False if the store is empty or returns zero results —
    indicating the agent will likely spiral trying to find information
    that isn't there.
    """
    result = client.agents.archival_memory.list(
        agent_id=agent_id,
        query=test_query,
        limit=3,
    )
    passages = getattr(result, "items", result) if result else []
    return len(passages) >= min_useful_results


# Usage before starting a long task:
if not preflight_memory_check(client, agent_id, user_message):
    # Short-circuit: tell the user memory doesn't have this information
    # rather than letting the agent spiral for 20 steps discovering the same thing.
    return "I don't have information about that in my memory yet."

This is the same pre-flight pattern we use in our Python circuit breaker guide — check preconditions before entering the loop, not after you've already consumed budget discovering them. A zero-result memory store is a precondition failure, not a runtime event the agent should handle through repeated search.

Connecting to RunGuard

The LettaBreaker above is a standalone, zero-dependency implementation. If you want persistent trip state across restarts, multi-agent breaker sharing, and a dashboard showing which agents are tripping and why, that's what RunGuard's SDK provides. The integration is a guard() wrapper around your run_with_breaker function — the same API described in our cost engineering guide.

from runguard import guard, BudgetTracker

budget = BudgetTracker(max_cost_usd=2.00)

@guard(budget=budget, loop_detector=True)
def run_letta_task(agent_id: str, message: str) -> str:
    breaker = LettaBreaker()
    return run_with_breaker(client, agent_id, message, breaker)

The guard() wrapper adds the budget-enforcement layer that LettaBreaker doesn't track: total cost per invocation across all tool calls, with a hard cap before any single agent run can exceed your per-task budget. Combined with the four pattern-based trip conditions above, this covers both the "agent looping without spending much per step" and "agent spending a lot per step even without looping" failure modes.

FAQ

Does LettaBreaker work with the Letta server (self-hosted) and Letta Cloud?

Yes. The breaker intercepts tool call messages from Letta's streaming API, which is the same protocol for both the open-source server and Letta Cloud. The only difference is the client initialization (Letta(base_url=…) for self-hosted vs. Letta(token=…) for Cloud). The run_with_breaker wrapper works with either client.

What happens to the agent's state when the breaker trips mid-run?

Letta persists agent state on the server side — memory updates that were committed before the breaker tripped are permanent. The breaker halts future steps by raising RuntimeError, but it does not roll back memory writes. If an in-progress core memory rewrite was the last tool call before the trip, that write stands. For situations where you need transactional memory updates, add explicit rollback logic using the Letta client's memory management API before raising.

Can I use LettaBreaker per-agent in a multi-agent setup?

Yes, and that is the recommended approach. Each agent that orchestrates others should have its own LettaBreaker instance configured with appropriate max_send_message_calls for that agent's role. An orchestrator agent that fans out to 5 specialists might legitimately send 10+ messages; a specialist agent that receives tasks should rarely send more than 2–3 messages back. Match the thresholds to each agent's expected behavior, not a single global threshold.

The archival memory spiral detection trips during legitimate research tasks. What should I adjust?

Two options: raise archival_spiral_jaccard_threshold toward 0.85 (requires near-verbatim query repetition to trip) or increase archival_spiral_window to 5 (tolerates more consecutive near-identical queries before tripping). Also verify that your agent's queries are genuinely distinct — many "legitimate research" spirals turn out to be the agent paraphrasing the same question rather than exploring genuinely different facets of a topic. The spiral detection often surfaces real agent reasoning failures that the developer hadn't noticed because the output eventually looked correct.

How does this compare to Letta's built-in step limits?

Letta's max_steps is a hard upper bound that applies uniformly across all step types. LettaBreaker is a pattern detector that trips early on specific failure signatures. In a healthy run, LettaBreaker never fires — the agent completes its task well within max_steps. In a looping run, LettaBreaker trips after 3–6 repetitions rather than at the step limit, preserving most of your step budget for the retry. Use both: max_steps as the absolute backstop, LettaBreaker as the early-warning pattern detector.

Stop Letta agents from billing you for loops

RunGuard wraps any Letta agent run with persistent trip state, budget caps, and a dashboard that shows which agents are spiraling and why — without touching your Letta server configuration.

See pricing Learn more