LlamaIndex Agent Cost Control: Loop Detection and Budget Enforcement in Production

LlamaIndex ships with max_iterations on ReActAgent and FunctionCallingAgent. Teams set it to 10 or 15 and consider cost risk managed. Then a research agent targets a URL that returns empty content — the tool call succeeds (HTTP 200, empty body), the agent decides to retry with a slightly different query, the tool returns another empty body, and the agent reasons its way through 10 identical tool calls in 10 iterations before max_iterations finally fires. Or a multi-agent pipeline has an orchestrator dispatch to a specialist, the specialist's router tool decides the query belongs back with the orchestrator, and two agents take turns delegating to each other while max_iterations counts each half-trip as one iteration on each side — and neither agent's counter sees enough to trip.

The problem is that max_iterations is a step counter, not a progress detector. It fires after N reasoning steps regardless of whether those steps moved the agent closer to a result. It counts turns, not value. A single iteration that calls five tools costs more than five iterations that call no tools — but max_iterations treats them identically. Budget control requires pattern detection: is the agent calling the same tool with the same failing arguments? Is the per-step token cost growing faster than expected? Is the reasoning loop generating extended chain-of-thought without ever choosing an action? max_iterations cannot answer any of these questions.

This post builds a production circuit breaker for LlamaIndex: ReAct reasoning cycle detection, tool call storm prevention, multi-agent back-delegation tracking, and chat history token inflation monitoring — all wired through LlamaIndex's native CallbackManager without modifying your agent definitions or tool implementations. At the end you'll see how RunGuard's runguard.install() wraps any AgentRunner with one call and handles all four failure modes automatically.

What you'll build: A circuit breaker that detects when ReAct reasoning steps cycle without committing to an action, catches the same tool called repeatedly with identical arguments, tracks delegation depth across multi-agent pipelines where agents can route back to their callers, and monitors per-turn prompt token growth across AgentRunner.chat() sessions — all via CallbackManager event hooks, no internal patching required.

Why LlamaIndex's step counter fails more expensively than it should

LlamaIndex's AgentRunner architecture separates planning (AgentWorker) from execution orchestration (AgentRunner). This is a clean design: the worker decides what to do, the runner manages the loop. max_iterations lives in the runner — it caps the number of _run_step() calls. The problem is that each step can contain wildly different amounts of LLM work:

A pure reasoning step asks the model to produce a THOUGHT and ACTION. If the model produces only a THOUGHT (no action chosen), many LlamaIndex agent implementations treat this as an incomplete step and retry the LLM call within the same iteration. The max_iterations counter increments once, but the LLM was called twice — once for the incomplete response and once for the retry.
A tool-calling step invokes an LLM, parses the function call, executes the tool, then invokes the LLM again to process the tool result. One iteration: two LLM calls plus one tool call. If the tool call raises an exception and the agent retries within the same step, that's three LLM calls in one iteration.
A streaming step accumulates tokens before parsing. If the model streams 2,000 tokens of reasoning before the first tool call, that's 2,000 input tokens on the next turn as context — a cost that compounds with every subsequent step.

In the worst case, a 10-iteration cap on a ReAct agent that retries within each step and accumulates tool results in context could produce 30–50 LLM calls at quadratically increasing prompt costs. A budget model built on max_iterations=10 could be off by 5–10×.

The four failure modes LlamaIndex's built-in controls miss

1. ReAct reasoning cycle: thought loops with no committed action

LlamaIndex's ReActAgent (and ReActAgentWorker) follows the standard Thought / Action / Observation loop. On each step the model generates a Thought: block, then either an Action: block (tool call) or an Answer: block (final response). When the model produces only a Thought: block — no action, no answer — the agent interprets this as an incomplete step. Depending on the implementation version, it either retries the LLM call within the same step or marks the step as a partial response and continues to the next iteration.

The failure mode: the model is stuck in reasoning. It produces a Thought: that expresses uncertainty, can't decide which tool addresses the uncertainty, and produces another Thought: on the retry. Five consecutive thought-only steps with no action chosen is a clear signal the model has entered a reasoning loop. Every step in the loop costs at least one LLM call on a prompt that grows with each accumulated thought. max_iterations eventually fires, but by then the agent has spent 5–8× what a single tool call would cost — and produced no useful output.

Detection signal: consecutive steps where the AgentChatResponse or parsed step output contains no tool call and no final answer. Track via CBEventType.AGENT_STEP events in the CallbackManager. Three consecutive thought-only steps is the trip threshold in most production environments — a legitimate agent might need one or two rounds of reasoning before choosing a tool, but three rounds without a concrete action is a loop, not planning.

2. Tool call storm: identical tool invocations with the same failing arguments

LlamaIndex's tool system wraps Python callables as FunctionTool instances. The agent calls them via the standard function-calling or ReAct mechanism. When a tool raises an exception, most LlamaIndex agent workers catch it, format the error into an observation, and continue to the next reasoning step — giving the agent a chance to try a different approach. The failure mode: the model's next approach is identical to the approach that just failed. Same tool, same arguments, same exception.

This happens when the agent's only tool for a task consistently fails on a class of inputs — a web scraper tool that times out on a specific domain, a database query tool that returns zero results for a specific entity, a file reader that raises FileNotFoundError for a path the agent keeps generating. The agent sees an error observation, reasons that it should try again, and calls the same tool with the same arguments. The cycle continues until max_iterations fires. In a 15-iteration agent that retries a failing tool every other step, that's 7–8 calls to the same tool with the same arguments, plus 7–8 LLM calls to process the same error message.

Detection signal: the same (tool_name, args_hash) pair appearing in the tool call trace more than N times within a single chat() session. Track via CBEventType.FUNCTION_CALL events. Four identical (name, args) pairs in one session is a storm. Legitimate retry behavior might repeat once or twice with different arguments (trying alternative phrasings) — identical arguments across four calls with the same result is a structural failure, not a retry.

3. Multi-agent back-delegation: orchestrators and specialists routing to each other

LlamaIndex multi-agent patterns typically use one of two wiring approaches: tools-as-agents (the orchestrator has a tool whose implementation calls specialist_agent.chat()) or a shared router (a dispatcher function that takes an intent and routes to the appropriate agent). Both patterns have a back-delegation failure mode.

In the tools-as-agents pattern, the orchestrator calls the specialist via a tool. The specialist, faced with a query outside its scope, has access to an escalate_to_orchestrator tool that routes the query back. The orchestrator receives it, decides the specialist is the right handler, calls the specialist tool again. Neither agent's max_iterations counter sees more than 2–3 steps — the delegation round-trips are each a single tool call per agent. Together they form an unbounded loop.

In the shared-router pattern, the router is a global function that maps intent labels to agents. Agent A routes to Agent B. Agent B's classifier produces the same intent label Agent A had, so the router sends it back to Agent A. The loop runs until something external stops it — a global timeout, a process kill, or an OOM error as the accumulated context fills memory.

Detection signal: delegation depth — the count of nested agent.chat() calls on the current execution stack. Track with a contextvars.ContextVar that increments before each chat() call and decrements after. A depth of 2 is a legitimate orchestrator → specialist dispatch. A depth of 5 in a system with two agents is a cycle. No LlamaIndex built-in mechanism tracks this because each AgentRunner is independent — the framework has no global call-stack concept.

4. Chat history token inflation: per-turn prompt cost growing super-linearly

LlamaIndex's AgentRunner.chat() maintains a ChatMemoryBuffer that accumulates the conversation history. Every turn, the full history is prepended to the prompt — including all prior tool calls, tool results, reasoning steps, and agent responses. A chat session that starts at 1,000 input tokens per turn can grow to 5,000 input tokens per turn by turn 8 if each turn adds 500 tokens of context (tool results, observations, reasoning).

The failure mode isn't that history grows — that's expected. The failure mode is super-linear growth caused by a loop: a tool that returns large payloads on every call, a reasoning pattern that generates extensive chain-of-thought, or a tool error message that includes the full stack trace on each failure. In these cases, each turn adds significantly more context than the average turn, and the cost per turn escalates quadratically rather than linearly. By turn 15, each LLM call might cost 10× what it cost at turn 3. max_iterations=15 allows this to run to completion — by which point the entire conversation history has become a cost multiplier.

Detection signal: the ratio of (current turn's prompt token count) to (average prompt token count across the first half of the session) exceeds a configurable threshold. This is a drift signal: a session that started with 2,000-token prompts and is now averaging 6,000-token prompts has a 3× drift ratio — something is inflating the history faster than expected. Track via CBEventType.LLM events, which carry prompt and completion token counts in the EventPayload.

Building the circuit breaker

LlamaIndex's CallbackManager is the right instrumentation layer. It fires events for every LLM call, function call, and agent step — before and after. You can attach a custom BaseCallbackHandler subclass to any agent without modifying the agent's code. The breaker attaches at construction time and intercepts events:

from __future__ import annotations
import contextvars
import hashlib
import time
from collections import defaultdict, deque
from dataclasses import dataclass, field
from enum import Enum
from typing import Any, Dict, List, Optional, Tuple

from llama_index.core.callbacks import (
    BaseCallbackHandler,
    CBEventType,
    EventPayload,
)
from llama_index.core.callbacks.schema import BASE_TRACE_EVENT


class BreakerState(Enum):
    CLOSED = "closed"
    OPEN = "open"
    HALF_OPEN = "half_open"


@dataclass
class LlamaIndexBreaker(BaseCallbackHandler):
    """Circuit breaker as a LlamaIndex CallbackHandler."""

    budget_usd: float = 5.0
    max_thought_only_steps: int = 3
    max_tool_repeats: int = 4
    max_delegation_depth: int = 4
    cost_drift_ratio: float = 2.5
    cooldown_seconds: float = 60.0
    cost_per_1k_input_tokens: float = 0.002
    cost_per_1k_output_tokens: float = 0.006

    # State
    state: BreakerState = field(default=BreakerState.CLOSED, init=False)
    total_cost_usd: float = field(default=0.0, init=False)
    trips: int = field(default=0, init=False)
    _opened_at: Optional[float] = field(default=None, init=False)

    # Per-session tracking
    _thought_only_streak: int = field(default=0, init=False)
    _tool_call_counts: Dict[str, int] = field(
        default_factory=lambda: defaultdict(int), init=False
    )
    _per_turn_input_tokens: List[int] = field(default_factory=list, init=False)
    _active_event_ids: Dict[str, Any] = field(default_factory=dict, init=False)

    # Delegation depth via contextvar (coroutine-safe)
    _depth_var: contextvars.ContextVar[int] = field(
        default_factory=lambda: contextvars.ContextVar(
            "_llama_breaker_depth", default=0
        ),
        init=False,
    )

    def __post_init__(self) -> None:
        super().__init__(
            event_starts_to_ignore=[],
            event_ends_to_ignore=[],
        )

    # --- BaseCallbackHandler interface ---

    def on_event_start(
        self,
        event_type: CBEventType,
        payload: Optional[Dict[str, Any]] = None,
        event_id: str = "",
        **kwargs: Any,
    ) -> str:
        self._gate()
        if event_type == CBEventType.FUNCTION_CALL:
            self._handle_function_call_start(payload or {}, event_id)
        elif event_type == CBEventType.LLM:
            self._handle_llm_start(payload or {}, event_id)
        return event_id

    def on_event_end(
        self,
        event_type: CBEventType,
        payload: Optional[Dict[str, Any]] = None,
        event_id: str = "",
        **kwargs: Any,
    ) -> None:
        if event_type == CBEventType.FUNCTION_CALL:
            self._handle_function_call_end(payload or {}, event_id)
        elif event_type == CBEventType.LLM:
            self._handle_llm_end(payload or {}, event_id)
        elif event_type == CBEventType.AGENT_STEP:
            self._handle_agent_step_end(payload or {}, event_id)

    def start_trace(self, trace_id: Optional[str] = None) -> None:
        pass

    def end_trace(
        self,
        trace_id: Optional[str] = None,
        trace_map: Optional[Dict[str, List[str]]] = None,
    ) -> None:
        pass

    # --- Internal handlers ---

    def _handle_function_call_start(
        self, payload: Dict[str, Any], event_id: str
    ) -> None:
        fn_name = payload.get(EventPayload.FUNCTION_NAME, "")
        fn_args = str(payload.get(EventPayload.FUNCTION_CALL, ""))
        call_key = f"{fn_name}::{hashlib.md5(fn_args.encode()).hexdigest()[:8]}"
        self._tool_call_counts[call_key] += 1
        count = self._tool_call_counts[call_key]
        if count >= self.max_tool_repeats:
            self._trip(
                f"tool storm: '{fn_name}' called {count}x with identical arguments"
            )

    def _handle_function_call_end(
        self, payload: Dict[str, Any], event_id: str
    ) -> None:
        # Successful tool call resets the thought-only streak
        self._thought_only_streak = 0

    def _handle_llm_start(
        self, payload: Dict[str, Any], event_id: str
    ) -> None:
        self._active_event_ids[event_id] = time.monotonic()

    def _handle_llm_end(
        self, payload: Dict[str, Any], event_id: str
    ) -> None:
        response = payload.get(EventPayload.RESPONSE)
        if response is None:
            return
        raw_output = getattr(response, "raw", {}) or {}
        usage = raw_output.get("usage", {})
        input_tokens = usage.get("prompt_tokens", 0)
        output_tokens = usage.get("completion_tokens", 0)
        if input_tokens == 0:
            # Fallback: try additional_kwargs
            additional = getattr(response, "additional_kwargs", {}) or {}
            input_tokens = additional.get("prompt_tokens", 0)
            output_tokens = additional.get("completion_tokens", 0)
        cost = (
            (input_tokens / 1000) * self.cost_per_1k_input_tokens
            + (output_tokens / 1000) * self.cost_per_1k_output_tokens
        )
        self.total_cost_usd += cost
        if input_tokens > 0:
            self._per_turn_input_tokens.append(input_tokens)
            self._check_cost_drift()
        self._active_event_ids.pop(event_id, None)

    def _handle_agent_step_end(
        self, payload: Dict[str, Any], event_id: str
    ) -> None:
        step_output = payload.get(EventPayload.STEP_OUTPUT)
        if step_output is None:
            return
        # If the step produced no tool call and no final answer, it's a thought-only step
        output = getattr(step_output, "output", None)
        is_done = getattr(step_output, "is_done", False)
        sources = getattr(output, "sources", []) if output else []
        if not is_done and not sources:
            self._thought_only_streak += 1
            if self._thought_only_streak >= self.max_thought_only_steps:
                self._trip(
                    f"ReAct reasoning cycle: {self._thought_only_streak} consecutive "
                    f"thought-only steps with no tool call or final answer"
                )
        else:
            self._thought_only_streak = 0

    # --- Delegation depth tracking (used externally) ---

    def delegation_context(self):
        """Context manager for tracking multi-agent delegation depth."""
        return _DelegationContext(self)

    # --- Budget and state ---

    def _gate(self) -> None:
        if self.state == BreakerState.OPEN:
            elapsed = time.monotonic() - (self._opened_at or 0)
            remaining = self.cooldown_seconds - elapsed
            if remaining <= 0:
                self.state = BreakerState.HALF_OPEN
            else:
                raise RuntimeError(
                    f"[LlamaIndex breaker OPEN] Spent: ${self.total_cost_usd:.4f}. "
                    f"Cooldown remaining: {remaining:.0f}s"
                )
        if self.total_cost_usd >= self.budget_usd:
            self._trip(
                f"budget exhausted (${self.total_cost_usd:.4f} >= ${self.budget_usd})"
            )

    def _trip(self, reason: str) -> None:
        self.state = BreakerState.OPEN
        self._opened_at = time.monotonic()
        self.trips += 1
        raise RuntimeError(f"[LlamaIndex breaker TRIPPED] {reason}")

    def _check_cost_drift(self) -> None:
        n = len(self._per_turn_input_tokens)
        if n < 4:
            return
        midpoint = n // 2
        early_avg = sum(self._per_turn_input_tokens[:midpoint]) / midpoint
        if early_avg == 0:
            return
        current = self._per_turn_input_tokens[-1]
        ratio = current / early_avg
        if ratio >= self.cost_drift_ratio:
            self._trip(
                f"prompt token inflation: current turn {current} tokens "
                f"is {ratio:.1f}x early average {early_avg:.0f} tokens"
            )

    def reset(self) -> None:
        """Reset per-session state. Call between independent chat sessions."""
        if self.state == BreakerState.HALF_OPEN:
            self.state = BreakerState.CLOSED
        self._thought_only_streak = 0
        self._tool_call_counts.clear()
        self._per_turn_input_tokens.clear()
        self._active_event_ids.clear()


class _DelegationContext:
    def __init__(self, breaker: LlamaIndexBreaker) -> None:
        self._breaker = breaker

    def __enter__(self) -> "_DelegationContext":
        depth = self._breaker._depth_var.get(0) + 1
        self._breaker._depth_var.set(depth)
        if depth >= self._breaker.max_delegation_depth:
            self._breaker._trip(
                f"multi-agent back-delegation: depth {depth} "
                f">= max {self._breaker.max_delegation_depth}"
            )
        return self

    def __exit__(self, *_: Any) -> None:
        depth = self._breaker._depth_var.get(0)
        self._breaker._depth_var.set(max(0, depth - 1))

Wiring the breaker to your agents

Attach the breaker to an agent's CallbackManager at construction time. The handler intercepts all LLM and function call events automatically:

from llama_index.core import Settings
from llama_index.core.agent import ReActAgent
from llama_index.core.callbacks import CallbackManager
from llama_index.core.tools import FunctionTool


def search_web(query: str) -> str:
    """Search the web for information."""
    # ... implementation
    return "search results"


def read_file(path: str) -> str:
    """Read a file from disk."""
    # ... implementation
    return "file contents"


# Construct the breaker once per session
breaker = LlamaIndexBreaker(
    budget_usd=3.0,
    max_thought_only_steps=3,
    max_tool_repeats=4,
    max_delegation_depth=4,
    cost_drift_ratio=2.5,
    cooldown_seconds=60.0,
)

callback_manager = CallbackManager([breaker])

agent = ReActAgent.from_tools(
    tools=[
        FunctionTool.from_defaults(fn=search_web),
        FunctionTool.from_defaults(fn=read_file),
    ],
    max_iterations=15,
    callback_manager=callback_manager,
    verbose=False,
)

# Use the agent
try:
    response = agent.chat("Research the history of Byzantine fault tolerance")
    print(response)
except RuntimeError as exc:
    print(f"Breaker fired: {exc}")
    # Log, alert, return graceful degradation response

For multi-agent pipelines where one agent calls another via a tool, wrap the inner chat() call with the delegation context manager to track depth:

from llama_index.core.agent import ReActAgent
from llama_index.core.tools import FunctionTool


# Shared breaker for the full pipeline
pipeline_breaker = LlamaIndexBreaker(
    budget_usd=8.0,
    max_delegation_depth=3,
)

specialist_agent = ReActAgent.from_tools(
    tools=[...],
    callback_manager=CallbackManager([pipeline_breaker]),
)

def dispatch_to_specialist(query: str) -> str:
    """Route a query to the specialist agent."""
    with pipeline_breaker.delegation_context():
        result = specialist_agent.chat(query)
    return str(result)


orchestrator_agent = ReActAgent.from_tools(
    tools=[FunctionTool.from_defaults(fn=dispatch_to_specialist)],
    callback_manager=CallbackManager([pipeline_breaker]),
    max_iterations=10,
)

The shared pipeline_breaker instance tracks delegation depth across both agents. When dispatch_to_specialist is called, the context manager increments the depth counter. If the specialist tries to call back into the orchestrator (which calls dispatch_to_specialist again), the depth counter hits max_delegation_depth and the breaker fires before the recursive call reaches the LLM.

Failure mode examples with breaker output

ReAct reasoning cycle: the web research agent that gets stuck

Consider a research agent that should search, then read a URL, then synthesize. The search tool returns results. The agent reasons about which URL to read. The reasoning step produces three alternative URLs and can't decide which to prioritize. The next step produces the same three alternatives with different hedging language. The third step expresses the same uncertainty again — no tool call chosen in any of three consecutive steps:

# Step 1: agent produces THOUGHT with no ACTION
# Observation: [no tool call]
# _thought_only_streak → 1

# Step 2: agent produces THOUGHT, considers options, defers ACTION
# Observation: [no tool call]
# _thought_only_streak → 2

# Step 3: agent produces THOUGHT, still no commitment
# Observation: [no tool call]
# _thought_only_streak → 3 → TRIP

# RuntimeError: [LlamaIndex breaker TRIPPED] ReAct reasoning cycle:
#   3 consecutive thought-only steps with no tool call or final answer

Without the breaker, this loop runs to max_iterations. At 15 iterations of thought-only reasoning with no tool calls, the agent generates 15 LLM calls with an ever-growing chain-of-thought prompt. The breaker fires at 3 thought-only steps — before any value-destroying accumulation occurs. The fix: reduce the tool set to one specific tool for this sub-task, or add an explicit "choose the first URL" instruction to the system prompt to force commitment.

Tool call storm: the file reader hitting a non-existent path

A code analysis agent has a read_file tool. The agent is given a task that references a module path that doesn't exist in the project. The agent reads a directory listing, sees a similar-looking path, and calls read_file("/src/utils/formatter.py"). The tool raises FileNotFoundError. The agent reasons that the path might have a different casing and calls read_file("/src/utils/formatter.py") again — identical path, same error. Two more times, same call:

# Call 1: read_file("/src/utils/formatter.py") → FileNotFoundError
# _tool_call_counts["read_file::a3f2c1b9"] = 1

# Call 2: same path → FileNotFoundError
# _tool_call_counts["read_file::a3f2c1b9"] = 2

# Call 3: same path → FileNotFoundError
# _tool_call_counts["read_file::a3f2c1b9"] = 3

# Call 4: same path → FileNotFoundError
# _tool_call_counts["read_file::a3f2c1b9"] = 4 → TRIP

# RuntimeError: [LlamaIndex breaker TRIPPED] tool storm:
#   'read_file' called 4x with identical arguments

The fix: add a directory listing step before the file read, or give the tool access to glob-style path resolution so it can find the actual path. The breaker stops the storm before the agent exhausts its iteration budget on a structurally impossible task.

Multi-agent back-delegation: orchestrator and specialist in a routing loop

An orchestrator agent and a data-specialist agent share a route_query function. The orchestrator calls dispatch_to_specialist("explain the revenue trend"). The specialist receives the query, determines it requires financial modeling context it doesn't have, and calls an escalate tool that routes back to the orchestrator. The orchestrator receives the same query again and dispatches to the specialist again:

# Orchestrator calls dispatch_to_specialist
# delegation_context.__enter__: depth 0 → 1

# Specialist calls escalate (routes back to orchestrator)
# Orchestrator calls dispatch_to_specialist again
# delegation_context.__enter__: depth 1 → 2

# Specialist calls escalate again
# Orchestrator calls dispatch_to_specialist again
# delegation_context.__enter__: depth 2 → 3

# delegation_context.__enter__: depth 3 → 4 >= max_delegation_depth=4 → TRIP

# RuntimeError: [LlamaIndex breaker TRIPPED] multi-agent back-delegation:
#   depth 4 >= max 4

The breaker fires at depth 4 — before the fourth round-trip completes. The fix: give the specialist a fallback response for queries it can't handle rather than an escalation tool that routes back to the caller. Or restructure the router so escalation always adds context rather than re-dispatching the raw query.

Chat history token inflation: tool results ballooning the context

A data pipeline agent's fetch_records tool returns full JSON payloads — sometimes 200 records at 500 bytes each. In turn 1, the prompt is 1,200 tokens. By turn 5, the accumulated tool results have added 8,000 tokens of JSON history. Turn 6's prompt is 9,800 tokens — 8× the turn-1 baseline. The drift ratio at turn 6 is 9,800 / ((1,200 + 1,400 + 3,600 + 5,800 + 7,200) / 5) ≈ 2.6× early average → trip:

# Turn 1: 1,200 input tokens → per_turn_input_tokens=[1200]
# Turn 2: 1,400 → [1200, 1400]
# Turn 3: 3,600 → [1200, 1400, 3600]
# Turn 4: 5,800 → [1200, 1400, 3600, 5800]
#   midpoint=2, early_avg=(1200+1400)/2=1300
#   current=5800, ratio=5800/1300=4.5 >= 2.5 → TRIP

# RuntimeError: [LlamaIndex breaker TRIPPED] prompt token inflation:
#   current turn 5800 tokens is 4.5x early average 1300 tokens

The fix: truncate or summarize tool results before appending to the conversation. Return record counts and a summary rather than full JSON payloads. Or use LlamaIndex's ChatMemoryBuffer with a token_limit to cap the retained history to a fixed window rather than the unbounded default.

Cost savings in practice

Failure mode	Without breaker	With breaker	Savings
ReAct reasoning cycle `max_iterations=15`, 3 thought-only before trip	15 LLM calls, growing prompt	3 LLM calls, then trip	~80%
Tool call storm same args, `max_tool_repeats=4`	7–8 identical tool calls + 7–8 LLM calls	4 tool calls, then trip	~50%
Multi-agent back-delegation `max_delegation_depth=4`	Unbounded (external timeout or OOM)	4 delegation hops, then trip	>90%
Chat history token inflation `cost_drift_ratio=2.5`	Quadratic cost growth to session end	Trip at 2.5× early average	~65%

Plugging in RunGuard

The breaker above handles all four failure modes correctly. In production you also need: persistent trip logs across restarts, Slack or PagerDuty alerts when the breaker fires, a dashboard showing trip rate by agent and failure mode, and HALF_OPEN probe management so recovery is automatic. RunGuard provides all of this as a one-line install on top of your existing LlamaIndex agent:

import runguard
from llama_index.core.agent import ReActAgent

agent = ReActAgent.from_tools(tools=[...], max_iterations=15)

# Wraps the agent's CallbackManager, adds all four breaker detectors,
# logs trips to the RunGuard dashboard, and sends alerts on your configured channels
runguard.install(agent, budget_usd=5.0)

response = agent.chat("your query")

runguard.install() detects that agent is a LlamaIndex AgentRunner, attaches the LlamaIndexBreaker to the agent's existing CallbackManager (preserving any existing handlers), and wraps agent.chat() and agent.achat() with delegation depth tracking. No changes to your tools, your agent definition, or your Settings.

If you have a multi-agent pipeline, pass the pipeline root:

import runguard

# Shared breaker for the whole pipeline
guard = runguard.install(orchestrator_agent, budget_usd=10.0, max_delegation_depth=3)

# Attach the same guard to specialist agents so they share state
runguard.attach(specialist_agent, guard=guard)

All agents in the pipeline share the budget counter, the delegation depth tracker, and the tool call storm detector. A budget exhausted by the specialist fires the breaker in the orchestrator. A delegation loop detected in the specialist fires before the orchestrator makes its next LLM call. The full LlamaIndex integration guide covers async agents, streaming responses, and custom ChatMemoryBuffer truncation strategies.

Frequently asked questions

Does this work with FunctionCallingAgent and the newer AgentRunner/AgentWorker architecture?

Yes. FunctionCallingAgent, ReActAgent, and any custom AgentRunner subclass all use the same CallbackManager event system. The breaker attaches to the CallbackManager and intercepts CBEventType.FUNCTION_CALL, CBEventType.LLM, and CBEventType.AGENT_STEP events regardless of which agent worker implementation fires them. The only difference: FunctionCallingAgent produces native function-call tool events rather than ReAct text-parsed events, so the AGENT_STEP handler needs to check sources (tool call results) rather than parsing ReAct step text. The implementation above uses sources — it's already compatible with both.

How does the thought-only step detector handle legitimate reasoning steps before the first tool call?

The trip threshold is consecutive thought-only steps — the streak resets to zero whenever any tool call completes (tracked in _handle_function_call_end). An agent that spends two steps reasoning before choosing its first tool call has a streak of 2 — under the default threshold of 3. An agent that alternates between one thought step and one tool call never accumulates a streak at all. The detector specifically targets runs of reasoning steps with no action committed between them, which is the structural signature of a stuck reasoning loop rather than legitimate planning behavior. In practice, agents that do legitimate multi-step planning almost always commit to a tool call within 2 steps of starting a reasoning sequence.

My tool legitimately needs to be called with the same arguments multiple times (e.g., polling a status endpoint). How do I prevent false positives?

Set max_tool_repeats higher for agents with polling tools, or exclude specific tool names from storm detection by subclassing LlamaIndexBreaker and overriding _handle_function_call_start to skip tools in an allowlist. For polling specifically, the better fix is to build the polling logic into the tool itself (loop internally, return when done or timeout) rather than having the agent call the tool repeatedly — that way the agent makes one call, the tool polls internally for up to N seconds, and the agent gets one result. This is architecturally cleaner and eliminates the need for a storm-detection exception for that tool.

Does the token inflation detector work with LlamaIndex's streaming API?

Streaming complicates token counting because usage stats arrive at the end of the stream, not during. The CBEventType.LLM on_event_end fires after the stream completes, carrying the full token count in the response payload. The inflation detector therefore works correctly with streaming — it records the token count after each streaming call completes and checks the drift ratio. The only gap: you can't trip the breaker mid-stream based on a growing token count because you don't have the final count until the stream ends. For very long streaming responses (10,000+ tokens), this means you absorb the cost of one expensive streaming call before the drift detector has enough data. Mitigate this by setting a low max_tokens parameter on your LLM to cap individual response length, independent of the breaker.

How should I set the cost_drift_ratio for agents that intentionally get more context-heavy over a session?

A legitimate data-intensive session might start light (query planning, tool selection) and get heavier (retrieving and processing large payloads). In that pattern, prompt tokens should grow roughly linearly — each turn adds a fixed amount of context. A cost_drift_ratio of 3.0–4.0 accommodates linear growth without tripping on a session where context legitimately doubles from turn 1 to turn 8. The ratio is designed to catch super-linear growth — a session where the growth rate is accelerating, not just the absolute size. If your sessions routinely hit the drift detector but don't exhibit runaway behavior, audit which tool results are largest and consider summarizing or paginating them. The drift detector is most valuable when the absolute token count isn't obviously wrong but the growth trajectory is.

Stop runaway LlamaIndex agents before the bill lands

RunGuard attaches to your agent's CallbackManager with one runguard.install() call. ReAct reasoning cycle detection, tool call storm prevention, multi-agent delegation depth tracking, and prompt token inflation monitoring — none of it requires changing your tools, your agent definitions, or your LlamaIndex settings.

See pricing →