LlamaIndex Cost Control: Sub-Question Fan-Out, ReAct Tool Loops, and Workflow Cycles

LlamaIndex is one of the most widely deployed RAG and agent frameworks in production. Its core design philosophy — composable query engines, pluggable retrievers, and declarative agent workflows — gives developers powerful primitives for building complex information retrieval and reasoning pipelines. The same composability that makes LlamaIndex expressive also makes it easy to accidentally build a cost amplifier. Every composition layer is a potential multiplier: a query engine that calls three sub-engines, a retriever that retries on low relevance, an agent that routes to five per-document agents, each paying its own synthesis call.

Unlike single-shot LLM call failures — where a bad prompt wastes one call — LlamaIndex's layered architecture means a misconfigured component can turn one user query into 30 or 40 billable LLM calls before returning a response. At GPT-4o or Claude Sonnet rates, a well-tuned pipeline costs $0.01–$0.05 per query. A pipeline caught in a fan-out or retry loop costs $0.40–$2.00 for that same query. Multiply by a few hundred concurrent users and the difference between a well-guarded pipeline and an unguarded one is thousands of dollars per day.

The four structural failure modes that account for the majority of unexpected costs in LlamaIndex deployments:

  • Sub-question decomposition fan-outSubQuestionQueryEngine decomposes a top-level query into N sub-questions and routes each to a separate sub-engine; ambiguous or broad queries routinely produce 8–15 sub-questions, each paying a full retrieval and synthesis call, plus a final aggregation call over all sub-answers.
  • ReAct agent tool repeat loop — LlamaIndex's ReActAgent operates within a max_iterations ceiling, but tool invocation errors and low-confidence observations can cause the agent to re-invoke the same tool with cosmetically different arguments across consecutive steps without advancing toward a final answer.
  • RetryQueryEngine reformulation storm — the RetryQueryEngine and RetryWithErrorFeedback evaluator retry any query that scores below a relevance threshold; when the underlying index is sparse or the query is out-of-distribution, every reformulation pays a new embedding call, a new retrieval pass, and a new synthesis call before failing again.
  • Multi-document agent routing amplification — LlamaIndex's multi-document agent pattern builds a per-document ReActAgent for each document in the corpus and routes queries via an ObjectIndex; when the router judges multiple documents as potentially relevant, it dispatches to each independently, paying a full agent invocation per selected document.

LlamaIndex's cost model

LlamaIndex is model-agnostic: it routes calls to whichever LLM and embedding provider you configure. The cost structure at each layer is additive:

  • Retrieval: one embedding call per query (input tokens only; typically $0.00002–$0.00010 for a 100-token query with text-embedding-3-small or similar).
  • Synthesis: one LLM call combining the query with the retrieved context chunks; cost is dominated by context size. A typical RAG response with 5 retrieved chunks at 300 tokens each costs roughly $0.01–$0.04 at Claude Sonnet 4.6 rates.
  • Re-ranking: a cross-encoder re-ranking pass over the top-K candidates; adds another embedding or LLM call per query.
  • Agent iteration: each ReAct step pays one synthesis call for the Thought-Action generation plus one synthesis call for the Observation interpretation; a 10-step agent pays 20 synthesis calls minimum.

The total per-query cost is the sum across every layer activated. A SubQuestionQueryEngine with 10 sub-questions, each hitting a RetryQueryEngine that retries twice, pays 30 synthesis calls plus 30 embedding calls plus one aggregation synthesis call for a single user query. At $0.015 per synthesis call, that is $0.46 before accounting for context size. The same query through a single-engine pipeline costs $0.015.

Failure mode 1: sub-question decomposition fan-out

SubQuestionQueryEngine is one of LlamaIndex's most powerful high-level abstractions: it takes a complex query, asks an LLM to decompose it into a list of simpler sub-questions that can each be answered by a specific sub-engine, answers each sub-question independently using the appropriate engine, and synthesizes a final answer from all sub-answers. The design assumption is that the sub-question count is bounded and proportional to the query's genuine complexity.

In practice, the LLM generating sub-questions has no awareness of your cost structure and will generate as many sub-questions as seem useful. A query like "compare the approaches to agent memory management across LangChain, LlamaIndex, and AutoGen, including their persistence options and context window handling" can produce 12–18 sub-questions: one per framework per dimension. A broader question about an entire software category or an open-ended research topic produces even more. Each sub-question is dispatched to a separate engine invocation — embedding + retrieval + synthesis — and the responses are then aggregated in a final synthesis call that itself receives all sub-answers as context, making the aggregation call the most expensive single call in the chain.

The guard intercepts the decomposition step before sub-questions are dispatched. It counts the proposed sub-questions and either caps them or raises before any engine invocations fire.

Python — sub-question fan-out guard for LlamaIndex SubQuestionQueryEngine
import anthropic
from runguard import BudgetTracker, BudgetExceededError, LoopDetector

class SubQuestionFanOutGuard:
    """
    Wraps the sub-question generation step of SubQuestionQueryEngine.
    Caps the number of sub-questions dispatched and detects query patterns
    that repeatedly produce oversized fan-outs (repeated broad queries).
    """
    def __init__(
        self,
        max_sub_questions: int = 6,
        session_budget_usd: float = 1.0,
        warn_threshold: int = 4,
    ):
        self.max_sub_questions = max_sub_questions
        self.budget = BudgetTracker(cap=session_budget_usd)
        self.warn_threshold = warn_threshold
        # LoopDetector tracks repeated oversized fan-out queries
        self.loop_detector = LoopDetector(repeats=3, max_cycle_len=3)
        self._client = anthropic.Anthropic()

    def decompose_query(self, query: str, tool_descriptions: list[str]) -> list[dict]:
        """
        Generates sub-questions for a query against a list of sub-engine tools.
        Returns a list of {"sub_question": str, "tool_name": str} dicts, capped
        at max_sub_questions. Raises if session budget is exhausted or if the
        query pattern is stuck in a fan-out loop.
        """
        tools_text = "\n".join(
            f"- {desc}" for desc in tool_descriptions
        )

        # Check LoopDetector: if this query signature keeps producing oversized
        # fan-outs, the caller is stuck sending the same broad query repeatedly
        query_sig = query.lower().strip()[:60]
        loop_state = self.loop_detector.step(query_sig)
        if loop_state.tripped:
            raise RuntimeError(
                f"Sub-question fan-out loop detected: query pattern '{query_sig}' "
                f"has triggered oversized decomposition {loop_state.repeats} times. "
                "Narrow the query or cap sub-engines before retrying."
            )

        # Generate sub-questions via the LLM
        decomp_response = self._client.messages.create(
            model="claude-haiku-4-5-20251001",  # Haiku is fast + cheap for decomposition
            max_tokens=512,
            messages=[{
                "role": "user",
                "content": (
                    f"You are decomposing a complex query for a multi-source retrieval system.\n\n"
                    f"Available information sources:\n{tools_text}\n\n"
                    f"Query: {query}\n\n"
                    f"Decompose this query into at most {self.max_sub_questions} specific "
                    f"sub-questions, each answerable by exactly one of the listed sources. "
                    f"Only create a sub-question if it is genuinely necessary to answer "
                    f"the original query. Prefer fewer, more focused sub-questions over "
                    f"exhaustive coverage.\n\n"
                    f"Return one sub-question per line in the format:\n"
                    f"TOOL:  | QUESTION: "
                ),
            }],
        )

        decomp_cost = (
            decomp_response.usage.input_tokens * 0.000001 +
            decomp_response.usage.output_tokens * 0.000005
        )
        try:
            self.budget.add(decomp_cost)
        except BudgetExceededError as e:
            raise RuntimeError(
                f"Session budget exhausted during query decomposition. "
                f"Spent: ${self.budget.spent:.4f} / ${self.budget.cap:.2f}"
            ) from e

        sub_questions = []
        for line in decomp_response.content[0].text.strip().splitlines():
            line = line.strip()
            if "TOOL:" not in line or "QUESTION:" not in line:
                continue
            parts = line.split("|")
            if len(parts) != 2:
                continue
            tool_part = parts[0].replace("TOOL:", "").strip()
            question_part = parts[1].replace("QUESTION:", "").strip()
            if tool_part and question_part:
                sub_questions.append({
                    "sub_question": question_part,
                    "tool_name": tool_part,
                })

        # Hard cap: never dispatch more than max_sub_questions
        capped = sub_questions[:self.max_sub_questions]

        if len(capped) >= self.warn_threshold:
            # Record this as a "high fan-out" signal for loop detection
            self.loop_detector.step(query_sig)

        return capped

    def record_synthesis_cost(self, cost_usd: float):
        try:
            self.budget.add(cost_usd)
        except BudgetExceededError as e:
            raise RuntimeError(
                "Session budget exhausted during sub-question synthesis."
            ) from e


def run_guarded_subquestion_query(
    query: str,
    sub_engines: dict,  # {tool_name: callable(question) -> str}
    model: str = "claude-sonnet-4-6",
    max_sub_questions: int = 6,
    session_budget_usd: float = 1.0,
) -> str:
    """
    Executes a SubQuestionQueryEngine-style pipeline with fan-out guard.
    sub_engines: dict mapping tool name → callable that takes a sub-question
    and returns a string answer.
    """
    client = anthropic.Anthropic()
    guard = SubQuestionFanOutGuard(
        max_sub_questions=max_sub_questions,
        session_budget_usd=session_budget_usd,
    )

    tool_descriptions = [
        f"{name}: answers questions about {name.replace('_', ' ')}"
        for name in sub_engines
    ]

    # Decompose — guarded
    sub_questions = guard.decompose_query(query, tool_descriptions)
    if not sub_questions:
        # Fallback: answer directly without decomposition
        fallback = client.messages.create(
            model=model,
            max_tokens=512,
            messages=[{"role": "user", "content": query}],
        )
        return fallback.content[0].text.strip()

    # Execute each sub-question against its designated engine
    sub_answers = []
    for item in sub_questions:
        engine_fn = sub_engines.get(item["tool_name"])
        if engine_fn is None:
            sub_answers.append(f"[{item['tool_name']}] No engine found.")
            continue
        answer = engine_fn(item["sub_question"])
        sub_answers.append(f"[{item['tool_name']}] {item['sub_question']}\nAnswer: {answer}")

    # Aggregate synthesis — the most expensive call
    combined_answers = "\n\n".join(sub_answers)
    agg_response = client.messages.create(
        model=model,
        max_tokens=768,
        messages=[{
            "role": "user",
            "content": (
                f"Original query: {query}\n\n"
                f"Sub-answers from {len(sub_answers)} sources:\n\n"
                f"{combined_answers}\n\n"
                "Synthesize a comprehensive, coherent answer to the original query "
                "using the sub-answers above."
            ),
        }],
    )
    agg_cost = (
        agg_response.usage.input_tokens * 0.000003 +
        agg_response.usage.output_tokens * 0.000015
    )
    guard.record_synthesis_cost(agg_cost)
    return agg_response.content[0].text.strip()

The guard enforces three controls at different points in the pipeline. The sub-question count cap (max_sub_questions=6) fires before any engine invocations, limiting the worst-case cost to 6 synthesis calls plus one aggregation call regardless of how many sub-questions the decomposition LLM proposes. The budget tracker accumulates costs across the decomposition call, all sub-engine calls, and the aggregation call — if any individual call would push over the session budget, the tracker raises before that call is made. The LoopDetector catches the session-level pattern: if the same broad query keeps triggering high fan-outs across multiple requests, it means the upstream caller is misconfigured or the user is stuck in a retry loop at the application layer.

Failure mode 2: ReAct agent tool repeat loop

LlamaIndex's ReActAgent implements the standard Reason-Act-Observe cycle: the agent generates a Thought and an Action (tool call), executes the tool, observes the result, and either generates the final answer or continues to the next step. The max_iterations parameter caps the number of steps before the agent is forced to produce a final answer. This cap works correctly when each iteration genuinely advances toward the goal.

The repeat loop emerges from two distinct sources. First, tool execution errors: when a tool call fails or returns an error response, the LlamaIndex agent logs the error as an Observation but does not always increment the error-specific iteration counter separately from the success counter. The agent re-tries the tool with a slightly modified argument — perhaps changing capitalization or adding a qualifier — pays another full synthesis call for the new Thought-Action generation, and observes the same error again. Second, low-confidence results: when the tool returns a valid but uncertain response ("I could not find specific information about X"), the agent may rephrase the query and call the same tool repeatedly, generating valid-looking but costless-to-the-tool iterations that each pay one synthesis call.

The guard tracks tool invocations by tool name and argument signature. After a configurable number of consecutive calls to the same tool with semantically similar arguments, it injects a forced termination or a tool-change directive into the agent context.

Python — ReAct tool repeat guard for LlamaIndex ReActAgent
import hashlib
import anthropic
from runguard import BudgetTracker, BudgetExceededError, LoopDetector

class ReActToolRepeatGuard:
    """
    Tracks tool invocations in a LlamaIndex ReActAgent step loop.
    Detects when the agent is calling the same tool with near-identical
    arguments across consecutive steps without advancing.
    """
    def __init__(
        self,
        max_same_tool_consecutive: int = 3,
        session_budget_usd: float = 0.50,
    ):
        self.max_consecutive = max_same_tool_consecutive
        self.budget = BudgetTracker(cap=session_budget_usd)
        # LoopDetector on (tool_name, arg_sig) pairs
        self.loop_detector = LoopDetector(repeats=max_same_tool_consecutive, max_cycle_len=2)
        self._step_history: list[dict] = []

    def _arg_signature(self, tool_name: str, tool_input: str) -> str:
        """
        Produces a signature for (tool_name, tool_input) that collapses
        near-identical inputs. Strips punctuation and lowercases before hashing
        so "search for X" and "search for X." map to the same signature.
        """
        normalized = "".join(
            c for c in f"{tool_name}:{tool_input}".lower()
            if c.isalnum() or c == ":"
        )
        return hashlib.sha256(normalized.encode()).hexdigest()[:16]

    def before_tool_call(
        self,
        tool_name: str,
        tool_input: str,
        step_index: int,
    ) -> None:
        """
        Call before each tool invocation. Raises RuntimeError if the agent
        is stuck repeating the same tool call. Records the call in history.
        """
        sig = self._arg_signature(tool_name, tool_input)
        loop_state = self.loop_detector.step(sig)

        self._step_history.append({
            "step": step_index,
            "tool": tool_name,
            "input": tool_input[:120],
            "sig": sig,
        })

        if loop_state.tripped:
            history_summary = "\n".join(
                f"  Step {h['step']}: {h['tool']}({h['input'][:60]}...)"
                for h in self._step_history[-5:]
            )
            raise RuntimeError(
                f"ReAct tool repeat loop detected: '{tool_name}' called with "
                f"near-identical arguments {self.max_consecutive} consecutive times.\n"
                f"Recent tool calls:\n{history_summary}\n"
                "Agent is not making progress. Force a final answer or change tool."
            )

    def record_step_cost(self, input_tokens: int, output_tokens: int) -> None:
        cost = (input_tokens * 0.000003) + (output_tokens * 0.000015)
        try:
            self.budget.add(cost)
        except BudgetExceededError as e:
            raise RuntimeError(
                f"ReAct session budget exhausted after {len(self._step_history)} steps."
            ) from e

    def reset_tool_counter(self) -> None:
        """Call when the agent successfully calls a different tool to reset repeat tracking."""
        self.loop_detector = LoopDetector(
            repeats=self.max_consecutive, max_cycle_len=2
        )


def run_guarded_react_agent(
    query: str,
    tools: list[dict],  # [{"name": str, "fn": callable, "description": str}]
    model: str = "claude-sonnet-4-6",
    max_iterations: int = 10,
    session_budget_usd: float = 0.50,
) -> str:
    """
    LlamaIndex-style ReAct loop with tool-repeat guard and budget tracker.
    tools: list of dicts with "name", "fn" (callable: str -> str), "description".
    """
    client = anthropic.Anthropic()
    guard = ReActToolRepeatGuard(
        max_same_tool_consecutive=3,
        session_budget_usd=session_budget_usd,
    )

    tool_index = {t["name"]: t["fn"] for t in tools}
    tools_description = "\n".join(
        f"- {t['name']}: {t['description']}" for t in tools
    )

    history = []
    observations = []

    system_prompt = (
        "You are a ReAct agent. For each step, output:\n"
        "Thought: \n"
        "Action: \n"
        "Action Input: \n\n"
        "When you have enough information, output:\n"
        "Thought: I now have enough information to answer.\n"
        "Final Answer: \n\n"
        f"Available tools:\n{tools_description}"
    )

    for step in range(max_iterations):
        context = "\n\n".join(observations[-4:]) if observations else "No observations yet."

        response = client.messages.create(
            model=model,
            max_tokens=512,
            system=system_prompt,
            messages=[
                *history[-8:],
                {
                    "role": "user",
                    "content": (
                        f"Query: {query}\n\n"
                        f"Recent observations:\n{context}\n\n"
                        "What is your next thought and action?"
                    ),
                },
            ],
        )

        response_text = response.content[0].text.strip()
        guard.record_step_cost(
            response.usage.input_tokens, response.usage.output_tokens
        )

        history.extend([
            {"role": "user", "content": f"Query: {query}"},
            {"role": "assistant", "content": response_text},
        ])

        if "Final Answer:" in response_text:
            final = response_text.split("Final Answer:")[-1].strip()
            return final

        # Parse action and input
        action_name = ""
        action_input = ""
        for line in response_text.splitlines():
            if line.startswith("Action:"):
                action_name = line.replace("Action:", "").strip()
            elif line.startswith("Action Input:"):
                action_input = line.replace("Action Input:", "").strip()

        if not action_name or action_name not in tool_index:
            observations.append(f"Step {step}: No valid action found in response.")
            continue

        # Guard: check before invoking the tool
        try:
            guard.before_tool_call(action_name, action_input, step)
        except RuntimeError as e:
            # Force final answer with current observations
            return (
                f"Agent terminated by loop guard: {str(e)[:200]}. "
                f"Last observations: {observations[-1] if observations else 'none'}"
            )

        # Invoke tool
        tool_fn = tool_index[action_name]
        try:
            observation = tool_fn(action_input)
        except Exception as exc:
            observation = f"Tool error: {type(exc).__name__}: {str(exc)[:200]}"

        observations.append(f"Step {step} [{action_name}]: {observation[:300]}")

    return f"Max iterations ({max_iterations}) reached. Last observation: {observations[-1] if observations else 'none'}"

The before_tool_call interception is the key insertion point. Rather than wrapping the entire agent, the guard sits at the tool dispatch layer — the one place where the agent's abstract reasoning transitions into a concrete (and billed) action. The argument normalization step is load-bearing: without collapsing punctuation and case differences, search("AI agent frameworks") and search("ai agent frameworks.") would be treated as distinct calls and the loop would never trip. The reset_tool_counter method lets the host reset the repeat counter when the agent genuinely pivots to a new tool, preventing false positives when an agent legitimately calls the same tool twice in non-consecutive steps.

Failure mode 3: RetryQueryEngine reformulation storm

LlamaIndex's RetryQueryEngine wraps an inner query engine with an evaluator that scores the quality of each response. If the score falls below a threshold, the engine automatically reformulates the query and retries — the evaluator generates a critique of the failed response and suggests a revised query, which is then sent through the full retrieval + synthesis pipeline again. This is the correct behavior for a single transient failure. It becomes a cost storm when the underlying cause of the low score is not transient.

Three conditions produce persistent low scores: queries about topics not covered in the indexed documents (out-of-distribution queries), very specific queries that don't match the chunking granularity of the index (a question about a specific line in a document that was chunked at the paragraph level), and adversarially vague queries where no retrieval result can produce a high-confidence synthesis (queries like "what does this document mean for me" or "what should I do"). In all three cases, every reformulation pays a new embedding call, a new retrieval pass, and a new synthesis call — and scores below the threshold again. The RetryQueryEngine default configuration allows up to 3 retries; with an always-failing query, you pay 4 full pipeline invocations per user question.

The guard tracks retry attempts per query signature. If the same query (or near-identical reformulations of it) keeps failing the evaluator threshold across multiple retry cycles, the guard raises rather than allowing another reformulation.

Python — retry storm guard for LlamaIndex RetryQueryEngine pattern
import hashlib
import anthropic
from runguard import BudgetTracker, BudgetExceededError

class RetryStormGuard:
    """
    Guards the RetryQueryEngine retry loop. Tracks how many times a given
    query signature has been reformulated and failed the evaluator threshold.
    Raises if reformulation count exceeds max_reformulations — indicating
    an out-of-distribution or structurally unanswerable query.
    """
    def __init__(
        self,
        max_reformulations: int = 2,
        min_score_threshold: float = 0.6,
        session_budget_usd: float = 0.30,
    ):
        self.max_reformulations = max_reformulations
        self.threshold = min_score_threshold
        self.budget = BudgetTracker(cap=session_budget_usd)
        self._retry_counts: dict[str, int] = {}
        self._client = anthropic.Anthropic()

    def _query_sig(self, query: str) -> str:
        normalized = "".join(c for c in query.lower() if c.isalnum() or c.isspace())
        return hashlib.sha256(normalized[:200].encode()).hexdigest()[:16]

    def evaluate_and_maybe_retry(
        self,
        original_query: str,
        current_query: str,
        response_text: str,
        query_engine_fn,  # callable: (query: str) -> str
        model: str = "claude-haiku-4-5-20251001",
    ) -> str:
        """
        Evaluates a query response. If the score is below threshold and
        reformulation budget remains, generates a revised query and retries.
        Returns the best response obtained.
        """
        orig_sig = self._query_sig(original_query)
        reformulation_count = self._retry_counts.get(orig_sig, 0)

        # Evaluate current response
        score = self._evaluate_response(original_query, response_text, model)

        if score >= self.threshold:
            # Good response — return it
            return response_text

        # Low score — check if we can reformulate
        if reformulation_count >= self.max_reformulations:
            # Exhausted reformulations — return best-effort response with warning
            return (
                f"[RetryGuard: {reformulation_count} reformulations exhausted, "
                f"best score {score:.2f} < {self.threshold}] {response_text}"
            )

        self._retry_counts[orig_sig] = reformulation_count + 1

        # Generate reformulated query
        reformulation = self._generate_reformulation(
            original_query, current_query, response_text, score, model
        )

        if not reformulation or reformulation == current_query:
            return response_text  # no useful reformulation possible

        # Retry with reformulated query
        try:
            new_response = query_engine_fn(reformulation)
            return self.evaluate_and_maybe_retry(
                original_query,
                reformulation,
                new_response,
                query_engine_fn,
                model,
            )
        except RuntimeError:
            return response_text  # budget exhausted mid-retry — keep best effort

    def _evaluate_response(
        self, query: str, response: str, model: str
    ) -> float:
        eval_cost_estimate = 0.00005  # Haiku evaluation call
        try:
            self.budget.add(eval_cost_estimate)
        except BudgetExceededError as e:
            raise RuntimeError("Session budget exhausted during evaluation.") from e

        eval_response = self._client.messages.create(
            model=model,
            max_tokens=16,
            messages=[{
                "role": "user",
                "content": (
                    f"Query: {query}\n\n"
                    f"Response: {response[:500]}\n\n"
                    "Score the response on how well it answers the query "
                    "(0.0 = does not answer, 1.0 = fully answers). "
                    "Reply with only a decimal number between 0.0 and 1.0."
                ),
            }],
        )
        raw = eval_response.content[0].text.strip()
        try:
            return min(1.0, max(0.0, float(raw)))
        except ValueError:
            return 0.5

    def _generate_reformulation(
        self,
        original_query: str,
        current_query: str,
        failed_response: str,
        score: float,
        model: str,
    ) -> str:
        reform_cost_estimate = 0.00008
        try:
            self.budget.add(reform_cost_estimate)
        except BudgetExceededError as e:
            raise RuntimeError("Session budget exhausted during reformulation.") from e

        reform_response = self._client.messages.create(
            model=model,
            max_tokens=128,
            messages=[{
                "role": "user",
                "content": (
                    f"Original question: {original_query}\n"
                    f"Current query sent to retrieval: {current_query}\n"
                    f"Response received (score {score:.2f}/1.0 — insufficient):\n"
                    f"{failed_response[:300]}\n\n"
                    "Rewrite the query to be more specific and more likely to match "
                    "indexed content. Focus on concrete keywords rather than abstract "
                    "phrasing. Return only the rewritten query, nothing else."
                ),
            }],
        )
        return reform_response.content[0].text.strip()

The critical design choice is tracking retry counts by original query signature rather than by the reformulated query. Without this, each reformulation looks like a fresh query to the guard and the ceiling never applies — the guard would reset its counter on every reformulation and allow unlimited retries. By anchoring to the original query, the guard correctly accumulates the count across the full chain of reformulations that all stem from the same user request. The max_reformulations=2 default allows one genuine retry (useful for transient poor retrieval) before the guard surface a best-effort response with a warning prefix rather than silently returning low-quality output.

Failure mode 4: multi-document agent routing amplification

LlamaIndex's multi-document agent pattern is designed for knowledge bases where different documents require different reasoning styles: technical documentation, legal contracts, and financial reports each benefit from different prompting strategies. The pattern builds one ReActAgent per document, indexes the agents themselves in a top-level ObjectIndex, and routes each incoming query to the most relevant agents. Queries that clearly target one document hit one agent. Queries that reference multiple documents or use vague language are routed to several agents simultaneously, each executing a full agent invocation — retrieval, reasoning, synthesis — before the parent agent aggregates their outputs.

The amplification compounds in two ways. First, routing threshold: if the relevance threshold for agent selection is set too low, queries are dispatched to more agents than necessary. A corpus of 20 documents with a 0.3 cosine similarity threshold may route a moderately ambiguous query to 8 or 10 agents. Second, nested agent invocations: each per-document ReActAgent may itself iterate over multiple steps before returning, making the total cost agents_selected × average_steps_per_agent × cost_per_step. A 10-document routing with 3-step per-document agents pays 30 synthesis calls for one user query.

The guard sits at the routing layer and enforces a maximum number of agents that can be simultaneously dispatched for a single query, plus a cost ceiling that aborts remaining dispatches if the budget is exhausted mid-routing.

Python — multi-document routing guard for LlamaIndex ObjectIndex agent pattern
import anthropic
from runguard import BudgetTracker, BudgetExceededError

class MultiDocAgentRoutingGuard:
    """
    Caps the number of per-document agents dispatched for a single query.
    Enforces a query-level and session-level budget.
    Prevents the routing fan-out from scaling with corpus size.
    """
    def __init__(
        self,
        max_agents_per_query: int = 3,
        query_budget_usd: float = 0.15,
        session_budget_usd: float = 2.0,
    ):
        self.max_agents = max_agents_per_query
        self.query_budget = query_budget_usd
        self.session_budget = BudgetTracker(cap=session_budget_usd)
        self._client = anthropic.Anthropic()

    def route_and_execute(
        self,
        query: str,
        candidate_agents: list[dict],  # [{name, description, invoke_fn}]
        aggregation_model: str = "claude-sonnet-4-6",
    ) -> str:
        """
        Routes a query to at most max_agents_per_query agents from the candidate list.
        candidate_agents: pre-retrieved candidates from ObjectIndex (already ranked).
        Executes each selected agent and aggregates results.
        """
        if not candidate_agents:
            return "No relevant agents found for this query."

        # Cap routing fan-out
        selected = candidate_agents[:self.max_agents]

        # Per-query budget tracks spend across all selected agent invocations
        query_budget = BudgetTracker(cap=self.query_budget)

        agent_responses = []
        for agent_info in selected:
            agent_name = agent_info["name"]
            invoke_fn = agent_info["invoke_fn"]

            try:
                # Invoke per-document agent
                agent_response = invoke_fn(query)
                agent_responses.append({
                    "agent": agent_name,
                    "response": agent_response,
                })

                # Estimate per-agent cost (1 synthesis call at ~500 tokens average)
                estimated_cost = 0.015  # conservative estimate per agent invocation
                try:
                    query_budget.add(estimated_cost)
                    self.session_budget.add(estimated_cost)
                except BudgetExceededError as e:
                    # Budget exhausted — aggregate what we have so far
                    agent_responses.append({
                        "agent": "__budget_limit__",
                        "response": f"[Budget exhausted after {len(agent_responses)} agents]",
                    })
                    break

            except Exception as exc:
                agent_responses.append({
                    "agent": agent_name,
                    "response": f"[Agent error: {type(exc).__name__}: {str(exc)[:100]}]",
                })

        if not agent_responses:
            return "No agent responses received."

        if len(agent_responses) == 1:
            # Single agent — return directly without aggregation call
            return agent_responses[0]["response"]

        # Aggregate responses — this is the most expensive call due to combined context
        combined = "\n\n".join(
            f"[{r['agent']}]: {r['response'][:400]}"
            for r in agent_responses
            if r["agent"] != "__budget_limit__"
        )

        agg_response = self._client.messages.create(
            model=aggregation_model,
            max_tokens=512,
            messages=[{
                "role": "user",
                "content": (
                    f"Query: {query}\n\n"
                    f"Responses from {len(agent_responses)} relevant document agents:\n\n"
                    f"{combined}\n\n"
                    "Synthesize a unified answer to the query using the agent responses above. "
                    "Cite which document(s) each key fact comes from."
                ),
            }],
        )

        agg_cost = (
            agg_response.usage.input_tokens * 0.000003 +
            agg_response.usage.output_tokens * 0.000015
        )
        try:
            query_budget.add(agg_cost)
            self.session_budget.add(agg_cost)
        except BudgetExceededError:
            pass  # aggregation cost is unavoidable at this point — accept it

        return agg_response.content[0].text.strip()

    def check_session_budget(self) -> tuple[float, float]:
        """Returns (spent_usd, remaining_usd) for the session."""
        spent = self.session_budget.spent
        remaining = self.session_budget.cap - spent
        return spent, remaining

The max_agents_per_query=3 ceiling is the primary cost control: it converts the routing from an O(corpus_size) worst-case to O(3) regardless of how many agents score above the relevance threshold. The per-query budget provides a secondary ceiling that aborts mid-execution if a single query's actual costs (which may exceed the per-agent estimate if any per-document agent runs many steps) push toward the query budget. The session budget accumulates across all queries, providing the outermost ceiling for multi-turn sessions where routing amplification can compound across dozens of user messages.

Combining guards in a production LlamaIndex pipeline

These four guards are independent and composable. A production LlamaIndex deployment might use all four simultaneously at different points in the pipeline:

  • The SubQuestionFanOutGuard wraps the SubQuestionQueryEngine.query() call at the outermost layer.
  • The ReActToolRepeatGuard is injected into each ReActAgent's tool dispatch callback via LlamaIndex's callback_manager hooks.
  • The RetryStormGuard replaces the RetryQueryEngine's built-in retry logic for queries routed to sparse or out-of-distribution sub-engines.
  • The MultiDocAgentRoutingGuard wraps the ObjectIndex query results before they are dispatched to per-document agents.

The shared session budget — either passed as a single BudgetTracker instance or as individual caps that sum to the session total — ensures that the guards collectively enforce a single per-session ceiling rather than independently allowing their own sub-budgets to add up to a surprise total.

LlamaIndex 0.10+ Workflows note: The new LlamaIndex Workflows framework introduces an event-driven step graph that can contain back-edges (a step emitting an event that routes back to an earlier step). A back-edge with no exit condition is a cycle that will run indefinitely. Guard Workflow cycles by tracking event type counts per execution ID — if the same event type fires more than N times in a single workflow execution, the workflow has entered a loop and should be halted. The LoopDetector works directly here: use the (workflow_id, event_type) pair as the step signature, with repeats=5 to allow legitimate retry patterns while catching genuine loops.

Summary: LlamaIndex cost amplification patterns

Component Cost multiplier Guard
SubQuestionQueryEngine
default: unbounded sub-questions
N sub-questions × (embed + retrieve + synthesize) Cap sub-question count before dispatch
ReActAgent
default: max_iterations=10
Repeat tool calls × synthesis cost per step Track (tool, normalized-arg) pairs; trip on repeats
RetryQueryEngine
default: max_retries=3
Retries × (embed + retrieve + evaluate + synthesize) Track reformulation count per original query signature
Multi-doc agent routing
default: relevance threshold 0.3+
Agents selected × steps per agent × synthesis cost Cap agents dispatched per query; per-query budget
Workflow back-edges
event-driven cycles
Unbounded step repetitions until kill LoopDetector on (workflow_id, event_type) signature

Frequently asked questions

Does LlamaIndex have built-in cost controls for these patterns?

LlamaIndex provides max_iterations for ReActAgent, max_retries for RetryQueryEngine, and callback hooks throughout the pipeline. These built-in limits prevent the absolute worst cases but they don't detect semantic loops (the same tool call with different phrasing), don't aggregate costs across composed components, and don't provide session-level or query-level budget enforcement that spans multiple pipeline stages. The guards above complement the built-in limits rather than replacing them.

How do I estimate the cost of a LlamaIndex RAG pipeline before deploying it?

Run the pipeline against a representative sample of queries with DRY_RUN=1 or by mocking the LLM calls and counting invocations instead of spending tokens. Log every LLM call — model, input tokens, output tokens, call type (embedding/synthesis/evaluation) — to a local file. Multiply by your provider's pricing. The key number is not average cost per query but the 95th percentile cost per query, which is where the fan-out and retry patterns appear. A pipeline with a $0.02 median cost and a $0.80 P95 cost will surprise you at scale.

What is the safest default configuration for SubQuestionQueryEngine in production?

Set the sub-question generation prompt to explicitly cap the count ("generate at most 4 sub-questions"), provide the LLM with context about your indexing structure so it generates questions that match your chunking granularity, and wrap the engine with a fan-out guard that rejects outputs above your cap rather than silently truncating them (truncation can produce incoherent aggregations if the LLM assumed all sub-questions would be answered). A hard cap of 4–6 sub-questions covers the majority of legitimate complex queries without enabling pathological fan-outs.

Does the multi-document agent pattern scale to large document corpora?

The multi-document agent pattern is designed for corpora where each document benefits from specialized reasoning — typically 5–50 documents that differ significantly in domain or structure. For larger corpora (hundreds or thousands of documents), the per-document agent routing overhead becomes prohibitive even with a routing guard, because the ObjectIndex retrieval itself grows and the aggregation call receives too many agent responses to synthesize coherently. For large corpora, prefer a single well-parameterized retrieval engine with metadata filtering over per-document agents.

How does RunGuard's LoopDetector map to LlamaIndex's callback system?

LlamaIndex exposes a CallbackManager that fires before and after LLM calls, tool calls, embedding calls, and query engine invocations. The cleanest integration is to call guard.before_tool_call() inside a custom callback registered for the CBEventType.LLM or CBEventType.FUNCTION_CALL event types. The guard instance is held by the callback class and accumulates state across the agent's iteration loop. This avoids modifying the agent internals while still intercepting every billable call before it fires.

Guard your LlamaIndex pipeline before it bills you

RunGuard's LoopDetector, BudgetTracker, and ContextGuard are the same primitives used in the guards above — one install, works with any LlamaIndex version.

See pricing Learn more