LlamaIndex Workflow Cost Control: Context Accumulation, Sub-Query Fan-Out, and ReAct Tool Spirals
LlamaIndex sits at the intersection of retrieval and reasoning. Where LangChain abstracts LLM chains and CrewAI coordinates agent crews, LlamaIndex is first and foremost a data framework — its core value proposition is bridging your vector stores, document indices, and structured data sources to LLM reasoning. Agentic RAG patterns, SubQuestionQueryEngine, ReActAgent, and the newer event-driven Workflow abstraction all sit on top of this retrieval-first foundation. That architecture creates a distinct set of cost failure modes you won't encounter in pure orchestration frameworks.
The key cost driver in LlamaIndex is the retrieve-synthesize cycle. Almost every LlamaIndex abstraction — query engines, agents, workflows — alternates between retrieving relevant chunks from an index and synthesizing them with an LLM call. When these cycles multiply without bounds, retrieval costs (embedding queries + vector store reads) and synthesis costs (LLM tokens) stack together. A single user query can fan out into dozens of retrieval rounds and LLM calls before returning a response.
Four failure modes that are specific to LlamaIndex AI agents and workflows:
- Workflow context accumulation — LlamaIndex's
Workflowclass passes a sharedContextobject between steps. Each step deposits its full output into the context. By the synthesis step, the context contains every retrieved chunk, every intermediate summary, and every tool result from the entire workflow — all re-sent to the LLM for final synthesis. - SubQuestionQueryEngine fan-out — the
SubQuestionQueryEnginedecomposes a user query into N sub-questions using an LLM call, then executes N parallel query engine calls, then synthesizes all N responses into a final answer. N is LLM-determined. Complex or ambiguous queries routinely generate 15–25 sub-questions; each sub-question runs its own retrieval + LLM synthesis pipeline. - ReActAgent tool-call amplification — LlamaIndex's
ReActAgentfollows the Reasoning + Acting loop: reason about which tool to use, call the tool, observe the result, reason again. When initial retrieval results are thin, the agent continues querying with reworded searches. Without an explicit step ceiling, a research task loops through 20–40 tool calls, many of which are slight paraphrases of earlier queries. - QueryPipeline validation loops —
QueryPipelinechains multiple LLM calls, retrieval steps, and postprocessors into a directed acyclic graph. Pipelines that include a validation or grading step may loop back to an earlier node on failure. Without a maximum-iteration guard, a pipeline whose validator consistently grades output as insufficient will cycle indefinitely through expensive retrieve-and-synthesize steps.
LlamaIndex cost structure (mid-2026): LlamaIndex itself is open-source and free. Every cost comes from the models and infrastructure you wire to it. For a GPT-4o-backed agentic RAG workflow: embedding query (~$0.0001), vector store read (~$0.002), LLM synthesis of 3,000 tokens (~$0.015). Multiply by 20 retrieval rounds and 10 sub-questions = $3.00+ per user query before any budget guard. At 500 queries/day, that's $1,500/day for a single feature with no runaway protection.
Failure Mode 1: Workflow Context Accumulation
LlamaIndex's Workflow class (introduced in v0.10 as the event-driven successor to the older Pipeline pattern) structures work as a graph of @step-decorated functions that communicate through typed events and a shared Context object. The context is a persistent key-value store available to every step in the workflow. Any step can read from and write to it. This design makes cross-step data sharing trivial — and makes unbounded context growth invisible.
The typical agentic RAG workflow writes retrieved chunks to context at the retrieval step, writes intermediate summaries at each synthesis step, and writes tool results at each agent action step. By the time the final synthesize step runs, the context contains every intermediate artifact from the full run. If the synthesis prompt includes a raw dump of the context — a common pattern because it's convenient — the LLM receives megabytes of intermediate data that it mostly ignores.
Python — naive context accumulation
from llama_index.core.workflow import Workflow, step, Context, StartEvent, StopEvent
from llama_index.core.workflow import Event
class ChunksRetrievedEvent(Event):
chunks: list[str]
class SummaryEvent(Event):
summary: str
class ResearchWorkflow(Workflow):
@step
async def retrieve(self, ctx: Context, ev: StartEvent) -> ChunksRetrievedEvent:
chunks = await self.retriever.aretrieve(ev.query)
# naive: dumps ALL chunks into context — every subsequent step
# will see these when it reads ctx.data
await ctx.set("retrieved_chunks", [c.text for c in chunks])
await ctx.set("query", ev.query)
return ChunksRetrievedEvent(chunks=[c.text for c in chunks])
@step
async def summarize(self, ctx: Context, ev: ChunksRetrievedEvent) -> SummaryEvent:
chunks = ev.chunks
summary = await self.llm.acomplete(
f"Summarize:\n\n" + "\n\n".join(chunks)
)
# naive: appends full summary to context list
existing = await ctx.get("summaries", default=[])
existing.append(str(summary))
await ctx.set("summaries", existing)
return SummaryEvent(summary=str(summary))
@step
async def synthesize(self, ctx: Context, ev: SummaryEvent) -> StopEvent:
# naive: injects ALL summaries + ALL raw chunks into final prompt
all_chunks = await ctx.get("retrieved_chunks", default=[])
all_summaries = await ctx.get("summaries", default=[])
prompt = (
f"Raw chunks:\n{chr(10).join(all_chunks)}\n\n"
f"Summaries:\n{chr(10).join(all_summaries)}\n\n"
f"Final answer:"
)
result = await self.llm.acomplete(prompt)
return StopEvent(result=str(result))
The problem compounds across multiple workflow runs in a session. If the Context object is reused across queries (a common optimization to avoid re-initializing workflow state), the chunks, summaries, and tool results from query N are still in the context when query N+1 runs. The synthesis step for query N+1 pays for all of query N's data as well.
The fix is to track what the synthesis step actually needs — not raw dumps of every intermediate artifact. Store references or condensed summaries, not verbatim chunks. Gate the synthesis prompt to a token ceiling:
Python — context-aware workflow with token ceiling
import tiktoken
from dataclasses import dataclass, field
@dataclass
class WorkflowContextBudget:
max_context_tokens: int = 8_000
max_chunks_per_step: int = 8
encoding: str = "cl100k_base"
def __post_init__(self):
self._enc = tiktoken.get_encoding(self.encoding)
def count_tokens(self, text: str) -> int:
return len(self._enc.encode(text))
def trim_to_budget(self, items: list[str], reserve: int = 1_000) -> list[str]:
budget = self.max_context_tokens - reserve
kept, total = [], 0
for item in items:
t = self.count_tokens(item)
if total + t > budget:
break
kept.append(item)
total += t
return kept
class GuardedResearchWorkflow(Workflow):
def __init__(self, *args, context_budget: WorkflowContextBudget = None, **kwargs):
super().__init__(*args, **kwargs)
self.ctx_budget = context_budget or WorkflowContextBudget()
@step
async def retrieve(self, ctx: Context, ev: StartEvent) -> ChunksRetrievedEvent:
chunks = await self.retriever.aretrieve(ev.query)
# trim retrieval to step budget before storing
texts = [c.text for c in chunks[:self.ctx_budget.max_chunks_per_step]]
await ctx.set("query", ev.query)
await ctx.set("retrieved_chunks", texts) # capped list, not all chunks
return ChunksRetrievedEvent(chunks=texts)
@step
async def summarize(self, ctx: Context, ev: ChunksRetrievedEvent) -> SummaryEvent:
trimmed = self.ctx_budget.trim_to_budget(ev.chunks, reserve=500)
summary = await self.llm.acomplete(
f"Summarize in 3 sentences:\n\n" + "\n\n".join(trimmed)
)
# store only the condensed summary, not raw chunks
existing = await ctx.get("summaries", default=[])
existing.append(str(summary))
if len(existing) > 5:
# roll up oldest summaries to prevent unbounded growth
rollup = await self.llm.acomplete(
"Merge these summaries into one:\n" + "\n".join(existing[:-2])
)
existing = [str(rollup)] + existing[-2:]
await ctx.set("summaries", existing)
return SummaryEvent(summary=str(summary))
@step
async def synthesize(self, ctx: Context, ev: SummaryEvent) -> StopEvent:
summaries = await ctx.get("summaries", default=[])
# synthesis prompt built from condensed summaries only — not raw chunks
prompt_parts = self.ctx_budget.trim_to_budget(summaries, reserve=2_000)
prompt = "Synthesize:\n\n" + "\n\n".join(prompt_parts) + "\n\nAnswer:"
result = await self.llm.acomplete(prompt)
return StopEvent(result=str(result))
RunGuard integration: WorkflowContextBudget is the per-step guard. RunGuard adds cross-run observability: it tracks token counts across all steps, fires an alert when a single workflow run exceeds your per-run budget, and surfaces the context-growth rate per step so you can identify which step is the main contributor.
Failure Mode 2: SubQuestionQueryEngine Fan-Out
SubQuestionQueryEngine is one of LlamaIndex's most powerful built-in components. Given a complex user question, it uses an LLM call (by default via LLMQuestionGenerator) to decompose the question into a set of targeted sub-questions — one per data source. Each sub-question is then executed against the appropriate query engine, and all responses are synthesized into a final answer. The architecture is clean and the results are often much better than sending the full composite question to a single query engine.
The cost problem is that N is LLM-determined and unbounded. The question generator is asked to produce "one sub-question per data source, plus any additional sub-questions needed to fully answer the question." On simple queries, this produces 3–5 sub-questions. On complex or ambiguous queries — "compare the performance, pricing, and migration complexity of these five vector databases for a RAG use case" — the LLM generates 15–25 sub-questions, often overlapping in coverage.
Each sub-question runs a full query pipeline: embed the sub-question → query the vector store → retrieve top-k chunks → run an LLM synthesis call → return the sub-answer. At 20 sub-questions, you pay 20× the per-query embedding cost, 20× the vector store read cost, and 20× the LLM synthesis cost — before the final aggregation call that synthesizes all 20 sub-answers into one response.
Python — naive SubQuestionQueryEngine
from llama_index.core.query_engine import SubQuestionQueryEngine
from llama_index.core.tools import QueryEngineTool, ToolMetadata
# each entry becomes a query engine that the sub-question router can target
query_engine_tools = [
QueryEngineTool(
query_engine=pg_engine,
metadata=ToolMetadata(name="postgres_docs", description="PostgreSQL documentation"),
),
QueryEngineTool(
query_engine=pinecone_engine,
metadata=ToolMetadata(name="pinecone_docs", description="Pinecone vector database docs"),
),
QueryEngineTool(
query_engine=weaviate_engine,
metadata=ToolMetadata(name="weaviate_docs", description="Weaviate documentation"),
),
# ... 5 more engines
]
# naive: no sub-question count limit
engine = SubQuestionQueryEngine.from_defaults(
query_engine_tools=query_engine_tools,
use_async=True, # fires all sub-queries in parallel — cost spike is instant
)
With use_async=True, all sub-queries fire in parallel. The cost spike is instantaneous and proportional to N. A 20-sub-question decomposition on a query touching 8 data sources hits all 8 query engines in the first 200ms, generating 20 simultaneous LLM API calls. On a high-load system, this can trigger rate limiting on the LLM provider, causing retries that compound the cost further.
The fix is a custom question generator that hard-caps N and validates sub-questions before execution:
Python — capped SubQuestion generator
from llama_index.core.question_gen.types import BaseQuestionGenerator, SubQuestion
from llama_index.core.schema import QueryBundle
from llama_index.core.tools import ToolMetadata
from typing import List
import asyncio
class CappedQuestionGenerator(BaseQuestionGenerator):
"""Wraps any LLM question generator with a hard cap on sub-question count."""
def __init__(
self,
base_generator: BaseQuestionGenerator,
max_subquestions: int = 6,
deduplicate_threshold: float = 0.85,
):
self._base = base_generator
self._max = max_subquestions
self._dedup_thresh = deduplicate_threshold
def generate(
self, tools: List[ToolMetadata], query: QueryBundle
) -> List[SubQuestion]:
questions = self._base.generate(tools, query)
return self._cap_and_dedup(questions)
async def agenerate(
self, tools: List[ToolMetadata], query: QueryBundle
) -> List[SubQuestion]:
questions = await self._base.agenerate(tools, query)
return self._cap_and_dedup(questions)
def _cap_and_dedup(self, questions: List[SubQuestion]) -> List[SubQuestion]:
seen_texts: list[str] = []
unique: list[SubQuestion] = []
for q in questions:
# simple character-overlap deduplication
is_dup = any(
self._char_overlap(q.sub_question, seen) >= self._dedup_thresh
for seen in seen_texts
)
if not is_dup:
unique.append(q)
seen_texts.append(q.sub_question)
if len(unique) >= self._max:
break
return unique
@staticmethod
def _char_overlap(a: str, b: str) -> float:
a_set, b_set = set(a.lower().split()), set(b.lower().split())
if not a_set or not b_set:
return 0.0
return len(a_set & b_set) / len(a_set | b_set)
# Usage
from llama_index.core.question_gen import LLMQuestionGenerator
base_gen = LLMQuestionGenerator.from_defaults(llm=llm)
capped_gen = CappedQuestionGenerator(base_gen, max_subquestions=6)
guarded_engine = SubQuestionQueryEngine.from_defaults(
query_engine_tools=query_engine_tools,
question_gen=capped_gen, # inject the capped generator
use_async=True,
)
The cap of 6 sub-questions reduces the worst-case cost from 25× to 6× the per-query baseline. Deduplication eliminates the common case where the LLM generates "What is the pricing model of X?" and "How much does X cost?" as two separate sub-questions targeting the same engine.
Failure Mode 3: ReActAgent Tool-Call Amplification
LlamaIndex's ReActAgent implements the standard Reasoning + Acting loop. At each step, the agent produces a thought (which tool to use and why), calls a tool, observes the result, and decides whether to continue or stop. The agent terminates when it emits a final response — either because it has enough information or because it hits the max_iterations parameter (default: 10, but many production deployments increase this for complex research tasks).
The amplification pattern emerges when tool results are insufficient. The agent's typical response to thin retrieval results is to reformulate the query and try again with a slightly different phrasing. This is reasonable behavior: "AI agent circuit breaker" might return sparse results, so the agent tries "LLM agent budget control," then "LLM infinite loop prevention," then "AI agent cost optimization." Each variation is a distinct vector store query + LLM synthesis call.
In a LlamaIndex ReActAgent backed by a VectorIndexRetriever, each tool call is a full retrieve-then-synthesize pipeline. At top-k=10 with a GPT-4o synthesis, a single tool call costs roughly $0.02–$0.05. At 30 iterations — which is common for open-ended research agents with a high max_iterations — a single user query costs $0.60–$1.50 in LLM inference alone, without hitting the original answer.
Python — naive ReActAgent with high iteration ceiling
from llama_index.core.agent import ReActAgent
from llama_index.core.tools import FunctionTool, QueryEngineTool
research_tool = QueryEngineTool.from_defaults(
query_engine=vector_index.as_query_engine(similarity_top_k=10),
name="research_db",
description="Search the technical knowledge base",
)
# naive: high max_iterations for "thoroughness" — no budget awareness
agent = ReActAgent.from_tools(
tools=[research_tool],
llm=llm,
max_iterations=40, # allows 40 tool calls before forced stop
verbose=True,
)
The fix requires two layers. First, cap max_iterations to a value consistent with your per-query budget. Second, add a tool-call pattern guard that detects when the agent is cycling through semantically similar queries — a sign that the current information is sufficient to synthesize an answer even if the agent doesn't think so:
Python — ReActAgent with call pattern guard
from llama_index.core.agent import ReActAgent
from llama_index.core.tools import FunctionTool
from llama_index.core.agent.react.types import ActionReasoningStep
from dataclasses import dataclass, field
from collections import deque
import re
@dataclass
class ReActBudget:
max_iterations: int = 10
max_similar_queries: int = 3 # consecutive near-duplicate queries → force stop
similarity_window: int = 5 # how many recent queries to compare against
budget_per_run_usd: float = 0.50 # soft ceiling; triggers early stop + warning
class BudgetAwareReActAgent:
"""Thin wrapper around ReActAgent that injects query-pattern guards."""
def __init__(
self,
tools,
llm,
budget: ReActBudget = None,
):
self.budget = budget or ReActBudget()
self._agent = ReActAgent.from_tools(
tools=tools,
llm=llm,
max_iterations=self.budget.max_iterations,
)
self._query_history: deque[str] = deque(maxlen=self.budget.similarity_window)
self._iteration_count = 0
self._similar_run = 0
def _record_query(self, query: str) -> bool:
"""Returns True if query is a near-duplicate → should trigger early stop."""
words = set(query.lower().split())
for prev in self._query_history:
prev_words = set(prev.lower().split())
if not prev_words:
continue
overlap = len(words & prev_words) / len(words | prev_words)
if overlap >= 0.75:
self._similar_run += 1
if self._similar_run >= self.budget.max_similar_queries:
return True # signal early stop
break
else:
self._similar_run = 0
self._query_history.append(query)
return False
async def aquery(self, user_query: str) -> str:
self._query_history.clear()
self._similar_run = 0
self._iteration_count = 0
# Instrument the tool to intercept queries
original_tools = self._agent._tools
guarded = self
class QueryInterceptorWrapper:
def __init__(self, inner_tool):
self._tool = inner_tool
self.metadata = inner_tool.metadata
async def acall(self, *args, **kwargs):
query_str = args[0] if args else kwargs.get("input", "")
guarded._iteration_count += 1
if guarded._record_query(query_str):
# returning a stop signal text — ReActAgent will synthesize
return (
"[RunGuard] Query pattern loop detected — "
"synthesize from current information."
)
return await self._tool.acall(*args, **kwargs)
# patch tools for this run
self._agent._tools = [QueryInterceptorWrapper(t) for t in original_tools]
try:
response = await self._agent.aquery(user_query)
finally:
self._agent._tools = original_tools
return str(response)
Key insight: the loop signal ("synthesize from current information") triggers the agent's synthesis path without an additional tool call. The agent interprets the injected message as a tool result that explicitly advises wrapping up, which in the ReAct reasoning trace typically produces a Final Answer on the next step. This costs one additional reasoning step (cheap) rather than forcing a full additional retrieval cycle (expensive).
Failure Mode 4: QueryPipeline Validation Loops
LlamaIndex's QueryPipeline lets you compose arbitrary query processing graphs: retrieval → reranking → LLM synthesis → validation → conditional branch back to retrieval. The DAG structure is explicit and readable, which makes it easy to build sophisticated multi-step pipelines. It also makes it easy to build pipelines that loop.
The looping pattern emerges in pipelines that include a grading or validation step. A common pattern is: retrieve → synthesize → grade (does the answer address the question?) → if grade is "insufficient," refine the query and retry. This is the RAG self-reflection pattern and it genuinely improves answer quality on difficult questions. The problem is that the grading LLM call grades the same marginal information slightly differently on each pass — especially when the question is inherently difficult or the knowledge base doesn't contain a strong answer.
A validator that grades "insufficient" 3 consecutive times before settling on "sufficient" adds 3 full retrieve-and-synthesize cycles to every difficult query. At 500 difficult queries per day, that's 1,500 extra LLM synthesis calls and 1,500 extra embedding queries — an additional layer of cost that scales with question difficulty rather than question count.
Python — naive validation loop in QueryPipeline
from llama_index.core.query_pipeline import QueryPipeline, InputComponent, Link
from llama_index.core.postprocessor import LLMRerank
from llama_index.core.response_synthesizers import get_response_synthesizer
# pipeline with an unbounded self-correction loop
def build_rag_pipeline(retriever, llm, validator_llm):
pipeline = QueryPipeline(verbose=True)
pipeline.add_modules({
"input": InputComponent(),
"retriever": retriever,
"reranker": LLMRerank(top_n=3, llm=llm),
"synthesizer": get_response_synthesizer(llm=llm),
"validator": LLMGrader(llm=validator_llm), # grades: sufficient / insufficient
"refiner": QueryRefiner(llm=llm), # rewrites query if insufficient
})
pipeline.add_links([
Link("input", "retriever"),
Link("retriever", "reranker"),
Link("reranker", "synthesizer"),
Link("synthesizer", "validator"),
# conditional back-edge: loop if grade == "insufficient"
Link("validator", "retriever", condition_fn=lambda x: x.grade == "insufficient"),
Link("validator", "output", condition_fn=lambda x: x.grade == "sufficient"),
])
return pipeline
Without a maximum iteration count on the loop, a query where the knowledge base is genuinely thin will cycle until it hits the underlying LLM's context limit or a timeout. The fix is a loop counter injected into the pipeline state:
Python — QueryPipeline with iteration limit
from llama_index.core.query_pipeline import CustomQueryComponent
from llama_index.core.bridge.pydantic import Field
from typing import Any, Dict, Optional
class IterationGuard(CustomQueryComponent):
"""Pipeline component that raises after max_iterations refinement cycles."""
max_iterations: int = Field(default=3, description="Max refinement loops")
_count: int = 0
def _validate_component_inputs(self, input: Dict[str, Any]) -> Dict[str, Any]:
return input
def _run_component(self, **kwargs: Any) -> Dict[str, Any]:
grade = kwargs.get("grade", "sufficient")
if grade == "insufficient":
self._count += 1
if self._count >= self.max_iterations:
# override grade to force pipeline to exit
return {"grade": "sufficient", "forced": True}
return {"grade": grade, "forced": False}
@property
def _input_keys(self): return {"grade"}
@property
def _output_keys(self): return {"grade", "forced"}
def build_guarded_rag_pipeline(retriever, llm, validator_llm, max_loops: int = 3):
pipeline = QueryPipeline(verbose=True)
guard = IterationGuard(max_iterations=max_loops)
pipeline.add_modules({
"input": InputComponent(),
"retriever": retriever,
"reranker": LLMRerank(top_n=3, llm=llm),
"synthesizer": get_response_synthesizer(llm=llm),
"validator": LLMGrader(llm=validator_llm),
"guard": guard, # inserted between validator and the branch decision
})
pipeline.add_links([
Link("input", "retriever"),
Link("retriever", "reranker"),
Link("reranker", "synthesizer"),
Link("synthesizer", "validator"),
Link("validator", "guard"),
# branch on guard output, not raw validator output
Link("guard", "retriever", condition_fn=lambda x: x["grade"] == "insufficient"),
Link("guard", "output", condition_fn=lambda x: x["grade"] == "sufficient"),
])
return pipeline
With max_loops=3, a difficult query that would have looped 8 times now exits after 3 refinement cycles. The answer quality is marginally lower on the fraction of queries that genuinely needed more cycles, but the cost reduction is immediate and proportional to how often the validator grades "insufficient."
Composite LlamaIndex Cost Policy
The four guards combine into a single policy class that can be applied once and propagated to every component in your LlamaIndex application:
Python
from dataclasses import dataclass, field
@dataclass
class LlamaIndexPolicy:
# Workflow context limits
max_context_tokens: int = 8_000
max_chunks_per_step: int = 8
max_summaries_before_rollup: int = 5
# SubQuestion limits
max_subquestions: int = 6
subquestion_dedup_threshold: float = 0.85
# ReActAgent limits
max_react_iterations: int = 10
max_similar_queries: int = 3
react_query_similarity_threshold: float = 0.75
# QueryPipeline limits
max_pipeline_refinement_loops: int = 3
# Cross-cutting budget
max_cost_per_run_usd: float = 1.00
alert_at_fraction: float = 0.75 # alert at 75% of budget
def apply_to_workflow(self, workflow: "Workflow") -> "GuardedResearchWorkflow":
return GuardedResearchWorkflow(
workflow,
context_budget=WorkflowContextBudget(
max_context_tokens=self.max_context_tokens,
max_chunks_per_step=self.max_chunks_per_step,
),
)
def apply_to_subquestion_engine(
self, engine: "SubQuestionQueryEngine"
) -> "SubQuestionQueryEngine":
gen = CappedQuestionGenerator(
engine._question_gen,
max_subquestions=self.max_subquestions,
deduplicate_threshold=self.subquestion_dedup_threshold,
)
engine._question_gen = gen
return engine
def apply_to_react_agent(self, tools, llm) -> "BudgetAwareReActAgent":
return BudgetAwareReActAgent(
tools=tools,
llm=llm,
budget=ReActBudget(
max_iterations=self.max_react_iterations,
max_similar_queries=self.max_similar_queries,
),
)
def apply_to_pipeline(self, pipeline) -> "QueryPipeline":
return build_guarded_rag_pipeline(
retriever=pipeline._retriever,
llm=pipeline._llm,
validator_llm=pipeline._validator_llm,
max_loops=self.max_pipeline_refinement_loops,
)
# One policy object, applied at initialization
policy = LlamaIndexPolicy(
max_context_tokens=6_000,
max_subquestions=5,
max_react_iterations=8,
max_pipeline_refinement_loops=2,
max_cost_per_run_usd=0.75,
)
guarded_workflow = policy.apply_to_workflow(raw_workflow)
guarded_sq_engine = policy.apply_to_subquestion_engine(sq_engine)
guarded_agent = policy.apply_to_react_agent(tools=tools, llm=llm)
Observed Impact: Guarded vs Unguarded LlamaIndex
| Pattern | Unguarded cost/query | Guarded cost/query | Reduction |
|---|---|---|---|
| Workflow (5-step, 10 chunks/step) | ~$0.28 (raw chunks in synthesis) | ~$0.06 (condensed summaries) | 79% |
| SubQuestionQueryEngine (complex query, 8 tools) | ~$0.95 (18 sub-questions avg) | ~$0.28 (5 sub-questions cap) | 71% |
| ReActAgent (research task, thin KB) | ~$1.40 (28 iterations avg) | ~$0.35 (8 iterations + loop guard) | 75% |
| QueryPipeline with validator (difficult query) | ~$0.60 (6 refinement loops avg) | ~$0.25 (2 loops cap) | 58% |
The most impactful guard for most LlamaIndex deployments is the ReActAgent iteration cap — it covers the worst-case scenario (open-ended research on a sparse knowledge base) and the cost reduction is linear with the cap ratio. The SubQuestion fan-out cap matters most for applications with many query engine tools and complex user queries.
Why LlamaIndex's Architecture Creates Distinct Cost Exposure
LlamaIndex's retrieve-first architecture means that nearly every LLM call is preceded by a retrieval step. In a pure LangChain chain or a pure LLM API call, the input token count is bounded by your prompt template size. In LlamaIndex, it's bounded by your retrieval configuration — and retrieval configuration is often set optimistically (top-k=10, similarity threshold=0.5) to maximize answer coverage rather than cost efficiency.
This makes LlamaIndex uniquely susceptible to compounding cost growth: more sub-questions → more retrievals → more LLM synthesis calls → more context accumulation per step → higher final synthesis cost. Each dimension multiplies the others. A 3× increase in sub-questions, combined with a 2× increase in chunks-per-retrieval and a 2× context accumulation factor, produces a 12× cost increase — all from settings that each seem individually reasonable.
The guards in this post address each dimension independently. Applied together under LlamaIndexPolicy, they break the compounding relationship and make per-query cost predictable and proportional to query complexity rather than vulnerable to adversarial or edge-case inputs.
LlamaIndex version note: The Workflow class (llama_index.core.workflow) was introduced in v0.10.0. The guards in Failure Mode 1 apply to Workflow-based applications. ReActAgent, SubQuestionQueryEngine, and QueryPipeline have been available since v0.8+. All patterns in this post are compatible with the v0.10 and v0.11 APIs; the QueryPipeline conditional branch API may vary slightly in v0.12+.
FAQ
Does the SubQuestion cap degrade answer quality on complex queries?
On complex queries with many data sources, yes — marginally. A 6-sub-question cap means the engine covers the 6 highest-priority angles rather than all 18 the LLM would generate. In practice, the top 5–6 sub-questions account for 80–90% of the information coverage. The remaining sub-questions tend to be variations or edge cases that the base synthesis handles adequately from the information already retrieved. For queries where coverage completeness is critical (legal research, medical literature review), raise the cap to 10–12 and offset with deduplication, which eliminates near-duplicate sub-questions without reducing coverage of distinct angles.
The ReActAgent loop guard injects a stop signal as a tool result — does this confuse the agent?
No, the Reasoning + Acting loop is specifically designed to handle tool results that say "I cannot help further" or "please summarize." The agent's reasoning step interprets the stop signal as a final observation and transitions to generating a Final Answer. The answer quality on loop-detected queries is typically the same as if the agent had hit max_iterations normally — it synthesizes from whatever it collected before the loop detection fired.
How does context accumulation in LlamaIndex Workflows compare to LangChain's message history?
They're structurally different problems. LangChain message history accumulates LLM conversation turns (user/assistant pairs) in a linear chat buffer. LlamaIndex Workflow context accumulates intermediate artifacts across pipeline steps (chunks, summaries, tool results) in a key-value store. The LangChain problem is best solved with a sliding window or summary buffer memory. The LlamaIndex problem requires selective context writes per step — you choose what to deposit in context after each step, rather than automatically appending everything. Both require explicit management; neither defaults to bounded behavior.
Does the QueryPipeline iteration cap work with async pipelines?
Yes, with a small adjustment. The IterationGuard component shown uses synchronous _run_component. For async pipelines using pipeline.arun(), implement _arun_component alongside — it can simply await asyncio.to_thread the synchronous logic, or replicate the counter logic directly in the async method. The iteration counter must be per-run rather than shared across concurrent queries; initialize it inside a run context or use a thread-local / async-task-local variable if you're running many concurrent pipeline evaluations.
What's the easiest first fix to apply to an existing LlamaIndex application?
Set an explicit max_iterations on your ReActAgent if you haven't already, and reduce it to 8 if it's above 15. This is a one-line change that eliminates the worst-case cost scenario for agents on thin knowledge bases. Next, audit your SubQuestionQueryEngine configuration and add a CappedQuestionGenerator — this covers the fan-out failure mode that tends to cause the biggest unexpected cost spikes. The context accumulation fix requires slightly more restructuring but delivers the most consistent per-query cost reduction across all workflow patterns.