Multi-agent orchestration cost control: why costs grow non-linearly and how to cap every agent independently
Multi-agent systems multiply both capability and cost. An orchestrator agent that spawns five worker agents for a single task creates 5× the baseline token consumption. If one worker enters a tool-call loop and the orchestrator retries it, you get 5× amplified cost loops: the looping worker accumulates cost on every iteration, and the orchestrator pays for the delegation calls on top. Teams running OpenAI Swarm, AutoGen multi-agent pipelines, CrewAI hierarchical crews, or LangGraph multi-node graphs all face the same structural problem: costs in multi-agent systems grow non-linearly with the number of agents, the number of delegation levels, and the number of parallel workers. A single orchestrator-level budget cap does not solve this problem — it can only stop the pipeline after the orchestrator itself has crossed its limit, by which point the workers may have already generated most of the damage. The only reliable defence is per-agent budget caps that fire inside each agent’s own execution context, combined with loop detection that catches repeated delegation patterns before they compound.
The cost amplification math
The numbers make the problem concrete. Consider a simple agent that costs $0.30 per LLM call, with a maximum of 20 iterations before a hard stop:
- Single agent, no orchestration. Maximum cost exposure is $0.30 × 20 iterations = $6.00. A bug that causes the agent to loop runs up a $6 bill before the iteration limit fires.
- 5-agent pipeline (orchestrator + 4 workers). If the orchestrator delegates to each worker and each worker can run up to 20 iterations, the maximum exposure is $0.30 × 20 × 5 agents = $30.00. If one worker loops and the orchestrator retries it across all four workers in sequence, the exposure is even higher: the orchestrator’s own calls plus each worker’s iteration budget.
- 3-level hierarchy (orchestrator → manager → workers). At three delegation levels with a fan-out of three workers per manager, the maximum exposure is $0.30 × 20 iterations × 3 levels × 3 fan-out = $54.00 per manager branch, easily reaching $90–$180 for the full tree. A single runaway loop at the worker level propagates up through two delegation layers before any level-specific cap fires.
The critical insight is why capping at the orchestrator level is insufficient: if a worker agent enters a tool-call loop, the orchestrator does not see the individual LLM calls inside the worker — it only sees whether the worker returned a result. The orchestrator keeps re-delegating to the looping worker, each delegation spawning a fresh iteration budget, until the orchestrator itself hits its own cap. By then, the worker has run its full iteration budget on every delegation attempt. The correct architecture caps each agent at the point where its LLM calls are made, not at the level above it in the delegation hierarchy.
The three multi-agent cost failure modes
- Cascading delegation. The orchestrator sends a task to a worker. The worker fails (wrong output format, tool error, incomplete result). The orchestrator re-delegates to a second worker with more context added to the task description (the first worker’s failed output, additional instructions, error details). The second worker also fails. The orchestrator re-delegates to a third worker or retries the first with even more context. Each delegation cycle costs more than the previous one because the task description grows with accumulated failure context. After four delegation attempts, the orchestrator’s context has grown to include three failed outputs, three sets of additional instructions, and the original task — each of which adds to the input token count for the next delegation call. The orchestrator-level cost grows super-linearly with each delegation round even if each worker’s individual cost is bounded.
- Cross-agent context explosion. A common multi-agent pattern: the parent agent passes its full accumulated context to each child; each child’s response is appended back to the parent’s context; the parent passes the now-larger context to the next child. After processing three children, the parent’s context contains all three children’s outputs in addition to its original state. The fourth child receives a context that is 4× the size of what the first child received. Input token costs are linear with context size, so the fourth child costs 4× the first. In an 8-agent pipeline where each agent’s output is fed back to the orchestrator, the eighth agent call costs 8× the first. Without per-agent tracking, this cost explosion is invisible until the invoice arrives.
-
Redundant parallel work. The orchestrator spawns N workers in parallel for the same subtask — common in fan-out patterns where multiple perspectives or approaches are desired. All N workers complete and return similar results (because they were given the same task and the same context). The orchestrator then spawns N more workers to validate or synthesize the first round of results. Without deduplication logic and without per-agent caps, the orchestrator may launch multiple rounds of N-parallel workers, paying N× the per-worker cost per round. If the validation workers also determine the synthesis is insufficient and trigger another round, costs compound geometrically. This pattern is especially common in AutoGen’s
GroupChatand LangGraph’s map-reduce subgraphs.
Per-agent budget caps with RunGuard: CrewAI example
The solution is to give each agent its own BudgetTracker and LoopDetector instance. When a worker raises LoopDetectedError or BudgetExceededError, the orchestrator catches it and handles it explicitly — rather than silently retrying with a fresh iteration budget. This is the crucial distinction: silent retries compound cost; explicit exceptions let the orchestrator make a policy decision (abort, degrade gracefully, escalate to human) without spending another cent on the broken worker.
-
Python: CrewAI-like pipeline with per-agent guards
from runguard import BudgetTracker, LoopDetector, guard, BudgetExceededError, LoopDetectedError import openai client = openai.OpenAI() # Prices in USD per million tokens (GPT-4o as of 2026) INPUT_PRICE = 2.50 OUTPUT_PRICE = 10.00 def make_llm_call(messages: list, sig_fn=None) -> dict: """Base LLM call that returns response + cost metadata.""" response = client.chat.completions.create( model="gpt-4o", messages=messages, ) usage = response.usage usd = (usage.prompt_tokens * INPUT_PRICE + usage.completion_tokens * OUTPUT_PRICE) / 1_000_000 tool_calls = getattr(response.choices[0].message, "tool_calls", None) sig = tool_calls[0].function.name if tool_calls else "end_turn" if sig_fn: sig = sig_fn(response) return {"response": response, "usd": usd, "sig": sig} # Each agent gets independent guard instances orchestrator_guard = guard( make_llm_call, budget={"max_usd": 8.0}, # orchestrator gets $8 — it makes fewer calls loop={"repeats": 3, "max_cycle_len": 6}, ) researcher_guard = guard( make_llm_call, budget={"max_usd": 5.0}, # researcher: $5 cap — most expensive role loop={"repeats": 3, "max_cycle_len": 4}, ) writer_guard = guard( make_llm_call, budget={"max_usd": 3.0}, # writer: $3 cap loop={"repeats": 3, "max_cycle_len": 4}, ) def run_researcher(topic: str) -> str: """Researcher agent with its own budget and loop guard.""" messages = [ {"role": "system", "content": "You are a research analyst. Use search tools to gather information."}, {"role": "user", "content": f"Research this topic thoroughly: {topic}"}, ] result = researcher_guard(messages) return result["response"].choices[0].message.content def run_writer(research: str, brief: str) -> str: """Writer agent with its own budget and loop guard.""" messages = [ {"role": "system", "content": "You are a professional writer. Turn research into clear prose."}, {"role": "user", "content": f"Brief: {brief}\n\nResearch notes:\n{research}"}, ] result = writer_guard(messages) return result["response"].choices[0].message.content def run_pipeline(topic: str, brief: str) -> dict: """Orchestrator that catches worker errors explicitly instead of retrying silently.""" try: research = run_researcher(topic) except LoopDetectedError as e: # Researcher entered a loop — use a reduced, focused fallback task print(f"Researcher loop detected ({e.pattern} repeated {e.repeats}x). Using fallback.") research = f"[Research incomplete due to loop at step: {e.pattern}]" except BudgetExceededError as e: # Researcher exceeded its $5 cap — proceed with what was accumulated print(f"Researcher budget exceeded (${e.spent:.4f}). Proceeding with partial research.") research = getattr(e, "partial_output", "[Research budget exhausted]") try: article = run_writer(research, brief) except (LoopDetectedError, BudgetExceededError) as e: print(f"Writer halted: {e}") return {"status": "partial", "research": research, "article": None, "error": str(e)} return {"status": "complete", "research": research, "article": article}
The key difference from a naive implementation: each guard() call is a separate instance with its own accumulated spend counter. The researcher’s $5 budget and the writer’s $3 budget are independent — the researcher exhausting its budget does not affect the writer’s remaining headroom. The orchestrator catches typed exceptions from each worker and makes an explicit decision, rather than letting the worker’s error propagate silently or triggering an automatic retry that starts the budget clock over.
For a deeper look at per-role budgets in CrewAI specifically, see CrewAI budget per agent.
LangGraph multi-agent cost control
LangGraph’s StateGraph model makes it straightforward to wrap each node function with its own guard. Because nodes are just Python callables, you can decorate them with guard() before passing them to graph.add_node(). A shared BudgetTracker at the graph level gives you a single total spend counter across all nodes, while per-node guards provide the fine-grained caps that prevent any single node from consuming the entire graph budget.
-
Python: LangGraph StateGraph with per-node RunGuard
from langgraph.graph import StateGraph, END from runguard import BudgetTracker, LoopDetector, guard, BudgetExceededError, LoopDetectedError from typing import TypedDict import openai client = openai.OpenAI() class GraphState(TypedDict): task: str research: str draft: str final: str error: str # Shared graph-level budget: total session cap across all nodes graph_budget = BudgetTracker(cap_usd=20.0) def make_call(messages: list) -> dict: response = client.chat.completions.create(model="gpt-4o", messages=messages) usage = response.usage usd = (usage.prompt_tokens * 2.50 + usage.completion_tokens * 10.0) / 1_000_000 tool_calls = getattr(response.choices[0].message, "tool_calls", None) sig = tool_calls[0].function.name if tool_calls else "end_turn" return {"response": response, "usd": usd, "sig": sig} # Per-node guards share the graph_budget tracker, but have individual loop detectors researcher_guard = guard( make_call, budget=graph_budget, # shared session cap loop={"repeats": 3, "max_cycle_len": 4}, # independent loop detector per_call_budget={"max_usd": 0.50}, # single-call hard ceiling ) writer_guard = guard( make_call, budget=graph_budget, loop={"repeats": 3, "max_cycle_len": 4}, per_call_budget={"max_usd": 0.50}, ) reviewer_guard = guard( make_call, budget=graph_budget, loop={"repeats": 2, "max_cycle_len": 3}, per_call_budget={"max_usd": 0.30}, ) def researcher_fn(state: GraphState) -> GraphState: try: result = researcher_guard([ {"role": "system", "content": "You are a research analyst."}, {"role": "user", "content": f"Research: {state['task']}"}, ]) return {**state, "research": result["response"].choices[0].message.content} except (LoopDetectedError, BudgetExceededError) as e: return {**state, "research": "", "error": f"researcher: {e}"} def writer_fn(state: GraphState) -> GraphState: if state.get("error"): return state try: result = writer_guard([ {"role": "system", "content": "You are a content writer."}, {"role": "user", "content": f"Task: {state['task']}\nResearch: {state['research']}"}, ]) return {**state, "draft": result["response"].choices[0].message.content} except (LoopDetectedError, BudgetExceededError) as e: return {**state, "draft": "", "error": f"writer: {e}"} def reviewer_fn(state: GraphState) -> GraphState: if state.get("error"): return state try: result = reviewer_guard([ {"role": "system", "content": "You are an editor. Review for accuracy and clarity."}, {"role": "user", "content": f"Review this draft:\n{state['draft']}"}, ]) return {**state, "final": result["response"].choices[0].message.content} except (LoopDetectedError, BudgetExceededError) as e: return {**state, "final": state["draft"], "error": f"reviewer: {e}"} # Build the graph graph = StateGraph(GraphState) graph.add_node("researcher", researcher_fn) graph.add_node("writer", writer_fn) graph.add_node("reviewer", reviewer_fn) graph.set_entry_point("researcher") graph.add_edge("researcher", "writer") graph.add_edge("writer", "reviewer") graph.add_edge("reviewer", END) pipeline = graph.compile() result = pipeline.invoke({"task": "Summarize recent AI safety research", "research": "", "draft": "", "final": "", "error": ""}) if result.get("error"): print(f"Pipeline halted: {result['error']}") else: print(f"Final output: {result['final']}")
The graph_budget instance is shared across all three node guards. If the researcher and writer together spend $18 of the $20 session cap, the reviewer has only $2 of headroom before the shared tracker raises BudgetExceededError. This prevents any combination of node spend from exceeding the total session budget, while still providing per-node caps via individual loop and per_call_budget parameters on each guard.
AutoGen multi-agent cost control
AutoGen’s ConversableAgent and AssistantAgent call the model via a configurable client. The cleanest injection point for RunGuard is the model client’s call method, which all agents share. For per-agent cost tracking, each agent needs its own guard instance wrapping its own model client call, so that their spend counters are independent.
-
Python: AutoGen v0.2 with per-agent guards
import openai as _openai from runguard import guard, BudgetExceededError, LoopDetectedError _orig_create = _openai.chat.completions.create def make_call(messages, **kwargs): response = _orig_create(messages=messages, **kwargs) usage = response.usage usd = (usage.prompt_tokens * 2.50 + usage.completion_tokens * 10.0) / 1_000_000 tool_calls = getattr(response.choices[0].message, "tool_calls", None) sig = tool_calls[0].function.name if tool_calls else "end_turn" return {"response": response, "usd": usd, "sig": sig} # Independent guards for each agent role researcher_guard = guard( make_call, budget={"max_usd": 4.0}, loop={"repeats": 3, "max_cycle_len": 5}, ) writer_guard = guard( make_call, budget={"max_usd": 2.0}, loop={"repeats": 3, "max_cycle_len": 4}, ) # Patch per-agent: store the active guard in a context variable import contextvars _active_guard = contextvars.ContextVar("active_guard", default=None) def patched_create(messages, **kwargs): active = _active_guard.get() if active: return active(messages, **kwargs)["response"] return _orig_create(messages=messages, **kwargs) _openai.chat.completions.create = patched_create from autogen import ConversableAgent def run_agent_with_guard(agent: ConversableAgent, guard_instance, task: str): """Run an AutoGen agent with a specific guard active in the call context.""" token = _active_guard.set(guard_instance) try: result = agent.generate_reply( messages=[{"role": "user", "content": task}] ) return result except LoopDetectedError as e: return f"[loop detected: {e.pattern}]" except BudgetExceededError as e: return f"[budget exceeded: ${e.spent:.4f}]" finally: _active_guard.reset(token) llm_config = {"config_list": [{"model": "gpt-4o", "api_key": "YOUR_KEY"}]} researcher_agent = ConversableAgent("researcher", llm_config=llm_config, system_message="You are a research analyst.") writer_agent = ConversableAgent("writer", llm_config=llm_config, system_message="You are a professional writer.") research = run_agent_with_guard(researcher_agent, researcher_guard, "Research recent developments in AI safety.") article = run_agent_with_guard(writer_agent, writer_guard, f"Write a 500-word article based on:\n{research}") print(article)
The contextvars.ContextVar approach ensures that the correct guard instance is active for each agent’s calls, even when agents run sequentially in the same thread. Each agent’s guard accumulates spend only for that agent’s calls, giving you independent per-agent budget tracking without changing how AutoGen manages its agent objects. For a deeper treatment of AutoGen loop detection across agent boundaries, see AutoGen loop guard and circuit breaker.
Budget hierarchy design: session, agent, and call-level caps
A robust multi-agent cost control strategy operates at three levels simultaneously. Each level catches failures that the others cannot:
- Session-level cap: Total budget for the entire multi-agent run. For example, $20. If the cumulative cost across all agents, all delegation rounds, and all parallel workers exceeds $20, the entire pipeline halts. This is the outer safety net that prevents catastrophic runaway even if per-agent caps are misconfigured or bypassed.
- Agent-level cap: Per-agent budget that prevents any single agent from consuming a disproportionate share of the session budget. For example, $5 per worker agent and $8 for the orchestrator. An agent that exhausts its own budget raises an exception that the caller handles explicitly, without the session-level cap being affected by that one agent’s spend.
- Call-level cap: Per-LLM-call budget that prevents any single API call from being unexpectedly expensive. For example, $0.50 maximum per call. This catches cases where an agent’s context has grown extremely large (cross-agent context explosion), making individual calls very expensive even without a loop pattern. A $0.50 call ceiling fires on the first anomalously expensive call, before it can recur.
-
Python: three-level budget hierarchy with RunGuard
from runguard import BudgetTracker, guard, BudgetExceededError, LoopDetectedError import openai client = openai.OpenAI() # Level 1: session-level cap — shared across all agents session_budget = BudgetTracker(cap_usd=20.0) def make_call(messages: list, model: str = "gpt-4o") -> dict: response = client.chat.completions.create(model=model, messages=messages) usage = response.usage usd = (usage.prompt_tokens * 2.50 + usage.completion_tokens * 10.0) / 1_000_000 # Level 3: call-level cap — reject before spending if single call is too large if usd > 0.50: raise BudgetExceededError(f"Single call cost ${usd:.4f} exceeds $0.50 call ceiling") tool_calls = getattr(response.choices[0].message, "tool_calls", None) sig = tool_calls[0].function.name if tool_calls else "end_turn" return {"response": response, "usd": usd, "sig": sig} def make_agent_guard(agent_cap_usd: float, loop_repeats: int = 3, cycle_len: int = 4): """Level 2: agent-level cap — each agent guard has its own spend counter, but all agents also contribute to the shared session_budget.""" agent_budget = BudgetTracker(cap_usd=agent_cap_usd) def guarded_call(messages: list, model: str = "gpt-4o") -> dict: # Check session budget before every call session_budget.check() # Check agent budget before every call agent_budget.check() # Make the call result = make_call(messages, model) # Record cost in both trackers session_budget.record(result["usd"]) agent_budget.record(result["usd"]) return result return guard( guarded_call, loop={"repeats": loop_repeats, "max_cycle_len": cycle_len}, ) # Create per-agent guards with independent agent-level caps orchestrator_guard = make_agent_guard(agent_cap_usd=8.0, loop_repeats=3, cycle_len=6) worker_a_guard = make_agent_guard(agent_cap_usd=5.0, loop_repeats=3, cycle_len=4) worker_b_guard = make_agent_guard(agent_cap_usd=5.0, loop_repeats=3, cycle_len=4) worker_c_guard = make_agent_guard(agent_cap_usd=5.0, loop_repeats=3, cycle_len=4) # Example: orchestrator runs, then dispatches to workers def run_orchestrated_pipeline(task: str) -> dict: # Orchestrator call try: orch_result = orchestrator_guard([ {"role": "system", "content": "You are an orchestrator. Break the task into subtasks."}, {"role": "user", "content": task}, ]) subtasks = parse_subtasks(orch_result["response"]) except (BudgetExceededError, LoopDetectedError) as e: return {"status": "orchestrator_failed", "error": str(e)} # Worker calls — each worker has its own $5 cap, but all share the $20 session cap results = [] for subtask, worker_guard in zip(subtasks, [worker_a_guard, worker_b_guard, worker_c_guard]): try: result = worker_guard([ {"role": "system", "content": "You are a specialist agent."}, {"role": "user", "content": subtask}, ]) results.append(result["response"].choices[0].message.content) except BudgetExceededError as e: results.append(f"[worker budget exceeded: ${e.spent:.4f}]") except LoopDetectedError as e: results.append(f"[worker loop detected: {e.pattern}]") session_total = session_budget.spent return {"status": "complete", "results": results, "session_cost_usd": session_total} def parse_subtasks(response) -> list: # Extract subtask list from orchestrator response content = response.choices[0].message.content lines = [l.strip("- ").strip() for l in content.splitlines() if l.strip().startswith("-")] return lines[:3] or [content]
This three-level design provides defense in depth. A single misconfigured agent-level cap does not expose the full session budget. A surprisingly expensive single call (due to context explosion) is caught at the call level before it can recur. And the session-level backstop ensures the entire run cannot exceed $20 regardless of how individual agents are configured. For guidance on calibrating these numbers from profiling data, see autonomous agent cost control best practices and how to set max cost per LLM request.
No guardrails vs. orchestrator-level cap vs. RunGuard per-agent caps
| Capability | No guardrails | Orchestrator-level cap only | RunGuard per-agent caps |
|---|---|---|---|
| Loop detection | None — loop runs until hard iteration limit | None — orchestrator counts turns, not patterns | Per-agent signature-based detection (repeats + max_cycle_len) |
| Per-agent cost cap | Not supported | Not supported — single cap at orchestrator level only | Independent BudgetTracker per agent; each fires before next call |
| Cascading failure prevention | None — failed worker triggers silent retry with growing context | Partial — orchestrator cap eventually fires, but after workers have spent | Worker raises typed exception; orchestrator handles explicitly before re-delegating |
| Hierarchical budget enforcement | Not supported | Not supported — no session or call-level layers | Three-level hierarchy: session cap + agent cap + per-call ceiling |
| Real-time alert on budget exceeded | Not supported — discovered on monthly invoice | Not supported — alert fires after spend is committed | BudgetExceededError fires before the call that would cross the cap |
| Cross-agent context explosion detection | Not supported | Not supported | Call-level cap catches anomalously large single calls before they recur |
For framework-specific context on how these patterns apply to LangChain agent pipelines, see LangChain agent budget limit. For real-time prevention of the runaway cost patterns discussed here, see prevent AI agent runaway cost in real time.
Add per-agent cost control to your multi-agent pipeline
RunGuard installs in one command: pip install runguard for Python, npm install @runguard/sdk for TypeScript. Wrap each agent’s LLM call function with guard(), give each guard its own BudgetTracker(cap_usd=...), and catch BudgetExceededError and LoopDetectedError in your orchestrator. The three-level hierarchy (session, agent, call) is three additional lines of configuration. No changes to your existing agent definitions, no new infrastructure, no proxy layer between you and your LLM provider.
RunGuard pricing: Solo plan at $19/month covers individual developers and small pipelines. Team plan at $79/month adds shared dashboards, multi-user access, and webhook alerts for Slack and PagerDuty. Both plans include a 14-day free trial — no credit card required.
Start your 14-day free trial — or explore related patterns: CrewAI per-agent budget, AutoGen loop guard, LangChain agent budget limit, autonomous agent cost control best practices, and prevent AI agent runaway cost in real time.