Multi-agent orchestration cost control: why costs grow non-linearly and how to cap every agent independently

Multi-agent systems multiply both capability and cost. An orchestrator agent that spawns five worker agents for a single task creates 5× the baseline token consumption. If one worker enters a tool-call loop and the orchestrator retries it, you get 5× amplified cost loops: the looping worker accumulates cost on every iteration, and the orchestrator pays for the delegation calls on top. Teams running OpenAI Swarm, AutoGen multi-agent pipelines, CrewAI hierarchical crews, or LangGraph multi-node graphs all face the same structural problem: costs in multi-agent systems grow non-linearly with the number of agents, the number of delegation levels, and the number of parallel workers. A single orchestrator-level budget cap does not solve this problem — it can only stop the pipeline after the orchestrator itself has crossed its limit, by which point the workers may have already generated most of the damage. The only reliable defence is per-agent budget caps that fire inside each agent’s own execution context, combined with loop detection that catches repeated delegation patterns before they compound.

The cost amplification math

The numbers make the problem concrete. Consider a simple agent that costs $0.30 per LLM call, with a maximum of 20 iterations before a hard stop:

Single agent, no orchestration. Maximum cost exposure is $0.30 × 20 iterations = $6.00. A bug that causes the agent to loop runs up a $6 bill before the iteration limit fires.
5-agent pipeline (orchestrator + 4 workers). If the orchestrator delegates to each worker and each worker can run up to 20 iterations, the maximum exposure is $0.30 × 20 × 5 agents = $30.00. If one worker loops and the orchestrator retries it across all four workers in sequence, the exposure is even higher: the orchestrator’s own calls plus each worker’s iteration budget.
3-level hierarchy (orchestrator → manager → workers). At three delegation levels with a fan-out of three workers per manager, the maximum exposure is $0.30 × 20 iterations × 3 levels × 3 fan-out = $54.00 per manager branch, easily reaching $90–$180 for the full tree. A single runaway loop at the worker level propagates up through two delegation layers before any level-specific cap fires.

The critical insight is why capping at the orchestrator level is insufficient: if a worker agent enters a tool-call loop, the orchestrator does not see the individual LLM calls inside the worker — it only sees whether the worker returned a result. The orchestrator keeps re-delegating to the looping worker, each delegation spawning a fresh iteration budget, until the orchestrator itself hits its own cap. By then, the worker has run its full iteration budget on every delegation attempt. The correct architecture caps each agent at the point where its LLM calls are made, not at the level above it in the delegation hierarchy.

The three multi-agent cost failure modes

Cascading delegation. The orchestrator sends a task to a worker. The worker fails (wrong output format, tool error, incomplete result). The orchestrator re-delegates to a second worker with more context added to the task description (the first worker’s failed output, additional instructions, error details). The second worker also fails. The orchestrator re-delegates to a third worker or retries the first with even more context. Each delegation cycle costs more than the previous one because the task description grows with accumulated failure context. After four delegation attempts, the orchestrator’s context has grown to include three failed outputs, three sets of additional instructions, and the original task — each of which adds to the input token count for the next delegation call. The orchestrator-level cost grows super-linearly with each delegation round even if each worker’s individual cost is bounded.
Cross-agent context explosion. A common multi-agent pattern: the parent agent passes its full accumulated context to each child; each child’s response is appended back to the parent’s context; the parent passes the now-larger context to the next child. After processing three children, the parent’s context contains all three children’s outputs in addition to its original state. The fourth child receives a context that is 4× the size of what the first child received. Input token costs are linear with context size, so the fourth child costs 4× the first. In an 8-agent pipeline where each agent’s output is fed back to the orchestrator, the eighth agent call costs 8× the first. Without per-agent tracking, this cost explosion is invisible until the invoice arrives.
Redundant parallel work. The orchestrator spawns N workers in parallel for the same subtask — common in fan-out patterns where multiple perspectives or approaches are desired. All N workers complete and return similar results (because they were given the same task and the same context). The orchestrator then spawns N more workers to validate or synthesize the first round of results. Without deduplication logic and without per-agent caps, the orchestrator may launch multiple rounds of N-parallel workers, paying N× the per-worker cost per round. If the validation workers also determine the synthesis is insufficient and trigger another round, costs compound geometrically. This pattern is especially common in AutoGen’s GroupChat and LangGraph’s map-reduce subgraphs.

Per-agent budget caps with RunGuard: CrewAI example

The solution is to give each agent its own BudgetTracker and LoopDetector instance. When a worker raises LoopDetectedError or BudgetExceededError, the orchestrator catches it and handles it explicitly — rather than silently retrying with a fresh iteration budget. This is the crucial distinction: silent retries compound cost; explicit exceptions let the orchestrator make a policy decision (abort, degrade gracefully, escalate to human) without spending another cent on the broken worker.

Python: CrewAI-like pipeline with per-agent guards

from runguard import BudgetTracker, LoopDetector, guard, BudgetExceededError, LoopDetectedError
import openai

client = openai.OpenAI()

# Prices in USD per million tokens (GPT-4o as of 2026)
INPUT_PRICE = 2.50
OUTPUT_PRICE = 10.00

def make_llm_call(messages: list, sig_fn=None) -> dict:
    """Base LLM call that returns response + cost metadata."""
    response = client.chat.completions.create(
        model="gpt-4o",
        messages=messages,
    )
    usage = response.usage
    usd = (usage.prompt_tokens * INPUT_PRICE + usage.completion_tokens * OUTPUT_PRICE) / 1_000_000
    tool_calls = getattr(response.choices[0].message, "tool_calls", None)
    sig = tool_calls[0].function.name if tool_calls else "end_turn"
    if sig_fn:
        sig = sig_fn(response)
    return {"response": response, "usd": usd, "sig": sig}

# Each agent gets independent guard instances
orchestrator_guard = guard(
    make_llm_call,
    budget={"max_usd": 8.0},      # orchestrator gets $8 — it makes fewer calls
    loop={"repeats": 3, "max_cycle_len": 6},
)

researcher_guard = guard(
    make_llm_call,
    budget={"max_usd": 5.0},      # researcher: $5 cap — most expensive role
    loop={"repeats": 3, "max_cycle_len": 4},
)

writer_guard = guard(
    make_llm_call,
    budget={"max_usd": 3.0},      # writer: $3 cap
    loop={"repeats": 3, "max_cycle_len": 4},
)


def run_researcher(topic: str) -> str:
    """Researcher agent with its own budget and loop guard."""
    messages = [
        {"role": "system", "content": "You are a research analyst. Use search tools to gather information."},
        {"role": "user", "content": f"Research this topic thoroughly: {topic}"},
    ]
    result = researcher_guard(messages)
    return result["response"].choices[0].message.content


def run_writer(research: str, brief: str) -> str:
    """Writer agent with its own budget and loop guard."""
    messages = [
        {"role": "system", "content": "You are a professional writer. Turn research into clear prose."},
        {"role": "user", "content": f"Brief: {brief}\n\nResearch notes:\n{research}"},
    ]
    result = writer_guard(messages)
    return result["response"].choices[0].message.content


def run_pipeline(topic: str, brief: str) -> dict:
    """Orchestrator that catches worker errors explicitly instead of retrying silently."""
    try:
        research = run_researcher(topic)
    except LoopDetectedError as e:
        # Researcher entered a loop — use a reduced, focused fallback task
        print(f"Researcher loop detected ({e.pattern} repeated {e.repeats}x). Using fallback.")
        research = f"[Research incomplete due to loop at step: {e.pattern}]"
    except BudgetExceededError as e:
        # Researcher exceeded its $5 cap — proceed with what was accumulated
        print(f"Researcher budget exceeded (${e.spent:.4f}). Proceeding with partial research.")
        research = getattr(e, "partial_output", "[Research budget exhausted]")

    try:
        article = run_writer(research, brief)
    except (LoopDetectedError, BudgetExceededError) as e:
        print(f"Writer halted: {e}")
        return {"status": "partial", "research": research, "article": None, "error": str(e)}

    return {"status": "complete", "research": research, "article": article}

The key difference from a naive implementation: each guard() call is a separate instance with its own accumulated spend counter. The researcher’s $5 budget and the writer’s $3 budget are independent — the researcher exhausting its budget does not affect the writer’s remaining headroom. The orchestrator catches typed exceptions from each worker and makes an explicit decision, rather than letting the worker’s error propagate silently or triggering an automatic retry that starts the budget clock over.

For a deeper look at per-role budgets in CrewAI specifically, see CrewAI budget per agent.

LangGraph multi-agent cost control

LangGraph’s StateGraph model makes it straightforward to wrap each node function with its own guard. Because nodes are just Python callables, you can decorate them with guard() before passing them to graph.add_node(). A shared BudgetTracker at the graph level gives you a single total spend counter across all nodes, while per-node guards provide the fine-grained caps that prevent any single node from consuming the entire graph budget.

Python: LangGraph StateGraph with per-node RunGuard

from langgraph.graph import StateGraph, END
from runguard import BudgetTracker, LoopDetector, guard, BudgetExceededError, LoopDetectedError
from typing import TypedDict
import openai

client = openai.OpenAI()

class GraphState(TypedDict):
    task: str
    research: str
    draft: str
    final: str
    error: str

# Shared graph-level budget: total session cap across all nodes
graph_budget = BudgetTracker(cap_usd=20.0)

def make_call(messages: list) -> dict:
    response = client.chat.completions.create(model="gpt-4o", messages=messages)
    usage = response.usage
    usd = (usage.prompt_tokens * 2.50 + usage.completion_tokens * 10.0) / 1_000_000
    tool_calls = getattr(response.choices[0].message, "tool_calls", None)
    sig = tool_calls[0].function.name if tool_calls else "end_turn"
    return {"response": response, "usd": usd, "sig": sig}

# Per-node guards share the graph_budget tracker, but have individual loop detectors
researcher_guard = guard(
    make_call,
    budget=graph_budget,                        # shared session cap
    loop={"repeats": 3, "max_cycle_len": 4},    # independent loop detector
    per_call_budget={"max_usd": 0.50},          # single-call hard ceiling
)

writer_guard = guard(
    make_call,
    budget=graph_budget,
    loop={"repeats": 3, "max_cycle_len": 4},
    per_call_budget={"max_usd": 0.50},
)

reviewer_guard = guard(
    make_call,
    budget=graph_budget,
    loop={"repeats": 2, "max_cycle_len": 3},
    per_call_budget={"max_usd": 0.30},
)

def researcher_fn(state: GraphState) -> GraphState:
    try:
        result = researcher_guard([
            {"role": "system", "content": "You are a research analyst."},
            {"role": "user", "content": f"Research: {state['task']}"},
        ])
        return {**state, "research": result["response"].choices[0].message.content}
    except (LoopDetectedError, BudgetExceededError) as e:
        return {**state, "research": "", "error": f"researcher: {e}"}

def writer_fn(state: GraphState) -> GraphState:
    if state.get("error"):
        return state
    try:
        result = writer_guard([
            {"role": "system", "content": "You are a content writer."},
            {"role": "user", "content": f"Task: {state['task']}\nResearch: {state['research']}"},
        ])
        return {**state, "draft": result["response"].choices[0].message.content}
    except (LoopDetectedError, BudgetExceededError) as e:
        return {**state, "draft": "", "error": f"writer: {e}"}

def reviewer_fn(state: GraphState) -> GraphState:
    if state.get("error"):
        return state
    try:
        result = reviewer_guard([
            {"role": "system", "content": "You are an editor. Review for accuracy and clarity."},
            {"role": "user", "content": f"Review this draft:\n{state['draft']}"},
        ])
        return {**state, "final": result["response"].choices[0].message.content}
    except (LoopDetectedError, BudgetExceededError) as e:
        return {**state, "final": state["draft"], "error": f"reviewer: {e}"}

# Build the graph
graph = StateGraph(GraphState)
graph.add_node("researcher", researcher_fn)
graph.add_node("writer", writer_fn)
graph.add_node("reviewer", reviewer_fn)
graph.set_entry_point("researcher")
graph.add_edge("researcher", "writer")
graph.add_edge("writer", "reviewer")
graph.add_edge("reviewer", END)

pipeline = graph.compile()
result = pipeline.invoke({"task": "Summarize recent AI safety research", "research": "", "draft": "", "final": "", "error": ""})

if result.get("error"):
    print(f"Pipeline halted: {result['error']}")
else:
    print(f"Final output: {result['final']}")

The graph_budget instance is shared across all three node guards. If the researcher and writer together spend $18 of the $20 session cap, the reviewer has only $2 of headroom before the shared tracker raises BudgetExceededError. This prevents any combination of node spend from exceeding the total session budget, while still providing per-node caps via individual loop and per_call_budget parameters on each guard.

AutoGen multi-agent cost control

AutoGen’s ConversableAgent and AssistantAgent call the model via a configurable client. The cleanest injection point for RunGuard is the model client’s call method, which all agents share. For per-agent cost tracking, each agent needs its own guard instance wrapping its own model client call, so that their spend counters are independent.

Python: AutoGen v0.2 with per-agent guards

import openai as _openai
from runguard import guard, BudgetExceededError, LoopDetectedError

_orig_create = _openai.chat.completions.create

def make_call(messages, **kwargs):
    response = _orig_create(messages=messages, **kwargs)
    usage = response.usage
    usd = (usage.prompt_tokens * 2.50 + usage.completion_tokens * 10.0) / 1_000_000
    tool_calls = getattr(response.choices[0].message, "tool_calls", None)
    sig = tool_calls[0].function.name if tool_calls else "end_turn"
    return {"response": response, "usd": usd, "sig": sig}

# Independent guards for each agent role
researcher_guard = guard(
    make_call,
    budget={"max_usd": 4.0},
    loop={"repeats": 3, "max_cycle_len": 5},
)
writer_guard = guard(
    make_call,
    budget={"max_usd": 2.0},
    loop={"repeats": 3, "max_cycle_len": 4},
)

# Patch per-agent: store the active guard in a context variable
import contextvars
_active_guard = contextvars.ContextVar("active_guard", default=None)

def patched_create(messages, **kwargs):
    active = _active_guard.get()
    if active:
        return active(messages, **kwargs)["response"]
    return _orig_create(messages=messages, **kwargs)

_openai.chat.completions.create = patched_create

from autogen import ConversableAgent

def run_agent_with_guard(agent: ConversableAgent, guard_instance, task: str):
    """Run an AutoGen agent with a specific guard active in the call context."""
    token = _active_guard.set(guard_instance)
    try:
        result = agent.generate_reply(
            messages=[{"role": "user", "content": task}]
        )
        return result
    except LoopDetectedError as e:
        return f"[loop detected: {e.pattern}]"
    except BudgetExceededError as e:
        return f"[budget exceeded: ${e.spent:.4f}]"
    finally:
        _active_guard.reset(token)

llm_config = {"config_list": [{"model": "gpt-4o", "api_key": "YOUR_KEY"}]}

researcher_agent = ConversableAgent("researcher", llm_config=llm_config,
                                     system_message="You are a research analyst.")
writer_agent = ConversableAgent("writer", llm_config=llm_config,
                                 system_message="You are a professional writer.")

research = run_agent_with_guard(researcher_agent, researcher_guard,
                                "Research recent developments in AI safety.")
article = run_agent_with_guard(writer_agent, writer_guard,
                               f"Write a 500-word article based on:\n{research}")

print(article)

The contextvars.ContextVar approach ensures that the correct guard instance is active for each agent’s calls, even when agents run sequentially in the same thread. Each agent’s guard accumulates spend only for that agent’s calls, giving you independent per-agent budget tracking without changing how AutoGen manages its agent objects. For a deeper treatment of AutoGen loop detection across agent boundaries, see AutoGen loop guard and circuit breaker.

Budget hierarchy design: session, agent, and call-level caps

A robust multi-agent cost control strategy operates at three levels simultaneously. Each level catches failures that the others cannot:

Session-level cap: Total budget for the entire multi-agent run. For example, $20. If the cumulative cost across all agents, all delegation rounds, and all parallel workers exceeds $20, the entire pipeline halts. This is the outer safety net that prevents catastrophic runaway even if per-agent caps are misconfigured or bypassed.
Agent-level cap: Per-agent budget that prevents any single agent from consuming a disproportionate share of the session budget. For example, $5 per worker agent and $8 for the orchestrator. An agent that exhausts its own budget raises an exception that the caller handles explicitly, without the session-level cap being affected by that one agent’s spend.
Call-level cap: Per-LLM-call budget that prevents any single API call from being unexpectedly expensive. For example, $0.50 maximum per call. This catches cases where an agent’s context has grown extremely large (cross-agent context explosion), making individual calls very expensive even without a loop pattern. A $0.50 call ceiling fires on the first anomalously expensive call, before it can recur.

Python: three-level budget hierarchy with RunGuard

from runguard import BudgetTracker, guard, BudgetExceededError, LoopDetectedError
import openai

client = openai.OpenAI()

# Level 1: session-level cap — shared across all agents
session_budget = BudgetTracker(cap_usd=20.0)

def make_call(messages: list, model: str = "gpt-4o") -> dict:
    response = client.chat.completions.create(model=model, messages=messages)
    usage = response.usage
    usd = (usage.prompt_tokens * 2.50 + usage.completion_tokens * 10.0) / 1_000_000
    # Level 3: call-level cap — reject before spending if single call is too large
    if usd > 0.50:
        raise BudgetExceededError(f"Single call cost ${usd:.4f} exceeds $0.50 call ceiling")
    tool_calls = getattr(response.choices[0].message, "tool_calls", None)
    sig = tool_calls[0].function.name if tool_calls else "end_turn"
    return {"response": response, "usd": usd, "sig": sig}

def make_agent_guard(agent_cap_usd: float, loop_repeats: int = 3, cycle_len: int = 4):
    """Level 2: agent-level cap — each agent guard has its own spend counter,
    but all agents also contribute to the shared session_budget."""
    agent_budget = BudgetTracker(cap_usd=agent_cap_usd)

    def guarded_call(messages: list, model: str = "gpt-4o") -> dict:
        # Check session budget before every call
        session_budget.check()
        # Check agent budget before every call
        agent_budget.check()
        # Make the call
        result = make_call(messages, model)
        # Record cost in both trackers
        session_budget.record(result["usd"])
        agent_budget.record(result["usd"])
        return result

    return guard(
        guarded_call,
        loop={"repeats": loop_repeats, "max_cycle_len": cycle_len},
    )

# Create per-agent guards with independent agent-level caps
orchestrator_guard = make_agent_guard(agent_cap_usd=8.0, loop_repeats=3, cycle_len=6)
worker_a_guard    = make_agent_guard(agent_cap_usd=5.0, loop_repeats=3, cycle_len=4)
worker_b_guard    = make_agent_guard(agent_cap_usd=5.0, loop_repeats=3, cycle_len=4)
worker_c_guard    = make_agent_guard(agent_cap_usd=5.0, loop_repeats=3, cycle_len=4)

# Example: orchestrator runs, then dispatches to workers
def run_orchestrated_pipeline(task: str) -> dict:
    # Orchestrator call
    try:
        orch_result = orchestrator_guard([
            {"role": "system", "content": "You are an orchestrator. Break the task into subtasks."},
            {"role": "user", "content": task},
        ])
        subtasks = parse_subtasks(orch_result["response"])
    except (BudgetExceededError, LoopDetectedError) as e:
        return {"status": "orchestrator_failed", "error": str(e)}

    # Worker calls — each worker has its own $5 cap, but all share the $20 session cap
    results = []
    for subtask, worker_guard in zip(subtasks, [worker_a_guard, worker_b_guard, worker_c_guard]):
        try:
            result = worker_guard([
                {"role": "system", "content": "You are a specialist agent."},
                {"role": "user", "content": subtask},
            ])
            results.append(result["response"].choices[0].message.content)
        except BudgetExceededError as e:
            results.append(f"[worker budget exceeded: ${e.spent:.4f}]")
        except LoopDetectedError as e:
            results.append(f"[worker loop detected: {e.pattern}]")

    session_total = session_budget.spent
    return {"status": "complete", "results": results, "session_cost_usd": session_total}

def parse_subtasks(response) -> list:
    # Extract subtask list from orchestrator response
    content = response.choices[0].message.content
    lines = [l.strip("- ").strip() for l in content.splitlines() if l.strip().startswith("-")]
    return lines[:3] or [content]

This three-level design provides defense in depth. A single misconfigured agent-level cap does not expose the full session budget. A surprisingly expensive single call (due to context explosion) is caught at the call level before it can recur. And the session-level backstop ensures the entire run cannot exceed $20 regardless of how individual agents are configured. For guidance on calibrating these numbers from profiling data, see autonomous agent cost control best practices and how to set max cost per LLM request.

No guardrails vs. orchestrator-level cap vs. RunGuard per-agent caps

Capability	No guardrails	Orchestrator-level cap only	RunGuard per-agent caps
Loop detection	None — loop runs until hard iteration limit	None — orchestrator counts turns, not patterns	Per-agent signature-based detection (repeats + max_cycle_len)
Per-agent cost cap	Not supported	Not supported — single cap at orchestrator level only	Independent BudgetTracker per agent; each fires before next call
Cascading failure prevention	None — failed worker triggers silent retry with growing context	Partial — orchestrator cap eventually fires, but after workers have spent	Worker raises typed exception; orchestrator handles explicitly before re-delegating
Hierarchical budget enforcement	Not supported	Not supported — no session or call-level layers	Three-level hierarchy: session cap + agent cap + per-call ceiling
Real-time alert on budget exceeded	Not supported — discovered on monthly invoice	Not supported — alert fires after spend is committed	BudgetExceededError fires before the call that would cross the cap
Cross-agent context explosion detection	Not supported	Not supported	Call-level cap catches anomalously large single calls before they recur

For framework-specific context on how these patterns apply to LangChain agent pipelines, see LangChain agent budget limit. For real-time prevention of the runaway cost patterns discussed here, see prevent AI agent runaway cost in real time.

Add per-agent cost control to your multi-agent pipeline

RunGuard installs in one command: pip install runguard for Python, npm install @runguard/sdk for TypeScript. Wrap each agent’s LLM call function with guard(), give each guard its own BudgetTracker(cap_usd=...), and catch BudgetExceededError and LoopDetectedError in your orchestrator. The three-level hierarchy (session, agent, call) is three additional lines of configuration. No changes to your existing agent definitions, no new infrastructure, no proxy layer between you and your LLM provider.

RunGuard pricing: Solo plan at $19/month covers individual developers and small pipelines. Team plan at $79/month adds shared dashboards, multi-user access, and webhook alerts for Slack and PagerDuty. Both plans include a 14-day free trial — no credit card required.

Start your 14-day free trial — or explore related patterns: CrewAI per-agent budget, AutoGen loop guard, LangChain agent budget limit, autonomous agent cost control best practices, and prevent AI agent runaway cost in real time.