Autonomous agent cost control: five layers you need before going to production

Autonomous agents are cost-unbounded by nature: they decide how many tool calls to make, how many planning steps to take, and how many tokens to generate in each response. An agent that works perfectly in development — running 3-step tasks that cost $0.04 each — can, in production, encounter an edge case that sends it into a 50-step loop costing $3.00. If ten users trigger that edge case simultaneously, you’ve spent $30 before your alerting fires. The engineers who ship autonomous agents without a cost-control stack are the ones who come in Monday morning to find an unexpected invoice. This page covers the five layers of cost control that every production autonomous agent needs, with specific implementation patterns for each layer in Python and TypeScript.

Layer 1 — Provider billing caps (infrastructure backstop)

What to configure. Every major LLM provider (OpenAI, Anthropic, Google) allows you to set monthly spend limits at the account or project level. These are hard limits: when hit, the API returns errors until the next billing cycle. Set project-level limits so different agents have separate budgets. OpenAI: Settings → Billing → Usage limits → Set monthly budget. Anthropic: Account Settings → Usage. Configure these as the last-resort backstop — they prevent catastrophic runaway, but they fire at the billing layer, not the application layer.
What they don’t do. Provider caps don’t tell you which agent or run triggered the spend. They don’t fire in real time — billing lag means you see the damage an hour or a day later. They don’t protect individual users from runaway (one user can exhaust the project cap for all users). Layer 1 is necessary but not sufficient. You need Layers 2–5 for per-run and per-user granularity.

Layer 2 — Per-run budget enforcement (real-time circuit breaker)

The correct primitive. Per-run budget enforcement fires before each LLM call: if the accumulated cost of the current run has already reached max_usd, the call is blocked and a BudgetExceededError is raised. This is fundamentally different from checking the accumulated cost after the call (which lets the last call overshoot the budget by however much it costs).
Setting the budget. Run 20–30 sample tasks from your production distribution and measure the P95 cost per run. Multiply P95 by 2.5. If the P95 cost of a successful task is $0.80, set max_usd=2.00. Runs that cost more than $2.00 are anomalous (loop or context blow-through) and should be terminated, not allowed to continue. Revisit the budget whenever you change the agent’s tool set, model, or prompt.

Python implementation with RunGuard.

from runguard import guard, BudgetExceededError, LoopDetectedError

def llm_call(messages, tools) -> dict:
    response = openai_client.chat(messages=messages, tools=tools)
    return {
        "response": response.choices[0].message,
        "usd": compute_usd(response.usage, model="gpt-4o"),
        "sig": (response.choices[0].message.tool_calls or [None])[0]
               and response.choices[0].message.tool_calls[0].function.name
               or "end_turn",
    }

def run_task(task: str, max_usd: float = 2.00):
    guarded = guard(llm_call,
                    budget={"max_usd": max_usd},
                    loop={"repeats": 3})
    messages = [{"role": "user", "content": task}]
    try:
        while True:
            result = guarded(messages, tools=TOOLS)
            if not result["response"].tool_calls:
                return result["response"].content
            messages.append(result["response"])
            # ... execute tool calls, append results
    except BudgetExceededError as e:
        return f"[Budget cap reached: ${e.spent:.3f} of ${max_usd:.2f}] Partial result above."
    except LoopDetectedError as e:
        return f"[Loop detected: {e.pattern!r} × {e.repeats}] Stopping."

Layer 3 — Loop detection (pattern circuit breaker)

Why budgets alone are not enough. A budget cap stops a loop eventually — when it hits the dollar ceiling. But a loop that costs $0.05 per step and has a $2.00 cap runs for 40 steps before terminating. If the loop produces no useful output (which is almost always the case), you’ve spent $2.00 and gotten nothing. A pattern detector fires after the third repetition regardless of cost: it catches the loop at step 3, not step 40.
How signature-based loop detection works. Each LLM call returns a tool-call identifier (the name of the function the model wants to call, or “end_turn” if the model is returning a final answer). A loop is detected when the same identifier appears at positions i, i+k, i+2k in the call history — i.e., a repeating pattern of period k. Catching period-1 loops (same tool called three times in a row) is the minimum. RunGuard supports max_cycle_len to also catch period-2 and period-3 loops (A-B-A-B-A-B or A-B-C-A-B-C).
What to do when a loop is detected. On LoopDetectedError, extract the accumulated messages from the error, pass them to a cheap summarizer model, and return the partial result. Do not simply raise — most loops produce some useful intermediate output that the user can work with. The graceful degradation path is: catch → summarize → return partial.

Layer 4 — Token budgets (context management)

Context window blow-through is a cost multiplier. As an agent accumulates tool call results in its message history, the prompt size grows. A 10-step agent that starts with a 1,000-token prompt can end step 10 with a 40,000-token prompt as tool results accumulate. Each LLM call in the final steps costs 40× more in input tokens than the first call. Token budgets cap this growth.
Sliding window vs. summarization. The two standard approaches are: (1) keep only the last N messages in context (sliding window) — simple but loses early tool results that may be relevant; (2) periodically summarize older messages with a cheap model and compress them into a summary message — more expensive to implement but retains information. For most agent workloads, a sliding window of 20 messages is the right default. Add summarization only if you observe the agent losing context in long tasks.

Implementation.

def trim_context(messages: list, max_messages: int = 20) -> list:
    """Keep system message + last max_messages non-system messages."""
    system = [m for m in messages if m["role"] == "system"]
    non_system = [m for m in messages if m["role"] != "system"]
    trimmed = non_system[-max_messages:]
    if len(non_system) > max_messages:
        trimmed[0]["content"] = (
            f"[{len(non_system) - max_messages} earlier messages omitted] " + trimmed[0]["content"]
        )
    return system + trimmed

Layer 5 — Cost observability (logging and alerting)

Log cost per run, not just per call. Most LLM observability tools (Langfuse, Braintrust, Weave) show you cost per individual LLM call. This is useful for debugging, but production cost control needs cost per agent run — the sum of all calls that serve a single user task. Tag each LLM call with a run_id and aggregate cost by run_id in your database or observability platform.
Alert on P99 run cost, not average. Average cost per run is a vanity metric for cost control. Alert when P99 cost exceeds 2× your target median. A rising P99 means your agent is encountering edge cases more frequently — potential loops, context blow-throughs, or unexpected tool chains that the average doesn’t capture.

Minimal cost-logging implementation.

import sqlite3
import uuid
from datetime import datetime

DB = sqlite3.connect("agent_costs.db")
DB.execute("""
  CREATE TABLE IF NOT EXISTS run_costs (
    run_id TEXT,
    task_hash TEXT,
    model TEXT,
    prompt_tokens INTEGER,
    completion_tokens INTEGER,
    usd REAL,
    ts TEXT,
    outcome TEXT
  )
""")

def log_run_cost(run_id: str, model: str, usage, outcome: str):
    usd = compute_usd(usage, model)
    DB.execute(
        "INSERT INTO run_costs VALUES (?,?,?,?,?,?,?,?)",
        (run_id, "", model, usage.prompt_tokens, usage.completion_tokens,
         usd, datetime.utcnow().isoformat(), outcome)
    )
    DB.commit()

# Query P99 cost daily
p99_query = """
  SELECT date(ts), percentile_cont(0.99) WITHIN GROUP (ORDER BY usd) as p99
  FROM run_costs
  GROUP BY date(ts)
"""

Slack alert on budget breach. Integrate with your incident response by sending a Slack message on every BudgetExceededError or LoopDetectedError. Include the run ID, task description (truncated), cost at breach, and the loop pattern if applicable. RunGuard’s alert config handles this in one line: guard(fn, budget={"max_usd": 2.0}, alerts={"slack_webhook": WEBHOOK_URL}).

The five layers at a glance

Layer	What it prevents	Granularity	Fires in real time?
1. Provider billing cap	Account-level catastrophe	Monthly, project	No (billing lag)
2. Per-run budget	Individual run blow-through	Per agent run	Yes (before each call)
3. Loop detection	Repetition waste	Per agent run	Yes (after N repeats)
4. Token budget	Context-growth cost multiplier	Per run, rolling	Yes (trim before call)
5. Cost observability	Gradual drift, edge-case surfacing	Per run, aggregated	Async (alerting)