Autonomous agent cost control: five layers you need before going to production
Autonomous agents are cost-unbounded by nature: they decide how many tool calls to make, how many planning steps to take, and how many tokens to generate in each response. An agent that works perfectly in development — running 3-step tasks that cost $0.04 each — can, in production, encounter an edge case that sends it into a 50-step loop costing $3.00. If ten users trigger that edge case simultaneously, you’ve spent $30 before your alerting fires. The engineers who ship autonomous agents without a cost-control stack are the ones who come in Monday morning to find an unexpected invoice. This page covers the five layers of cost control that every production autonomous agent needs, with specific implementation patterns for each layer in Python and TypeScript.
Layer 1 — Provider billing caps (infrastructure backstop)
- What to configure. Every major LLM provider (OpenAI, Anthropic, Google) allows you to set monthly spend limits at the account or project level. These are hard limits: when hit, the API returns errors until the next billing cycle. Set project-level limits so different agents have separate budgets. OpenAI: Settings → Billing → Usage limits → Set monthly budget. Anthropic: Account Settings → Usage. Configure these as the last-resort backstop — they prevent catastrophic runaway, but they fire at the billing layer, not the application layer.
- What they don’t do. Provider caps don’t tell you which agent or run triggered the spend. They don’t fire in real time — billing lag means you see the damage an hour or a day later. They don’t protect individual users from runaway (one user can exhaust the project cap for all users). Layer 1 is necessary but not sufficient. You need Layers 2–5 for per-run and per-user granularity.
Layer 2 — Per-run budget enforcement (real-time circuit breaker)
- The correct primitive. Per-run budget enforcement fires before each LLM call: if the accumulated cost of the current run has already reached
max_usd, the call is blocked and aBudgetExceededErroris raised. This is fundamentally different from checking the accumulated cost after the call (which lets the last call overshoot the budget by however much it costs). - Setting the budget. Run 20–30 sample tasks from your production distribution and measure the P95 cost per run. Multiply P95 by 2.5. If the P95 cost of a successful task is $0.80, set
max_usd=2.00. Runs that cost more than $2.00 are anomalous (loop or context blow-through) and should be terminated, not allowed to continue. Revisit the budget whenever you change the agent’s tool set, model, or prompt. - Python implementation with RunGuard.
from runguard import guard, BudgetExceededError, LoopDetectedError def llm_call(messages, tools) -> dict: response = openai_client.chat(messages=messages, tools=tools) return { "response": response.choices[0].message, "usd": compute_usd(response.usage, model="gpt-4o"), "sig": (response.choices[0].message.tool_calls or [None])[0] and response.choices[0].message.tool_calls[0].function.name or "end_turn", } def run_task(task: str, max_usd: float = 2.00): guarded = guard(llm_call, budget={"max_usd": max_usd}, loop={"repeats": 3}) messages = [{"role": "user", "content": task}] try: while True: result = guarded(messages, tools=TOOLS) if not result["response"].tool_calls: return result["response"].content messages.append(result["response"]) # ... execute tool calls, append results except BudgetExceededError as e: return f"[Budget cap reached: ${e.spent:.3f} of ${max_usd:.2f}] Partial result above." except LoopDetectedError as e: return f"[Loop detected: {e.pattern!r} × {e.repeats}] Stopping."
Layer 3 — Loop detection (pattern circuit breaker)
- Why budgets alone are not enough. A budget cap stops a loop eventually — when it hits the dollar ceiling. But a loop that costs $0.05 per step and has a $2.00 cap runs for 40 steps before terminating. If the loop produces no useful output (which is almost always the case), you’ve spent $2.00 and gotten nothing. A pattern detector fires after the third repetition regardless of cost: it catches the loop at step 3, not step 40.
- How signature-based loop detection works. Each LLM call returns a tool-call identifier (the name of the function the model wants to call, or “end_turn” if the model is returning a final answer). A loop is detected when the same identifier appears at positions
i,i+k,i+2kin the call history — i.e., a repeating pattern of periodk. Catching period-1 loops (same tool called three times in a row) is the minimum. RunGuard supportsmax_cycle_lento also catch period-2 and period-3 loops (A-B-A-B-A-B or A-B-C-A-B-C). - What to do when a loop is detected. On
LoopDetectedError, extract the accumulated messages from the error, pass them to a cheap summarizer model, and return the partial result. Do not simply raise — most loops produce some useful intermediate output that the user can work with. The graceful degradation path is: catch → summarize → return partial.
Layer 4 — Token budgets (context management)
- Context window blow-through is a cost multiplier. As an agent accumulates tool call results in its message history, the prompt size grows. A 10-step agent that starts with a 1,000-token prompt can end step 10 with a 40,000-token prompt as tool results accumulate. Each LLM call in the final steps costs 40× more in input tokens than the first call. Token budgets cap this growth.
- Sliding window vs. summarization. The two standard approaches are: (1) keep only the last N messages in context (sliding window) — simple but loses early tool results that may be relevant; (2) periodically summarize older messages with a cheap model and compress them into a summary message — more expensive to implement but retains information. For most agent workloads, a sliding window of 20 messages is the right default. Add summarization only if you observe the agent losing context in long tasks.
- Implementation.
def trim_context(messages: list, max_messages: int = 20) -> list: """Keep system message + last max_messages non-system messages.""" system = [m for m in messages if m["role"] == "system"] non_system = [m for m in messages if m["role"] != "system"] trimmed = non_system[-max_messages:] if len(non_system) > max_messages: trimmed[0]["content"] = ( f"[{len(non_system) - max_messages} earlier messages omitted] " + trimmed[0]["content"] ) return system + trimmed
Layer 5 — Cost observability (logging and alerting)
- Log cost per run, not just per call. Most LLM observability tools (Langfuse, Braintrust, Weave) show you cost per individual LLM call. This is useful for debugging, but production cost control needs cost per agent run — the sum of all calls that serve a single user task. Tag each LLM call with a
run_idand aggregate cost byrun_idin your database or observability platform. - Alert on P99 run cost, not average. Average cost per run is a vanity metric for cost control. Alert when P99 cost exceeds 2× your target median. A rising P99 means your agent is encountering edge cases more frequently — potential loops, context blow-throughs, or unexpected tool chains that the average doesn’t capture.
- Minimal cost-logging implementation.
import sqlite3 import uuid from datetime import datetime DB = sqlite3.connect("agent_costs.db") DB.execute(""" CREATE TABLE IF NOT EXISTS run_costs ( run_id TEXT, task_hash TEXT, model TEXT, prompt_tokens INTEGER, completion_tokens INTEGER, usd REAL, ts TEXT, outcome TEXT ) """) def log_run_cost(run_id: str, model: str, usage, outcome: str): usd = compute_usd(usage, model) DB.execute( "INSERT INTO run_costs VALUES (?,?,?,?,?,?,?,?)", (run_id, "", model, usage.prompt_tokens, usage.completion_tokens, usd, datetime.utcnow().isoformat(), outcome) ) DB.commit() # Query P99 cost daily p99_query = """ SELECT date(ts), percentile_cont(0.99) WITHIN GROUP (ORDER BY usd) as p99 FROM run_costs GROUP BY date(ts) """ - Slack alert on budget breach. Integrate with your incident response by sending a Slack message on every
BudgetExceededErrororLoopDetectedError. Include the run ID, task description (truncated), cost at breach, and the loop pattern if applicable. RunGuard’s alert config handles this in one line:guard(fn, budget={"max_usd": 2.0}, alerts={"slack_webhook": WEBHOOK_URL}).
The five layers at a glance
| Layer | What it prevents | Granularity | Fires in real time? |
|---|---|---|---|
| 1. Provider billing cap | Account-level catastrophe | Monthly, project | No (billing lag) |
| 2. Per-run budget | Individual run blow-through | Per agent run | Yes (before each call) |
| 3. Loop detection | Repetition waste | Per agent run | Yes (after N repeats) |
| 4. Token budget | Context-growth cost multiplier | Per run, rolling | Yes (trim before call) |
| 5. Cost observability | Gradual drift, edge-case surfacing | Per run, aggregated | Async (alerting) |