Your cost dashboard tells you what you spent. RunGuard stops the run that is actively spending it.
Every major LLM observability platform — LangSmith, Langfuse, Braintrust, Helicone, and the native dashboards from Anthropic and OpenAI — provides cost tracking. You can see daily spend, cost per trace, cost per model, and cost per user. You can set fleet-level alerts: “notify me if daily spend exceeds $500.” These are retrospective instruments. By the time the alert fires, the money is already on your invoice. The alert tells you that you spent $437 in the last 24 hours. It does not tell you that there is a single agent run currently in its forty-eighth tool-call iteration that has spent $38 in the last six minutes, and that without intervention it will cross $100 before the on-call engineer reads the Slack message. The gap between fleet-level alerting and per-run circuit breaking is not a gap in dashboard coverage — it is a gap in architecture. A dashboard reads historical data. A circuit breaker reads current state, in-process, before the next LLM API call goes out. RunGuard’s BudgetTracker is that circuit breaker: it accumulates the USD cost you compute from each generation’s token usage and throws BudgetExceededError before the call that would push the run past your cap. The bill for that call never lands on your invoice because the call is never made.
Why fleet-level cost alerts are insufficient for runaway agent runs
- Alerts fire after the spend, not before the next call. A Slack alert configured to fire when daily spend exceeds $500 fires at the moment the $500 threshold is crossed. By that point, the run that crossed it is still executing. A multi-step agent in a tool-call loop may generate $50 per minute of run time on an expensive model like Claude Opus at scale. The two minutes between the alert firing and the on-call engineer reading it can add another $100 to the invoice. A per-run circuit breaker fires before the generation that would push the run past the cap, not after the cap has already been crossed.
- Fleet-level thresholds cannot distinguish a legitimate expensive run from a runaway one. If your application legitimately runs a research agent that costs $15 per successful run, a fleet-level alert at $500/day is calibrated around a normal run volume. A runaway agent that loops 50 times and spends $750 on a single run will blow through the $500 fleet alert in the middle of that run — but by the time the alert fires, the run has already crossed $500 and is continuing to accumulate. A per-run cap at
maxUsd: $30(2× the expected $15/run cost) would have caught the runaway run at its second loop iteration and stopped it before it crossed $30, leaving the fleet-level alert untriggered and the total daily spend unchanged. - The LLM provider’s own hard spending limits are blunt and per-account. Anthropic, OpenAI, and other providers let you set hard account-level spending limits that cut off API access when the monthly budget is exceeded. These limits are useful for preventing account-level disasters, but they fire at the wrong granularity: they are per-account rather than per-run, they fire after the monthly invoice has already been committed, and when they fire they cut off all API access for all applications using that account key simultaneously. A per-run circuit breaker that fires inside your application process at the right granularity prevents the runaway run from contributing to the account limit, leaving capacity for legitimate runs to proceed.
- Cost spikes in agentic runs are structurally different from cost spikes in batch jobs. A batch job that processes documents one at a time has a predictable cost per item: if a document costs $0.05 to process, 1,000 documents cost $50. The per-item cost is roughly constant. An AI agent in a tool-call loop has an unbounded cost per run: each additional loop iteration adds a full LLM generation, and the context grows with each iteration (accumulated tool results), so the per-generation cost actually increases as the loop continues. A ten-iteration loop does not cost 10× the one-iteration cost; it costs more than 10× because the context is larger on each iteration and token pricing is linear with input tokens. This super-linear cost growth is why a per-run hard cap matters more for agents than for batch jobs.
How BudgetTracker works: accumulate, check, throw or pass
RunGuard’s BudgetTracker follows a three-step lifecycle for every guarded call. Step one (preflight): before your inner function runs, the tracker checks whether the accumulated spend for this run has already reached maxUsd. If it has, it throws BudgetExceededError immediately, before any API call is made. The API call that would have exceeded the cap is never sent; no tokens are consumed; no cost lands on the invoice. Step two (invocation): your inner function runs. The LLM API call goes out, the response comes back, and your function returns a result object that includes a usd field (the cost of this specific generation, computed by you from response.usage and your per-token rate). Step three (record): the tracker adds usd to the run’s accumulated total. If the new total exceeds maxUsd, it throws BudgetExceededError with the current e.spent value, ensuring the next call never fires. The important invariant is that the check happens before the call, not after. If the accumulated spend is $4.95 and your cap is $5.00, and the next generation costs $0.20, the guard fires before the $0.20 call goes out, not after it brings the total to $5.15. The cap is a hard upper bound, not a threshold crossed after the fact.
Implementation in TypeScript and Python
-
TypeScript: per-run budget cap with real token prices
import { guard, BudgetExceededError } from "@runguard/sdk"; import Anthropic from "@anthropic-ai/sdk"; const client = new Anthropic(); // Token prices (USD per million tokens) as of 2026 const PRICES = { "claude-sonnet-4-6": { input: 3, output: 15 }, "claude-opus-4-7": { input: 15, output: 75 }, }; async function callClaude( model: keyof typeof PRICES, messages: Anthropic.MessageParam[] ) { const response = await client.messages.create({ model, max_tokens: 4096, messages, }); const p = PRICES[model]; const usd = (response.usage.input_tokens * p.input + response.usage.output_tokens * p.output) / 1_000_000; const sig = response.content[0].type === "tool_use" ? response.content[0].name : "end_turn"; return { response, usd, sig }; } const guarded = guard(callClaude, { budget: { maxUsd: 5 }, loop: { repeats: 3, maxCycleLen: 8 }, }); // In your agent loop: async function runAgent(task: string) { const messages: Anthropic.MessageParam[] = [{ role: "user", content: task }]; while (true) { try { const { response } = await guarded("claude-sonnet-4-6", messages); if (response.stop_reason === "end_turn") break; // process tool calls and append results messages.push({ role: "assistant", content: response.content }); // ... append tool results ... } catch (e) { if (e instanceof BudgetExceededError) { console.error(`Run aborted: spent $${e.spent.toFixed(4)}, cap was $5.00`); // notify on-call, log to analytics, or fail gracefully return { status: "budget_exceeded", spent: e.spent }; } throw e; } } } -
Python: same pattern with async guard support
import asyncio from runguard import guard, guard_async, BudgetExceededError import anthropic client = anthropic.AsyncAnthropic() PRICES = { "claude-sonnet-4-6": (3, 15), # input, output USD/1M tokens "claude-opus-4-7": (15, 75), } async def call_claude(model: str, messages: list) -> dict: response = await client.messages.create( model=model, max_tokens=4096, messages=messages, ) p_in, p_out = PRICES[model] usd = (response.usage.input_tokens * p_in + response.usage.output_tokens * p_out) / 1_000_000 sig = response.content[0].name if response.stop_reason == "tool_use" else "end_turn" return {"response": response, "usd": usd, "sig": sig} guarded = guard_async( call_claude, budget={"max_usd": 5}, loop={"repeats": 3}, ) async def run_agent(task: str) -> dict: messages = [{"role": "user", "content": task}] while True: try: result = await guarded("claude-sonnet-4-6", messages) response = result["response"] if response.stop_reason == "end_turn": return {"status": "complete", "output": response.content} # process tool calls messages.append({"role": "assistant", "content": response.content}) # ... append tool results ... except BudgetExceededError as e: return { "status": "budget_exceeded", "spent": e.spent, "message": f"Run aborted at ${e.spent:.4f} (cap: $5.00)", }
Calibrating the per-run cap: the 2–3× rule
The right value for maxUsd depends on your agent’s cost profile on successful runs. The goal is to let legitimate long runs finish while stopping runaway loops before they become expensive. A practical calibration process: run your agent on a representative sample of real tasks and record the cost of each run from e.spent (or from your observability platform). Find the 95th percentile cost across those runs — the cost of your most expensive legitimate run. Set maxUsd to 2–3× that value. A cap at 2× the 95th-percentile cost means that runs in the top 5% of your normal distribution still complete without hitting the guard, while a loop that has already doubled the expected cost for the most expensive legitimate run is immediately stopped. This heuristic works because loop-induced costs are super-linear: a loop that repeats three times has already spent at least 3× the per-iteration cost, not counting the growing context overhead. Setting the cap at 2–3× the 95th-percentile legitimate cost reliably separates looping runs from expensive-but-legitimate runs. If you have multimodal cost distributions (some tasks are inherently cheap and others inherently expensive), set different caps per task type rather than a single fleet-wide cap: wrap callModel differently for cheap-task callers and expensive-task callers, or use RunGuard’s reset capability (guard.reset()) to start a fresh budget accumulator for each new task invocation within the same process.
Pairing budget caps with loop detection
- The same
guard()wrapper activates both detectors.guard(fn, { budget: { maxUsd: 5 }, loop: { repeats: 3, maxCycleLen: 8 } })runs both checks before each call. The loop detector fires first (on the preflight step) because a loop that has already repeated three times should be stopped regardless of cost — a very cheap loop on a small model can still exhaust the context window even if it hasn’t crossed the budget cap. The budget tracker fires second (after the loop check passes). This ordering means the loop detector catches the structural problem (repeated behavior) and the budget tracker catches the cost problem (accumulated spend), and neither guard fires unnecessarily when the other guard should have fired first. - A loop that the loop detector misses (dissimilar outputs) is caught by the budget tracker. The
LoopDetectorcompares signatures: a 64-byte hash of the tool name and a truncated slice of the tool input. If a looping agent produces slightly different tool inputs on each iteration (slight rephrasing of a search query, different parameter values for the same tool), the signatures may not repeat and the loop detector may not fire. But the cost still accumulates: a loop with varying inputs is still a loop, and still burns money. The budget tracker is the backstop for this case: even if the loop detector passes, the budget cap fires when accumulated spend exceedsmaxUsd. - Adding context overflow detection closes the third failure mode. A complete RunGuard integration for runaway cost prevention activates all three guards:
guard(fn, { loop: { repeats: 3 }, budget: { maxUsd: 10 }, context: { maxContextTokens: 200_000, headroom: 8192, tokens: countFn } }). The loop detector catches structural repetition. The budget tracker catches accumulated spend. The context guard catches near-limit degradation and prevents the hard-limit 400 that would cost you the tokens from all preceding calls in the run. All three failure modes — structural loops, cost runaway, context overflow — are covered by one wrapper with three option blocks.
The horror story: six consecutive $437 billing cycles before the circuit breaker
We built RunGuard while running a bespoke daily script that posts a launch thread via the X API. The script had no loop detection and no budget cap (it made a single API call per day, not an agent loop). The API call came back HTTP 402 CreditsDepleted on day one. Days two through six: same error. Six consecutive sessions, zero progress, but also zero cost escalation because the X API returned an error before we spent anything. The X API failure was not a cost problem — it was a reliability problem. But the same failure pattern, if it had been a looping agent calling an LLM instead of a looping script calling a social API, would have looked like this: session one returns a wrong result, agent calls the LLM again with the same input, gets the same wrong result, calls again, and again, and again. At $0.015 per 1,000 tokens on a cheap model, six iterations of a 10,000-token context cost $0.90. At $0.015 per 1,000 tokens with a growing context (each iteration adds 1,000 tokens of accumulated results), six iterations cost $1.26. On a more expensive model like Claude Opus 4.7 at $0.075 per 1,000 input tokens, six iterations of a 10,000-token context cost $6.75, growing to $10.35 with accumulating context overhead. A cap at maxUsd: 5 would have stopped the run after the third iteration and saved $4–8 depending on the model. The full incident writeup is on the 30-day log, including the exact signature detection timeline.
What this is not
- Not a replacement for fleet-level cost monitoring. RunGuard’s per-run budget cap prevents individual runaway runs from driving up your bill. Fleet-level cost monitoring (LangSmith dashboards, Anthropic’s usage console, CloudWatch cost alerts) is still necessary for detecting trends across many runs: if your 95th-percentile legitimate run cost has drifted upward from $1 to $3 over a month (because your prompts got longer or you switched to a more expensive model), you want to know that. Use RunGuard’s per-run cap as the per-job fence; use fleet-level monitoring as the fleet-wide trend detector. The two instruments are complementary at different scales.
- Not a billing proxy or a server-side rate limiter. RunGuard runs entirely in your application process. The budget accumulator is an in-memory counter for the duration of a run. It does not communicate with your LLM provider to enforce the cap at the API layer, and it does not persist across process restarts. If your agent runs in a stateless serverless environment where each function invocation is a new process, the accumulator starts at zero for each invocation. This is usually correct behavior for serverless agents (each invocation is a new “run”), but if you need cross-invocation tracking, you are responsible for initializing the guard with the accumulated spend from prior invocations (not a RunGuard feature today).
- Not a smart cost optimizer. RunGuard does not decide which LLM call to make cheaper (shorter prompts, cheaper models, cached responses). It decides whether to make the call at all. Prompt optimization, model routing, and response caching are separate concerns. RunGuard is the last line of defense when optimization hasn’t prevented a run from approaching the cap, not a substitute for building efficient agents.
The minimum integration
Two lines of new code: import { guard } from "@runguard/sdk" and const guarded = guard(yourLlmCallFn, { budget: { maxUsd: 5 } }). One new return field from your guarded function: usd (the cost of this specific generation, computed from response.usage and your per-token rate). One catch block for BudgetExceededError. That’s the entire budget guard integration. The loop guard and context guard are additive options on the same guard() call. RunGuard ships as @runguard/sdk on npm and runguard on PyPI. The full API surface is in llms.txt. The LangChain circuit breaker page and CrewAI loop detection page show the same budget + loop integration applied to specific frameworks. For context-overflow protection alongside the budget cap, see the context window truncation alert page.