How to set a max cost per LLM request — and why “per request” is the wrong granularity for agents
Every team that has shipped an AI agent eventually discovers the same problem: a single API call to GPT-4o or Claude costs fractions of a cent, so it feels too small to worry about. Then the agent loops fourteen times on the same tool call and the run costs $12. The instinct is to find a “max cost per LLM request” setting and turn it on. That’s a good instinct, but it targets the wrong level of granularity. A single LLM request in an agentic loop can cost $0.03 and still be a problem if it repeats 400 times. The right primitive is max cost per agent run, not per individual request. This page explains three approaches, their failure modes, and the right architecture for each stage of production maturity.
Approach 1 — Provider-level spend limits (billing caps)
- What it is. OpenAI, Anthropic, Google, and most LLM providers allow you to set monthly spend limits in their billing dashboard. When your account hits the limit, API calls start returning 429 errors. Some providers (OpenAI as of 2025) support project-level limits so you can cap one application without affecting others on the same account.
- When it works. Provider limits are the right backstop for preventing total account compromise (a leaked API key that triggers mass generation). They are account-wide or project-wide and enforce hard monthly ceilings. They are the correct tool for billing-level protection at the infrastructure layer.
- Where it fails for agents. Provider limits fire at the billing layer, not the application layer. If your monthly cap is $200 and a runaway agent spends $50 in one afternoon, the agent keeps running until the cap is hit — you see the damage only when the bill arrives or when the 429s start cascading into your application error logs. Provider limits do not let you say “this agent run should cost no more than $3.” They enforce “this account should cost no more than $200 this month.”
- How to set it. OpenAI: Project Settings → Limits → Set a monthly budget limit. Anthropic: Usage → Usage limits. These are billing-layer controls, not per-request or per-run controls. Set them as a backstop and add a runtime guard for finer granularity.
Approach 2 — Token-count guards (max_tokens + prompt budget)
- What it is. Every LLM API accepts a
max_tokensparameter that caps the number of completion tokens the model generates. This indirectly caps the cost of a single API call, since completion tokens are typically priced at 2–4× input tokens. You can combine this with a max prompt size (in tokens) to bound the input cost per call as well. - The calculation for a single call. For GPT-4o at $2.50/1M input and $10/1M output: a call with 10,000 input tokens and 2,000 output tokens costs $0.025 + $0.020 = $0.045. Setting
max_tokens=2000bounds the output cost per call but not the accumulated cost across a multi-step agent run. An agent with a 10-step loop still accumulates $0.45 even withmax_tokens=2000per call. - The right way to implement prompt budget tracking.
from openai import OpenAI client = OpenAI() class BudgetedChatClient: """Wraps OpenAI client with per-session USD accounting.""" def __init__(self, max_usd: float, model: str = "gpt-4o"): self._max_usd = max_usd self._model = model self._spent = 0.0 self._PRICE = { "gpt-4o": (2.50, 10.00), # (input, output) per 1M tokens "gpt-4o-mini": (0.15, 0.60), "o3-mini": (1.10, 4.40), } def chat(self, messages: list, **kwargs) -> str: input_price, output_price = self._PRICE.get(self._model, (2.50, 10.00)) response = client.chat.completions.create( model=self._model, messages=messages, max_tokens=kwargs.pop("max_tokens", 4096), **kwargs, ) usage = response.usage cost = (usage.prompt_tokens * input_price + usage.completion_tokens * output_price) / 1_000_000 self._spent += cost if self._spent > self._max_usd: raise RuntimeError( f"Budget exceeded: ${self._spent:.4f} spent (cap ${self._max_usd:.2f})" ) return response.choices[0].message.content # Usage llm = BudgetedChatClient(max_usd=2.0, model="gpt-4o") # Each call to llm.chat() accumulates cost and raises on breach - Why this is still incomplete for loops. The client above raises after the call returns — the cost has already been incurred. It also does not detect loops: if the same call repeats eight times under budget, all eight calls go through and you pay for all eight before the next breach check. For a real-time circuit breaker that fires before a call goes out when a loop is detected, you need pattern detection in the guard layer, not just cost accumulation.
Approach 3 — SDK-level runtime guard (RunGuard)
- What it is. RunGuard wraps your agent’s tool-call or model-call function and maintains two states across the entire run: accumulated USD and a sliding window of call signatures. Before each call goes out, RunGuard checks both: if accumulated cost exceeds
max_usd, it raisesBudgetExceededErrorbefore the HTTP request leaves your process; if the tool-call signature matches a pattern seenrepeatstimes in the recent window, it raisesLoopDetectedError. Both errors fire at the layer that is cheapest to intercept: before the call, not after. - Python integration pattern.
from runguard import guard, BudgetExceededError, LoopDetectedError from openai import OpenAI client = OpenAI() def call_llm(messages: list, tools: list) -> dict: response = client.chat.completions.create( model="gpt-4o", messages=messages, tools=tools, max_tokens=4096, ) choice = response.choices[0] usage = response.usage usd = (usage.prompt_tokens * 2.50 + usage.completion_tokens * 10.0) / 1_000_000 sig = "end_turn" if choice.message.tool_calls: sig = choice.message.tool_calls[0].function.name return {"response": choice.message, "usd": usd, "sig": sig} # One guard instance per agent run — not per call run_guard = guard( call_llm, budget={"max_usd": 3.00}, # $3 per-run hard cap loop={"repeats": 3, "max_cycle_len": 10}, ) try: result = run_guard(messages, tools) except BudgetExceededError as e: print(f"Budget hit: ${e.spent:.4f} (cap $3.00)") except LoopDetectedError as e: print(f"Loop: {e.pattern!r} repeated {e.repeats}x") - TypeScript integration pattern.
import { guard, BudgetExceededError, LoopDetectedError } from "@runguard/sdk"; import OpenAI from "openai"; const client = new OpenAI(); async function callLLM(messages: OpenAI.ChatCompletionMessageParam[], tools: OpenAI.ChatCompletionTool[]) { const response = await client.chat.completions.create({ model: "gpt-4o", messages, tools, max_tokens: 4096, }); const choice = response.choices[0]; const usage = response.usage!; const usd = (usage.prompt_tokens * 2.50 + usage.completion_tokens * 10.0) / 1_000_000; const sig = choice.message.tool_calls?.[0]?.function?.name ?? "end_turn"; return { response: choice.message, usd, sig }; } const runGuard = guard(callLLM, { budget: { maxUsd: 3.00 }, loop: { repeats: 3, maxCycleLen: 10 }, }); try { const result = await runGuard(messages, tools); } catch (e) { if (e instanceof BudgetExceededError) { console.log(`Budget: $${e.spent.toFixed(4)} of $3.00`); } else if (e instanceof LoopDetectedError) { console.log(`Loop: ${e.pattern} × ${e.repeats}`); } } - Setting the right budget value. The correct
max_usdfor a given agent task is 2–3× the 95th-percentile cost of a successful run. Instrument 20–30 real runs with the cost-accumulation approach from Approach 2, take the P95, and double it. If P95 is $0.90, setmax_usd=2.00. This catches genuine runaway behaviour (a run that costs $5 when it should cost $0.90 is a loop or a context blow-through) without false positives on legitimately complex runs that take longer than average. Revisit the cap every time you change the agent’s tool set or prompt.
Comparison: three approaches at a glance
| Approach | Granularity | Fires before or after cost? | Detects loops? | Best for |
|---|---|---|---|---|
| Provider billing cap | Monthly, account or project | After (billing lag) | No | Infrastructure backstop |
| max_tokens per call | Per-call output size | After (call completes) | No | Output size control |
| Accumulated cost check | Per-run accumulated | After each call | No | Cost visibility with 1-call lag |
| RunGuard budget guard | Per-run accumulated | Before each call | Yes (signature window) | Real-time circuit breaker |
What to do when the budget fires
- Graceful degradation: summarize what you have. On
BudgetExceededError, catch the exception in your agent loop and pass the accumulated conversation history to a cheap model (GPT-4o-mini or Claude Haiku) with a prompt asking it to summarize the partial results. Return the summary to the user with a notice that the task was partially completed. This is almost always better than returning nothing. - Checkpoint and resume. After each tool call, serialize the agent’s accumulated state (messages, tool results, intermediate variables) to a checkpoint store (Redis, SQLite, S3). On
BudgetExceededError, save the checkpoint and return a resume token to the caller. The caller can start a new run with the resume token and pick up from the checkpoint, using a fresh budget. This pattern is most useful for long-running research or data-processing agents where partial results have value. - Soft-limit warning + hard cap. Set
warn_at_fraction=0.7in the RunGuard budget config to receive a warning callback when 70% of the budget is spent. Use the warning to inject a “please wrap up” message into the agent’s context, giving the model a chance to produce a partial answer before the hard cap fires. Set the hard cap at 100% as before. This reduces abrupt cutoffs without raising the ceiling.