How to set a max cost per LLM request — and why “per request” is the wrong granularity for agents

Every team that has shipped an AI agent eventually discovers the same problem: a single API call to GPT-4o or Claude costs fractions of a cent, so it feels too small to worry about. Then the agent loops fourteen times on the same tool call and the run costs $12. The instinct is to find a “max cost per LLM request” setting and turn it on. That’s a good instinct, but it targets the wrong level of granularity. A single LLM request in an agentic loop can cost $0.03 and still be a problem if it repeats 400 times. The right primitive is max cost per agent run, not per individual request. This page explains three approaches, their failure modes, and the right architecture for each stage of production maturity.

Approach 1 — Provider-level spend limits (billing caps)

What it is. OpenAI, Anthropic, Google, and most LLM providers allow you to set monthly spend limits in their billing dashboard. When your account hits the limit, API calls start returning 429 errors. Some providers (OpenAI as of 2025) support project-level limits so you can cap one application without affecting others on the same account.
When it works. Provider limits are the right backstop for preventing total account compromise (a leaked API key that triggers mass generation). They are account-wide or project-wide and enforce hard monthly ceilings. They are the correct tool for billing-level protection at the infrastructure layer.
Where it fails for agents. Provider limits fire at the billing layer, not the application layer. If your monthly cap is $200 and a runaway agent spends $50 in one afternoon, the agent keeps running until the cap is hit — you see the damage only when the bill arrives or when the 429s start cascading into your application error logs. Provider limits do not let you say “this agent run should cost no more than $3.” They enforce “this account should cost no more than $200 this month.”
How to set it. OpenAI: Project Settings → Limits → Set a monthly budget limit. Anthropic: Usage → Usage limits. These are billing-layer controls, not per-request or per-run controls. Set them as a backstop and add a runtime guard for finer granularity.

Approach 2 — Token-count guards (max_tokens + prompt budget)

What it is. Every LLM API accepts a max_tokens parameter that caps the number of completion tokens the model generates. This indirectly caps the cost of a single API call, since completion tokens are typically priced at 2–4× input tokens. You can combine this with a max prompt size (in tokens) to bound the input cost per call as well.
The calculation for a single call. For GPT-4o at $2.50/1M input and $10/1M output: a call with 10,000 input tokens and 2,000 output tokens costs $0.025 + $0.020 = $0.045. Setting max_tokens=2000 bounds the output cost per call but not the accumulated cost across a multi-step agent run. An agent with a 10-step loop still accumulates $0.45 even with max_tokens=2000 per call.

The right way to implement prompt budget tracking.

from openai import OpenAI

client = OpenAI()

class BudgetedChatClient:
    """Wraps OpenAI client with per-session USD accounting."""

    def __init__(self, max_usd: float, model: str = "gpt-4o"):
        self._max_usd = max_usd
        self._model = model
        self._spent = 0.0
        self._PRICE = {
            "gpt-4o":      (2.50, 10.00),   # (input, output) per 1M tokens
            "gpt-4o-mini": (0.15,  0.60),
            "o3-mini":     (1.10,  4.40),
        }

    def chat(self, messages: list, **kwargs) -> str:
        input_price, output_price = self._PRICE.get(self._model, (2.50, 10.00))
        response = client.chat.completions.create(
            model=self._model,
            messages=messages,
            max_tokens=kwargs.pop("max_tokens", 4096),
            **kwargs,
        )
        usage = response.usage
        cost = (usage.prompt_tokens * input_price + usage.completion_tokens * output_price) / 1_000_000
        self._spent += cost
        if self._spent > self._max_usd:
            raise RuntimeError(
                f"Budget exceeded: ${self._spent:.4f} spent (cap ${self._max_usd:.2f})"
            )
        return response.choices[0].message.content

# Usage
llm = BudgetedChatClient(max_usd=2.0, model="gpt-4o")
# Each call to llm.chat() accumulates cost and raises on breach

Why this is still incomplete for loops. The client above raises after the call returns — the cost has already been incurred. It also does not detect loops: if the same call repeats eight times under budget, all eight calls go through and you pay for all eight before the next breach check. For a real-time circuit breaker that fires before a call goes out when a loop is detected, you need pattern detection in the guard layer, not just cost accumulation.

Approach 3 — SDK-level runtime guard (RunGuard)

What it is. RunGuard wraps your agent’s tool-call or model-call function and maintains two states across the entire run: accumulated USD and a sliding window of call signatures. Before each call goes out, RunGuard checks both: if accumulated cost exceeds max_usd, it raises BudgetExceededError before the HTTP request leaves your process; if the tool-call signature matches a pattern seen repeats times in the recent window, it raises LoopDetectedError. Both errors fire at the layer that is cheapest to intercept: before the call, not after.

Python integration pattern.

from runguard import guard, BudgetExceededError, LoopDetectedError
from openai import OpenAI

client = OpenAI()

def call_llm(messages: list, tools: list) -> dict:
    response = client.chat.completions.create(
        model="gpt-4o",
        messages=messages,
        tools=tools,
        max_tokens=4096,
    )
    choice = response.choices[0]
    usage = response.usage
    usd = (usage.prompt_tokens * 2.50 + usage.completion_tokens * 10.0) / 1_000_000
    sig = "end_turn"
    if choice.message.tool_calls:
        sig = choice.message.tool_calls[0].function.name
    return {"response": choice.message, "usd": usd, "sig": sig}

# One guard instance per agent run — not per call
run_guard = guard(
    call_llm,
    budget={"max_usd": 3.00},          # $3 per-run hard cap
    loop={"repeats": 3, "max_cycle_len": 10},
)

try:
    result = run_guard(messages, tools)
except BudgetExceededError as e:
    print(f"Budget hit: ${e.spent:.4f} (cap $3.00)")
except LoopDetectedError as e:
    print(f"Loop: {e.pattern!r} repeated {e.repeats}x")

TypeScript integration pattern.

import { guard, BudgetExceededError, LoopDetectedError } from "@runguard/sdk";
import OpenAI from "openai";

const client = new OpenAI();

async function callLLM(messages: OpenAI.ChatCompletionMessageParam[], tools: OpenAI.ChatCompletionTool[]) {
  const response = await client.chat.completions.create({
    model: "gpt-4o",
    messages,
    tools,
    max_tokens: 4096,
  });
  const choice = response.choices[0];
  const usage = response.usage!;
  const usd = (usage.prompt_tokens * 2.50 + usage.completion_tokens * 10.0) / 1_000_000;
  const sig = choice.message.tool_calls?.[0]?.function?.name ?? "end_turn";
  return { response: choice.message, usd, sig };
}

const runGuard = guard(callLLM, {
  budget: { maxUsd: 3.00 },
  loop: { repeats: 3, maxCycleLen: 10 },
});

try {
  const result = await runGuard(messages, tools);
} catch (e) {
  if (e instanceof BudgetExceededError) {
    console.log(`Budget: $${e.spent.toFixed(4)} of $3.00`);
  } else if (e instanceof LoopDetectedError) {
    console.log(`Loop: ${e.pattern} × ${e.repeats}`);
  }
}

Setting the right budget value. The correct max_usd for a given agent task is 2–3× the 95th-percentile cost of a successful run. Instrument 20–30 real runs with the cost-accumulation approach from Approach 2, take the P95, and double it. If P95 is $0.90, set max_usd=2.00. This catches genuine runaway behaviour (a run that costs $5 when it should cost $0.90 is a loop or a context blow-through) without false positives on legitimately complex runs that take longer than average. Revisit the cap every time you change the agent’s tool set or prompt.

Comparison: three approaches at a glance

Approach	Granularity	Fires before or after cost?	Detects loops?	Best for
Provider billing cap	Monthly, account or project	After (billing lag)	No	Infrastructure backstop
max_tokens per call	Per-call output size	After (call completes)	No	Output size control
Accumulated cost check	Per-run accumulated	After each call	No	Cost visibility with 1-call lag
RunGuard budget guard	Per-run accumulated	Before each call	Yes (signature window)	Real-time circuit breaker

What to do when the budget fires

Graceful degradation: summarize what you have. On BudgetExceededError, catch the exception in your agent loop and pass the accumulated conversation history to a cheap model (GPT-4o-mini or Claude Haiku) with a prompt asking it to summarize the partial results. Return the summary to the user with a notice that the task was partially completed. This is almost always better than returning nothing.
Checkpoint and resume. After each tool call, serialize the agent’s accumulated state (messages, tool results, intermediate variables) to a checkpoint store (Redis, SQLite, S3). On BudgetExceededError, save the checkpoint and return a resume token to the caller. The caller can start a new run with the resume token and pick up from the checkpoint, using a fresh budget. This pattern is most useful for long-running research or data-processing agents where partial results have value.
Soft-limit warning + hard cap. Set warn_at_fraction=0.7 in the RunGuard budget config to receive a warning callback when 70% of the budget is spent. Use the warning to inject a “please wrap up” message into the agent’s context, giving the model a chance to produce a partial answer before the hard cap fires. Set the hard cap at 100% as before. This reduces abrupt cutoffs without raising the ceiling.