How to set a max cost per LLM request — and why “per request” is the wrong granularity for agents

Every team that has shipped an AI agent eventually discovers the same problem: a single API call to GPT-4o or Claude costs fractions of a cent, so it feels too small to worry about. Then the agent loops fourteen times on the same tool call and the run costs $12. The instinct is to find a “max cost per LLM request” setting and turn it on. That’s a good instinct, but it targets the wrong level of granularity. A single LLM request in an agentic loop can cost $0.03 and still be a problem if it repeats 400 times. The right primitive is max cost per agent run, not per individual request. This page explains three approaches, their failure modes, and the right architecture for each stage of production maturity.

Approach 1 — Provider-level spend limits (billing caps)

Approach 2 — Token-count guards (max_tokens + prompt budget)

Approach 3 — SDK-level runtime guard (RunGuard)

Comparison: three approaches at a glance

ApproachGranularityFires before or after cost?Detects loops?Best for
Provider billing capMonthly, account or projectAfter (billing lag)NoInfrastructure backstop
max_tokens per callPer-call output sizeAfter (call completes)NoOutput size control
Accumulated cost checkPer-run accumulatedAfter each callNoCost visibility with 1-call lag
RunGuard budget guardPer-run accumulatedBefore each callYes (signature window)Real-time circuit breaker

What to do when the budget fires