LLM inference cost benchmarking: compare provider and model spend at production scale

Provider pricing pages list cost per million tokens. That number tells you almost nothing about what your agent will actually spend per session, per user, or per feature. Real inference cost is a function of your specific prompt templates, your tool definitions, your expected output lengths, your cache hit rate, your error rate, and your traffic distribution across model tiers. Two teams running nominally similar agents on the same provider can have 4–10x different per-session costs because of differences in prompt engineering and model routing decisions. LLM inference cost benchmarking — systematically measuring what your specific workload costs across models and providers, with quality held constant — is how you distinguish marketing page pricing from actual per-task cost. This page covers the benchmarking methodology, the key metrics to capture, how to account for quality in cost comparisons, and how to automate continuous cost benchmarking so provider pricing changes or model updates don’t silently alter your unit economics.

Why token-price comparisons mislead

Output tokens dominate, not input. For most AI agent workloads, output token cost is 3–5x higher per token than input cost on frontier models. An agent that uses 2,000 input tokens and 400 output tokens per call at Anthropic Sonnet pricing ($3/$15 per MTok input/output) has an effective blended rate of $6.60/MTok — 2.2x the headline input rate. Token-price comparisons that cite input cost without modeling the input-to-output ratio overstate the savings from switching providers by 50–100%.
Cache hit rate changes the effective rate dramatically. Anthropic charges $0.30/MTok for cache reads vs $3/MTok for full input — a 10x reduction. OpenAI charges 50% of input rate for cached tokens. If your prompt has a stable 2,000-token system prompt and you achieve a 70% cache hit rate on it, your effective input rate drops by 64% vs uncached. A provider with a 25% higher headline price but better cache semantics can be significantly cheaper in practice. Benchmarking without measuring cache hit rates produces misleading comparisons.
Error rates add hidden cost. LLM API errors (rate limits, timeouts, validation errors, context length exceeded) trigger retries. At a 3% error rate with one retry, 3% of your calls cost 2x. At a 3% rate with structured output validation failures causing a correction loop, the overhead is higher still. Provider reliability at your traffic volume and in your geographic region directly impacts effective cost. A provider with a 10% lower token price but 2x your current error rate may cost 5–8% more in total once retries and engineer time are factored in.
Latency affects throughput cost. For batch workloads, slower inference directly reduces throughput, which increases infrastructure cost per unit of work. For real-time agents, P95 latency (not average) determines user-visible quality. A model that is 20% cheaper per token but takes 3x longer per call may cost more in infrastructure and user experience terms even if token spend is lower. Benchmarking must capture latency distributions, not just token costs.

Benchmarking methodology: the task-based approach

Define a representative task corpus. Select 50–200 representative tasks drawn from production logs. For an AI agent, a “task” is a complete agent run — from initial user input through tool calls to final response. Use stratified sampling to cover the full distribution of task complexity (simple queries, medium research tasks, complex multi-step tasks). Avoid cherry-picking easy or hard tasks; the corpus should reflect your actual traffic mix.
Capture full call traces, not just final outputs. For each task in the corpus, record the full sequence of LLM calls: input tokens, output tokens, model, latency, error flag, retry count, and cache hit. This gives you the full cost tree per task, not just the nominal token count of the final answer. Tasks that look similar in final output token count can differ 3x in total cost due to intermediate reasoning calls and tool use patterns.
Quality scoring is mandatory. Cost without quality is meaningless; a model that achieves a task in 200 tokens but produces wrong output is not cheaper than one that uses 800 tokens and produces correct output. Establish a quality metric for your task type: ROUGE score for summarization, pass/fail for code generation, binary correctness for structured extraction, human-in-the-loop rating for open-ended tasks. Compute cost-per-quality-point: cost_usd / quality_score. This normalized metric enables fair comparison across models with different quality levels.
Run each task on each candidate model. Execute the full task corpus against each model you want to compare. Use the same prompt templates, same tool definitions, same system prompt. Hold everything constant except model and provider. Record cost and quality for each run. Average over the corpus to get mean cost-per-task and mean quality-per-task. The winner on cost-per-quality-point is not always the cheapest-per-token model.

Key metrics to track in benchmarks

Cost per successful task (CPST). Total cost (including all retries and failed attempts) divided by the number of tasks completed successfully. CPST captures the full economics of a model including its error rate. A model with a 5% error rate and one retry has a CPST multiplier of 1.05× its mean task cost; a model with a 15% error rate and two retries has a multiplier of 1.32×. CPST is the single most honest cost metric for production AI agents.
Token efficiency ratio. useful output tokens / total tokens billed. A model that generates 1,500 tokens of reasoning to produce 200 tokens of useful output has a token efficiency ratio of 0.13; one that produces 300 tokens of reasoning and 200 tokens of output has a ratio of 0.40. Extended thinking models and verbose reasoning models have low token efficiency ratios. If you are paying for reasoning tokens but not using the reasoning output downstream, low efficiency is pure cost waste. See AI agent token efficiency optimization for reducing reasoning verbosity.
P95 latency at benchmark traffic. Run the benchmark corpus at 3× your expected production QPS to simulate peak load. Measure P50, P95, and P99 latency for each provider. P95 latency at 3× QPS is a reasonable predictor of P95 latency during your actual peak periods. Providers that look identical at low QPS can diverge significantly under load due to rate limit throttling, shared infrastructure saturation, or regional capacity constraints.
Cache utilization and effective rate. Instrument your benchmark runs to capture cache hit rates on stable prompt components (system prompts, tool definitions). Compute the effective input token rate: (cached_tokens × cache_rate + uncached_tokens × full_rate) / total_input_tokens. Compare effective rates across providers, not headline rates. For agents with large stable system prompts, a provider with aggressive caching may achieve effective rates 40–60% below headline.

Automating continuous cost benchmarking

Nightly benchmark pipeline. Provider pricing and model performance change continuously. Model updates can increase or decrease token counts for the same prompt; provider pricing changes during contract renewals. A benchmark run from six months ago is stale. Automate your benchmark corpus to run nightly against your primary model and weekly against all candidate models. Alert if the CPST for your primary model increases by more than 10% week-over-week — this signals either a model update that changed output verbosity or a prompt regression that increased token consumption.
Shadow benchmarking in production. Run 1–2% of production traffic in shadow mode against a candidate model alongside your primary model. Both models process the same input; only the primary model’s output is returned to the user. Record cost and quality for both. Shadow benchmarking is far more representative than offline corpora because it reflects live user behavior, live prompt templates, and live tool definitions. It is the gold standard for validating a model switch before full rollout.
Cost regression detection in CI. Include a cost benchmark step in your CI pipeline that runs your task corpus against your current production model with the modified prompt templates or agent code. If the mean CPST increases by more than 5% compared to the main branch, block the merge and flag the PR for cost review. See LLM agent canary deployment strategy for integrating cost gates into deployment pipelines. This prevents cost regressions from reaching production silently.
Cost trend dashboards. Aggregate daily benchmark results into a cost trend dashboard showing CPST by model, effective token rate by provider, and quality scores over time. This makes it immediately visible when a model update or pricing change alters your unit economics. Teams that lack trend dashboards typically discover cost changes only when the monthly invoice arrives; trend dashboards enable detection within 24–48 hours.

RunGuard BudgetTracker for production cost monitoring

From benchmarks to production guardrails. Once your benchmark establishes an expected CPST, use that as the basis for a RunGuard BudgetTracker ceiling. If the benchmark shows your agent costs $0.08 per task at P95, set a $0.15 ceiling in BudgetTracker — 2× the P95 as a safety margin. Any session that exceeds $0.15 is a statistical outlier that warrants investigation, not a normal operational cost. BudgetTracker’s BudgetExceededError surfaces these outliers in real time rather than letting them accumulate silently in your monthly bill.
Per-model cost tracking. If you route traffic across multiple models (a router sends simple tasks to Haiku, complex tasks to Sonnet), instrument each model tier with its own RunGuard context tagged with model_tier. This gives you per-tier cost visibility: “Haiku tier: 7,200 tasks at $0.003/task average; Sonnet tier: 800 tasks at $0.09/task average; blended: $0.094/session.” Divergence from benchmark baselines by tier reveals whether routing logic is misdirecting traffic to the wrong tier. See LLM model routing cost optimization for routing implementation patterns.
Benchmark-to-production cost correlation. RunGuard’s cost data can feed back into your benchmarking pipeline. If production P95 CPST is consistently 1.4× your benchmark P95 CPST, your benchmark corpus is not representative enough — add more hard tasks to bring the benchmark in line with production. RunGuard’s daily cost summaries are the ground truth; benchmarks should be calibrated against them, not the other way around.

Benchmark once. Monitor forever.

LLM inference cost benchmarking tells you what your workload actually costs per task. RunGuard’s BudgetTracker enforces those baselines in production so provider pricing changes, model updates, and prompt regressions surface within hours instead of months.

Start free trial →