LLM inference provider cost comparison 2026: OpenAI, Anthropic, Groq, Together AI, Fireworks, Mistral
LLM inference pricing in 2026 spans a roughly 20× range from the most expensive frontier model APIs to the cheapest open-model inference providers. Picking the right provider for your use case can reduce your inference bill significantly. But there’s a cost risk that no amount of provider-shopping eliminates: a looping agent or an unbounded context window will burn your budget on any provider, regardless of per-token price. This page covers both — where to find cheaper inference, and how to cap spend so that a bad run doesn’t wipe out the savings from careful provider selection.
How to read LLM pricing tables
LLM providers charge separately for input tokens (the prompt + context you send) and output tokens (the model’s response). For agent workloads, this distinction matters a lot:
- Input tokens dominate in agents. In a multi-turn agent with a large context window, the same base context is re-sent on every call. A 10-turn agent with a 4,000-token system prompt sends those 4,000 tokens 10 times. Input token price compounds with context growth; output token price is roughly proportional to response length, which is easier to control.
- Cached input pricing changes the math. OpenAI, Anthropic, and some other providers offer prompt caching (also called prefix caching), where repeated prefixes in your prompt are stored server-side and charged at a reduced rate on subsequent calls. If your agent reuses a large system prompt or tool schema across many calls, cached pricing can reduce your input token costs by 50–90% — which changes which provider is cheapest for your specific workload.
- Context window size affects maximum exposure. A provider offering a 128K context window doesn’t cost more per token than one with 8K, but it means your agent can run up a much larger input token bill before context management kicks in. Larger context windows increase the potential blast radius of an uncontrolled run.
Provider comparison: major frontier models
| Provider / Model | Input ($/1M tokens) | Output ($/1M tokens) | Cached input | Context window | Best for |
|---|---|---|---|---|---|
| OpenAI GPT-4o | $2.50 | $10.00 | $1.25 (50% off) | 128K | General-purpose reasoning + tool use |
| OpenAI GPT-4o mini | $0.15 | $0.60 | $0.075 (50% off) | 128K | High-volume simpler tasks |
| Anthropic Claude Sonnet 4 | $3.00 | $15.00 | $0.30 (90% off) | 200K | Long-context agents, code generation |
| Anthropic Claude Haiku 4 | $0.80 | $4.00 | $0.08 (90% off) | 200K | Fast, low-cost agentic routing |
| Google Gemini 2.0 Flash | $0.10 | $0.40 | $0.025 (75% off) | 1M | Ultra-long context, high throughput |
| Google Gemini 2.5 Pro | $1.25 | $10.00 | $0.31 (75% off) | 1M | Complex reasoning, very long context |
| Mistral Large | $2.00 | $6.00 | Not available | 128K | European data residency, multilingual |
| Mistral Small 3.1 | $0.10 | $0.30 | Not available | 128K | Cost-sensitive workloads, Europe |
Prices current as of June 2026. Always verify with provider pricing pages — this market moves fast.
Open-model inference providers: lower cost, different tradeoffs
For teams willing to use open-weight models (Llama, Mistral, Qwen, DeepSeek), specialized inference providers offer dramatically lower token prices at the cost of some quality reduction on complex reasoning tasks.
| Provider | Representative model | Input ($/1M tokens) | Output ($/1M tokens) | Notes |
|---|---|---|---|---|
| Groq | Llama 3.3 70B | $0.59 | $0.79 | Fastest latency (LPU hardware), free tier |
| Together AI | Llama 3.1 405B | $3.50 | $3.50 | Largest open models, fine-tuning available |
| Together AI | Llama 3.2 3B | $0.06 | $0.06 | Cheapest option for simple tasks |
| Fireworks AI | Llama 3.3 70B | $0.90 | $0.90 | Structured output specialization, fast |
| Fireworks AI | Qwen2.5 72B | $0.90 | $0.90 | Strong multilingual + code |
| Cerebras | Llama 3.3 70B | $0.85 | $1.20 | Extreme throughput (wafer-scale chip) |
| DeepInfra | Llama 3.3 70B | $0.23 | $0.40 | Lowest published price for Llama 3.3 70B |
The cheapest option for a simple classification task (Together AI Llama 3.2 3B at $0.06/1M tokens) is roughly 167× cheaper per token than Claude Sonnet 4, and about 42× cheaper than GPT-4o. For agents that make hundreds of LLM calls per task, this gap is material.
When provider-switching actually helps
Switching providers saves money when:
- Your task complexity is lower than you assumed. Many teams default to GPT-4o or Claude Sonnet because they tested with those models. But for routing decisions, simple classification, or short-context extraction tasks, a smaller model (GPT-4o mini, Haiku, Llama 3.2 3B) produces equivalent results at a fraction of the cost. The key test: run quality evals on a sample of real production inputs before switching.
- You have high cache hit rates. Anthropic’s 90% cached input discount is exceptional. If your agent uses a large, stable system prompt (tool schemas, instructions, knowledge base excerpts) and sends it on many consecutive calls in the same session, prompt caching can make Claude Haiku cheaper per actual call than any uncached alternative.
- Latency is a constraint. Groq’s LPU hardware delivers inference at 300+ tokens/second, which is 5–10× faster than standard GPU inference on comparable models. For user-facing agents where response latency matters, faster inference may allow you to make more calls within the same user-perceived latency budget, changing your cost calculation.
When provider-switching does not help
Lower per-token prices do not reduce costs when:
- Your agent is looping. If an agent calls
read_file("config.yaml")30 times instead of once, switching from GPT-4o to Together AI reduces the cost per iteration but does not reduce the 30-iteration count. You still pay 30× what you should for that run. A cheaper provider makes a looping run more affordable, not safe. - Your context window is growing unbounded. An agent that carries full conversation history across 50 turns will spend 50× more input tokens per call by the final turn than by the first. No per-token discount eliminates quadratic cost growth.
- You have no per-run spend limit. Without a hard ceiling, any run can become arbitrarily expensive if it hits an edge case. Provider selection optimizes the expected cost; it does nothing for the tail risk.
Enforcing per-run budgets across any provider
RunGuard’s BudgetTracker tracks accumulated cost for the current run against a configurable ceiling. It includes a built-in price table for major providers so you don’t have to maintain price lookups yourself:
from runguard import guard, BudgetTracker, BudgetExceededError
# Set per-run ceiling at $2.00 regardless of which provider you use
tracker = BudgetTracker(
max_usd=2.0,
# RunGuard knows prices for: gpt-4o, gpt-4o-mini, claude-sonnet-4,
# claude-haiku-4, gemini-2.0-flash, llama-3.3-70b (groq), and more
default_model="gpt-4o"
)
@guard(budget=tracker, loop_window=20, loop_threshold=3)
async def call_llm(messages: list, model: str = "gpt-4o") -> str:
response = await openai_client.chat.completions.create(
model=model,
messages=messages
)
# Report actual token usage to tracker for accurate accounting
tracker.record_usage(
model=model,
input_tokens=response.usage.prompt_tokens,
output_tokens=response.usage.completion_tokens
)
return response.choices[0].message.content
# When accumulated cost crosses $2.00, next call raises BudgetExceededError
try:
result = await call_llm(messages, model="gpt-4o")
except BudgetExceededError as e:
print(f"Run halted: ${e.accumulated_usd:.3f} of ${e.max_usd:.2f} limit reached")
# RunGuard has already sent a Slack alert if configured
Switching your agent from GPT-4o to Together AI’s Llama 3.3 70B only requires changing the model parameter. RunGuard’s price table updates automatically and the $2.00 ceiling applies regardless. The circuit breaker is provider-agnostic by design.
The combined approach: pick a cheaper provider, then add a ceiling
The highest-impact cost reduction for most AI agent applications is a combination of provider optimization and per-run enforcement:
- Identify your high-volume, lower-complexity tasks. Run quality evals on a representative sample. If a smaller model (Haiku, GPT-4o mini, Llama 3.2 3B) matches your quality bar, switch. This is a one-time optimization with ongoing savings.
- Enable prompt caching where available. For agents with large, stable system prompts, Anthropic’s 90% cached input discount is the single largest cost lever after model selection.
- Add per-run budget enforcement with RunGuard. Set the ceiling at 3–5× your expected per-run cost. Normal runs stay well under the ceiling; pathological runs (loops, context explosions, retry storms) are stopped before they do real damage. This is ongoing protection that compounds over time — every future edge case is capped, not just the ones you anticipated in your provider selection analysis.
Provider selection optimizes your expected cost. Runtime enforcement limits your worst-case cost. Both are necessary if you’re running AI agents in production with real money at stake.
Cap your per-run spend across any LLM provider
RunGuard’s BudgetTracker works with OpenAI, Anthropic, Google, Groq, Together, Fireworks, and any other provider — you just report token usage and RunGuard handles the math. Add a loop breaker and a Slack alert in the same install.
Get started with RunGuard — or read about real-time LLM spend alerts, LLM cost per feature tracking, and token budget enforcement in Python.