LLM caching cost savings calculation: how to measure your prompt cache ROI and maximize it
Prompt caching is one of the most powerful cost optimization tools available in modern LLM APIs. Anthropic’s Claude prompt caching charges cache reads at 10% of the normal input token price (90% discount). OpenAI’s prompt caching charges at 50% of the normal input token price (50% discount). In theory, an agent that repeats the same long system prompt on every call saves 50–90% of its input token cost on the cached prefix. In practice, the savings depend on three factors that are easy to get wrong: cache hit rate (what fraction of calls actually hit the cache), cacheable prefix length (how many tokens are actually the same across calls), and cache write overhead (caching the first call costs more). This guide shows you the exact formula for calculating prompt cache ROI for any workload, how to measure each factor in practice, and how to use RunGuard’s per-session cost tracking to verify that caching is delivering the expected savings.
The prompt caching cost formula
-
Without caching (baseline). For each LLM API call, you pay:
input_cost = total_input_tokens × price_per_input_token. For a system prompt of 2,000 tokens plus 500 tokens of user message plus 300 tokens of conversation history = 2,800 total input tokens. At Claude Sonnet’s $3/M input tokens: $0.0084 per call. 1,000 calls/day = $8.40/day = $252/month. -
With caching (cache hit scenario). The first call (cache write) costs slightly more than normal: the cached prefix is charged at the cache write price, which is typically 25% higher than the normal input price for the cached tokens. Subsequent calls (cache hits) charge the cached prefix at the cache read price (10% of normal for Claude). Formula:
cache_write_cost = cached_tokens × price_write; cache_read_cost = cached_tokens × price_read; uncached_cost = uncached_tokens × price_input. Example: 2,000 cached system prompt tokens, 800 uncached tokens (user message + history), 1,000 calls/day. Cache write call: (2,000 × $3.75/M) + (800 × $3/M) = $0.0075 + $0.0024 = $0.0099. Cache read calls (999 calls): (2,000 × $0.30/M) + (800 × $3/M) = $0.0006 + $0.0024 = $0.0030/call. Daily cost: $0.0099 + (999 × $0.0030) = $0.0099 + $2.997 = $3.007. Savings vs. no caching: $8.40 − $3.007 = $5.393/day = $161.79/month (64% reduction). -
The hit rate factor: why real savings are often lower. The calculation above assumes a 99.9% cache hit rate (1 write, 999 reads). In practice, cache hit rate depends on: (1) cache TTL (Claude’s cache expires after 5 minutes of inactivity; if your agent is low-traffic, every call may be a cache miss); (2) prefix stability (if your system prompt varies per user, each user gets their own cache entry and hit rate = 1 per user); (3) concurrency (high-concurrency scenarios may have multiple parallel calls writing the same cache entry simultaneously). Measure your actual hit rate from API response metadata (
cache_read_input_tokensin Claude’s usage object) before projecting savings.
How to measure your cache hit rate in practice
-
Claude API: read cache_read_input_tokens from usage. Every Claude API response includes a
usageobject. When prompt caching is enabled, this object containscache_creation_input_tokens(tokens written to cache on this call) andcache_read_input_tokens(tokens read from cache on this call). A cache hit is any call wherecache_read_input_tokens > 0. Hit rate = (calls with cache_read_input_tokens > 0) / (total calls). Log this metric per session and per deployment to track cache effectiveness over time. -
OpenAI API: read cached_tokens from prompt_tokens_details. OpenAI’s usage object includes
prompt_tokens_details.cached_tokensfor calls where the prefix was cached. Same calculation: hit rate = (calls with cached_tokens > 0) / (total calls). -
Calculating effective savings rate from measured hit rate. Once you have the measured hit rate H, the effective savings rate S is:
S = H × (1 − cache_read_price_ratio) × (cached_fraction). Wherecache_read_price_ratio= cache read price / normal input price (0.10 for Claude, 0.50 for OpenAI), andcached_fraction= cached_tokens / total_input_tokens. Example: H=0.85, cache_read_price_ratio=0.10, cached_fraction=0.72. S = 0.85 × 0.90 × 0.72 = 0.55. This workload saves 55% of its input token costs with caching, not the theoretical 64% from the earlier example, because hit rate is 85% rather than 100% and because 28% of tokens are not cached (user messages, recent history).
Maximizing cache ROI: prefix length and stability
- Maximize cached prefix length. The larger the cached prefix relative to total input tokens, the higher the potential savings. Move as much stable content as possible to the beginning of the prompt: system instructions, tool definitions, examples, background context. Everything that varies per call (user message, recent history) goes at the end. For a typical agent, 60–80% of input tokens are stable (system prompt + tool definitions) and cacheable. Agents that inject per-user context into the system prompt reduce this fraction significantly; consider moving per-user context to a user turn instead.
- Cache stability: the single-character trap. Prompt caching matches on exact byte prefixes. A single character difference invalidates the cached prefix. Common mistakes that cause cache misses: timestamp injection into system prompt ("Current time: 2026-06-02T14:32:55Z" — changes every second), dynamic tool definitions (tool descriptions that include per-user IDs or preferences), session IDs injected into system prompts. Audit your system prompt for any dynamic content before assuming your cache hit rate will be high.
- Claude cache warmup: the 5-minute TTL strategy. Claude’s prompt cache expires after 5 minutes of inactivity. For batch workloads that process requests in bursts with gaps between bursts, the cache may expire between bursts. Strategy: send a cheap warmup call (minimal user message, same system prompt) before each burst to ensure the cache is warm when the batch starts. The warmup call costs one cache write; the entire batch then hits the warm cache. Alternatively, use extended cache TTL if your API tier supports it.
Tracking cache savings per session with RunGuard
-
RunGuard’s cache_savings metric. RunGuard automatically tracks
cache_read_input_tokensandcache_creation_input_tokensfrom API responses and exposes them as per-session metrics:session.cache_hit_rate,session.cache_saved_usd, andsession.effective_cost_per_token. This lets you see at the session level whether caching is working as expected. -
Python: logging cache savings with RunGuard.
import runguard import anthropic guard = runguard.Guard( app_id="my-agent", budget_usd=2.00, track_cache_savings=True, # enable cache savings tracking ) client = anthropic.Anthropic() SYSTEM_PROMPT = """You are a helpful AI assistant...""" # large, stable system prompt @guard.protect async def agent_call(user_message: str, session_id: str) -> str: response = client.messages.create( model="claude-sonnet-4-6", max_tokens=512, system=[ { "type": "text", "text": SYSTEM_PROMPT, "cache_control": {"type": "ephemeral"}, # mark for caching } ], messages=[{"role": "user", "content": user_message}], ) # RunGuard reads response.usage automatically when track_cache_savings=True return response.content[0].text # After a batch of calls, check cache performance: session_stats = guard.session_stats(session_id) print(f"Cache hit rate: {session_stats.cache_hit_rate:.1%}") print(f"Cache saved: ${session_stats.cache_saved_usd:.4f}") print(f"Effective $/call: ${session_stats.effective_cost_per_call:.4f}") - When to alert on low cache hit rate. Configure RunGuard to alert when cache hit rate drops below your expected threshold. A sudden drop in cache hit rate (e.g., from 85% to 20%) indicates a system prompt change that broke prefix matching — which may have been accidental (a dynamic timestamp got injected). Catching this early prevents the silent cost increase from running for days before someone notices the higher bill.
Prompt caching cost savings by provider and scenario
| Provider | Cache read discount | Cache write premium | TTL | Minimum cached tokens |
|---|---|---|---|---|
| Anthropic Claude | 90% off (cache reads at 10% of input price) | 25% over input price | 5 minutes (extendable) | 1,024 tokens |
| OpenAI GPT-4o | 50% off (cache reads at 50% of input price) | None (automatic) | ~5–10 minutes | 1,024 tokens |
| Google Gemini (via Vertex) | 75% off (context caching) | Storage cost per hour | Configurable (minutes to hours) | 32,768 tokens (model-dependent) |
| Self-hosted (vLLM prefix caching) | ~95% off (GPU compute only) | None | Until evicted from GPU memory | 1 token (block-level) |
For overall Claude API cost optimization, see Anthropic Claude API cost optimization. For memory-based cost reduction, see AI agent memory consolidation cost optimization.
Measure your cache ROI, then optimize from data
The prompt caching calculation is straightforward, but the inputs (hit rate, cached fraction, TTL utilization) are workload-specific. Measure first: add cache_read_input_tokens logging to every LLM call. Then calculate your actual hit rate and effective savings percentage. If hit rate is below 70%, find the dynamic content that’s breaking your prefix before adding more cached content. RunGuard’s cache tracking surfaces these metrics automatically per session, so you know exactly what you saved and where the leakage is.
RunGuard pricing: Solo plan at $19/month for individual developers. Team plan at $79/month adds Slack and PagerDuty webhook alerts, shared dashboards, and audit log. Both plans include a 14-day free trial — no credit card required.
Start your 14-day free trial — or explore related: Anthropic Claude API cost optimization, memory consolidation cost optimization, set max cost per LLM request, autonomous agent cost control, and OpenAI Assistants API budget control.