LLM caching cost savings calculation: how to measure your prompt cache ROI and maximize it

Prompt caching is one of the most powerful cost optimization tools available in modern LLM APIs. Anthropic’s Claude prompt caching charges cache reads at 10% of the normal input token price (90% discount). OpenAI’s prompt caching charges at 50% of the normal input token price (50% discount). In theory, an agent that repeats the same long system prompt on every call saves 50–90% of its input token cost on the cached prefix. In practice, the savings depend on three factors that are easy to get wrong: cache hit rate (what fraction of calls actually hit the cache), cacheable prefix length (how many tokens are actually the same across calls), and cache write overhead (caching the first call costs more). This guide shows you the exact formula for calculating prompt cache ROI for any workload, how to measure each factor in practice, and how to use RunGuard’s per-session cost tracking to verify that caching is delivering the expected savings.

The prompt caching cost formula

How to measure your cache hit rate in practice

Maximizing cache ROI: prefix length and stability

Tracking cache savings per session with RunGuard

Prompt caching cost savings by provider and scenario

Provider Cache read discount Cache write premium TTL Minimum cached tokens
Anthropic Claude 90% off (cache reads at 10% of input price) 25% over input price 5 minutes (extendable) 1,024 tokens
OpenAI GPT-4o 50% off (cache reads at 50% of input price) None (automatic) ~5–10 minutes 1,024 tokens
Google Gemini (via Vertex) 75% off (context caching) Storage cost per hour Configurable (minutes to hours) 32,768 tokens (model-dependent)
Self-hosted (vLLM prefix caching) ~95% off (GPU compute only) None Until evicted from GPU memory 1 token (block-level)

For overall Claude API cost optimization, see Anthropic Claude API cost optimization. For memory-based cost reduction, see AI agent memory consolidation cost optimization.

Measure your cache ROI, then optimize from data

The prompt caching calculation is straightforward, but the inputs (hit rate, cached fraction, TTL utilization) are workload-specific. Measure first: add cache_read_input_tokens logging to every LLM call. Then calculate your actual hit rate and effective savings percentage. If hit rate is below 70%, find the dynamic content that’s breaking your prefix before adding more cached content. RunGuard’s cache tracking surfaces these metrics automatically per session, so you know exactly what you saved and where the leakage is.

RunGuard pricing: Solo plan at $19/month for individual developers. Team plan at $79/month adds Slack and PagerDuty webhook alerts, shared dashboards, and audit log. Both plans include a 14-day free trial — no credit card required.

Start your 14-day free trial — or explore related: Anthropic Claude API cost optimization, memory consolidation cost optimization, set max cost per LLM request, autonomous agent cost control, and OpenAI Assistants API budget control.