Anthropic Claude API cost optimization: prompt caching, batch API, model selection, and loop prevention
The Anthropic Claude API offers three first-party cost reduction mechanisms that most teams underuse: prompt caching (90% discount on cached tokens, 5-minute TTL), the Message Batches API (50% discount for offline/async workloads), and model tier selection (Claude Haiku 4.5 at $0.80/M input vs Claude Sonnet 4.6 at $3/M input). Used together, these can reduce costs by 80% or more on the right workloads. But all three optimizations share a critical vulnerability: a single agent loop erases them. A tool-call loop that runs 50 iterations burns the equivalent of a full week of optimized runs in minutes, whether or not prompt caching is active. This guide covers each optimization with practical Python and TypeScript implementations, the cost math for each, and how RunGuard’s circuit breaking ensures that a loop cannot undo your cost work.
Optimization 1: prompt caching with cache_control
-
When prompt caching applies. Prompt caching works on the static prefix of your prompt: the system message, large reference documents, tool definitions, or few-shot examples that are identical across multiple calls. The first call that includes a
cache_controlblock is billed at the standard input rate ($3/M for Sonnet 4.6) plus a one-time write fee ($3.75/M). Subsequent calls that hit the cache within the 5-minute TTL are billed at $0.30/M — a 90% discount. For an agent that re-sends a 5,000-token system prompt on every turn, this reduces system-prompt cost from $3/M to $0.30/M per cache-hit call. -
Cache math for a coding agent:
- System prompt: 4,000 tokens (tool definitions + instructions)
- Without cache: 100 turns × 4,000 tokens × $3/M = $1.20 system-prompt cost
- With cache (100 hits, 5-min TTL): 1 write ($0.015) + 99 reads × 4,000 × $0.30/M = $0.015 + $0.119 = $0.134
- Savings: $1.07 (89%)
-
Python: adding cache_control to system prompt and tools
import anthropic client = anthropic.Anthropic() SYSTEM_PROMPT = """You are a code review assistant with access to the following tools. Your task is to analyze the provided code and suggest improvements. [... full system instructions, 2000+ words ...]""" # Tool definitions (expensive to re-send each turn) TOOLS = [ { "name": "read_file", "description": "Read a source file from the repository", "input_schema": { "type": "object", "properties": { "path": {"type": "string", "description": "File path relative to repo root"} }, "required": ["path"], }, }, # ... more tools ] def call_with_caching(messages: list) -> anthropic.types.Message: """ Send a request with prompt caching on the static prefix. The system prompt and tool definitions are cached after the first call; subsequent calls within 5 minutes pay 90% less for those tokens. """ return client.messages.create( model="claude-sonnet-4-6", max_tokens=2048, system=[ { "type": "text", "text": SYSTEM_PROMPT, "cache_control": {"type": "ephemeral"}, # cache this block } ], tools=[ {**tool, "cache_control": {"type": "ephemeral"}} if i == len(TOOLS) - 1 # mark last tool to cache all preceding content else tool for i, tool in enumerate(TOOLS) ], messages=messages, ) def check_cache_hit(response: anthropic.types.Message) -> dict: """Extract cache hit/miss stats from response usage.""" usage = response.usage cache_read = getattr(usage, "cache_read_input_tokens", 0) or 0 cache_write = getattr(usage, "cache_creation_input_tokens", 0) or 0 regular = usage.input_tokens return { "cache_read_tokens": cache_read, "cache_write_tokens": cache_write, "regular_input_tokens": regular, "cache_hit_rate": cache_read / (cache_read + cache_write + regular) if (cache_read + cache_write + regular) else 0, } -
Cache TTL management. The 5-minute TTL means the cache refreshes automatically if calls are more than 5 minutes apart. For interactive agents with human-in-the-loop turns, the cache may expire between turns. For batch processing or high-frequency agents, the cache stays warm. You can tell if the cache is hitting by checking
usage.cache_read_input_tokensin the response — a non-zero value means you paid 90% less for those tokens.
Optimization 2: Message Batches API for offline workloads
- When to use the Batches API. The Batches API processes requests asynchronously and returns results within 24 hours. In exchange, you get a 50% discount on all tokens. It is appropriate for: document processing pipelines, overnight report generation, bulk data extraction, evaluation runs, and any workload where low latency is not required. It is not appropriate for: interactive agents, real-time applications, or anything that needs a response in under 30 seconds.
-
Python: submit a batch job and retrieve results
import anthropic import json client = anthropic.Anthropic() def submit_batch_job(prompts: list[dict]) -> str: """ Submit a batch of requests at 50% off standard pricing. prompts: list of {"custom_id": str, "user_message": str} Returns batch ID to poll later. """ requests = [ { "custom_id": p["custom_id"], "params": { "model": "claude-sonnet-4-6", "max_tokens": 1024, "messages": [{"role": "user", "content": p["user_message"]}], }, } for p in prompts ] batch = client.messages.batches.create(requests=requests) return batch.id def retrieve_batch_results(batch_id: str) -> list[dict]: """ Poll until batch is complete and return results. In production, use a webhook or polling loop with backoff. """ import time while True: batch = client.messages.batches.retrieve(batch_id) if batch.processing_status == "ended": break time.sleep(30) # poll every 30s; adjust for your latency requirements results = [] for result in client.messages.batches.results(batch_id): if result.result.type == "succeeded": results.append({ "custom_id": result.custom_id, "text": result.result.message.content[0].text, "usage": result.result.message.usage, }) else: results.append({ "custom_id": result.custom_id, "error": result.result.type, }) return results # Cost comparison: 1,000 document summaries # Standard: 1,000 × avg 2,000 input tokens × $3/M = $6.00 # Batch API: 1,000 × avg 2,000 input tokens × $1.50/M = $3.00 # Savings: $3.00 (50%)
Optimization 3: model tier selection
-
Claude model pricing in 2026.
Model Input $/M Output $/M Context window Best for Claude Opus 4.7 $15.00 $75.00 200k Complex reasoning, long documents Claude Sonnet 4.6 $3.00 $15.00 200k Production agents, balanced cost/quality Claude Haiku 4.5 $0.80 $4.00 200k Simple extraction, classification, routing -
Python: model router that selects tier by task complexity
from runguard import BudgetExceededError # Task categories → appropriate model tier TASK_MODEL_MAP = { "classify": "claude-haiku-4-5-20251001", # binary/categorical output "extract": "claude-haiku-4-5-20251001", # structured extraction "summarize": "claude-haiku-4-5-20251001", # condensation of short content "reason": "claude-sonnet-4-6", # multi-step reasoning "code_review": "claude-sonnet-4-6", # code quality assessment "analyze": "claude-sonnet-4-6", # complex analysis "research": "claude-opus-4-7", # deep synthesis across sources } def select_model(task_type: str, content_length: int) -> str: """ Select the cheapest model that can handle the task. Upgrades to Sonnet for long inputs even on simple tasks because Haiku quality degrades on very long context. """ base_model = TASK_MODEL_MAP.get(task_type, "claude-sonnet-4-6") # For very long inputs (>50k tokens), upgrade classification/extraction # to Sonnet — Haiku quality degrades on long context if content_length > 50_000 and base_model == "claude-haiku-4-5-20251001": return "claude-sonnet-4-6" return base_model def estimate_cost(model: str, input_tokens: int, output_tokens: int) -> float: """Estimate cost in USD for a given model and token counts.""" rates = { "claude-haiku-4-5-20251001": (0.80, 4.00), "claude-sonnet-4-6": (3.00, 15.00), "claude-opus-4-7": (15.00, 75.00), } in_rate, out_rate = rates.get(model, (3.00, 15.00)) return input_tokens * in_rate / 1_000_000 + output_tokens * out_rate / 1_000_000
Why loop prevention is the most important cost optimization
- A single loop erases every optimization. Prompt caching cuts per-turn cost by 80%. A tool-call loop that runs 100 iterations burns 100× your per-turn cost — erasing not just the savings but generating net-new overspend. The math: if your normal cost is $0.006/turn (with caching), a 100-turn loop costs $0.60. If the loop runs twice per day for a week, that’s $8.40 in loop costs against your optimized baseline. One loop incident can exceed an entire month’s optimized budget.
-
Python: combining all optimizations with RunGuard
import anthropic from runguard import LoopDetector, BudgetExceededError, LoopDetectedError client = anthropic.Anthropic() detector = LoopDetector(repeats=3, max_cycle_len=3) class OptimizedAgentSession: def __init__(self, budget_usd: float = 5.0): self.budget = budget_usd self.spent = 0.0 self.messages = [] def turn(self, task_type: str, user_message: str) -> str: model = select_model(task_type, len(" ".join(user_message.split()))) # Pre-check budget estimated = estimate_cost(model, 4000, 500) if self.spent + estimated > self.budget: raise BudgetExceededError( f"Estimated turn cost ${estimated:.4f} would exceed " f"remaining budget ${self.budget - self.spent:.4f}" ) # Loop detection: record intent before call sig = f"{model}:{user_message[:80]}" match = detector.record(sig) if match: raise LoopDetectedError( f"Repeated query pattern detected ({match.repeats}×) — stopping.", match=match, ) self.messages.append({"role": "user", "content": user_message}) # Use caching for the system prompt response = client.messages.create( model=model, max_tokens=1024, system=[{"type": "text", "text": SYSTEM_PROMPT, "cache_control": {"type": "ephemeral"}}], messages=self.messages, ) text = response.content[0].text self.messages.append({"role": "assistant", "content": text}) # Record actual cost usage = response.usage actual = estimate_cost( model, usage.input_tokens - getattr(usage, "cache_read_input_tokens", 0), usage.output_tokens, ) # Cache reads cost 90% less cache_read_cost = getattr(usage, "cache_read_input_tokens", 0) * 0.30 / 1_000_000 self.spent += actual + cache_read_cost return text
Claude API cost optimization summary
| Optimization | Max savings | Best for | Constraint |
|---|---|---|---|
Prompt caching (cache_control) |
Up to 90% on cached tokens | Agents with large static system prompts or tool definitions | 5-minute TTL; only saves on static prefix, not dynamic messages |
| Message Batches API | 50% on all tokens | Offline processing, bulk jobs, evaluation pipelines | Up to 24-hour latency; not for interactive agents |
| Model tier selection | Up to 97% (Opus → Haiku) | Routing simple tasks to cheaper models | Quality trade-off; requires task categorization |
| RunGuard loop detection | Prevents total loss on runaway loops | All agents — defensive baseline | Trips after N repeats; correct loops also trip (set N appropriately) |
For the complete cost control architecture for Claude-based agents, see autonomous agent cost control best practices. For loop detection specifically for the Claude Agents SDK, see Claude Agents SDK runaway prevention. For per-request cost caps, see how to set max cost per LLM request.
Optimize Claude API costs while preventing the loops that undo them
RunGuard installs in one command: pip install runguard for Python, npm install @runguard/sdk for TypeScript. Start with prompt caching (biggest bang for the least code change), then add model tier routing for tasks where Haiku is sufficient, then add RunGuard loop detection as the circuit breaker that makes sure optimization gains are never erased by a runaway agent. The three optimizations are complementary — together they can reduce Claude API costs by 85% or more on typical agent workloads.
RunGuard pricing: Solo plan at $19/month for individual developers. Team plan at $79/month adds Slack and PagerDuty webhook alerts, shared dashboards, and audit log. Both plans include a 14-day free trial — no credit card required.
Start your 14-day free trial — or explore related: Claude Agents SDK runaway prevention, autonomous agent cost control best practices, set max cost per LLM request, prevent runaway cost in real time, and graceful degradation patterns.