Anthropic Claude API cost optimization: prompt caching, batch API, model selection, and loop prevention

The Anthropic Claude API offers three first-party cost reduction mechanisms that most teams underuse: prompt caching (90% discount on cached tokens, 5-minute TTL), the Message Batches API (50% discount for offline/async workloads), and model tier selection (Claude Haiku 4.5 at $0.80/M input vs Claude Sonnet 4.6 at $3/M input). Used together, these can reduce costs by 80% or more on the right workloads. But all three optimizations share a critical vulnerability: a single agent loop erases them. A tool-call loop that runs 50 iterations burns the equivalent of a full week of optimized runs in minutes, whether or not prompt caching is active. This guide covers each optimization with practical Python and TypeScript implementations, the cost math for each, and how RunGuard’s circuit breaking ensures that a loop cannot undo your cost work.

Optimization 1: prompt caching with cache_control

When prompt caching applies. Prompt caching works on the static prefix of your prompt: the system message, large reference documents, tool definitions, or few-shot examples that are identical across multiple calls. The first call that includes a cache_control block is billed at the standard input rate ($3/M for Sonnet 4.6) plus a one-time write fee ($3.75/M). Subsequent calls that hit the cache within the 5-minute TTL are billed at $0.30/M — a 90% discount. For an agent that re-sends a 5,000-token system prompt on every turn, this reduces system-prompt cost from $3/M to $0.30/M per cache-hit call.
Cache math for a coding agent:
- System prompt: 4,000 tokens (tool definitions + instructions)
- Without cache: 100 turns × 4,000 tokens × $3/M = $1.20 system-prompt cost
- With cache (100 hits, 5-min TTL): 1 write ($0.015) + 99 reads × 4,000 × $0.30/M = $0.015 + $0.119 = $0.134
- Savings: $1.07 (89%)

Python: adding cache_control to system prompt and tools

import anthropic

client = anthropic.Anthropic()

SYSTEM_PROMPT = """You are a code review assistant with access to the following tools.
Your task is to analyze the provided code and suggest improvements.
[... full system instructions, 2000+ words ...]"""

# Tool definitions (expensive to re-send each turn)
TOOLS = [
    {
        "name": "read_file",
        "description": "Read a source file from the repository",
        "input_schema": {
            "type": "object",
            "properties": {
                "path": {"type": "string", "description": "File path relative to repo root"}
            },
            "required": ["path"],
        },
    },
    # ... more tools
]

def call_with_caching(messages: list) -> anthropic.types.Message:
    """
    Send a request with prompt caching on the static prefix.
    The system prompt and tool definitions are cached after the first call;
    subsequent calls within 5 minutes pay 90% less for those tokens.
    """
    return client.messages.create(
        model="claude-sonnet-4-6",
        max_tokens=2048,
        system=[
            {
                "type": "text",
                "text": SYSTEM_PROMPT,
                "cache_control": {"type": "ephemeral"},  # cache this block
            }
        ],
        tools=[
            {**tool, "cache_control": {"type": "ephemeral"}}
            if i == len(TOOLS) - 1  # mark last tool to cache all preceding content
            else tool
            for i, tool in enumerate(TOOLS)
        ],
        messages=messages,
    )

def check_cache_hit(response: anthropic.types.Message) -> dict:
    """Extract cache hit/miss stats from response usage."""
    usage = response.usage
    cache_read = getattr(usage, "cache_read_input_tokens", 0) or 0
    cache_write = getattr(usage, "cache_creation_input_tokens", 0) or 0
    regular = usage.input_tokens
    return {
        "cache_read_tokens": cache_read,
        "cache_write_tokens": cache_write,
        "regular_input_tokens": regular,
        "cache_hit_rate": cache_read / (cache_read + cache_write + regular) if (cache_read + cache_write + regular) else 0,
    }

Cache TTL management. The 5-minute TTL means the cache refreshes automatically if calls are more than 5 minutes apart. For interactive agents with human-in-the-loop turns, the cache may expire between turns. For batch processing or high-frequency agents, the cache stays warm. You can tell if the cache is hitting by checking usage.cache_read_input_tokens in the response — a non-zero value means you paid 90% less for those tokens.

Optimization 2: Message Batches API for offline workloads

When to use the Batches API. The Batches API processes requests asynchronously and returns results within 24 hours. In exchange, you get a 50% discount on all tokens. It is appropriate for: document processing pipelines, overnight report generation, bulk data extraction, evaluation runs, and any workload where low latency is not required. It is not appropriate for: interactive agents, real-time applications, or anything that needs a response in under 30 seconds.

Python: submit a batch job and retrieve results

import anthropic
import json

client = anthropic.Anthropic()

def submit_batch_job(prompts: list[dict]) -> str:
    """
    Submit a batch of requests at 50% off standard pricing.
    prompts: list of {"custom_id": str, "user_message": str}
    Returns batch ID to poll later.
    """
    requests = [
        {
            "custom_id": p["custom_id"],
            "params": {
                "model": "claude-sonnet-4-6",
                "max_tokens": 1024,
                "messages": [{"role": "user", "content": p["user_message"]}],
            },
        }
        for p in prompts
    ]
    batch = client.messages.batches.create(requests=requests)
    return batch.id

def retrieve_batch_results(batch_id: str) -> list[dict]:
    """
    Poll until batch is complete and return results.
    In production, use a webhook or polling loop with backoff.
    """
    import time
    while True:
        batch = client.messages.batches.retrieve(batch_id)
        if batch.processing_status == "ended":
            break
        time.sleep(30)  # poll every 30s; adjust for your latency requirements

    results = []
    for result in client.messages.batches.results(batch_id):
        if result.result.type == "succeeded":
            results.append({
                "custom_id": result.custom_id,
                "text": result.result.message.content[0].text,
                "usage": result.result.message.usage,
            })
        else:
            results.append({
                "custom_id": result.custom_id,
                "error": result.result.type,
            })
    return results

# Cost comparison: 1,000 document summaries
# Standard:  1,000 × avg 2,000 input tokens × $3/M = $6.00
# Batch API: 1,000 × avg 2,000 input tokens × $1.50/M = $3.00
# Savings: $3.00 (50%)

Optimization 3: model tier selection

Claude model pricing in 2026.

Model	Input $/M	Output $/M	Context window	Best for
Claude Opus 4.7	$15.00	$75.00	200k	Complex reasoning, long documents
Claude Sonnet 4.6	$3.00	$15.00	200k	Production agents, balanced cost/quality
Claude Haiku 4.5	$0.80	$4.00	200k	Simple extraction, classification, routing

Python: model router that selects tier by task complexity

from runguard import BudgetExceededError

# Task categories → appropriate model tier
TASK_MODEL_MAP = {
    "classify":    "claude-haiku-4-5-20251001",   # binary/categorical output
    "extract":     "claude-haiku-4-5-20251001",   # structured extraction
    "summarize":   "claude-haiku-4-5-20251001",   # condensation of short content
    "reason":      "claude-sonnet-4-6",           # multi-step reasoning
    "code_review": "claude-sonnet-4-6",           # code quality assessment
    "analyze":     "claude-sonnet-4-6",           # complex analysis
    "research":    "claude-opus-4-7",             # deep synthesis across sources
}

def select_model(task_type: str, content_length: int) -> str:
    """
    Select the cheapest model that can handle the task.
    Upgrades to Sonnet for long inputs even on simple tasks
    because Haiku quality degrades on very long context.
    """
    base_model = TASK_MODEL_MAP.get(task_type, "claude-sonnet-4-6")

    # For very long inputs (>50k tokens), upgrade classification/extraction
    # to Sonnet — Haiku quality degrades on long context
    if content_length > 50_000 and base_model == "claude-haiku-4-5-20251001":
        return "claude-sonnet-4-6"

    return base_model

def estimate_cost(model: str, input_tokens: int, output_tokens: int) -> float:
    """Estimate cost in USD for a given model and token counts."""
    rates = {
        "claude-haiku-4-5-20251001": (0.80, 4.00),
        "claude-sonnet-4-6":         (3.00, 15.00),
        "claude-opus-4-7":           (15.00, 75.00),
    }
    in_rate, out_rate = rates.get(model, (3.00, 15.00))
    return input_tokens * in_rate / 1_000_000 + output_tokens * out_rate / 1_000_000

Why loop prevention is the most important cost optimization

A single loop erases every optimization. Prompt caching cuts per-turn cost by 80%. A tool-call loop that runs 100 iterations burns 100× your per-turn cost — erasing not just the savings but generating net-new overspend. The math: if your normal cost is $0.006/turn (with caching), a 100-turn loop costs $0.60. If the loop runs twice per day for a week, that’s $8.40 in loop costs against your optimized baseline. One loop incident can exceed an entire month’s optimized budget.

Python: combining all optimizations with RunGuard

import anthropic
from runguard import LoopDetector, BudgetExceededError, LoopDetectedError

client = anthropic.Anthropic()
detector = LoopDetector(repeats=3, max_cycle_len=3)

class OptimizedAgentSession:
    def __init__(self, budget_usd: float = 5.0):
        self.budget = budget_usd
        self.spent = 0.0
        self.messages = []

    def turn(self, task_type: str, user_message: str) -> str:
        model = select_model(task_type, len(" ".join(user_message.split())))

        # Pre-check budget
        estimated = estimate_cost(model, 4000, 500)
        if self.spent + estimated > self.budget:
            raise BudgetExceededError(
                f"Estimated turn cost ${estimated:.4f} would exceed "
                f"remaining budget ${self.budget - self.spent:.4f}"
            )

        # Loop detection: record intent before call
        sig = f"{model}:{user_message[:80]}"
        match = detector.record(sig)
        if match:
            raise LoopDetectedError(
                f"Repeated query pattern detected ({match.repeats}×) — stopping.",
                match=match,
            )

        self.messages.append({"role": "user", "content": user_message})

        # Use caching for the system prompt
        response = client.messages.create(
            model=model,
            max_tokens=1024,
            system=[{"type": "text", "text": SYSTEM_PROMPT,
                      "cache_control": {"type": "ephemeral"}}],
            messages=self.messages,
        )
        text = response.content[0].text
        self.messages.append({"role": "assistant", "content": text})

        # Record actual cost
        usage = response.usage
        actual = estimate_cost(
            model,
            usage.input_tokens - getattr(usage, "cache_read_input_tokens", 0),
            usage.output_tokens,
        )
        # Cache reads cost 90% less
        cache_read_cost = getattr(usage, "cache_read_input_tokens", 0) * 0.30 / 1_000_000
        self.spent += actual + cache_read_cost
        return text

Claude API cost optimization summary

Optimization	Max savings	Best for	Constraint
Prompt caching (`cache_control`)	Up to 90% on cached tokens	Agents with large static system prompts or tool definitions	5-minute TTL; only saves on static prefix, not dynamic messages
Message Batches API	50% on all tokens	Offline processing, bulk jobs, evaluation pipelines	Up to 24-hour latency; not for interactive agents
Model tier selection	Up to 97% (Opus → Haiku)	Routing simple tasks to cheaper models	Quality trade-off; requires task categorization
RunGuard loop detection	Prevents total loss on runaway loops	All agents — defensive baseline	Trips after N repeats; correct loops also trip (set N appropriately)

For the complete cost control architecture for Claude-based agents, see autonomous agent cost control best practices. For loop detection specifically for the Claude Agents SDK, see Claude Agents SDK runaway prevention. For per-request cost caps, see how to set max cost per LLM request.

Optimize Claude API costs while preventing the loops that undo them

RunGuard installs in one command: pip install runguard for Python, npm install @runguard/sdk for TypeScript. Start with prompt caching (biggest bang for the least code change), then add model tier routing for tasks where Haiku is sufficient, then add RunGuard loop detection as the circuit breaker that makes sure optimization gains are never erased by a runaway agent. The three optimizations are complementary — together they can reduce Claude API costs by 85% or more on typical agent workloads.

RunGuard pricing: Solo plan at $19/month for individual developers. Team plan at $79/month adds Slack and PagerDuty webhook alerts, shared dashboards, and audit log. Both plans include a 14-day free trial — no credit card required.

Start your 14-day free trial — or explore related: Claude Agents SDK runaway prevention, autonomous agent cost control best practices, set max cost per LLM request, prevent runaway cost in real time, and graceful degradation patterns.