Anthropic Claude API Cost Control: Loop Detection and Budget Enforcement in Production

Building agentic loops directly on the Anthropic Messages API — using the anthropic Python package or @anthropic-ai/sdk for TypeScript — gives you maximum control over every aspect of the interaction. You define the tools, structure the messages, control the loop termination, and decide exactly when to stop. That control comes with a cost: there is no framework layer enforcing loop budgets, no built-in similarity check on tool inputs, no max-turns guard, and no mechanism that fires before the billing meter does.

The Messages API tool use flow is straightforward to describe and expensive to get wrong. You send a message with tools defined, the model returns a response with stop_reason: "tool_use" and one or more ToolUseBlock items in content, you execute the tools, append the results as a role: "user" message with type: "tool_result" blocks, and call the API again. That loop continues until the model issues a stop_reason: "end_turn" response. Nothing in the SDK stops the loop from running indefinitely. There is no max_turns parameter on messages.create(), no token budget enforcement, no Slack notification when your per-run spend crosses a threshold.

This post covers four failure modes specific to raw Claude API agents and provides complete Python implementations for each. The guards use only the standard library plus the anthropic package — no additional dependencies. A TypeScript SpiralGuard equivalent using @anthropic-ai/sdk is included in the final implementation section. The last section explains RunGuard's managed API as an alternative to maintaining your own guard infrastructure. If you're not familiar with the general principles behind these patterns, the AI agent cost control pattern reference covers the universal failure modes before diving into SDK-specific implementations.

How the Claude Messages API tool use loop works

The anthropic Python SDK (install via pip install anthropic) exposes a synchronous and async client. A minimal tool use agent loop looks like this:

import anthropic

client = anthropic.Anthropic()  # reads ANTHROPIC_API_KEY from env

tools = [
    {
        "name": "web_search",
        "description": "Search the web for up-to-date information",
        "input_schema": {
            "type": "object",
            "properties": {
                "query": {"type": "string", "description": "Search query"}
            },
            "required": ["query"]
        }
    }
]

messages = [{"role": "user", "content": "What are the latest AI agent frameworks released in 2026?"}]

while True:
    response = client.messages.create(
        model="claude-sonnet-4-6",
        max_tokens=4096,
        tools=tools,
        messages=messages,
    )

    if response.stop_reason == "end_turn":
        # Model issued a final text response
        text = next(b.text for b in response.content if hasattr(b, "text"))
        print(text)
        break

    if response.stop_reason == "max_tokens":
        # Output was cut off — handle truncation
        break

    # stop_reason == "tool_use": execute each tool call
    messages.append({"role": "assistant", "content": response.content})

    tool_results = []
    for block in response.content:
        if block.type == "tool_use":
            result = execute_tool(block.name, block.input)
            tool_results.append({
                "type": "tool_result",
                "tool_use_id": block.id,
                "content": result,
            })

    messages.append({"role": "user", "content": tool_results})

The loop above is correct and complete. It is also unbounded. Four structural properties of the Messages API create cost exposure that no amount of correct loop structure can prevent on its own:

  • No turn limitmessages.create() has no max_turns parameter. The loop runs until you break it or your application process crashes.
  • No input similarity check — the SDK sends whatever input dict the model returns. If the model calls web_search with {"query": "AI agent frameworks 2026"} ten times in a row, all ten calls go through.
  • Growing messages list — each turn appends the assistant content and tool results. A 100-turn loop against a search tool can accumulate 80,000+ input tokens before the model ever issues end_turn.
  • SDK retries are hidden from your code — by default the SDK retries failed requests twice (max_retries=2, three total attempts). Your application-level retry wrapping the call creates a silent multiplier: 3 SDK retries × 3 app retries = 9 actual API calls per logical failure.

Failure mode 1: Tool use spiral

A tool use spiral occurs when the model calls the same tool with semantically identical or very similar arguments across multiple consecutive turns. The most common trigger is a retrieval or search tool: the model searches for something, the result doesn't satisfy the prompt, so it searches again with a nearly identical query. Claude's instruction-following is strong enough that it usually doesn't repeat the exact same string, but "AI agent frameworks 2026", "latest AI agent frameworks", and "new agent frameworks released 2026" are semantically the same search with different surface text.

Exact-match deduplication catches nothing here. The right detection primitive is a Jaccard similarity coefficient on the normalized token set of the tool input arguments, compared across a sliding window of recent calls to the same tool. A threshold of 0.72 catches near-duplicate calls while tolerating legitimate parameter variation — a threshold tuned across the same patterns described in the pattern reference and validated across framework-specific implementations in this series.

The guard is stateless beyond the sliding window and per-run instantiation. Instantiate one guard per agent run, not per application startup — shared instances cross-contaminate spiral detection across independent runs.

import hashlib
import json
from collections import defaultdict, deque

class ToolSpiralGuard:
    """Detects tool use spirals via Jaccard similarity on a sliding window."""

    def __init__(self, window: int = 4, threshold: float = 0.72):
        self.window = window
        self.threshold = threshold
        self._history: dict[str, deque] = defaultdict(lambda: deque(maxlen=window))

    def _fingerprint(self, tool_input: dict) -> frozenset:
        """Normalize tool input to a token set for Jaccard comparison."""
        tokens = set()
        for v in tool_input.values():
            if isinstance(v, str):
                tokens.update(v.lower().split())
            elif isinstance(v, (int, float, bool)):
                tokens.add(str(v))
            elif isinstance(v, (list, dict)):
                tokens.update(json.dumps(v, sort_keys=True).lower().split())
        return frozenset(tokens)

    def _jaccard(self, a: frozenset, b: frozenset) -> float:
        if not a and not b:
            return 1.0
        union = a | b
        if not union:
            return 0.0
        return len(a & b) / len(union)

    def check(self, tool_name: str, tool_input: dict) -> None:
        """Raises RuntimeError if a spiral is detected."""
        fp = self._fingerprint(tool_input)
        history = self._history[tool_name]

        for prior_fp in history:
            sim = self._jaccard(fp, prior_fp)
            if sim >= self.threshold:
                raise RuntimeError(
                    f"Tool use spiral detected on '{tool_name}': "
                    f"Jaccard similarity {sim:.2f} >= {self.threshold} "
                    f"with a recent call. Halting to prevent runaway costs."
                )

        history.append(fp)

    def record(self, tool_name: str, tool_input: dict) -> None:
        """Record a tool call without raising. Use after check() passes."""
        fp = self._fingerprint(tool_input)
        self._history[tool_name].append(fp)

Usage in the agent loop — call check() before executing each tool call block:

spiral_guard = ToolSpiralGuard()

for block in response.content:
    if block.type == "tool_use":
        # Raises RuntimeError if spiral detected
        spiral_guard.check(block.name, block.input)
        result = execute_tool(block.name, block.input)
        tool_results.append({
            "type": "tool_result",
            "tool_use_id": block.id,
            "content": result,
        })

When RuntimeError is raised, catch it at the top of your loop and return whatever partial result you have, or surface a structured error to the caller. Do not catch and continue — the guard tripped because the next call would waste money on a near-duplicate result.

Failure mode 2: Context window accumulation

Every tool call appends two entries to the messages list: the assistant's ToolUseBlock response and your tool_result user message. In a research agent that calls a search tool 40 times before producing a final answer, the messages list at turn 40 includes 80 additional entries plus all their content. Tool results from web search, document retrieval, or code execution are often long — a single result may be 2,000 tokens.

Claude 4 models support a 200,000-token context window. That sounds large until you run a research agent against a knowledge base for 30 minutes and find the 185th turn failing with a 400 context_length_exceeded error — at which point you have already paid for all 185 turns of input tokens, including the growing context repeated in full on every call.

The Anthropic SDK provides a client.messages.count_tokens() method that returns a token count estimate without making a model call. It accepts the same parameters as messages.create() and is fast enough to call on every turn. Use it to check context size before sending each model request:

class ContextGuard:
    """Guards against context window accumulation."""

    def __init__(self, client: anthropic.Anthropic, model: str,
                 warn_fraction: float = 0.70, hard_fraction: float = 0.85):
        self.client = client
        self.model = model
        self.warn_fraction = warn_fraction
        self.hard_fraction = hard_fraction
        # Model context limits in tokens
        self._limits = {
            "claude-opus-4-7":   200_000,
            "claude-sonnet-4-6": 200_000,
            "claude-haiku-4-5":  200_000,
        }

    def _limit(self) -> int:
        for prefix, limit in self._limits.items():
            if self.model.startswith(prefix):
                return limit
        return 200_000  # safe default for current Claude 4 generation

    def check(self, messages: list, tools: list | None = None,
              system: str | None = None) -> int:
        """Check context size. Returns token count. Raises on hard limit."""
        kwargs: dict = {"model": self.model, "messages": messages, "max_tokens": 1}
        if tools:
            kwargs["tools"] = tools
        if system:
            kwargs["system"] = system

        response = self.client.messages.count_tokens(**kwargs)
        count = response.input_tokens
        limit = self._limit()

        if count >= limit * self.hard_fraction:
            raise RuntimeError(
                f"Context window hard limit: {count:,} tokens is "
                f"{count/limit:.0%} of {limit:,}-token limit for {self.model}. "
                f"Halting to prevent truncation and wasted spend."
            )

        if count >= limit * self.warn_fraction:
            import warnings
            warnings.warn(
                f"Context window warning: {count:,} tokens ({count/limit:.0%} of limit). "
                f"Consider summarizing earlier turns.",
                stacklevel=2,
            )

        return count

Call context_guard.check() at the top of each loop iteration, before calling messages.create():

context_guard = ContextGuard(client=client, model="claude-sonnet-4-6")

while True:
    # Check context before every API call — count_tokens is fast and free
    context_guard.check(messages, tools=tools, system=system_prompt)

    response = client.messages.create(
        model="claude-sonnet-4-6",
        max_tokens=4096,
        tools=tools,
        messages=messages,
    )
    # ... rest of loop

The count_tokens() call is not billed — it counts tokens server-side without invoking the model. The overhead per turn is typically under 100ms on a warm connection, which is negligible relative to the model call latency. Do not skip it to save time; the cost of a single context overflow far exceeds thousands of counting calls.

Important: The count_tokens() method requires the same tools and system parameters you pass to messages.create(). Omitting them will undercount tokens, making your limit check inaccurate. The tool definitions themselves consume tokens — complex tool schemas with many properties can add 500–2,000 tokens to every request.

Failure mode 3: Retry cascade multiplication

The Anthropic Python SDK retries failed requests automatically. The default is max_retries=2, meaning the SDK makes up to three attempts (one original plus two retries) on network errors, 408 (timeout), 429 (rate limit), and 5xx responses. This is correct behavior for a single-call client — you want transient failures to self-heal.

The problem arises at the agent loop level. If your application wraps the entire loop in a retry block (a common pattern for handling downstream failures), the multiplication is invisible:

  • SDK attempt 1 fails → SDK retries → SDK attempt 2 fails → SDK retries → SDK attempt 3 fails → SDK raises
  • Application retry block catches the exception → starts the loop again from the top
  • Result: 3 application retries × 3 SDK retries = 9 actual API calls for a single logical failure, each paying full input token cost for the entire accumulated messages list

On a 100-turn research agent with 50,000 accumulated input tokens at the failure point, nine redundant calls to claude-opus-4-7 at $15/MTok input = $6.75 from a single failure event. A production system hitting this failure mode hourly can accumulate hundreds of dollars per day in invisible retry spend.

The fix is a circuit breaker that tracks consecutive failures per run and refuses to make further calls once the threshold is exceeded:

import time

class ClaudeCircuitBreaker:
    """Per-run circuit breaker for the Anthropic SDK."""

    def __init__(self, failure_threshold: int = 2, cooldown_seconds: float = 60.0):
        self.failure_threshold = failure_threshold
        self.cooldown_seconds = cooldown_seconds
        self._consecutive_failures = 0
        self._tripped_at: float | None = None

    def _is_open(self) -> bool:
        if self._tripped_at is None:
            return False
        if time.monotonic() - self._tripped_at >= self.cooldown_seconds:
            self._consecutive_failures = 0
            self._tripped_at = None
            return False
        return True

    def before_call(self) -> None:
        """Call before every messages.create(). Raises if breaker is open."""
        if self._is_open():
            wait = self.cooldown_seconds - (time.monotonic() - self._tripped_at)
            raise RuntimeError(
                f"Circuit breaker open: {self._consecutive_failures} consecutive "
                f"API failures. Cooldown has {wait:.0f}s remaining."
            )

    def on_success(self) -> None:
        """Call after a successful messages.create() response."""
        self._consecutive_failures = 0
        self._tripped_at = None

    def on_failure(self, exc: Exception) -> None:
        """Call when messages.create() raises. Re-raises after threshold."""
        self._consecutive_failures += 1
        if self._consecutive_failures >= self.failure_threshold:
            self._tripped_at = time.monotonic()
            raise RuntimeError(
                f"Circuit breaker tripped after {self._consecutive_failures} "
                f"consecutive failures. Last error: {exc}"
            ) from exc
        raise exc

Usage — wrap every messages.create() call with the circuit breaker:

circuit_breaker = ClaudeCircuitBreaker(failure_threshold=2, cooldown_seconds=60.0)

while True:
    circuit_breaker.before_call()
    try:
        response = client.messages.create(
            model="claude-sonnet-4-6",
            max_tokens=4096,
            tools=tools,
            messages=messages,
        )
        circuit_breaker.on_success()
    except anthropic.APIError as e:
        circuit_breaker.on_failure(e)  # re-raises after threshold

    # ... process response

Set max_retries=0 on the SDK client when using this circuit breaker if you want full control over retry logic:

client = anthropic.Anthropic(max_retries=0)  # circuit breaker handles retries

With max_retries=0, every failed call immediately raises to your on_failure() handler. The circuit breaker controls the re-attempt schedule and the total failure count — no hidden multiplication.

Failure mode 4: Budget breach

The Anthropic SDK returns token counts in every response via the usage attribute: response.usage.input_tokens and response.usage.output_tokens. These are exact counts, not estimates. A per-run budget guard accumulates spend from these counts and halts the loop before the next call if the budget ceiling would be exceeded.

Claude 4 model pricing as of mid-2026:

Model Input ($/MTok) Output ($/MTok) Context
claude-opus-4-7claude-opus-4-7 $15.00 $75.00 200k
claude-sonnet-4-6claude-sonnet-4-6 $3.00 $15.00 200k
claude-haiku-4-5claude-haiku-4-5 $0.80 $4.00 200k

Cache-aware pricing note: the Anthropic API supports prompt caching for long static prefixes (system prompts, tool definitions). Cached input tokens are billed at approximately 10% of normal input token cost. The guard below tracks gross input tokens (no cache discount) for a conservative ceiling — your actual spend will be lower if you've enabled prompt caching.

class BudgetGuard:
    """Per-run budget enforcement using exact token counts from API responses."""

    # USD per token (not per MTok)
    PRICING = {
        "claude-opus-4-7":   {"input": 15.00 / 1_000_000, "output": 75.00 / 1_000_000},
        "claude-sonnet-4-6": {"input":  3.00 / 1_000_000, "output": 15.00 / 1_000_000},
        "claude-haiku-4-5":  {"input":  0.80 / 1_000_000, "output":  4.00 / 1_000_000},
    }

    def __init__(self, model: str, budget_usd: float):
        self.model = model
        self.budget_usd = budget_usd
        self._spent_usd = 0.0
        self._input_tokens = 0
        self._output_tokens = 0

        # Find pricing for this model (prefix match)
        self._rates = None
        for prefix, rates in self.PRICING.items():
            if model.startswith(prefix):
                self._rates = rates
                break
        if self._rates is None:
            # Fall back to Sonnet pricing for unknown Claude 4 models
            self._rates = self.PRICING["claude-sonnet-4-6"]

    def check(self) -> None:
        """Call before messages.create(). Raises if at or over budget."""
        if self._spent_usd >= self.budget_usd:
            raise RuntimeError(
                f"Budget ceiling reached: ${self._spent_usd:.4f} spent "
                f"against ${self.budget_usd:.4f} limit for this run. "
                f"Halting to prevent further charges."
            )

    def record(self, response) -> float:
        """Record token usage from a response. Returns incremental cost."""
        input_tok = response.usage.input_tokens
        output_tok = response.usage.output_tokens
        cost = (input_tok * self._rates["input"] +
                output_tok * self._rates["output"])
        self._spent_usd += cost
        self._input_tokens += input_tok
        self._output_tokens += output_tok
        return cost

    @property
    def spent(self) -> float:
        return self._spent_usd

    @property
    def remaining(self) -> float:
        return max(0.0, self.budget_usd - self._spent_usd)

    def summary(self) -> dict:
        return {
            "spent_usd": round(self._spent_usd, 6),
            "budget_usd": self.budget_usd,
            "input_tokens": self._input_tokens,
            "output_tokens": self._output_tokens,
            "model": self.model,
        }

Call budget_guard.check() before each API call and budget_guard.record(response) immediately after:

budget_guard = BudgetGuard(model="claude-sonnet-4-6", budget_usd=0.50)

while True:
    budget_guard.check()  # halt if budget exceeded

    response = client.messages.create(
        model="claude-sonnet-4-6",
        max_tokens=4096,
        tools=tools,
        messages=messages,
    )

    cost = budget_guard.record(response)
    # print(f"Turn cost: ${cost:.4f} | Remaining: ${budget_guard.remaining:.4f}")

    if response.stop_reason == "end_turn":
        break
    # ... process tool calls

Integrating all four guards

The four guards are independent — each handles a distinct failure mode and can be used without the others. In production, all four should run together. The application order matters: check budget and circuit breaker before making the API call, check spiral before executing tool calls, check context before constructing the next API call.

def run_claude_agent(
    messages: list,
    tools: list,
    model: str = "claude-sonnet-4-6",
    system: str | None = None,
    budget_usd: float = 0.50,
) -> str:
    """Run a tool-use agent loop with all four cost guards active."""
    spiral_guard   = ToolSpiralGuard(window=4, threshold=0.72)
    context_guard  = ContextGuard(client=client, model=model)
    circuit_breaker = ClaudeCircuitBreaker(failure_threshold=2, cooldown_seconds=60.0)
    budget_guard   = BudgetGuard(model=model, budget_usd=budget_usd)

    create_kwargs: dict = {
        "model": model,
        "max_tokens": 4096,
        "tools": tools,
        "messages": messages,
    }
    if system:
        create_kwargs["system"] = system

    while True:
        # 1. Budget check — before any API call
        budget_guard.check()

        # 2. Context check — before constructing the model request
        context_guard.check(
            messages, tools=tools,
            system=system,
        )

        # 3. Circuit breaker — before the SDK call
        circuit_breaker.before_call()
        try:
            response = client.messages.create(**create_kwargs)
            circuit_breaker.on_success()
        except anthropic.APIError as e:
            circuit_breaker.on_failure(e)

        # 4. Record cost
        budget_guard.record(response)

        # Terminal conditions
        if response.stop_reason in ("end_turn", "stop_sequence"):
            text_blocks = [b.text for b in response.content if hasattr(b, "text")]
            return " ".join(text_blocks)

        if response.stop_reason == "max_tokens":
            raise RuntimeError("Response truncated by max_tokens limit.")

        # Process tool calls
        messages.append({"role": "assistant", "content": response.content})

        tool_results = []
        for block in response.content:
            if block.type == "tool_use":
                # 5. Spiral check — before executing each tool call
                spiral_guard.check(block.name, block.input)
                result = execute_tool(block.name, block.input)
                tool_results.append({
                    "type": "tool_result",
                    "tool_use_id": block.id,
                    "content": str(result),
                })

        messages.append({"role": "user", "content": tool_results})

The guard initialization is at the top of the function — one instance per call to run_claude_agent(). Shared guard instances across calls would cause false positives: a spiral detected in a previous run's tool history would trip the guard on the first tool call of a new run. Per-run instantiation is the invariant all four guards rely on.

TypeScript implementation with @anthropic-ai/sdk

The @anthropic-ai/sdk package (install via npm install @anthropic-ai/sdk) mirrors the Python API closely. A TypeScript SpiralGuard for the tool use spiral failure mode:

import Anthropic from "@anthropic-ai/sdk";
import type { ToolUseBlock } from "@anthropic-ai/sdk/resources/messages";

class SpiralGuard {
  private window: number;
  private threshold: number;
  private history: Map<string, Array<Set<string>>> = new Map();

  constructor(window = 4, threshold = 0.72) {
    this.window = window;
    this.threshold = threshold;
  }

  private fingerprint(input: Record<string, unknown>): Set<string> {
    const tokens = new Set<string>();
    for (const v of Object.values(input)) {
      if (typeof v === "string") {
        v.toLowerCase().split(/\s+/).forEach((t) => tokens.add(t));
      } else if (typeof v === "number" || typeof v === "boolean") {
        tokens.add(String(v));
      } else if (v !== null && typeof v === "object") {
        JSON.stringify(v).toLowerCase().split(/\s+/).forEach((t) => tokens.add(t));
      }
    }
    return tokens;
  }

  private jaccard(a: Set<string>, b: Set<string>): number {
    if (a.size === 0 && b.size === 0) return 1.0;
    const intersection = new Set([...a].filter((x) => b.has(x)));
    const union = new Set([...a, ...b]);
    return union.size === 0 ? 0 : intersection.size / union.size;
  }

  check(toolName: string, toolInput: Record<string, unknown>): void {
    const fp = this.fingerprint(toolInput);
    const history = this.history.get(toolName) ?? [];

    for (const prior of history) {
      const sim = this.jaccard(fp, prior);
      if (sim >= this.threshold) {
        throw new Error(
          `Tool use spiral on '${toolName}': Jaccard similarity ` +
          `${sim.toFixed(2)} >= ${this.threshold} with recent call.`
        );
      }
    }

    const updated = [...history, fp].slice(-this.window);
    this.history.set(toolName, updated);
  }
}

// Usage in agent loop
const client = new Anthropic();
const spiralGuard = new SpiralGuard();

for (const block of response.content) {
  if (block.type === "tool_use") {
    spiralGuard.check(block.name, block.input as Record<string, unknown>);
    const result = await executeTool(block.name, block.input);
    toolResults.push({
      type: "tool_result" as const,
      tool_use_id: block.id,
      content: String(result),
    });
  }
}

For the BudgetGuard in TypeScript, use response.usage.input_tokens and response.usage.output_tokens from the Message response type — same fields as the Python SDK. The countTokens() method is available as client.messages.countTokens() with identical parameters to client.messages.create().

Prompt caching and extended thinking

Two Anthropic-specific features interact with cost guards in ways worth understanding before deploying to production:

Prompt caching — you can mark static content (system prompts, long tool definitions, reference documents) with {"type": "text", "text": "...", "cache_control": {"type": "ephemeral"}} in the system or messages parameters. Cached tokens cost approximately 10% of standard input token price on subsequent calls. The BudgetGuard above uses gross input tokens for its spend calculation, which overestimates cost when caching is active. To track accurately, use response.usage.cache_read_input_tokens (available in the usage object when caching is enabled) and apply the reduced rate. For a conservative ceiling, gross tracking is appropriate — it ensures the guard always triggers before the budget is genuinely exceeded.

Extended thinking — enabling extended thinking (thinking: {"type": "enabled", "budget_tokens": N}) adds thinking token costs to every response. These are reflected in response.usage.input_tokens as additional cache read overhead and thinking output tokens. Extended thinking can increase per-turn cost significantly for complex tasks. Reduce your budget_usd ceiling proportionally when enabling it, or adjust the BudgetGuard to track thinking tokens separately from standard output tokens.

RunGuard managed API

The four guards above require you to maintain spiral detection state, context counting logic, circuit breaker state machines, and budget accounting across your application codebase. As your agent fleet grows — multiple models, multiple tool sets, multiple application teams — that maintenance burden compounds. RunGuard's managed API exposes the same circuit-breaking logic as a hosted service with two HTTP calls per turn:

import httpx

RUNGUARD_API_KEY = "rg_..."
RUNGUARD_BASE = "https://api.runguard.dev/v1"

def runguard_check(run_id: str, tool_name: str, tool_input: dict,
                   context_tokens: int, spent_usd: float) -> dict:
    """Check guards before a tool execution. Returns {allowed: bool, reason: str}"""
    resp = httpx.post(
        f"{RUNGUARD_BASE}/check",
        json={
            "run_id": run_id,
            "tool_name": tool_name,
            "tool_input": tool_input,
            "context_tokens": context_tokens,
            "spent_usd": spent_usd,
        },
        headers={"Authorization": f"Bearer {RUNGUARD_API_KEY}"},
        timeout=2.0,
    )
    resp.raise_for_status()
    return resp.json()

def runguard_record(run_id: str, response) -> None:
    """Record token usage after a successful API call."""
    httpx.post(
        f"{RUNGUARD_BASE}/record",
        json={
            "run_id": run_id,
            "input_tokens": response.usage.input_tokens,
            "output_tokens": response.usage.output_tokens,
        },
        headers={"Authorization": f"Bearer {RUNGUARD_API_KEY}"},
        timeout=2.0,
    )

RunGuard stores run state server-side — no local guard objects to instantiate per run, no state to serialize for multi-process deployments. The dashboard shows trips per app, cost per run, and spiral frequency across your entire fleet. The Solo plan at $19/month covers one app with 1M guarded invocations per month — less than the cost of a single runaway research agent hitting claude-opus-4-7 for 200 turns.

Frequently asked questions

Does count_tokens() actually not cost anything?

Correct — the Anthropic count_tokens() endpoint returns a token count estimate without invoking the model. It is not billed. The call still traverses the network (expect 50–150ms latency on a warm connection), so it adds per-turn overhead, but that overhead is negligible relative to model call latency. The endpoint is available on all Claude 4 models and accepts the same parameters as messages.create(), including tools and system.

Why 0.72 as the Jaccard similarity threshold?

0.72 is the threshold that catches near-duplicate tool inputs (same query with trivial surface rephrasing) while allowing legitimate tool call variation (same tool, different parameters for different subtasks). The calibration is described in detail in the pattern reference: below 0.65 you get false positives that trip the guard on legitimate multi-step tool use; above 0.80 you miss the most common spiral pattern (model generates slightly varied query text for semantically identical searches). 0.72 is the empirical midpoint validated across multiple frameworks in this series. If your tools have very uniform input schemas (e.g., a single query string field), consider raising to 0.78; if they have complex structured inputs, lower to 0.68.

What happens when the circuit breaker trips — does the run die permanently?

No. The ClaudeCircuitBreaker above has a cooldown_seconds parameter (default 60 seconds). After the cooldown expires, _is_open() returns False and the consecutive failure counter resets to zero. The next before_call() succeeds. This half-open behavior means transient outages (rate limit bursts, brief service disruptions) don't permanently kill a run — they pause it for 60 seconds and allow exactly one probe attempt. If that probe fails, the breaker trips again with a fresh cooldown. The pattern is intentionally conservative: a legitimate outage should pause the run rather than fill your error budget with retry spend.

How do these guards interact with the Anthropic Batch API?

They don't — and that's intentional. The Batch API (client.beta.messages.batches.create()) is for offline, non-agentic workloads where you submit a collection of independent requests and retrieve results hours later. Agentic tool use loops require synchronous turn-by-turn control: you need to inspect each response before deciding the next action. If you're running batch inference for classification, extraction, or other stateless tasks, you don't need these guards — batch jobs have a fixed input set and don't accumulate context across turns. The guards in this post are specifically for the interactive, iterative messages.create() loop pattern.

Do I need all four guards, or can I start with just one?

Each guard protects against a distinct failure mode, and the failure modes don't overlap — a spiral can happen without a context overflow, a retry cascade can amplify spend without any spiral. That said, if you're just starting to instrument a new agent, prioritize based on what your agent actually does. Search/retrieval agents are most exposed to spirals — start with ToolSpiralGuard. Long-horizon research agents are most exposed to context accumulation — start with ContextGuard. Any agent with application-level retry logic is exposed to cascade multiplication — ClaudeCircuitBreaker is the highest-ROI guard if you have retry wrappers anywhere in your stack. BudgetGuard is always worth adding because it's stateless-simple and catches catastrophic failures that slip past the other three.

Stop debugging runaway Claude API agents after the bill lands

RunGuard wraps your existing messages.create() calls with hosted circuit-breaking logic — no local state, no guard objects to manage. Two HTTP calls per turn. Dashboard shows trips per app and cost-per-run across your fleet. Try the 14-day free trial with no card required.

Start free trial →