Anthropic Claude API Cost Control: Loop Detection and Budget Enforcement in Production
Building agentic loops directly on the Anthropic Messages API — using the anthropic Python package or @anthropic-ai/sdk for TypeScript — gives you maximum control over every aspect of the interaction. You define the tools, structure the messages, control the loop termination, and decide exactly when to stop. That control comes with a cost: there is no framework layer enforcing loop budgets, no built-in similarity check on tool inputs, no max-turns guard, and no mechanism that fires before the billing meter does.
The Messages API tool use flow is straightforward to describe and expensive to get wrong. You send a message with tools defined, the model returns a response with stop_reason: "tool_use" and one or more ToolUseBlock items in content, you execute the tools, append the results as a role: "user" message with type: "tool_result" blocks, and call the API again. That loop continues until the model issues a stop_reason: "end_turn" response. Nothing in the SDK stops the loop from running indefinitely. There is no max_turns parameter on messages.create(), no token budget enforcement, no Slack notification when your per-run spend crosses a threshold.
This post covers four failure modes specific to raw Claude API agents and provides complete Python implementations for each. The guards use only the standard library plus the anthropic package — no additional dependencies. A TypeScript SpiralGuard equivalent using @anthropic-ai/sdk is included in the final implementation section. The last section explains RunGuard's managed API as an alternative to maintaining your own guard infrastructure. If you're not familiar with the general principles behind these patterns, the AI agent cost control pattern reference covers the universal failure modes before diving into SDK-specific implementations.
How the Claude Messages API tool use loop works
The anthropic Python SDK (install via pip install anthropic) exposes a synchronous and async client. A minimal tool use agent loop looks like this:
import anthropic
client = anthropic.Anthropic() # reads ANTHROPIC_API_KEY from env
tools = [
{
"name": "web_search",
"description": "Search the web for up-to-date information",
"input_schema": {
"type": "object",
"properties": {
"query": {"type": "string", "description": "Search query"}
},
"required": ["query"]
}
}
]
messages = [{"role": "user", "content": "What are the latest AI agent frameworks released in 2026?"}]
while True:
response = client.messages.create(
model="claude-sonnet-4-6",
max_tokens=4096,
tools=tools,
messages=messages,
)
if response.stop_reason == "end_turn":
# Model issued a final text response
text = next(b.text for b in response.content if hasattr(b, "text"))
print(text)
break
if response.stop_reason == "max_tokens":
# Output was cut off — handle truncation
break
# stop_reason == "tool_use": execute each tool call
messages.append({"role": "assistant", "content": response.content})
tool_results = []
for block in response.content:
if block.type == "tool_use":
result = execute_tool(block.name, block.input)
tool_results.append({
"type": "tool_result",
"tool_use_id": block.id,
"content": result,
})
messages.append({"role": "user", "content": tool_results})
The loop above is correct and complete. It is also unbounded. Four structural properties of the Messages API create cost exposure that no amount of correct loop structure can prevent on its own:
- No turn limit —
messages.create()has nomax_turnsparameter. The loop runs until you break it or your application process crashes. - No input similarity check — the SDK sends whatever
inputdict the model returns. If the model callsweb_searchwith{"query": "AI agent frameworks 2026"}ten times in a row, all ten calls go through. - Growing messages list — each turn appends the assistant content and tool results. A 100-turn loop against a search tool can accumulate 80,000+ input tokens before the model ever issues
end_turn. - SDK retries are hidden from your code — by default the SDK retries failed requests twice (
max_retries=2, three total attempts). Your application-level retry wrapping the call creates a silent multiplier: 3 SDK retries × 3 app retries = 9 actual API calls per logical failure.
Failure mode 1: Tool use spiral
A tool use spiral occurs when the model calls the same tool with semantically identical or very similar arguments across multiple consecutive turns. The most common trigger is a retrieval or search tool: the model searches for something, the result doesn't satisfy the prompt, so it searches again with a nearly identical query. Claude's instruction-following is strong enough that it usually doesn't repeat the exact same string, but "AI agent frameworks 2026", "latest AI agent frameworks", and "new agent frameworks released 2026" are semantically the same search with different surface text.
Exact-match deduplication catches nothing here. The right detection primitive is a Jaccard similarity coefficient on the normalized token set of the tool input arguments, compared across a sliding window of recent calls to the same tool. A threshold of 0.72 catches near-duplicate calls while tolerating legitimate parameter variation — a threshold tuned across the same patterns described in the pattern reference and validated across framework-specific implementations in this series.
The guard is stateless beyond the sliding window and per-run instantiation. Instantiate one guard per agent run, not per application startup — shared instances cross-contaminate spiral detection across independent runs.
import hashlib
import json
from collections import defaultdict, deque
class ToolSpiralGuard:
"""Detects tool use spirals via Jaccard similarity on a sliding window."""
def __init__(self, window: int = 4, threshold: float = 0.72):
self.window = window
self.threshold = threshold
self._history: dict[str, deque] = defaultdict(lambda: deque(maxlen=window))
def _fingerprint(self, tool_input: dict) -> frozenset:
"""Normalize tool input to a token set for Jaccard comparison."""
tokens = set()
for v in tool_input.values():
if isinstance(v, str):
tokens.update(v.lower().split())
elif isinstance(v, (int, float, bool)):
tokens.add(str(v))
elif isinstance(v, (list, dict)):
tokens.update(json.dumps(v, sort_keys=True).lower().split())
return frozenset(tokens)
def _jaccard(self, a: frozenset, b: frozenset) -> float:
if not a and not b:
return 1.0
union = a | b
if not union:
return 0.0
return len(a & b) / len(union)
def check(self, tool_name: str, tool_input: dict) -> None:
"""Raises RuntimeError if a spiral is detected."""
fp = self._fingerprint(tool_input)
history = self._history[tool_name]
for prior_fp in history:
sim = self._jaccard(fp, prior_fp)
if sim >= self.threshold:
raise RuntimeError(
f"Tool use spiral detected on '{tool_name}': "
f"Jaccard similarity {sim:.2f} >= {self.threshold} "
f"with a recent call. Halting to prevent runaway costs."
)
history.append(fp)
def record(self, tool_name: str, tool_input: dict) -> None:
"""Record a tool call without raising. Use after check() passes."""
fp = self._fingerprint(tool_input)
self._history[tool_name].append(fp)
Usage in the agent loop — call check() before executing each tool call block:
spiral_guard = ToolSpiralGuard()
for block in response.content:
if block.type == "tool_use":
# Raises RuntimeError if spiral detected
spiral_guard.check(block.name, block.input)
result = execute_tool(block.name, block.input)
tool_results.append({
"type": "tool_result",
"tool_use_id": block.id,
"content": result,
})
When RuntimeError is raised, catch it at the top of your loop and return whatever partial result you have, or surface a structured error to the caller. Do not catch and continue — the guard tripped because the next call would waste money on a near-duplicate result.
Failure mode 2: Context window accumulation
Every tool call appends two entries to the messages list: the assistant's ToolUseBlock response and your tool_result user message. In a research agent that calls a search tool 40 times before producing a final answer, the messages list at turn 40 includes 80 additional entries plus all their content. Tool results from web search, document retrieval, or code execution are often long — a single result may be 2,000 tokens.
Claude 4 models support a 200,000-token context window. That sounds large until you run a research agent against a knowledge base for 30 minutes and find the 185th turn failing with a 400 context_length_exceeded error — at which point you have already paid for all 185 turns of input tokens, including the growing context repeated in full on every call.
The Anthropic SDK provides a client.messages.count_tokens() method that returns a token count estimate without making a model call. It accepts the same parameters as messages.create() and is fast enough to call on every turn. Use it to check context size before sending each model request:
class ContextGuard:
"""Guards against context window accumulation."""
def __init__(self, client: anthropic.Anthropic, model: str,
warn_fraction: float = 0.70, hard_fraction: float = 0.85):
self.client = client
self.model = model
self.warn_fraction = warn_fraction
self.hard_fraction = hard_fraction
# Model context limits in tokens
self._limits = {
"claude-opus-4-7": 200_000,
"claude-sonnet-4-6": 200_000,
"claude-haiku-4-5": 200_000,
}
def _limit(self) -> int:
for prefix, limit in self._limits.items():
if self.model.startswith(prefix):
return limit
return 200_000 # safe default for current Claude 4 generation
def check(self, messages: list, tools: list | None = None,
system: str | None = None) -> int:
"""Check context size. Returns token count. Raises on hard limit."""
kwargs: dict = {"model": self.model, "messages": messages, "max_tokens": 1}
if tools:
kwargs["tools"] = tools
if system:
kwargs["system"] = system
response = self.client.messages.count_tokens(**kwargs)
count = response.input_tokens
limit = self._limit()
if count >= limit * self.hard_fraction:
raise RuntimeError(
f"Context window hard limit: {count:,} tokens is "
f"{count/limit:.0%} of {limit:,}-token limit for {self.model}. "
f"Halting to prevent truncation and wasted spend."
)
if count >= limit * self.warn_fraction:
import warnings
warnings.warn(
f"Context window warning: {count:,} tokens ({count/limit:.0%} of limit). "
f"Consider summarizing earlier turns.",
stacklevel=2,
)
return count
Call context_guard.check() at the top of each loop iteration, before calling messages.create():
context_guard = ContextGuard(client=client, model="claude-sonnet-4-6")
while True:
# Check context before every API call — count_tokens is fast and free
context_guard.check(messages, tools=tools, system=system_prompt)
response = client.messages.create(
model="claude-sonnet-4-6",
max_tokens=4096,
tools=tools,
messages=messages,
)
# ... rest of loop
The count_tokens() call is not billed — it counts tokens server-side without invoking the model. The overhead per turn is typically under 100ms on a warm connection, which is negligible relative to the model call latency. Do not skip it to save time; the cost of a single context overflow far exceeds thousands of counting calls.
Important: The count_tokens() method requires the same tools and system parameters you pass to messages.create(). Omitting them will undercount tokens, making your limit check inaccurate. The tool definitions themselves consume tokens — complex tool schemas with many properties can add 500–2,000 tokens to every request.
Failure mode 3: Retry cascade multiplication
The Anthropic Python SDK retries failed requests automatically. The default is max_retries=2, meaning the SDK makes up to three attempts (one original plus two retries) on network errors, 408 (timeout), 429 (rate limit), and 5xx responses. This is correct behavior for a single-call client — you want transient failures to self-heal.
The problem arises at the agent loop level. If your application wraps the entire loop in a retry block (a common pattern for handling downstream failures), the multiplication is invisible:
- SDK attempt 1 fails → SDK retries → SDK attempt 2 fails → SDK retries → SDK attempt 3 fails → SDK raises
- Application retry block catches the exception → starts the loop again from the top
- Result: 3 application retries × 3 SDK retries = 9 actual API calls for a single logical failure, each paying full input token cost for the entire accumulated messages list
On a 100-turn research agent with 50,000 accumulated input tokens at the failure point, nine redundant calls to claude-opus-4-7 at $15/MTok input = $6.75 from a single failure event. A production system hitting this failure mode hourly can accumulate hundreds of dollars per day in invisible retry spend.
The fix is a circuit breaker that tracks consecutive failures per run and refuses to make further calls once the threshold is exceeded:
import time
class ClaudeCircuitBreaker:
"""Per-run circuit breaker for the Anthropic SDK."""
def __init__(self, failure_threshold: int = 2, cooldown_seconds: float = 60.0):
self.failure_threshold = failure_threshold
self.cooldown_seconds = cooldown_seconds
self._consecutive_failures = 0
self._tripped_at: float | None = None
def _is_open(self) -> bool:
if self._tripped_at is None:
return False
if time.monotonic() - self._tripped_at >= self.cooldown_seconds:
self._consecutive_failures = 0
self._tripped_at = None
return False
return True
def before_call(self) -> None:
"""Call before every messages.create(). Raises if breaker is open."""
if self._is_open():
wait = self.cooldown_seconds - (time.monotonic() - self._tripped_at)
raise RuntimeError(
f"Circuit breaker open: {self._consecutive_failures} consecutive "
f"API failures. Cooldown has {wait:.0f}s remaining."
)
def on_success(self) -> None:
"""Call after a successful messages.create() response."""
self._consecutive_failures = 0
self._tripped_at = None
def on_failure(self, exc: Exception) -> None:
"""Call when messages.create() raises. Re-raises after threshold."""
self._consecutive_failures += 1
if self._consecutive_failures >= self.failure_threshold:
self._tripped_at = time.monotonic()
raise RuntimeError(
f"Circuit breaker tripped after {self._consecutive_failures} "
f"consecutive failures. Last error: {exc}"
) from exc
raise exc
Usage — wrap every messages.create() call with the circuit breaker:
circuit_breaker = ClaudeCircuitBreaker(failure_threshold=2, cooldown_seconds=60.0)
while True:
circuit_breaker.before_call()
try:
response = client.messages.create(
model="claude-sonnet-4-6",
max_tokens=4096,
tools=tools,
messages=messages,
)
circuit_breaker.on_success()
except anthropic.APIError as e:
circuit_breaker.on_failure(e) # re-raises after threshold
# ... process response
Set max_retries=0 on the SDK client when using this circuit breaker if you want full control over retry logic:
client = anthropic.Anthropic(max_retries=0) # circuit breaker handles retries
With max_retries=0, every failed call immediately raises to your on_failure() handler. The circuit breaker controls the re-attempt schedule and the total failure count — no hidden multiplication.
Failure mode 4: Budget breach
The Anthropic SDK returns token counts in every response via the usage attribute: response.usage.input_tokens and response.usage.output_tokens. These are exact counts, not estimates. A per-run budget guard accumulates spend from these counts and halts the loop before the next call if the budget ceiling would be exceeded.
Claude 4 model pricing as of mid-2026:
| Model | Input ($/MTok) | Output ($/MTok) | Context |
|---|---|---|---|
claude-opus-4-7claude-opus-4-7 |
$15.00 | $75.00 | 200k |
claude-sonnet-4-6claude-sonnet-4-6 |
$3.00 | $15.00 | 200k |
claude-haiku-4-5claude-haiku-4-5 |
$0.80 | $4.00 | 200k |
Cache-aware pricing note: the Anthropic API supports prompt caching for long static prefixes (system prompts, tool definitions). Cached input tokens are billed at approximately 10% of normal input token cost. The guard below tracks gross input tokens (no cache discount) for a conservative ceiling — your actual spend will be lower if you've enabled prompt caching.
class BudgetGuard:
"""Per-run budget enforcement using exact token counts from API responses."""
# USD per token (not per MTok)
PRICING = {
"claude-opus-4-7": {"input": 15.00 / 1_000_000, "output": 75.00 / 1_000_000},
"claude-sonnet-4-6": {"input": 3.00 / 1_000_000, "output": 15.00 / 1_000_000},
"claude-haiku-4-5": {"input": 0.80 / 1_000_000, "output": 4.00 / 1_000_000},
}
def __init__(self, model: str, budget_usd: float):
self.model = model
self.budget_usd = budget_usd
self._spent_usd = 0.0
self._input_tokens = 0
self._output_tokens = 0
# Find pricing for this model (prefix match)
self._rates = None
for prefix, rates in self.PRICING.items():
if model.startswith(prefix):
self._rates = rates
break
if self._rates is None:
# Fall back to Sonnet pricing for unknown Claude 4 models
self._rates = self.PRICING["claude-sonnet-4-6"]
def check(self) -> None:
"""Call before messages.create(). Raises if at or over budget."""
if self._spent_usd >= self.budget_usd:
raise RuntimeError(
f"Budget ceiling reached: ${self._spent_usd:.4f} spent "
f"against ${self.budget_usd:.4f} limit for this run. "
f"Halting to prevent further charges."
)
def record(self, response) -> float:
"""Record token usage from a response. Returns incremental cost."""
input_tok = response.usage.input_tokens
output_tok = response.usage.output_tokens
cost = (input_tok * self._rates["input"] +
output_tok * self._rates["output"])
self._spent_usd += cost
self._input_tokens += input_tok
self._output_tokens += output_tok
return cost
@property
def spent(self) -> float:
return self._spent_usd
@property
def remaining(self) -> float:
return max(0.0, self.budget_usd - self._spent_usd)
def summary(self) -> dict:
return {
"spent_usd": round(self._spent_usd, 6),
"budget_usd": self.budget_usd,
"input_tokens": self._input_tokens,
"output_tokens": self._output_tokens,
"model": self.model,
}
Call budget_guard.check() before each API call and budget_guard.record(response) immediately after:
budget_guard = BudgetGuard(model="claude-sonnet-4-6", budget_usd=0.50)
while True:
budget_guard.check() # halt if budget exceeded
response = client.messages.create(
model="claude-sonnet-4-6",
max_tokens=4096,
tools=tools,
messages=messages,
)
cost = budget_guard.record(response)
# print(f"Turn cost: ${cost:.4f} | Remaining: ${budget_guard.remaining:.4f}")
if response.stop_reason == "end_turn":
break
# ... process tool calls
Integrating all four guards
The four guards are independent — each handles a distinct failure mode and can be used without the others. In production, all four should run together. The application order matters: check budget and circuit breaker before making the API call, check spiral before executing tool calls, check context before constructing the next API call.
def run_claude_agent(
messages: list,
tools: list,
model: str = "claude-sonnet-4-6",
system: str | None = None,
budget_usd: float = 0.50,
) -> str:
"""Run a tool-use agent loop with all four cost guards active."""
spiral_guard = ToolSpiralGuard(window=4, threshold=0.72)
context_guard = ContextGuard(client=client, model=model)
circuit_breaker = ClaudeCircuitBreaker(failure_threshold=2, cooldown_seconds=60.0)
budget_guard = BudgetGuard(model=model, budget_usd=budget_usd)
create_kwargs: dict = {
"model": model,
"max_tokens": 4096,
"tools": tools,
"messages": messages,
}
if system:
create_kwargs["system"] = system
while True:
# 1. Budget check — before any API call
budget_guard.check()
# 2. Context check — before constructing the model request
context_guard.check(
messages, tools=tools,
system=system,
)
# 3. Circuit breaker — before the SDK call
circuit_breaker.before_call()
try:
response = client.messages.create(**create_kwargs)
circuit_breaker.on_success()
except anthropic.APIError as e:
circuit_breaker.on_failure(e)
# 4. Record cost
budget_guard.record(response)
# Terminal conditions
if response.stop_reason in ("end_turn", "stop_sequence"):
text_blocks = [b.text for b in response.content if hasattr(b, "text")]
return " ".join(text_blocks)
if response.stop_reason == "max_tokens":
raise RuntimeError("Response truncated by max_tokens limit.")
# Process tool calls
messages.append({"role": "assistant", "content": response.content})
tool_results = []
for block in response.content:
if block.type == "tool_use":
# 5. Spiral check — before executing each tool call
spiral_guard.check(block.name, block.input)
result = execute_tool(block.name, block.input)
tool_results.append({
"type": "tool_result",
"tool_use_id": block.id,
"content": str(result),
})
messages.append({"role": "user", "content": tool_results})
The guard initialization is at the top of the function — one instance per call to run_claude_agent(). Shared guard instances across calls would cause false positives: a spiral detected in a previous run's tool history would trip the guard on the first tool call of a new run. Per-run instantiation is the invariant all four guards rely on.
TypeScript implementation with @anthropic-ai/sdk
The @anthropic-ai/sdk package (install via npm install @anthropic-ai/sdk) mirrors the Python API closely. A TypeScript SpiralGuard for the tool use spiral failure mode:
import Anthropic from "@anthropic-ai/sdk";
import type { ToolUseBlock } from "@anthropic-ai/sdk/resources/messages";
class SpiralGuard {
private window: number;
private threshold: number;
private history: Map<string, Array<Set<string>>> = new Map();
constructor(window = 4, threshold = 0.72) {
this.window = window;
this.threshold = threshold;
}
private fingerprint(input: Record<string, unknown>): Set<string> {
const tokens = new Set<string>();
for (const v of Object.values(input)) {
if (typeof v === "string") {
v.toLowerCase().split(/\s+/).forEach((t) => tokens.add(t));
} else if (typeof v === "number" || typeof v === "boolean") {
tokens.add(String(v));
} else if (v !== null && typeof v === "object") {
JSON.stringify(v).toLowerCase().split(/\s+/).forEach((t) => tokens.add(t));
}
}
return tokens;
}
private jaccard(a: Set<string>, b: Set<string>): number {
if (a.size === 0 && b.size === 0) return 1.0;
const intersection = new Set([...a].filter((x) => b.has(x)));
const union = new Set([...a, ...b]);
return union.size === 0 ? 0 : intersection.size / union.size;
}
check(toolName: string, toolInput: Record<string, unknown>): void {
const fp = this.fingerprint(toolInput);
const history = this.history.get(toolName) ?? [];
for (const prior of history) {
const sim = this.jaccard(fp, prior);
if (sim >= this.threshold) {
throw new Error(
`Tool use spiral on '${toolName}': Jaccard similarity ` +
`${sim.toFixed(2)} >= ${this.threshold} with recent call.`
);
}
}
const updated = [...history, fp].slice(-this.window);
this.history.set(toolName, updated);
}
}
// Usage in agent loop
const client = new Anthropic();
const spiralGuard = new SpiralGuard();
for (const block of response.content) {
if (block.type === "tool_use") {
spiralGuard.check(block.name, block.input as Record<string, unknown>);
const result = await executeTool(block.name, block.input);
toolResults.push({
type: "tool_result" as const,
tool_use_id: block.id,
content: String(result),
});
}
}
For the BudgetGuard in TypeScript, use response.usage.input_tokens and response.usage.output_tokens from the Message response type — same fields as the Python SDK. The countTokens() method is available as client.messages.countTokens() with identical parameters to client.messages.create().
Prompt caching and extended thinking
Two Anthropic-specific features interact with cost guards in ways worth understanding before deploying to production:
Prompt caching — you can mark static content (system prompts, long tool definitions, reference documents) with {"type": "text", "text": "...", "cache_control": {"type": "ephemeral"}} in the system or messages parameters. Cached tokens cost approximately 10% of standard input token price on subsequent calls. The BudgetGuard above uses gross input tokens for its spend calculation, which overestimates cost when caching is active. To track accurately, use response.usage.cache_read_input_tokens (available in the usage object when caching is enabled) and apply the reduced rate. For a conservative ceiling, gross tracking is appropriate — it ensures the guard always triggers before the budget is genuinely exceeded.
Extended thinking — enabling extended thinking (thinking: {"type": "enabled", "budget_tokens": N}) adds thinking token costs to every response. These are reflected in response.usage.input_tokens as additional cache read overhead and thinking output tokens. Extended thinking can increase per-turn cost significantly for complex tasks. Reduce your budget_usd ceiling proportionally when enabling it, or adjust the BudgetGuard to track thinking tokens separately from standard output tokens.
RunGuard managed API
The four guards above require you to maintain spiral detection state, context counting logic, circuit breaker state machines, and budget accounting across your application codebase. As your agent fleet grows — multiple models, multiple tool sets, multiple application teams — that maintenance burden compounds. RunGuard's managed API exposes the same circuit-breaking logic as a hosted service with two HTTP calls per turn:
import httpx
RUNGUARD_API_KEY = "rg_..."
RUNGUARD_BASE = "https://api.runguard.dev/v1"
def runguard_check(run_id: str, tool_name: str, tool_input: dict,
context_tokens: int, spent_usd: float) -> dict:
"""Check guards before a tool execution. Returns {allowed: bool, reason: str}"""
resp = httpx.post(
f"{RUNGUARD_BASE}/check",
json={
"run_id": run_id,
"tool_name": tool_name,
"tool_input": tool_input,
"context_tokens": context_tokens,
"spent_usd": spent_usd,
},
headers={"Authorization": f"Bearer {RUNGUARD_API_KEY}"},
timeout=2.0,
)
resp.raise_for_status()
return resp.json()
def runguard_record(run_id: str, response) -> None:
"""Record token usage after a successful API call."""
httpx.post(
f"{RUNGUARD_BASE}/record",
json={
"run_id": run_id,
"input_tokens": response.usage.input_tokens,
"output_tokens": response.usage.output_tokens,
},
headers={"Authorization": f"Bearer {RUNGUARD_API_KEY}"},
timeout=2.0,
)
RunGuard stores run state server-side — no local guard objects to instantiate per run, no state to serialize for multi-process deployments. The dashboard shows trips per app, cost per run, and spiral frequency across your entire fleet. The Solo plan at $19/month covers one app with 1M guarded invocations per month — less than the cost of a single runaway research agent hitting claude-opus-4-7 for 200 turns.
Frequently asked questions
Does count_tokens() actually not cost anything?
Correct — the Anthropic count_tokens() endpoint returns a token count estimate without invoking the model. It is not billed. The call still traverses the network (expect 50–150ms latency on a warm connection), so it adds per-turn overhead, but that overhead is negligible relative to model call latency. The endpoint is available on all Claude 4 models and accepts the same parameters as messages.create(), including tools and system.
Why 0.72 as the Jaccard similarity threshold?
0.72 is the threshold that catches near-duplicate tool inputs (same query with trivial surface rephrasing) while allowing legitimate tool call variation (same tool, different parameters for different subtasks). The calibration is described in detail in the pattern reference: below 0.65 you get false positives that trip the guard on legitimate multi-step tool use; above 0.80 you miss the most common spiral pattern (model generates slightly varied query text for semantically identical searches). 0.72 is the empirical midpoint validated across multiple frameworks in this series. If your tools have very uniform input schemas (e.g., a single query string field), consider raising to 0.78; if they have complex structured inputs, lower to 0.68.
What happens when the circuit breaker trips — does the run die permanently?
No. The ClaudeCircuitBreaker above has a cooldown_seconds parameter (default 60 seconds). After the cooldown expires, _is_open() returns False and the consecutive failure counter resets to zero. The next before_call() succeeds. This half-open behavior means transient outages (rate limit bursts, brief service disruptions) don't permanently kill a run — they pause it for 60 seconds and allow exactly one probe attempt. If that probe fails, the breaker trips again with a fresh cooldown. The pattern is intentionally conservative: a legitimate outage should pause the run rather than fill your error budget with retry spend.
How do these guards interact with the Anthropic Batch API?
They don't — and that's intentional. The Batch API (client.beta.messages.batches.create()) is for offline, non-agentic workloads where you submit a collection of independent requests and retrieve results hours later. Agentic tool use loops require synchronous turn-by-turn control: you need to inspect each response before deciding the next action. If you're running batch inference for classification, extraction, or other stateless tasks, you don't need these guards — batch jobs have a fixed input set and don't accumulate context across turns. The guards in this post are specifically for the interactive, iterative messages.create() loop pattern.
Do I need all four guards, or can I start with just one?
Each guard protects against a distinct failure mode, and the failure modes don't overlap — a spiral can happen without a context overflow, a retry cascade can amplify spend without any spiral. That said, if you're just starting to instrument a new agent, prioritize based on what your agent actually does. Search/retrieval agents are most exposed to spirals — start with ToolSpiralGuard. Long-horizon research agents are most exposed to context accumulation — start with ContextGuard. Any agent with application-level retry logic is exposed to cascade multiplication — ClaudeCircuitBreaker is the highest-ROI guard if you have retry wrappers anywhere in your stack. BudgetGuard is always worth adding because it's stateless-simple and catches catastrophic failures that slip past the other three.