Amazon Bedrock Converse API Cost Control: Loop Detection and Budget Enforcement in Production

The Amazon Bedrock Converse API, introduced in April 2024, solves a genuine interoperability problem: instead of writing model-specific request/response code for Claude, Nova, Llama, Titan, and Mistral separately, you call one client.converse() method and swap the modelId string to change providers. Tool use, multi-turn conversation history, streaming, and inference parameters are normalized behind a single SDK contract.

That portability is the source of the API's cost control problem. The Bedrock Agents APIs — invoke_agent for pre-configured agents and invoke_inline_agent for dynamic configurations — manage conversation state server-side. Bedrock deletes session data after idle timeout; the orchestration layer enforces a step limit; quota governors cap total steps per invocation. None of those safeguards exist in the Converse API. You own the full messages list. You control the loop. You decide when to stop.

Teams building tool-use agents on client.converse() directly — common when the goal is model portability or when Bedrock Agents' opinionated orchestration doesn't fit the use case — inherit all the loop risk that Bedrock Agents' managed runtime normally absorbs. This post covers the four failure modes specific to Converse API agent patterns and provides a complete Python ConverseBreaker implementation you can drop in without restructuring your agent loop.

Bedrock API trilogy. This post covers the Converse API (converse / converse_stream). For pre-configured Bedrock Agents using a fixed agentId + agentAliasId, see AWS Bedrock Agents Cost Control. For dynamically configured agents using invoke_inline_agent, see Bedrock Inline Agents Cost Control. The failure modes and circuit breaker code are distinct for each API.

The Converse API surface

A minimal tool-use loop using client.converse() looks like this:

import boto3
import json

client = boto3.client("bedrock-runtime", region_name="us-east-1")

messages = [
    {"role": "user", "content": [{"text": "Find the top-3 trending Python packages today."}]}
]

tool_config = {
    "tools": [
        {
            "toolSpec": {
                "name": "web_search",
                "description": "Search the web for current information.",
                "inputSchema": {
                    "json": {
                        "type": "object",
                        "properties": {
                            "query": {"type": "string", "description": "Search query"}
                        },
                        "required": ["query"]
                    }
                }
            }
        }
    ]
}

while True:
    response = client.converse(
        modelId="anthropic.claude-3-5-sonnet-20241022-v2:0",
        messages=messages,
        toolConfig=tool_config,
        inferenceConfig={"maxTokens": 4096, "temperature": 0.1}
    )

    assistant_message = response["output"]["message"]
    messages.append(assistant_message)

    stop_reason = response["stopReason"]
    usage = response["usage"]  # inputTokens, outputTokens, totalTokens

    if stop_reason == "end_turn":
        break

    if stop_reason == "tool_use":
        # Extract all tool use blocks from the assistant message
        tool_results = []
        for block in assistant_message["content"]:
            if "toolUse" in block:
                tool_use = block["toolUse"]
                result = execute_tool(tool_use["name"], tool_use["input"])
                tool_results.append({
                    "toolResult": {
                        "toolUseId": tool_use["toolUseId"],
                        "content": [{"text": str(result)}]
                    }
                })

        # Tool results go back as a user-role message
        messages.append({"role": "user", "content": tool_results})
        continue

    break  # max_tokens, content_filtered, stop_sequence

Three structural facts about this loop determine where cost failures occur:

  1. You own the messages list. Every turn appends two messages: the assistant response and your tool results. That list is sent in full on every subsequent converse() call. Input token count grows with every round trip.
  2. There is no step limit. The while True loop above runs until stopReason == "end_turn", which the model controls. A model in a broken reasoning state can return "tool_use" indefinitely.
  3. Cost depends on which modelId you pass. Claude 3.5 Sonnet costs 50× more per input token than Nova Micro. A loop that's cheap on Nova Micro may be catastrophic if you swap in Sonnet for quality.

Failure mode 1: tool call spiral

The Converse API's normalized tool-use contract is identical to the raw Anthropic Messages API in terms of loop mechanics: the model returns stopReason: "tool_use", you execute the tool, you append the result as a user message, and you call converse() again. There is no forced termination. The model decides when it has enough information to return "end_turn".

Tool call spirals happen when the model enters a planning state where each tool result leads to another nearly-identical tool call. The most common trigger: tool output that the model misinterprets as incomplete, causing it to re-issue the same search query with minor variation indefinitely. At Claude 3.5 Sonnet pricing ($3/$15 per 1M input/output tokens), a 40-step spiral with a 10k-token context window costs roughly $1.80 in input tokens alone before your first "end_turn".

Detection uses Jaccard similarity on normalized tool input hashes across a sliding window of recent calls to the same tool. A similarity of 0.85 or above on two consecutive calls to the same tool is a near-certain spiral signal:

import hashlib
import re
from collections import defaultdict
from typing import Any

class ToolSpiralGuard:
    """Detects tool call spirals by comparing normalized input similarity."""

    def __init__(self, similarity_threshold: float = 0.85, window_size: int = 4):
        self.threshold = similarity_threshold
        self.window = window_size
        # per-tool-name history of normalized token sets
        self._history: dict[str, list[frozenset]] = defaultdict(list)

    def _normalize(self, value: Any) -> frozenset:
        """Flatten nested structure to a set of normalized string tokens."""
        text = json.dumps(value, sort_keys=True, ensure_ascii=False).lower()
        # Remove punctuation, split on whitespace
        tokens = re.findall(r"[a-z0-9_]+", text)
        return frozenset(tokens)

    def _jaccard(self, a: frozenset, b: frozenset) -> float:
        if not a and not b:
            return 1.0
        intersection = len(a & b)
        union = len(a | b)
        return intersection / union if union else 0.0

    def check(self, tool_name: str, tool_input: Any) -> bool:
        """Returns True if this call looks like a spiral. Call BEFORE executing the tool."""
        tokens = self._normalize(tool_input)
        history = self._history[tool_name]

        for prev_tokens in history[-self.window:]:
            if self._jaccard(tokens, prev_tokens) >= self.threshold:
                return True  # spiral detected

        history.append(tokens)
        if len(history) > self.window * 2:
            del history[:self.window]
        return False

Integrate into the loop before each tool execution:

spiral_guard = ToolSpiralGuard(similarity_threshold=0.85, window_size=4)

# Inside the tool_use handling block:
for block in assistant_message["content"]:
    if "toolUse" in block:
        tool_use = block["toolUse"]
        if spiral_guard.check(tool_use["name"], tool_use["input"]):
            raise RuntimeError(
                f"Tool call spiral detected on '{tool_use['name']}' — "
                "aborting before executing duplicate call"
            )
        result = execute_tool(tool_use["name"], tool_use["input"])

The threshold of 0.85 is intentionally tighter than the 0.72 used for raw Anthropic API loops in our Anthropic SDK cost control guide. Converse API agents often legitimately refine search queries slightly across a multi-step research task, so a wider threshold produces false positives. The 0.85 threshold targets near-verbatim repetition — the actual spiral case.

Failure mode 2: conversation history explosion

The Bedrock Agents APIs manage conversation state server-side using a sessionId. You send the new user turn; Bedrock loads the prior history internally. The Converse API does not work this way. You send the entire messages list on every call. If your agent maintains a persistent conversation across many turns — a research assistant with hours-long sessions, an autonomous coding agent that accumulates context across subtasks — the token count of your input grows quadratically.

The growth pattern: Turn 1 sends N₁ tokens of history. Turn 2 sends N₁ + N₂ tokens. Turn K sends ∑Nᵢ tokens. If each turn adds 2,000 tokens average (assistant response plus tool results), a 30-turn session sends 30 × 15,500 = 465,000 input tokens in the final call alone — before you receive a single output token. On Claude 3.5 Sonnet that final call costs $1.40. On Mistral Large it's $1.86.

The Converse API does not expose a standalone token-counting endpoint. You have to estimate. A reliable estimate for the Converse API is 4 characters per token, applied to the JSON-serialized messages array:

class ConversationGuard:
    """Estimates token count of the full messages list before each call."""

    CHARS_PER_TOKEN = 4.0

    def __init__(self, warn_threshold: int = 80_000, hard_threshold: int = 120_000):
        # Thresholds in estimated tokens
        self.warn_threshold = warn_threshold
        self.hard_threshold = hard_threshold

    def estimated_tokens(self, messages: list[dict]) -> int:
        raw = json.dumps(messages, ensure_ascii=False)
        return int(len(raw) / self.CHARS_PER_TOKEN)

    def check(self, messages: list[dict]) -> tuple[str, int]:
        """Returns ('ok'|'warn'|'hard_stop', estimated_token_count)."""
        est = self.estimated_tokens(messages)
        if est >= self.hard_threshold:
            return "hard_stop", est
        if est >= self.warn_threshold:
            return "warn", est
        return "ok", est

    @staticmethod
    def trim_oldest_tool_turns(messages: list[dict], keep_last_n: int = 10) -> list[dict]:
        """
        Emergency trim: keep system messages (index 0 if role='system'),
        the original user query (first user message), and the last N turns.
        Preserves conversation coherence better than sliding window.
        """
        user_turns = [m for m in messages if m["role"] == "user"]
        assistant_turns = [m for m in messages if m["role"] == "assistant"]

        # Always keep first user message (the original task)
        if not user_turns:
            return messages

        first_user = user_turns[0]
        recent = messages[-(keep_last_n * 2):]  # last N full turn-pairs

        # Deduplicate in case first_user is already in recent
        if first_user in recent:
            return recent
        return [first_user] + recent

The warn/hard thresholds of 80k/120k estimated tokens are conservative defaults for models with 200k context windows (Claude 3.5 Sonnet, Nova Pro). For models with smaller context windows — Mistral Large has a 128k context limit — reduce both thresholds proportionally. The estimation is deliberately pessimistic: JSON serialization adds quotes and escape overhead, so actual token counts tend to be slightly lower than the estimate.

Failure mode 3: cross-model retry amplification

The Converse API is model-agnostic, which is also why it's frequently used for cost optimization experiments: run the same workload against Nova Micro, Nova Pro, and Claude 3.5 Sonnet and compare quality vs. cost. The retry behavior across models is not uniform, and the Boto3 default retry configuration compounds the problem.

Boto3's default retry mode is standard, with max_attempts=3. When a Bedrock endpoint returns a ThrottlingException or ServiceUnavailableException, Boto3 retries up to 2 additional times with exponential backoff. This is appropriate for idempotent reads. For a Converse API call that already completed a 15-step tool-use loop before hitting throttling on step 16, Boto3's retry sends the full 15-turn messages list again — paying for all 15 turns' worth of input tokens a second time before the retry might also fail.

Two patterns amplify this further:

  • Application-level retry wrapper. If your code wraps the entire agent loop in a try/except that retries on any exception, a throttle on turn 16 reruns turns 1–16 from scratch. Combined with Boto3's internal retry, a single throttle event can cause 3 × 3 = 9 full loop re-executions.
  • Fallback model switching without loop reset. A pattern common in cost optimization pipelines: try Claude 3.5 Sonnet, fall back to Nova Pro on 429, fall back to Nova Micro on second 429. If the fallback logic doesn't reset the messages list to the turn where the throttle happened, each fallback re-sends the full accumulated history at the new model's pricing.

The circuit breaker pattern addresses this. Track consecutive failures per model and open the circuit after a threshold, refusing new calls during a cooldown period:

import time

class ConverseCircuitBreaker:
    """
    Opens after threshold consecutive failures (any exception type),
    rejects calls during cooldown_seconds, then closes and tries again.
    """

    def __init__(self, failure_threshold: int = 3, cooldown_seconds: float = 60.0):
        self.threshold = failure_threshold
        self.cooldown = cooldown_seconds
        self._failures = 0
        self._opened_at: float | None = None

    @property
    def is_open(self) -> bool:
        if self._opened_at is None:
            return False
        if time.monotonic() - self._opened_at >= self.cooldown:
            # Half-open: allow one trial
            self._opened_at = None
            self._failures = 0
            return False
        return True

    def record_success(self) -> None:
        self._failures = 0
        self._opened_at = None

    def record_failure(self) -> None:
        self._failures += 1
        if self._failures >= self.threshold:
            self._opened_at = time.monotonic()

    def call(self, fn, *args, **kwargs):
        if self.is_open:
            raise RuntimeError(
                f"ConverseCircuitBreaker open — "
                f"{self.cooldown}s cooldown after {self.threshold} failures. "
                "Check Bedrock service health and quota limits."
            )
        try:
            result = fn(*args, **kwargs)
            self.record_success()
            return result
        except Exception:
            self.record_failure()
            raise

Configure Boto3's own retry behavior explicitly to avoid compounding with the circuit breaker:

from botocore.config import Config

# max_attempts=1 means no Boto3-level retries — the circuit breaker handles retry logic
client = boto3.client(
    "bedrock-runtime",
    region_name="us-east-1",
    config=Config(retries={"max_attempts": 1, "mode": "standard"})
)

breaker = ConverseCircuitBreaker(failure_threshold=3, cooldown_seconds=60.0)

# Wrap the converse call:
response = breaker.call(
    client.converse,
    modelId="anthropic.claude-3-5-sonnet-20241022-v2:0",
    messages=messages,
    toolConfig=tool_config,
    inferenceConfig={"maxTokens": 4096}
)

Setting max_attempts=1 gives the circuit breaker full control over the retry strategy, prevents Boto3's exponential backoff from masking transient errors as long pauses, and eliminates the cost multiplication from Boto3 retrying a fully-accumulated conversation history.

Failure mode 4: streaming accumulation and mid-stream abort

client.converse_stream() returns an event stream instead of a single response object. The stream delivers content incrementally via contentBlockDelta events, followed by a metadata event at the very end that contains the usage object (inputTokens, outputTokens, totalTokens). This ordering creates a subtle cost accounting trap.

Consider an agent loop that reads the stream and checks output length to decide whether to abort early. If you break out of the event stream before consuming the metadata event, two things happen:

  1. You have no usage data, so your per-run budget tracker never records this call's cost.
  2. You've already paid for all output tokens generated up to the point you aborted — the model ran to completion on the server; you just didn't receive the rest of the stream.

The correct pattern: always drain the full stream through to metadata before acting on any termination condition. Capture delta events in a buffer and the usage event separately:

def converse_stream_with_usage(client, **kwargs) -> tuple[list[dict], dict]:
    """
    Consume a converse_stream response completely.
    Returns (content_blocks, usage_dict).
    Always reads through to the metadata event — never aborts mid-stream.
    """
    response = client.converse_stream(**kwargs)
    stream = response["stream"]

    content_blocks: list[dict] = []
    current_block: dict | None = None
    usage: dict = {}

    for event in stream:
        if "contentBlockStart" in event:
            start = event["contentBlockStart"]["start"]
            current_block = start.copy()

        elif "contentBlockDelta" in event:
            delta = event["contentBlockDelta"]["delta"]
            if current_block is None:
                current_block = {}
            if "text" in delta:
                current_block.setdefault("text", "")
                current_block["text"] += delta["text"]
            elif "toolUse" in delta:
                current_block.setdefault("toolUse", {})
                current_block["toolUse"].setdefault("input", "")
                current_block["toolUse"]["input"] += delta["toolUse"].get("input", "")

        elif "contentBlockStop" in event:
            if current_block is not None:
                # Parse accumulated tool input JSON if present
                if "toolUse" in current_block and isinstance(current_block["toolUse"].get("input"), str):
                    try:
                        current_block["toolUse"]["input"] = json.loads(
                            current_block["toolUse"]["input"]
                        )
                    except json.JSONDecodeError:
                        pass
                content_blocks.append(current_block)
                current_block = None

        elif "messageStop" in event:
            # stopReason is here, not in a separate field
            pass

        elif "metadata" in event:
            # ALWAYS arrives last — contains usage
            usage = event["metadata"].get("usage", {})

    return content_blocks, usage

For cases where you genuinely need to respond to content before the stream ends — for example, streaming output to a user's terminal — track usage in a pre-call estimate rather than relying on post-stream accounting:

class StreamingBudgetGuard:
    """
    Pre-estimates cost before a converse_stream call using input token estimate.
    Output token budget allocated at call time; adjusted after stream completes.
    """

    def __init__(self, max_usd_per_run: float, price_per_1m_input: float, price_per_1m_output: float):
        self.max_usd = max_usd_per_run
        self.input_price = price_per_1m_input  # USD per 1M tokens
        self.output_price = price_per_1m_output
        self._spent = 0.0

    def check_input(self, messages: list[dict], max_output_tokens: int) -> None:
        """Raise if estimated cost would exceed budget."""
        raw = json.dumps(messages, ensure_ascii=False)
        estimated_input = int(len(raw) / 4.0)
        estimated_cost = (
            estimated_input / 1_000_000 * self.input_price
            + max_output_tokens / 1_000_000 * self.output_price
        )
        if self._spent + estimated_cost > self.max_usd:
            raise RuntimeError(
                f"Budget would exceed ${self.max_usd:.2f} — "
                f"estimated call cost ${estimated_cost:.4f}, "
                f"spent so far ${self._spent:.4f}"
            )

    def record_usage(self, usage: dict, model_id: str) -> None:
        """Record actual usage from metadata event."""
        # Map modelId to pricing; defaults to Claude 3.5 Sonnet rates
        input_tokens = usage.get("inputTokens", 0)
        output_tokens = usage.get("outputTokens", 0)
        cost = (
            input_tokens / 1_000_000 * self.input_price
            + output_tokens / 1_000_000 * self.output_price
        )
        self._spent += cost

    @property
    def spent(self) -> float:
        return self._spent

ConverseBreaker: composing all four guards

The four guards combine into a single ConverseBreaker wrapper that replaces direct client.converse() calls. The wrapper intercepts the call, runs pre-call checks, executes the Bedrock request through the circuit breaker, and records post-call accounting:

from dataclasses import dataclass, field

@dataclass
class ConverseConfig:
    # Spiral detection
    spiral_threshold: float = 0.85
    spiral_window: int = 4
    # Conversation guard
    conversation_warn_tokens: int = 80_000
    conversation_hard_tokens: int = 120_000
    # Circuit breaker
    circuit_failure_threshold: int = 3
    circuit_cooldown_seconds: float = 60.0
    # Budget (per-run)
    max_usd_per_run: float = 2.00
    price_per_1m_input: float = 3.00   # Claude 3.5 Sonnet default
    price_per_1m_output: float = 15.00

class ConverseBreaker:
    """Drop-in wrapper around client.converse() with all four cost guards."""

    def __init__(self, client, config: ConverseConfig | None = None):
        self._client = client
        cfg = config or ConverseConfig()
        self._spiral = ToolSpiralGuard(cfg.spiral_threshold, cfg.spiral_window)
        self._conv = ConversationGuard(cfg.conversation_warn_tokens, cfg.conversation_hard_tokens)
        self._breaker = ConverseCircuitBreaker(cfg.circuit_failure_threshold, cfg.circuit_cooldown_seconds)
        self._budget = StreamingBudgetGuard(
            cfg.max_usd_per_run,
            cfg.price_per_1m_input,
            cfg.price_per_1m_output
        )

    def check_tool_calls(self, assistant_message: dict) -> None:
        """Call before executing any tool from the assistant message."""
        for block in assistant_message.get("content", []):
            if "toolUse" in block:
                tool = block["toolUse"]
                if self._spiral.check(tool["name"], tool["input"]):
                    raise RuntimeError(
                        f"Tool spiral detected on '{tool['name']}'. "
                        "Aborting before executing near-duplicate call."
                    )

    def converse(self, messages: list[dict], **kwargs) -> dict:
        """
        Guarded replacement for client.converse().
        Signature: converse(messages, modelId=..., toolConfig=..., inferenceConfig=...)
        """
        # 1. Conversation history check
        status, est_tokens = self._conv.check(messages)
        if status == "hard_stop":
            raise RuntimeError(
                f"Conversation history too large (~{est_tokens:,} tokens). "
                "Trim history before continuing."
            )
        if status == "warn":
            import warnings
            warnings.warn(
                f"Conversation history approaching limit (~{est_tokens:,} estimated tokens). "
                "Consider trimming oldest turns."
            )

        # 2. Budget pre-check
        max_output = kwargs.get("inferenceConfig", {}).get("maxTokens", 4096)
        self._budget.check_input(messages, max_output)

        # 3. Circuit breaker call
        response = self._breaker.call(
            self._client.converse,
            messages=messages,
            **kwargs
        )

        # 4. Record usage
        if "usage" in response:
            self._budget.record_usage(response["usage"], kwargs.get("modelId", ""))

        return response

    @property
    def total_spent_usd(self) -> float:
        return self._budget.spent

Usage replaces client.converse() with breaker.converse() throughout the agent loop:

from botocore.config import Config

client = boto3.client(
    "bedrock-runtime",
    region_name="us-east-1",
    config=Config(retries={"max_attempts": 1, "mode": "standard"})
)

cfg = ConverseConfig(
    spiral_threshold=0.85,
    max_usd_per_run=3.00,
    price_per_1m_input=3.00,   # Claude 3.5 Sonnet
    price_per_1m_output=15.00
)
breaker = ConverseBreaker(client, cfg)

messages = [{"role": "user", "content": [{"text": task}]}]

while True:
    response = breaker.converse(
        messages=messages,
        modelId="anthropic.claude-3-5-sonnet-20241022-v2:0",
        toolConfig=tool_config,
        inferenceConfig={"maxTokens": 4096, "temperature": 0.1}
    )

    assistant_message = response["output"]["message"]
    messages.append(assistant_message)

    if response["stopReason"] == "end_turn":
        break

    if response["stopReason"] == "tool_use":
        # Check for spirals before executing
        breaker.check_tool_calls(assistant_message)

        tool_results = []
        for block in assistant_message["content"]:
            if "toolUse" in block:
                tool = block["toolUse"]
                result = execute_tool(tool["name"], tool["input"])
                tool_results.append({
                    "toolResult": {
                        "toolUseId": tool["toolUseId"],
                        "content": [{"text": str(result)}]
                    }
                })

        messages.append({"role": "user", "content": tool_results})
        continue

    break

print(f"Run cost: ${breaker.total_spent_usd:.4f}")

Cross-model pricing reference

The Converse API's primary value proposition is model portability. The cost implications of switching models mid-deployment are significant. Reference pricing at AWS us-east-1 on-demand rates:

Model Model ID Input / 1M tokens Output / 1M tokens 10-step loop cost (est.)
Claude 3.5 Sonnet anthropic.claude-3-5-sonnet-20241022-v2:0 $3.00 $15.00 ~$1.80
Claude 3 Haiku anthropic.claude-3-haiku-20240307-v1:0 $0.25 $1.25 ~$0.15
Amazon Nova Pro amazon.nova-pro-v1:0 $0.80 $3.20 ~$0.48
Amazon Nova Lite amazon.nova-lite-v1:0 $0.06 $0.24 ~$0.04
Amazon Nova Micro amazon.nova-micro-v1:0 $0.035 $0.14 ~$0.02
Meta Llama 3.1 70B meta.llama3-1-70b-instruct-v1:0 $0.72 $0.72 ~$0.43
Mistral Large 2 mistral.mistral-large-2402-v1:0 $4.00 $12.00 ~$2.40

The 10-step loop estimate assumes 8,000 input tokens (growing conversation) and 800 output tokens per step, with all 10 steps completing before detection. The ConverseConfig.price_per_1m_input and price_per_1m_output fields accept the rates for whatever model you're running — set them correctly when you initialize ConverseBreaker or you'll undercount budget consumption.

Converse vs. Bedrock Agents APIs: cost control properties

Understanding why the Converse API requires more explicit cost control than the Agents APIs helps you choose the right API surface for new workloads:

Property Converse API invoke_agent invoke_inline_agent
Session state management Client-side (you own messages[]) Server-side (sessionId) Server-side (sessionId)
Step limit enforcement None — client-controlled loop Yes — orchestration layer Yes — orchestration layer
Model portability Any supported Bedrock model Configured per agent alias Per-invocation modelId
Quota governors Bedrock service quotas only Service quotas + step limits Service quotas + step limits
History truncation on context blow-through Not handled — your responsibility Managed by Bedrock Managed by Bedrock
Streaming support converse_stream() — full Partial (invoke_agent only) Partial (inline_agent only)
Circuit breaker needed Yes — all four failure modes Partial (trip detection) Partial (four inline-specific modes)

The Converse API is the right choice when you need model-agnostic code, fine-grained control over conversation management, or streaming. It requires more explicit cost control precisely because it gives you more control. The Agents APIs are appropriate when you want the managed runtime to handle orchestration complexity and are willing to accept its model selection and step-limit constraints.

FAQ

Does the Converse API support prompt caching like the native Anthropic API?

Not directly. The Anthropic Messages API exposes cache_control blocks that enable prefix caching, reducing input costs by up to 90% on repeated system prompts. The Bedrock Converse API does not surface these cache controls in its normalized request schema. If prompt caching is critical for cost reduction, use the Anthropic SDK directly (see Anthropic Claude API Cost Control) rather than routing through Bedrock Converse. AWS has indicated Bedrock will surface model-specific caching features over time, but this is model-dependent and subject to change.

Why is the spiral detection threshold 0.85 for Converse but 0.72 for the raw Anthropic SDK?

Converse API agents frequently use web search or retrieval tools where legitimate query refinement produces inputs that are 75–80% similar to the prior call. A threshold of 0.72 flags these legitimate refinements as spirals, causing false positives. The 0.85 threshold targets near-verbatim repetition — the actual uncontrolled loop case where the model re-issues essentially the same tool call. If your agent doesn't do iterative query refinement, you can lower the threshold to 0.78 for earlier detection.

How should I handle the circuit breaker in a multi-model fallback chain?

Instantiate one ConverseCircuitBreaker per model in your fallback chain, not one shared instance. A ThrottlingException on Claude 3.5 Sonnet should open the Sonnet circuit breaker and route to Nova Pro — but Nova Pro's circuit should only open on its own failures. A shared circuit breaker that opens on Sonnet throttling would also block Nova Pro calls, defeating the fallback logic. Reset the messages list to the turn where the throttle occurred before retrying on the fallback model to avoid replaying already-completed turns at the fallback model's pricing.

The conversation guard uses 4 characters per token as the estimate. How accurate is this?

Empirically, 4 characters per token is accurate to within ±15% for English-language text in Converse API messages. The estimate runs high for code (which has more whitespace and punctuation than the model's tokenizer costs it) and low for non-Latin scripts. The 80k/120k warn/hard thresholds are set conservatively enough that the ±15% range doesn't cause misses on the hard stop — a 120k-token estimate at ±15% corresponds to 102k–138k actual tokens, which is safely below Claude's 200k context limit. For non-English workloads, apply a language-specific calibration factor (Japanese/Chinese text runs approximately 1 token per 1.5 characters).

Can I use ConverseBreaker with converse_stream() as well?

The pre-call checks (conversation guard, budget pre-check, circuit breaker) apply directly to converse_stream() — call breaker._conv.check(messages), breaker._budget.check_input(messages, max_output), and wrap in breaker._breaker.call() before the stream call. For post-call accounting, use converse_stream_with_usage() from Failure Mode 4 to drain the full stream and get the metadata usage event, then pass it to breaker._budget.record_usage(). The spiral guard requires that you accumulate the full stream-reconstructed tool use block before calling breaker.check_tool_calls(). A future version of ConverseBreaker will wrap converse_stream() natively.

Stop the loop before the bill lands

RunGuard's SDK wraps your Converse API agent loop with production-grade circuit breakers. One integration, four guards, zero rewrite of your existing agent code.

Start free trial — no card required

Also in this series