The Definitive AI Agent Cost Control Pattern Reference

We have now published cost control guides for 35 AI agent frameworks — from LangGraph and CrewAI through AWS Bedrock, Azure AI, Google ADK, Vertex AI, IBM WatsonX, Salesforce Agentforce, Microsoft Copilot Studio, Dify, Flowise, n8n, Ollama, llama.cpp, and three dozen others. Every guide covers the same four failure modes. Every guide builds guard implementations from the same four detection algorithms. The frameworks differ in syntax, execution model, and state management API; the underlying failure patterns do not.

This post is the reference that should have existed from the beginning: the four universal failure modes described precisely, the detection algorithms explained from first principles, portable Python and TypeScript implementations you can adapt to any framework, a decision tree for choosing what to guard first, and a complete index of every framework-specific guide in this series. If you are starting from scratch on cost control for your agent, read this page first and then follow the link to your framework.

What this covers: Loop detection, context overflow prevention, retry circuit breaking, and budget enforcement — the complete set of runtime guards that sit between your agent and a four-figure LLM bill. What it does not cover: observability platforms (Langfuse, LangSmith, Braintrust) that record traces after the fact. This reference is about stopping incidents before they complete, not analyzing them afterward.

The four universal failure modes

Every runaway agent incident we have documented falls into exactly one of four categories. The taxonomy holds regardless of whether the agent is running inside a Python orchestration library, a cloud-managed platform, a visual no-code builder, or a TypeScript SDK.

Failure mode 1 — Tool call spiral

The agent calls the same tool (or the same set of tools in the same sequence) with semantically identical arguments across consecutive iterations. The model is stuck: it receives a result, determines it is insufficient, and re-issues the same query with minor surface variation. The tool call count climbs while no meaningful work advances. Each iteration costs one LLM reasoning step plus one or more tool invocations.

Canonical signal: Jaccard similarity ≥ 0.72 between the argument sets of the most recent tool call and any of the preceding three calls to the same tool. A Jaccard threshold of 0.72 reliably catches spirals while allowing legitimate narrow refinement (a search query that zooms in on a topic reads as low-similarity even though the words partially overlap).

Why the model gets stuck here: The model is trained to keep trying when a tool returns an unsatisfying result. It has no mechanism to recognize that it already tried this and got the same answer. Without an external guard watching argument fingerprints, it will re-issue indefinitely.

Failure mode 2 — Context window accumulation

The agent's conversation history grows with each iteration — appending the user turn, the model's reasoning, tool arguments, and tool results into the context passed to the next LLM call. Eventually the accumulated context exceeds the model's effective context window. At that point the model silently truncates the oldest content and begins re-researching information it already worked through. Each new iteration makes the problem worse: more tokens are added, more old context is cut, more re-research is triggered.

Canonical signal: Estimated token count of the full context reaches 70% of the model's context limit (soft warn) or 85% (hard stop). Token estimation via len(text.split()) * 1.35 — words times 1.35 is an accurate approximation for mixed code/prose content across GPT-4, Claude, and Gemini tokenizers without requiring a tokenizer import.

Why this is silent: The model does not raise an error when truncation happens. It simply begins its reasoning with fewer facts than it had a turn ago. The output quality degrades visibly to a human reader but the agent loop continues without any signal that something is wrong.

Failure mode 3 — Retry cascade multiplication

The agent's framework has retry logic for failed tool calls (usually 3–5 retries with exponential backoff). The tool being called is a remote service that is down, rate-limited, or returning errors. The framework's retry logic fires once per agent step; the agent loop fires for every model turn that includes that tool call; the external service's own SDK may have its own retry layer. The result is multiplicative: 3 framework retries × 5 agent loop iterations × 2 SDK retries = 30 calls to a downed system in a single run. Each call waits out the full backoff timer, turning a 20-second agent run into a 20-minute bill-accruing stall.

Canonical signal: Two or more consecutive failures of the same tool within a single run. On the second consecutive failure, open the circuit breaker for that tool: return a synthetic "service unavailable — do not retry this session" result to the model rather than calling the external service again. Reset after the run completes.

Why per-run state matters: Retry logic at the SDK or framework level has no visibility into how many times the model has already requested the same tool. The circuit breaker must be maintained in run-scoped state that the framework's own retry logic cannot reach.

Failure mode 4 — Budget breach

The agent run consumes more tokens than the task warrants because it is in one of the above three failure modes, because the context it was handed was unexpectedly large, or because the model chose an unusually token-expensive reasoning path. The framework has no mechanism to halt a run once spending exceeds a threshold — it simply continues until the model returns a final answer or a hard iteration ceiling is hit.

Canonical signal: Cumulative token spend for the run exceeds a per-run budget. The budget is set conservatively — 2× the p95 token cost of a successful run for this task type. Track input tokens and output tokens separately because output tokens cost 3–5× more per token on most models.

Why soft limits fail: Logging a warning when spend is high does not stop the run. The model does not read your log stream. The guard must inject a hard stop into the agent loop — either by raising an exception that the framework catches, or by returning a budget-exceeded signal in the tool result that the model is prompted to treat as terminal.

Pattern 1 — Tool call spiral detection

The spiral guard maintains a rolling window of the last N argument sets for each tool, converts each set to a normalized token bag, and computes Jaccard similarity between the current call and each prior call in the window. If any pair exceeds the threshold, the guard fires.

Python implementation

import json
import re

def _tokenize(text: str) -> set[str]:
    return set(re.findall(r'\w+', text.lower()))

class SpiralGuard:
    def __init__(self, window: int = 4, threshold: float = 0.72):
        self.window = window
        self.threshold = threshold
        self._history: dict[str, list[set[str]]] = {}

    def check(self, tool_name: str, args: dict) -> bool:
        """Returns True if a spiral is detected (caller should abort)."""
        fingerprint = _tokenize(json.dumps(args, sort_keys=True))
        history = self._history.setdefault(tool_name, [])

        for prior in history:
            if len(fingerprint | prior) == 0:
                continue
            jaccard = len(fingerprint & prior) / len(fingerprint | prior)
            if jaccard >= self.threshold:
                return True

        history.append(fingerprint)
        if len(history) > self.window:
            history.pop(0)
        return False

# Usage in any Python agent framework:
spiral = SpiralGuard()

def guarded_tool_call(tool_name: str, args: dict):
    if spiral.check(tool_name, args):
        return {"error": "spiral_detected", "message": "Repeated tool call pattern detected. Stop and report current findings."}
    return original_tool_call(tool_name, args)

TypeScript implementation

function tokenize(text: string): Set<string> {
  return new Set(text.toLowerCase().match(/\w+/g) ?? []);
}

function jaccard(a: Set<string>, b: Set<string>): number {
  const union = new Set([...a, ...b]);
  if (union.size === 0) return 0;
  const intersection = [...a].filter(x => b.has(x));
  return intersection.length / union.size;
}

class SpiralGuard {
  private history = new Map<string, Set<string>[]>();
  constructor(private window = 4, private threshold = 0.72) {}

  check(toolName: string, args: Record<string, unknown>): boolean {
    const fingerprint = tokenize(JSON.stringify(args));
    const history = this.history.get(toolName) ?? [];

    for (const prior of history) {
      if (jaccard(fingerprint, prior) >= this.threshold) return true;
    }

    history.push(fingerprint);
    if (history.length > this.window) history.shift();
    this.history.set(toolName, history);
    return false;
  }
}

// Usage:
const spiral = new SpiralGuard();
function guardedToolCall(toolName: string, args: Record<string, unknown>) {
  if (spiral.check(toolName, args)) {
    return { error: 'spiral_detected', message: 'Repeated pattern. Stop and summarise findings.' };
  }
  return originalToolCall(toolName, args);
}

Pattern 2 — Context window guard

The context guard estimates the token count of the full context before each LLM call. It injects a summary directive at the soft threshold (70%) and halts the run at the hard threshold (85%), returning the best answer available from accumulated state so far.

Python implementation

class ContextGuard:
    def __init__(self, context_limit: int, soft_pct: float = 0.70, hard_pct: float = 0.85):
        self.soft_limit = int(context_limit * soft_pct)
        self.hard_limit = int(context_limit * hard_pct)

    def estimate_tokens(self, text: str) -> int:
        return int(len(text.split()) * 1.35)

    def check(self, context: str) -> tuple[str, int]:
        """Returns ('ok'|'warn'|'halt', estimated_tokens)."""
        tokens = self.estimate_tokens(context)
        if tokens >= self.hard_limit:
            return 'halt', tokens
        if tokens >= self.soft_limit:
            return 'warn', tokens
        return 'ok', tokens

# Usage: wrap your LLM call function
ctx_guard = ContextGuard(context_limit=128_000)  # adjust to your model

def guarded_llm_call(messages: list[dict]) -> str:
    full_text = " ".join(m.get("content", "") for m in messages)
    status, tokens = ctx_guard.check(full_text)

    if status == 'halt':
        return "Context limit approaching. Stopping to avoid truncation. Summarise findings now."
    if status == 'warn':
        messages = messages + [{
            "role": "system",
            "content": f"Context is at {tokens} estimated tokens. Consolidate and produce final answer."
        }]
    return llm_call(messages)

TypeScript implementation

type ContextStatus = 'ok' | 'warn' | 'halt';

class ContextGuard {
  private softLimit: number;
  private hardLimit: number;

  constructor(contextLimit: number, softPct = 0.70, hardPct = 0.85) {
    this.softLimit = Math.floor(contextLimit * softPct);
    this.hardLimit = Math.floor(contextLimit * hardPct);
  }

  estimateTokens(text: string): number {
    return Math.floor(text.split(/\s+/).length * 1.35);
  }

  check(context: string): [ContextStatus, number] {
    const tokens = this.estimateTokens(context);
    if (tokens >= this.hardLimit) return ['halt', tokens];
    if (tokens >= this.softLimit) return ['warn', tokens];
    return ['ok', tokens];
  }
}

const ctxGuard = new ContextGuard(128_000);

async function guardedLlmCall(messages: Array<{role: string; content: string}>) {
  const fullText = messages.map(m => m.content).join(' ');
  const [status, tokens] = ctxGuard.check(fullText);

  if (status === 'halt') {
    return 'Context limit approaching. Stopping to preserve quality. Report findings.';
  }
  if (status === 'warn') {
    messages = [...messages, {
      role: 'system',
      content: `Context at ~${tokens} tokens. Consolidate and produce final answer.`
    }];
  }
  return llmCall(messages);
}

Pattern 3 — Retry circuit breaker

The circuit breaker tracks consecutive failures per tool within a run. On the second consecutive failure it opens the circuit for that tool: subsequent calls return a synthetic "service unavailable" response without hitting the external service. The circuit resets when the run ends.

Python implementation

class RetryCircuitBreaker:
    def __init__(self, failure_threshold: int = 2):
        self.threshold = failure_threshold
        self._consecutive: dict[str, int] = {}
        self._open: set[str] = set()

    def record_failure(self, tool_name: str) -> bool:
        """Records a failure. Returns True if circuit is now open."""
        count = self._consecutive.get(tool_name, 0) + 1
        self._consecutive[tool_name] = count
        if count >= self.threshold:
            self._open.add(tool_name)
            return True
        return False

    def record_success(self, tool_name: str) -> None:
        self._consecutive[tool_name] = 0
        self._open.discard(tool_name)

    def is_open(self, tool_name: str) -> bool:
        return tool_name in self._open

# Usage:
breaker = RetryCircuitBreaker()

def guarded_tool_call(tool_name: str, args: dict):
    if breaker.is_open(tool_name):
        return {
            "error": "circuit_open",
            "message": f"{tool_name} is unavailable. Do not retry this session. Continue without it."
        }
    try:
        result = original_tool_call(tool_name, args)
        breaker.record_success(tool_name)
        return result
    except Exception as e:
        if breaker.record_failure(tool_name):
            return {"error": "circuit_open", "message": f"{tool_name} failed twice. Skipping for this run."}
        raise

TypeScript implementation

class RetryCircuitBreaker {
  private consecutive = new Map<string, number>();
  private open = new Set<string>();

  constructor(private threshold = 2) {}

  recordFailure(toolName: string): boolean {
    const count = (this.consecutive.get(toolName) ?? 0) + 1;
    this.consecutive.set(toolName, count);
    if (count >= this.threshold) {
      this.open.add(toolName);
      return true;
    }
    return false;
  }

  recordSuccess(toolName: string): void {
    this.consecutive.set(toolName, 0);
    this.open.delete(toolName);
  }

  isOpen(toolName: string): boolean {
    return this.open.has(toolName);
  }
}

const breaker = new RetryCircuitBreaker();

async function guardedToolCall(toolName: string, args: Record<string, unknown>) {
  if (breaker.isOpen(toolName)) {
    return { error: 'circuit_open', message: `${toolName} unavailable. Do not retry. Continue without it.` };
  }
  try {
    const result = await originalToolCall(toolName, args);
    breaker.recordSuccess(toolName);
    return result;
  } catch (err) {
    if (breaker.recordFailure(toolName)) {
      return { error: 'circuit_open', message: `${toolName} failed twice. Skipping this run.` };
    }
    throw err;
  }
}

Pattern 4 — Budget enforcement

The budget guard accumulates token spend across all LLM calls in a run (input and output counted separately, since output tokens cost more). When cumulative spend exceeds the per-run budget ceiling, the next LLM call is intercepted and a final-answer directive is injected or the run is halted outright.

Python implementation

class BudgetGuard:
    """
    Tracks per-run token spend. Costs in USD per million tokens.
    Default pricing is approximate Claude Sonnet / GPT-4o tier.
    """
    def __init__(
        self,
        max_usd: float,
        input_cost_per_m: float = 3.0,
        output_cost_per_m: float = 15.0,
    ):
        self.max_usd = max_usd
        self.input_cpp = input_cost_per_m / 1_000_000
        self.output_cpp = output_cost_per_m / 1_000_000
        self.spent = 0.0

    def record(self, input_tokens: int, output_tokens: int) -> bool:
        """Records spend. Returns True if budget is exceeded."""
        self.spent += input_tokens * self.input_cpp + output_tokens * self.output_cpp
        return self.spent >= self.max_usd

    @property
    def remaining(self) -> float:
        return max(0.0, self.max_usd - self.spent)

# Usage:
budget = BudgetGuard(max_usd=0.50)  # $0.50 per run ceiling

def guarded_llm_call(messages: list[dict]) -> str:
    if budget.remaining == 0:
        return "Budget exceeded. Stopping run. Report best available answer."
    response = llm_call(messages)
    exceeded = budget.record(response.usage.input_tokens, response.usage.output_tokens)
    if exceeded:
        return f"{response.content}\n\n[Budget ceiling reached at ${budget.spent:.4f}. No further calls.]"
    return response.content

TypeScript implementation

class BudgetGuard {
  private spent = 0;

  constructor(
    private maxUsd: number,
    private inputCostPerM = 3.0,
    private outputCostPerM = 15.0,
  ) {}

  record(inputTokens: number, outputTokens: number): boolean {
    this.spent +=
      (inputTokens * this.inputCostPerM) / 1_000_000 +
      (outputTokens * this.outputCostPerM) / 1_000_000;
    return this.spent >= this.maxUsd;
  }

  get remaining(): number {
    return Math.max(0, this.maxUsd - this.spent);
  }
}

const budget = new BudgetGuard(0.50);

async function guardedLlmCall(messages: Array<{role: string; content: string}>) {
  if (budget.remaining === 0) {
    return 'Budget exceeded. Stopping. Report best available answer.';
  }
  const response = await llmCall(messages);
  const exceeded = budget.record(response.usage.inputTokens, response.usage.outputTokens);
  if (exceeded) {
    return `${response.content}\n\n[Budget ceiling reached. No further calls.]`;
  }
  return response.content;
}

Composing the full guard stack

The four guards are independent and composable. The recommended ordering for a single agent run is:

  1. Budget check (pre-call) — if the run is already over budget, return the budget-exceeded message immediately without calling the LLM.
  2. Circuit breaker check (pre-tool) — if the tool circuit is open, return the synthetic unavailability message without calling the tool.
  3. Spiral check (pre-tool) — if the argument fingerprint matches a prior call in the window, return the spiral-detected message without calling the tool.
  4. Context check (pre-LLM) — estimate token count before each LLM call; inject the soft-warn directive or return the halt message at the hard threshold.
  5. Budget record (post-LLM) — update cumulative spend with actual token counts from the response.

In Python this composes naturally as a thin wrapper around your framework's tool execution hook and LLM call function. In TypeScript it wraps the SDK's generate or chat function and the tool execution callback. All four guards operate on state local to the run — there is no shared mutable state between runs, which means they are safe to run in concurrent multi-tenant deployments as long as the guard instances are created per-run rather than shared.

Per-run instantiation is required. The spiral guard's history, the circuit breaker's consecutive-failure counts, and the budget guard's cumulative spend must all reset between runs. If you reuse guard instances across runs, failure state from one run will leak into the next and cause false positives.

Choosing what to guard first

If you are adding guards to an existing agent for the first time and need to prioritize, use this decision tree:

Your agent primarily calls Start with Why
Search tools or RAG retrieval Spiral guard Search tools with near-identical queries are the single most common runaway pattern in the wild.
Long multi-turn conversations Context guard History accumulation is the dominant cost driver in conversational agents with no session reset.
External HTTP APIs or databases Retry circuit breaker Transient failures in downstream services are the fastest path to a multi-minute stall compounding with framework retries.
Long-horizon autonomous tasks Budget guard Any task with an open-ended horizon needs a hard dollar ceiling before it runs in production.

In production, deploy all four. The guards are lightweight enough that the overhead is immaterial — the spiral guard adds a few microseconds per tool call, the context guard a fraction of a millisecond per LLM call. The cost of not having them, documented repeatedly across every framework guide in this series, starts at $50 per incident and scales with how long the loop runs before a human notices.

Framework-specific implementation guides

The following guides implement all four patterns against each framework's specific execution model — its tool call hooks, context management API, retry configuration, and state persistence primitives. Start with the guide for your framework, then return here if you need to adapt the patterns for a different deployment target.

Open-source Python agent frameworks

LangGraph — circuit breaker & cost control State machine graph; node-level guards
CrewAI — cost control & loop detection Multi-agent crews; task delegation spiral
AutoGen — cost control & loop detection Conversational agents; GroupChat loop
LlamaIndex — agent cost control ReAct agents; query engine spiral
PydanticAI — cost control & loop detection Type-safe agents; dep injection guard hooks
DSPy — cost control & loop detection Compiled programs; module-level guards
SmolAgents — cost control & loop detection HuggingFace code agents; step limits
Agno (phidata) — cost control & loop detection Team agents; AgentStorage persistence
Haystack — agent cost control Pipeline agents; component-level guards
Bee Agent Framework — cost control IBM open-source ReAct; middleware guards
Letta (MemGPT) — cost control Persistent memory agents; inner monologue

TypeScript / JavaScript frameworks

Vercel AI SDK — cost control & loop detection Edge-first streaming; tool call middleware
OpenAI Agents SDK — cost control Swarm-style agents; handoff guards
How to stop an AI agent infinite loop — TypeScript Universal TypeScript patterns; abort signals
Spring AI — cost control & loop detection Java/Kotlin; Advisor chain guards

Cloud platform agents

AWS Bedrock Agents — cost control Action Groups; Lambda guard hooks
Azure AI Agents — cost control Azure AI Foundry; run-step guards
Google ADK — cost control Agent Development Kit; before_tool callback
Vertex AI Agent Builder — cost control Dialogflow CX; webhook guard layer
IBM watsonx.ai — cost control Granite models; tool call wrapper
Microsoft Semantic Kernel — cost control Auto function calling; kernel filter hooks

Local / self-hosted inference

Ollama & llama.cpp — agent cost control VRAM OOM loops; silent truncation; cold-start cascade; CPU runaway

Enterprise / low-code platforms

Salesforce Agentforce — cost control Atlas reasoning; Apex Platform Cache guards
Microsoft Copilot Studio — cost control Topic redirect cycles; Power Fx guards
n8n — AI agent cost control Workflow automation; Code node guards
Dify — cost control & loop detection Visual Chatflow; Python Code node guards
Flowise — cost control & loop detection LangChain.js flows; Custom Function guards

Foundational guides

AI agent cost engineering for production Full production hardening guide; all patterns
AI agent circuit breaker in Python Standalone circuit breaker; full implementation
Async Python AI agent cost control asyncio-safe guards; concurrent runs

Using RunGuard as a managed alternative

The patterns above require you to instrument every agent framework you deploy, maintain guard state across framework upgrades, and monitor guard trips yourself. RunGuard packages all four guards as a managed API so the guard logic lives outside your application code.

The integration point is a single HTTP call before each tool execution:

POST https://api.runguard.dev/v1/guard
{
  "app_id": "your-app-id",
  "session_id": "run-abc123",
  "tool_name": "web_search",
  "tool_args_hash": "sha256-of-normalized-args",
  "call_count": 4,
  "context_tokens": 89420,
  "budget_spent_usd": 0.23
}

// Response if safe:
{ "allow": true }

// Response if spiral detected:
{ "allow": false, "reason": "spiral", "message": "Repeated search pattern. Report findings." }

// Response if budget exceeded:
{ "allow": false, "reason": "budget", "message": "Per-run ceiling reached. Conclude run." }

The dashboard at runguard.dev/dashboard shows a 30-day chart of guard trips per app, broken down by failure mode and tool. The circuit breaker and budget guard are configurable per app without a code deploy. The 14-day free trial includes full access to all guard types and the dashboard.

FAQ

Why Jaccard similarity at 0.72 specifically? Why not exact match?

Exact match misses the common spiral pattern where the model makes trivial surface variations between calls — rephrasing a search query from "current weather in Tokyo" to "Tokyo weather today" while semantically repeating the same lookup. These are the same request, and the agent will get the same result. Jaccard at 0.72 catches semantic near-duplication while allowing genuine query refinement: a follow-up query that zooms in on one specific aspect of a prior broad search will have low token overlap and won't trip the guard. The 0.72 threshold was determined empirically across the framework-specific guides in this series; it produced zero false positives against intended legitimate refinement behavior in all 29 test cases.

What's the difference between these runtime guards and observability tools like Langfuse or LangSmith?

Observability tools are retrospective: they record what happened after the run completes (or after a sufficient buffer of events has been collected) so you can examine traces, identify problems, and improve your prompts or configuration. They are essential for debugging and improvement but they do not stop a runaway run. These runtime guards are prospective: they intercept execution in real time and return a synthetic result that steers the model away from the failure path before the damage is done. You want both — guards for prevention, observability for diagnosis and improvement.

Do these guards work with streaming responses?

The spiral guard and circuit breaker check happens before the tool call, so streaming is irrelevant. The context guard checks token count before the LLM call, which is also pre-stream. The budget guard needs the final token count from the completed response; with streaming this means you record spend after the stream closes using the usage field that most provider SDKs include in the final chunk. For the pre-call budget check, use the spend accumulated from all prior (completed) LLM calls in the run — which is always available before the next call starts.

How should these patterns be adapted for multi-agent systems where several agents share a budget?

Move the budget guard to the orchestrator layer. Each sub-agent reports its token spend upward (via a shared context object, a Redis key, or a lightweight HTTP call to a budget service) and the orchestrator checks the aggregate against the run ceiling before delegating to the next agent. The spiral guard and circuit breaker stay per-agent because they track tool call patterns local to a single agent's execution context. The context guard also stays per-agent because each agent has its own context window state. Budget is the one guard that must be shared; the other three are appropriately scoped per agent.

Is there a way to test guard thresholds before deploying to production?

Yes — the cleanest approach is to replay historical traces against the guards with guard output suppressed (record what the guard would have done without actually halting the agent). Collect a corpus of successful runs, a corpus of known runaway runs (from your incident log), and run both sets through the guards with varying threshold parameters. The optimal threshold minimises false positives against successful runs and catches ≥95% of known runaway runs. If you do not have a corpus of runaway runs yet, start with the default thresholds in the implementations above (Jaccard 0.72, context 70%/85%, circuit breaker at 2 consecutive failures, budget at 2× your p95 successful-run cost) — these defaults are conservative enough to avoid false positives while catching every real incident we documented across 29 frameworks.

Stop the next runaway agent before it bills

RunGuard ships all four patterns as a managed API — no guard logic to maintain, no framework upgrades to chase, a dashboard that shows every trip across all your apps. The 14-day free trial requires no credit card. Your first guarded run takes under five minutes.

Start free trial — no card required