Azure AI Agents Cost Control: Loop Detection and Budget Enforcement in Production

Azure AI Agent Service, generally available since late 2024 under the azure-ai-projects SDK, is Microsoft's managed platform for building autonomous agents on top of Azure OpenAI models. You define an agent with a system prompt, attach tools (function calling, file search, code interpreter, Bing grounding), and the service handles the run loop: model inference, tool dispatch, result injection, and final response generation — all orchestrated as a durable Run against a persistent Thread.

The managed runtime means you don't own the orchestration loop. The trade-off is familiar to anyone who has used OpenAI Assistants: the loop is a black box between create_run and COMPLETED/FAILED status. Azure exposes max_completion_tokens and max_prompt_tokens on each run — per-run token ceilings — as the primary built-in cost controls. Teams building production agents routinely raise these or omit them entirely to avoid premature termination on complex tasks. The higher you set them, the more expensive a runaway run becomes.

This post covers four failure modes specific to Azure AI Agent Service's architecture and shows how to build an AzureAgentBreaker circuit breaker that wraps the azure-ai-projects run polling loop and trips before the bill lands.

Why max_completion_tokens is not a circuit breaker

Azure's max_completion_tokens parameter sets an upper bound on total completion tokens the agent can generate in a single run — model output across all internal inference steps. Combined with max_prompt_tokens, it creates a per-run token budget. A run that exceeds either limit is terminated with status incomplete and reason max_completion_tokens or max_prompt_tokens.

A circuit breaker detects a behavioral pattern — repeated or escalating behavior that signals the agent is spending without converging — and halts specifically because progress has stalled. Token limits are spending caps: they stop the agent after consuming N tokens regardless of whether those tokens produced any forward progress. The distinction matters in both directions:

An agent legitimately working through a complex multi-document research task should not be blocked mid-reasoning because it hit an arbitrarily set token ceiling. Raising the ceiling to accommodate legitimate tasks also raises the blast radius when the agent loops.
An agent calling the same function tool with identical arguments five times in a row has entered a spiral. A pattern-aware breaker trips after the third repeat; a token limit lets the spiral consume the entire budget.
A connected agent re-delegating the same subtask to a sub-agent three times because it doesn't like the response generates three full sub-agent runs, each with their own token budgets — the per-run limit on the supervisor offers no protection against supervisor-level over-retry.

The four failure modes below each operate within what looks like valid Azure agent behavior at the individual step level. Only a pattern-level view across run steps — or across connected-agent invocation boundaries — reveals the loop.

Azure AI Agent Service architecture overview

Understanding the four failure modes requires a brief orientation to how Azure AI Agent Service routes calls:

Threads carry the conversation history as a durable list of ThreadMessage objects. When you create a new run on a thread, the entire thread history is injected as input context — the longer the thread, the more input tokens every run consumes.
Runs are the execution units. Each run creates a series of RunStep objects — one per reasoning cycle. Each step has a type (message_creation or tool_calls) and a step_details payload that identifies which tools were called and with what arguments.
Function tools are developer-defined callables registered on the agent. When a step has type tool_calls, the run pauses at requires_action status and waits for your code to submit tool outputs before resuming. Every pause-resume cycle is an additional model inference round.
File search is a built-in RAG tool backed by Azure's managed vector stores. The agent queries it automatically — you don't call it explicitly. Search queries and retrieved chunk counts are visible in run step details when you list steps with include=["step_details.tool_calls[*].file_search.results[*].content"].
Connected agents (introduced in 2025) let one agent act as a coordinator that delegates to named sub-agents. The coordinator invokes a sub-agent tool, the sub-agent runs its own full loop on its own thread, and returns a result. Each actor has independent token budgets and billing.

Each of these components introduces a distinct looping failure mode.

Failure mode 1: Run-step tool-call spiral

Tool-call spirals occur when the model calls the same function tool with near-identical arguments across consecutive requires_action cycles without the task advancing. The pattern: the agent calls search_documents(query="Q3 revenue forecast"), you submit the result, the model processes it, determines it's insufficient, and calls the same function again — perhaps with a trivially rephrased query ("revenue forecast Q3"). The function returns essentially the same data. The model tries once more.

This failure mode appears most often when:

The function returns data that partially matches the model's expectations but leaves an unresolved question the model was instructed to fully answer before proceeding.
The function's return schema includes optional fields the model looks for; those fields are absent; the model retries hoping they appear in the next call.
The agent instructions say "verify all retrieved data before summarizing" but the verification tool is the same data retrieval function, creating a confirmation loop with no exit condition.

The cost structure is non-obvious. Each requires_action pause represents a full model inference round that produced the tool call decision. When you submit outputs and resume, the run executes another inference round to process the results. A spiral that repeats the same tool call eight times before a token limit fires costs eight full model inference rounds on the input side (each incorporating the full thread history plus accumulated run context), plus eight output inference rounds. On gpt-4o via Azure OpenAI at standard throughput pricing, a moderately complex thread can make each inference round cost $0.05–$0.20. Eight spiral cycles on a task that should have taken two rounds costs $0.40–$1.60 instead of $0.10–$0.40.

Detection requires reading the RunStep list after each tool submission and comparing the tool call arguments of consecutive steps targeting the same function. Token-set Jaccard similarity on the JSON-serialized argument map works reliably: if the same function is called three or more times in a sliding window of five steps with argument similarity above 0.80, the agent is spiraling.

Failure mode 2: File search query fixation

File search query fixation occurs when the agent issues semantically identical queries to its attached vector stores across consecutive run steps, each time receiving the same top-K retrieved passages and failing to advance. Unlike function tools, file search queries are not submitted by your code — they're issued automatically by the Azure orchestrator when the model determines retrieval is needed. This makes them harder to intercept but not impossible: the queries and results are visible in the run step details API when you enable the include parameter.

To inspect file search queries in real time, poll run steps with:

steps = client.agents.list_run_steps(
    thread_id=thread_id,
    run_id=run_id,
    include=["step_details.tool_calls[*].file_search.results[*].content"]
)
for step in steps:
    if step.type == "tool_calls":
        for tc in step.step_details.tool_calls:
            if tc.type == "file_search":
                # tc.file_search.queries contains the query strings
                # tc.file_search.results contains retrieved chunks
                pass

The cost structure for file search fixation compounds through three billing layers: per-query retrieval pricing on the Azure AI Search instance backing the vector store, the cost of injecting retrieved chunks (each chunk adds input tokens to the next inference round), and the inference round itself. An agent that fires eight semantically identical file search queries in one run is paying for eight retrieval operations, eight batches of retrieved content inflating its context window, and eight inference rounds to process redundant content. The per-chunk context inflation is particularly expensive: top-K=5 retrieved passages at ~500 tokens each add 2,500 input tokens per query to a context that already contains the thread history.

Query fixation typically appears when:

The vector store doesn't contain a document that satisfies the model's question — the query hits a coverage gap and the model retries with trivial reformulations that still miss.
The retrieved passages partially answer the question, leaving a follow-up that the model resolves by re-querying with refinements that semantically map to the same document chunks.
The agent instructions tell it to "find comprehensive evidence before answering" without a mechanism to stop if evidence isn't found after N attempts.

Failure mode 3: Thread token drift

Thread token drift is not a loop in the traditional sense — it doesn't repeat a specific action. It's a cost accumulation failure mode that compounds across runs within a long-lived thread. Each new run on a thread carries the entire prior message history as input context. A thread that has grown to 40 messages, each with multi-paragraph content, can add 15,000–30,000 input tokens to every subsequent run before the model processes a single character of the new request.

Azure's truncation_strategy parameter offers partial mitigation: you can set type="last_messages" with a last_n value to cap the number of prior messages included. But this is a coarse control — it truncates by message count rather than by semantic relevance, and it doesn't prevent drift within a single long run as the run's own step outputs accumulate in the context.

The drift becomes a budget problem when:

The agent is handling a long customer support session where each new user message triggers a run with the entire prior transcript as context, and the transcript grows unbounded.
The agent revisits resolved topics — re-asking a question it already resolved five turns earlier because the prior answer is buried deep in a long context that the model's attention attenuates over.
The model starts producing recap summaries of prior steps in its responses, inflating output tokens on steps where the actual task output is small, which then feeds back into the next run's input context.

Detection uses the usage object returned on completed runs and on individual run steps. A circuit breaker that tracks per-run input token counts across a thread and computes a rolling growth rate can trip when the projected cost for the next N runs, extrapolated from the current growth trend, exceeds a session budget threshold.

Failure mode 4: Connected agent re-delegation loop

Azure AI Agent Service's connected agents feature (introduced to GA in 2025) allows a coordinator agent to invoke named sub-agents as tools. The coordinator calls a sub-agent tool, the sub-agent runs its own complete autonomous loop on its own thread, and returns a text result. The coordinator then continues its run incorporating the sub-agent's result.

The re-delegation failure mode occurs when the coordinator repeatedly invokes the same sub-agent because it doesn't find the returned result satisfactory — either due to under-specified task instructions, a quality criterion the sub-agent consistently misses, or a validation pattern where the coordinator uses a second sub-agent to check the first sub-agent's work, which flags it as incomplete, triggering re-delegation.

The cost structure is multiplicative:

total_cost ≈ coordinator_inference_rounds
           × (sub_agent_inference_rounds × token_cost_per_round)
           + sub_agent_invocations × thread_creation_overhead

A coordinator that invokes the same sub-agent four times generates four complete sub-agent runs, each potentially consuming thousands of tokens across multiple internal inference steps. If the sub-agent itself uses file search or code interpreter tools, those costs stack on top. On complex research or code-generation tasks, each sub-agent invocation can cost $0.50–$2.00; four re-invocations on a task that should have needed one means $2.00–$8.00 instead of $0.50–$2.00.

Re-delegation loops appear most frequently when:

The coordinator's instructions use quality criteria that the sub-agent's responses never fully satisfy (e.g., "delegate to the analyst until you have a definitive answer — if the analysis is ambiguous, delegate again").
A validator sub-agent always returns improvement suggestions even on high-quality outputs, because its system prompt instructs it to "always find areas for refinement."
Two sub-agents are checking each other's work in a cycle where neither is empowered to declare the result final.

Building the AzureAgentBreaker

The AzureAgentBreaker wraps the standard azure-ai-projects run polling loop. Instead of a simple create_and_process_run call, it polls run steps explicitly, applying the four detection methods after each step is completed. When any method trips, it cancels the active run via the Azure API and raises an AgentBreakerTripError with a diagnostic payload.

from azure.ai.projects import AIProjectClient
from azure.ai.projects.models import (
    RunStepStatus, RunStatus, ToolCallType,
    RequiredActionType
)
from azure.identity import DefaultAzureCredential
from dataclasses import dataclass, field
from enum import Enum
import json, time

class TripReason(Enum):
    TOOL_CALL_SPIRAL    = "tool_call_spiral"
    FILE_SEARCH_FIXATION = "file_search_fixation"
    THREAD_TOKEN_DRIFT  = "thread_token_drift"
    CONNECTED_AGENT_LOOP = "connected_agent_loop"

class AgentBreakerTripError(Exception):
    def __init__(self, reason: TripReason, detail: str, run_id: str = ""):
        self.reason = reason
        self.detail = detail
        self.run_id = run_id
        super().__init__(f"AzureAgentBreaker tripped: {reason.value} — {detail}")

@dataclass
class BreakerConfig:
    # Tool-call spiral detection
    spiral_window: int = 5              # run steps to look back
    spiral_min_repeats: int = 3         # same function in window to trip
    spiral_similarity: float = 0.80     # min Jaccard similarity to count

    # File search fixation
    search_fixation_window: int = 3     # consecutive near-identical queries
    search_similarity: float = 0.75     # min Jaccard similarity

    # Thread token drift
    thread_budget_tokens: int = 60_000  # total input token cap across runs
    drift_window: int = 3               # runs to measure growth rate over
    drift_max_ratio: float = 1.30       # max allowed input-token growth per run

    # Connected agent re-delegation
    max_subagent_invocations: int = 3   # same sub-agent calls before trip
    subagent_similarity: float = 0.70   # task argument similarity threshold

    # Polling
    poll_interval_seconds: float = 1.0
    max_poll_seconds: float = 300.0

@dataclass
class BreakerRunState:
    tool_call_log: list = field(default_factory=list)   # (fn_name, args_tokens) per step
    search_query_log: list = field(default_factory=list) # str per file search step
    subagent_log: list = field(default_factory=list)     # (agent_name, args_tokens) per step
    seen_step_ids: set = field(default_factory=set)

def _tokenize(text: str) -> set:
    """Produce a token set for Jaccard comparison from serialized JSON/text."""
    import re
    return set(re.findall(r'\w+', text.lower()))

def _jaccard(a: set, b: set) -> float:
    if not a and not b:
        return 1.0
    union = a | b
    return len(a & b) / len(union) if union else 0.0


class AzureAgentBreaker:
    """
    Wraps azure-ai-projects run execution with circuit-breaker detection
    for the four Azure AI Agent Service cost failure modes.
    """

    def __init__(
        self,
        client: AIProjectClient,
        agent_id: str,
        config: BreakerConfig | None = None,
    ):
        self.client = client
        self.agent_id = agent_id
        self.cfg = config or BreakerConfig()
        self._thread_token_history: list[int] = []  # input tokens per run, thread-scoped

    def run(self, thread_id: str, additional_instructions: str = "") -> str:
        """
        Create and drive a run on thread_id, applying breaker detection
        after each completed run step. Returns the final assistant message.
        Raises AgentBreakerTripError if any detection method fires.
        Cancels the active run before raising.
        """
        kwargs = dict(agent_id=self.agent_id, thread_id=thread_id)
        if additional_instructions:
            kwargs["additional_instructions"] = additional_instructions

        run = self.client.agents.create_run(**kwargs)
        state = BreakerRunState()

        try:
            return self._poll_run(thread_id, run.id, state)
        except AgentBreakerTripError:
            self._cancel_run(thread_id, run.id)
            raise

    def _cancel_run(self, thread_id: str, run_id: str) -> None:
        try:
            self.client.agents.cancel_run(thread_id=thread_id, run_id=run_id)
        except Exception:
            pass  # best-effort; don't shadow the trip error

    def _poll_run(self, thread_id: str, run_id: str, state: BreakerRunState) -> str:
        deadline = time.monotonic() + self.cfg.max_poll_seconds

        while time.monotonic() < deadline:
            run = self.client.agents.get_run(thread_id=thread_id, run_id=run_id)

            if run.status == RunStatus.REQUIRES_ACTION:
                tool_outputs = self._handle_tool_calls(run, state, thread_id, run_id)
                self.client.agents.submit_tool_outputs_to_run(
                    thread_id=thread_id,
                    run_id=run_id,
                    tool_outputs=tool_outputs,
                )

            elif run.status in (RunStatus.COMPLETED, RunStatus.FAILED,
                                RunStatus.CANCELLED, RunStatus.EXPIRED,
                                RunStatus.INCOMPLETE):
                self._record_run_tokens(run)
                self._check_thread_drift(thread_id, run_id)
                return self._extract_last_message(thread_id)

            self._check_new_steps(thread_id, run_id, state)
            time.sleep(self.cfg.poll_interval_seconds)

        raise AgentBreakerTripError(
            TripReason.THREAD_TOKEN_DRIFT,
            f"Run {run_id} exceeded poll deadline of {self.cfg.max_poll_seconds}s",
            run_id,
        )

    def _handle_tool_calls(self, run, state: BreakerRunState, thread_id: str, run_id: str) -> list:
        required = run.required_action
        if required.type != RequiredActionType.SUBMIT_TOOL_OUTPUTS:
            return []

        outputs = []
        for tc in required.submit_tool_outputs.tool_calls:
            fn_name = tc.function.name
            fn_args = tc.function.arguments or "{}"
            args_tokens = _tokenize(fn_args)

            if fn_name == "connected_agent":
                # Connected agent invocations appear as function tool calls
                # with function name equal to the sub-agent's registered tool name.
                # We detect re-delegation here.
                self._check_connected_agent(fn_name, args_tokens, state, run_id)
            else:
                self._check_tool_spiral(fn_name, args_tokens, state, run_id)

            # Call the actual tool; for connected agents the SDK handles the sub-run.
            # For regular functions, dispatch to your registered handlers.
            result = self._dispatch_tool(fn_name, fn_args)
            outputs.append({"tool_call_id": tc.id, "output": str(result)})

        return outputs

    def _dispatch_tool(self, fn_name: str, fn_args: str) -> str:
        """
        Override this method to wire up your actual function implementations.
        Default returns a placeholder so the skeleton compiles standalone.
        """
        return f"Tool {fn_name} executed."

    def _check_new_steps(self, thread_id: str, run_id: str, state: BreakerRunState) -> None:
        steps = self.client.agents.list_run_steps(
            thread_id=thread_id,
            run_id=run_id,
            include=["step_details.tool_calls[*].file_search.results[*].content"],
        )
        for step in steps:
            if step.id in state.seen_step_ids:
                continue
            if step.status != RunStepStatus.COMPLETED:
                continue
            state.seen_step_ids.add(step.id)

            if step.type == "tool_calls":
                for tc in step.step_details.tool_calls:
                    if tc.type == ToolCallType.FILE_SEARCH:
                        for q in getattr(tc.file_search, "queries", []):
                            self._check_search_fixation(q, state, run_id)

    def _check_tool_spiral(
        self, fn_name: str, args_tokens: set, state: BreakerRunState, run_id: str
    ) -> None:
        state.tool_call_log.append((fn_name, args_tokens))
        window = state.tool_call_log[-self.cfg.spiral_window:]
        same_fn = [(n, t) for n, t in window if n == fn_name]
        if len(same_fn) < self.cfg.spiral_min_repeats:
            return
        # Check pairwise Jaccard on the most recent repeats
        recent = [t for _, t in same_fn[-self.cfg.spiral_min_repeats:]]
        for i in range(1, len(recent)):
            if _jaccard(recent[i - 1], recent[i]) < self.cfg.spiral_similarity:
                return
        raise AgentBreakerTripError(
            TripReason.TOOL_CALL_SPIRAL,
            (f"Function '{fn_name}' called {self.cfg.spiral_min_repeats}+ times "
             f"with >{self.cfg.spiral_similarity:.0%} argument similarity "
             f"in last {self.cfg.spiral_window} steps."),
            run_id,
        )

    def _check_search_fixation(self, query: str, state: BreakerRunState, run_id: str) -> None:
        state.search_query_log.append(_tokenize(query))
        window = state.search_query_log[-self.cfg.search_fixation_window:]
        if len(window) < self.cfg.search_fixation_window:
            return
        for i in range(1, len(window)):
            if _jaccard(window[i - 1], window[i]) < self.cfg.search_similarity:
                return
        raise AgentBreakerTripError(
            TripReason.FILE_SEARCH_FIXATION,
            (f"File search issued {self.cfg.search_fixation_window} consecutive queries "
             f"with >{self.cfg.search_similarity:.0%} similarity: last query tokens={window[-1]}"),
            run_id,
        )

    def _check_connected_agent(
        self, agent_tool_name: str, args_tokens: set, state: BreakerRunState, run_id: str
    ) -> None:
        state.subagent_log.append((agent_tool_name, args_tokens))
        same = [(n, t) for n, t in state.subagent_log if n == agent_tool_name]
        if len(same) < self.cfg.max_subagent_invocations:
            return
        recent = [t for _, t in same[-self.cfg.max_subagent_invocations:]]
        for i in range(1, len(recent)):
            if _jaccard(recent[i - 1], recent[i]) < self.cfg.subagent_similarity:
                return
        raise AgentBreakerTripError(
            TripReason.CONNECTED_AGENT_LOOP,
            (f"Connected agent '{agent_tool_name}' invoked "
             f"{self.cfg.max_subagent_invocations}+ times with >"
             f"{self.cfg.subagent_similarity:.0%} task similarity."),
            run_id,
        )

    def _record_run_tokens(self, run) -> None:
        if run.usage and run.usage.prompt_tokens:
            self._thread_token_history.append(run.usage.prompt_tokens)

    def _check_thread_drift(self, thread_id: str, run_id: str) -> None:
        history = self._thread_token_history
        total = sum(history)
        if total > self.cfg.thread_budget_tokens:
            raise AgentBreakerTripError(
                TripReason.THREAD_TOKEN_DRIFT,
                (f"Thread cumulative prompt tokens {total:,} exceeded "
                 f"budget {self.cfg.thread_budget_tokens:,}."),
                run_id,
            )
        if len(history) >= self.cfg.drift_window:
            window = history[-self.cfg.drift_window:]
            for i in range(1, len(window)):
                if window[i - 1] > 0:
                    ratio = window[i] / window[i - 1]
                    if ratio > self.cfg.drift_max_ratio:
                        raise AgentBreakerTripError(
                            TripReason.THREAD_TOKEN_DRIFT,
                            (f"Prompt token growth ratio {ratio:.2f}x exceeds "
                             f"threshold {self.cfg.drift_max_ratio}x — thread context growing unsustainably."),
                            run_id,
                        )

    def _extract_last_message(self, thread_id: str) -> str:
        messages = self.client.agents.list_messages(thread_id=thread_id)
        for msg in messages:
            if msg.role == "assistant":
                for block in msg.content:
                    if hasattr(block, "text"):
                        return block.text.value
        return ""

Wiring the breaker into a production Azure AI agent

The AzureAgentBreaker wraps your existing run execution. If you're currently polling runs like this:

from azure.ai.projects import AIProjectClient
from azure.identity import DefaultAzureCredential

client = AIProjectClient.from_connection_string(
    conn_str="",
    credential=DefaultAzureCredential(),
)

def handle_user_message(agent_id: str, thread_id: str, message: str) -> str:
    client.agents.create_message(
        thread_id=thread_id, role="user", content=message
    )
    run = client.agents.create_and_process_run(
        thread_id=thread_id, agent_id=agent_id
    )
    messages = client.agents.list_messages(thread_id=thread_id)
    return messages[0].content[0].text.value

The migration to AzureAgentBreaker replaces the create_and_process_run call:

from azure_agent_breaker import AzureAgentBreaker, BreakerConfig, AgentBreakerTripError

breaker = AzureAgentBreaker(
    client=client,
    agent_id=agent_id,
    config=BreakerConfig(
        thread_budget_tokens=50_000,       # trip at 50K cumulative prompt tokens
        spiral_min_repeats=3,              # trip after 3 similar tool calls
        search_fixation_window=3,          # trip after 3 similar file search queries
        max_subagent_invocations=3,        # trip after 3 similar sub-agent calls
    ),
)

def handle_user_message(thread_id: str, message: str) -> str:
    client.agents.create_message(
        thread_id=thread_id, role="user", content=message
    )
    try:
        return breaker.run(thread_id)
    except AgentBreakerTripError as e:
        # Log to Azure Monitor, emit an alert, return graceful degradation
        print(f"[AzureAgentBreaker] {e.reason.value}: {e.detail}")
        return "I was unable to complete this request. Please try again."

Handling REQUIRES_ACTION with your tools

The _dispatch_tool method in AzureAgentBreaker is where you wire up your function implementations. Override it with a registry dispatch pattern:

from azure_agent_breaker import AzureAgentBreaker, BreakerConfig
import json

TOOL_REGISTRY = {
    "get_customer_order": lambda args: get_customer_order(**json.loads(args)),
    "search_documents":   lambda args: search_documents(**json.loads(args)),
    "send_email":         lambda args: send_email(**json.loads(args)),
}

class MyBreaker(AzureAgentBreaker):
    def _dispatch_tool(self, fn_name: str, fn_args: str) -> str:
        handler = TOOL_REGISTRY.get(fn_name)
        if not handler:
            return json.dumps({"error": f"Unknown tool: {fn_name}"})
        try:
            result = handler(fn_args)
            return json.dumps(result) if not isinstance(result, str) else result
        except Exception as ex:
            return json.dumps({"error": str(ex)})

Tuning the configuration for your workload

The defaults in BreakerConfig are conservative starting points designed to catch obvious failure modes without false-positiving on legitimate multi-step reasoning. The right values depend on your agent's expected behavior:

Parameter	Default	When to raise	When to lower
`spiral_min_repeats` `int, default 3`	3	Agent legitimately retries with confirmation (e.g., write-then-verify pattern where the same read tool is called pre- and post-write)	Question-answering agents where calling the same tool twice signals a stuck state
`search_fixation_window` `int, default 3`	3	Agent uses iterative query refinement where the first query is intentionally broad and the second narrows (both may have high token overlap)	Single-shot RAG agents where any repeated retrieval indicates a gap in the vector store rather than a legitimate refinement
`thread_budget_tokens` `int, default 60_000`	60,000	Long customer support sessions with large prior context are expected; complex research threads with many prior messages	Single-turn or short-session agents (FAQ bots, classification agents) where 10K tokens should suffice per session
`max_subagent_invocations` `int, default 3`	3	Coordinator uses iterative refinement (draft → review → revise) where two sub-agent calls are expected before the coordinator finalizes	Simple delegation pipelines where one sub-agent call per task is the design intent
`drift_max_ratio` `float, default 1.30`	1.30×	Sessions with large document uploads mid-conversation; threads where one verbose turn legitimately expands context significantly	Stateless single-turn agents where any prompt token growth indicates thread misconfiguration

The most reliable calibration approach: run a representative sample of production traces through the breaker in dry-run mode (log trips without raising) for a week, then adjust thresholds to eliminate false positives before switching to enforcement mode.

Connecting to Azure Monitor

Azure AI Agent Service emits traces to Azure Monitor when you enable diagnostic logging on the AI Foundry project. The AzureAgentBreaker's trip events complement the native traces:

Forward all AgentBreakerTripError events to a custom Azure Monitor metric (e.g., namespace RunGuard/AzureAgent, metric name BreakerTrips) with dimensions for AgentId, TripReason, and ThreadId.
Create an Azure Monitor alert rule that fires when any trip reason exceeds 5 events per hour — this surfaces systematic agent instruction problems producing recurring spirals rather than one-off incidents.
Log the detail string from each AgentBreakerTripError to Application Insights as a custom event — the step-level detail is the most actionable artifact for diagnosing what the agent was attempting when it looped.
Wire the trip event to an Azure Logic App or webhook to page on-call when the THREAD_TOKEN_DRIFT reason fires — unlike the other three modes which indicate instruction bugs, thread drift often indicates a usage pattern problem that requires a thread management design change.

This mirrors the approach described in our AI Agent Cost Engineering Production Guide — the AzureAgentBreaker is the trip sensor; Azure Monitor is the alarm layer; Application Insights is the incident log.

Thread management to prevent drift at the source

Thread token drift is the failure mode most worth preventing at design time rather than detecting at runtime. Four patterns reduce drift structurally:

Session threads vs. persistent threads. Create a new thread per user session (or per task) rather than one thread per user that grows indefinitely. The cost of thread creation is negligible; the cost of a 200-message thread on every run is not.
Explicit truncation strategy. Set truncation_strategy=TruncationObject(type="last_messages", last_n=10) on each run creation when you know the agent only needs recent context to answer the current question. This is a hard cap — the orchestrator will not include messages older than the last 10 regardless of the thread size.
Periodic thread summarization. When a thread reaches N messages, create a new thread with a single synthetic "summary" message containing the AI-generated summary of the prior conversation. This compresses the context while preserving continuity across sessions.
Stateless agents for single-turn tasks. For agents that answer one question and return — classification, extraction, single-document summarization — don't use persistent threads at all. Create a thread, add one message, run, extract the response, delete the thread. Thread deletion is free and the per-invocation pattern eliminates drift by construction.

FAQ

Does AzureAgentBreaker work with all Azure OpenAI model deployments, or only GPT-4o?

The breaker operates at the Azure AI Agent Service API level, which is model-agnostic from the SDK perspective. The run step structure, tool call schema, and usage metadata format are consistent across gpt-4o, gpt-4o-mini, o1, and o3 deployments. The only model-specific consideration is that o1/o3 series models handle tool use differently — reasoning tokens may inflate your prompt token counts. Raise the drift_max_ratio threshold slightly for reasoning model deployments to avoid false-positive drift trips.

Does polling run steps add meaningful latency or cost?

The list_run_steps call is a lightweight paginated read against the Azure AI Foundry storage layer. It adds one additional API call per poll interval (default 1 second) — negligible compared to model inference latency (typically 2–30 seconds per step). There is no per-step-list billing; you pay for the model tokens and tool executions regardless. The file search include parameter does expand the response payload, but the data transferred is small relative to a typical run's inference cost.

How do I distinguish legitimate retry patterns from spirals?

Two approaches: raise the spiral_min_repeats threshold for agents with known retry-on-failure patterns, and lower the spiral_similarity threshold so that a function called with clearly different arguments doesn't count as a spiral repeat. You can also add a per-function allowlist in _check_tool_spiral — for example, a health-check or status-poll function that is legitimately called multiple times per run should be excluded from spiral detection. The Jaccard similarity check already handles the case where the model is making progress with clearly different arguments: if {"page": 1} → {"page": 2} → {"page": 3}, each pair has low token overlap and won't trigger a spiral trip.

Can I run AzureAgentBreaker in a multi-threaded service without state leakage between sessions?

Yes. The BreakerRunState object is created fresh on each run() call, so all per-run detection state is isolated to that invocation. The one shared instance-level field is _thread_token_history, which intentionally accumulates across runs on the same breaker instance to detect drift. For a high-concurrency service, create one AzureAgentBreaker instance per agent_id and pass per-session breaker instances if you need independent thread histories. Alternatively, subclass and override _record_run_tokens/_check_thread_drift to store history in a thread-keyed dict keyed by thread_id for per-thread drift tracking across concurrent sessions.

Does this integrate with RunGuard's hosted dashboard?

The AzureAgentBreaker pattern shown here is a standalone implementation of RunGuard's circuit-breaker logic. RunGuard's SDK provides the same detection logic as a hosted service: trip events are forwarded to the RunGuard dashboard, which shows trip rate per agent, trip reason distribution, and session cost trend across your entire Azure AI Agent fleet — without requiring you to instrument each agent individually. The Solo plan at $19/mo covers one agent with up to 1M guarded invocations per month.

Stop paying for loops

AzureAgentBreaker is one implementation of RunGuard's circuit-breaker pattern for managed agent platforms. RunGuard monitors tool-call spirals, file search fixation, thread token growth, and connected-agent re-delegation across your Azure AI Agent fleet — alerting before a single looping run hits your Azure OpenAI bill.

See pricing — Solo $19/mo

Also in this series