OpenAI Agents SDK: two loop failure modes you need to guard against before production
OpenAI’s Agents SDK (the successor to Swarm, released as the production-ready multi-agent framework in 2025) makes it trivially easy to build networks of specialized agents that hand off to each other. A triage_agent hands off to a billing_agent, which hands back to triage_agent, which hands to billing_agent again. Unless something breaks the cycle, this handoff loop runs indefinitely — generating a planning LLM call on every hop. The second failure mode is within a single agent: a tool-call loop where the agent calls the same function with the same arguments repeatedly because the function’s output fails to advance the agent’s goal. The Agents SDK has no built-in loop detection or per-run dollar cap. This page shows how to add both.
Failure mode 1: the handoff loop
- How Agents SDK handoffs work. In the Agents SDK, an agent signals a handoff by calling a
transfer_to_<agent_name>function, which is registered as a tool. The SDK’s runner intercepts the tool call result, switches the active agent to the target, and continues the conversation with the new agent. Handoffs are the core multi-agent coordination primitive: a routing agent classifies the user’s intent and hands off to the appropriate specialist. - The loop pattern. A handoff loop occurs when Agent A hands off to Agent B, and Agent B’s response causes Agent A to be selected again. The most common cause: Agent B receives a task it cannot handle (wrong scope, missing information, or ambiguous intent), so it signals a handoff back to Agent A (or to a triage agent that re-routes to Agent B). The loop continues until the SDK’s
max_turnslimit fires (default 10 in many configurations) or the process is killed. At GPT-4o pricing, a 10-turn handoff loop between two agents costs roughly $0.10–$0.50 depending on context size — small for a single loop, catastrophic if triggered by many users simultaneously. - Detection: track handoff signatures. Each handoff is a tool call whose function name starts with
transfer_to_. The handoff sequence is detectable as a repeating pattern in the run’s tool-call history:transfer_to_billing→transfer_to_triage→transfer_to_billing→ … A period-2 cycle in the handoff sequence. RunGuard’smax_cycle_len=4setting catches period-1 and period-2 cycles and fires on the third repetition.
Failure mode 2: tool-call loop within an agent
- The pattern. The agent calls a tool, gets a result, determines the result is insufficient, calls the same tool with the same (or very similar) arguments, gets the same insufficient result, and calls the same tool again. The Agents SDK’s
max_turnslimits the number of turns (user+assistant message pairs), not the number of tool calls within a turn. A single turn can contain many tool calls if the model calls them in parallel or sequentially. A tool-call loop can exhaust significant cost within a single turn. - The error-string masking trigger. The most common trigger for tool-call loops in Agents SDK applications is a tool that returns an error string rather than raising an exception when it fails. The model sees the error string as a result (not a failure) and calls the tool again to try to get a better result. The fix is to raise a typed exception from the tool on failure — the SDK will surface it as a tool error in the conversation, which the model interprets correctly as a reason to stop retrying rather than a result to act on.
Adding a loop guard to OpenAI Agents SDK
- The interception point: the model provider. The Agents SDK routes all LLM calls through a model provider (by default
OpenAIProvider, but custom providers are supported via themodel_providerparameter onRunner.run()). The correct place to add RunGuard is as a custom provider that wraps the underlying OpenAI provider and adds guard logic before each call goes out. - Python: custom model provider with RunGuard.
from openai import AsyncOpenAI from agents import Agent, Runner, ModelProvider, ModelSettings from agents.models.openai_responses import OpenAIResponsesModel from runguard import guard, LoopDetectedError, BudgetExceededError from typing import Any class GuardedModelProvider(ModelProvider): """Agents SDK ModelProvider with RunGuard budget + loop detection.""" def __init__(self, max_usd: float = 2.0, client: AsyncOpenAI = None): self._max_usd = max_usd self._client = client or AsyncOpenAI() self._guard = None def _build_guard(self): underlying = OpenAIResponsesModel( model="gpt-4o", openai_client=self._client, ) async def _inner(input_data, model_settings): response = await underlying.get_response(input_data, model_settings) # Extract USD cost usage = getattr(response, "usage", None) or {} input_tokens = usage.get("input_tokens", 0) output_tokens = usage.get("output_tokens", 0) usd = (input_tokens * 2.50 + output_tokens * 10.0) / 1_000_000 # Extract signature: prefer tool/handoff name over "end_turn" sig = "end_turn" for item in (getattr(response, "output", None) or []): call_type = getattr(item, "type", None) if call_type == "function_call": sig = item.name break return {"response": response, "usd": usd, "sig": sig} return guard( _inner, budget={"max_usd": self._max_usd}, loop={"repeats": 3, "max_cycle_len": 4}, # catches handoff A↔B and tool repeats ) def get_model(self, model_name: str): class _GuardedModel: def __init__(inner_self): inner_self._guard_provider = self async def get_response(inner_self, input_data, model_settings): if inner_self._guard_provider._guard is None: inner_self._guard_provider._guard = inner_self._guard_provider._build_guard() result = await inner_self._guard_provider._guard(input_data, model_settings) return result["response"] return _GuardedModel() # --- Usage --- triage_agent = Agent( name="Triage", instructions="Classify the user request and hand off to the appropriate specialist.", ) billing_agent = Agent( name="Billing", instructions="Handle billing questions. Hand back to Triage if the question is off-topic.", ) # Wire up handoff tools triage_agent.handoffs = [billing_agent] billing_agent.handoffs = [triage_agent] guarded_provider = GuardedModelProvider(max_usd=1.50) async def handle_request(user_message: str): try: result = await Runner.run( triage_agent, input=user_message, model_provider=guarded_provider, max_turns=20, # outer backstop; guard fires first ) return result.final_output except LoopDetectedError as e: return f"Routing loop detected (pattern: {e.pattern!r}). Please try again with a clearer request." except BudgetExceededError as e: return f"Response budget exceeded (${e.spent:.3f}). Task was too complex for this session." - Scoping the guard per user request. The example above uses a single
GuardedModelProviderinstance with a shared_guard. For production multi-user systems, create a new provider instance (and therefore a new guard) per user request to prevent one user’s call history from affecting another’s budget and loop window. Pass the provider as a parameter toRunner.run()— it is request-scoped, not singleton.
Agents SDK built-in limits vs. RunGuard
| Control | Agents SDK built-in | RunGuard |
|---|---|---|
| Max turns | max_turns parameter (default varies) | Not needed (loop detector fires first) |
| Per-run cost cap | Not supported | budget: max_usd — fires before each call |
| Handoff loop detection | Not supported | loop: max_cycle_len=4 catches A↔B cycles |
| Tool-call loop detection | Not supported | loop: repeats=3 — fires on 3rd repeat |
| Slack/PagerDuty alert on trip | Not supported | alerts: slack_webhook or pagerduty_key |
| Graceful partial output | Not supported (raises internally) | BudgetExceededError exposes accumulated context |
Fixing the underlying causes alongside the guard
- Fix 1: clear handoff conditions. Every agent should have an explicit condition for when NOT to hand off. An agent that can say “I cannot help with this request — please contact support directly” is less likely to create a handoff loop than one whose only exit is to hand off to another agent. Add a fallback tool or a terminal instruction to every agent in your network.
- Fix 2: raise exceptions from tools, never return error strings. Every tool in your Agents SDK application should raise a typed exception on failure. The SDK wraps tool exceptions as error messages in the conversation, which the model interprets as a hard stop rather than a soft “try again” signal. Returning “Error: no results found” as a string triggers a retry loop; raising
ToolExecutionError("no results found for query: ...")’does not. - Fix 3: add a max-turns guard at the agent level, not just the runner. Set
max_turnson individual agents for critical paths. A billing agent that genuinely needs more than 5 tool calls to complete a task is doing something unexpected. Layer the guard (fires at 3 repeats) with a turn limit (fires at N turns) for defense in depth.