Loop detection for CrewAI multi-agent crews

CrewAI gives every Agent a max_iter and a max_rpm, and every Crew a max_execution_time. They are the right primitives for a single-agent demo. In a three-agent hierarchical crew running paid tools, they catch the loop after the bill, not before. This page is the runtime breaker we ship and how it slots into a CrewAI Tool in eight lines of Python.

Where loops actually happen inside a CrewAI crew

The manager re-delegates after a worker errors. The most common shape in hierarchical crews: the manager asks the researcher for a fact, the researcher’s tool returns a 4XX, the manager “reasons” about the failure and dispatches the same task to the same agent. Same role, same goal-string, same tool, same upstream, same 4XX. Three rounds in, you have a fixed point at the manager layer — and every round is a manager LLM call and a worker LLM call and a tool call.
A worker’s tool retries the same paid call. Inside a single agent, the same shape as every other agent framework: the tool hits an upstream that is rate-limited, out of credits, or refusing a malformed payload. The error string lands in the agent’s scratchpad. The next iteration sees the failure, reasons about it, and emits the same call. The worker’s max_iter — 25 by default — runs to the ceiling.
A retrieval tool that returns the same passages. Two agents in a sequential process share a vector store. The researcher pulls the top-five passages on a query; the writer rejects them as off-topic and asks the researcher to “dig deeper”; the researcher re-asks the same store with a near-identical query and gets the same top-five back. The signatures match; the bill climbs across both agents.
A two-agent ping-pong. Researcher → writer → researcher → writer with the same intermediate values. Each handoff inflates the shared scratchpad and re-runs both agents’ reasoning over the same observation. Per-agent max_iter never trips because each individual agent only sees a few iterations — the loop lives at the crew level.

Why `max_iter`, `max_rpm`, and `max_execution_time` miss it

The three knobs CrewAI gives you are correct in shape and wrong in granularity. max_iter is the per-agent ceiling on reasoning iterations; the default is 25, the agent runs to that ceiling, and only then raises a stop. By the time round 25 fires inside one worker, you have made twenty-five manager LLM calls plus twenty-five worker LLM calls plus twenty-five tool calls — in a hierarchical process, the manager round-trips on every step. max_rpm throttles request rate per agent; a loop running at the cap is just a slower loop, with the same final bill. max_execution_time is wall-clock at the crew level; on multi-agent runs that legitimately take minutes, it never trips early enough. None of the three look at the content of the calls. A run that legitimately needs 12 distinct steps and a run that fires the same broken call 12 times look identical to the executor.

What a circuit breaker actually has to do

Fingerprint each call before the model spends another token on it. Tool name plus canonicalized arguments plus — for failures — the response status and error title. http_get:tinyurl.com/x:429:RateLimited is one signature; vector_search:"refund policy":200:ok is another. The signature must be deterministic; whitespace, default arg order, and timestamp fields all need to be normalized out.
Watch for the same fingerprint repeating in a sliding window. Crew-level loops fire fast. A window of 32 entries spans more rounds than most well-formed crews need; cycles of length 1 (same call), 2 (researcher/writer ping-pong), and up to 8 cover the realistic bad shapes — including a 3-agent hierarchical loop where the manager dispatches to the same worker thrice.
Trip before the next call goes out, not after. The check is in-process, on a deque-backed buffer. It runs in well under a millisecond per call. When the third repeat lands, the next tool._run() raises a typed error and the crew halts.
Be a primitive, not a framework opinion. Loops happen on retrieval tools, on HTTP tools, on shell tools, on the custom tool you wrote yesterday. A breaker that wraps any callable is the right shape; a breaker that ships as a CrewAI-only callback or a BaseTool subclass is not. The same wrap should compose with raw requests, with the openai client, with whatever framework lands next quarter.

Wrapping a CrewAI Tool with `runguard`

# crewai + runguard. The Tool stays a Tool; only its underlying func gets
# wrapped. Crew sees the same interface, the breaker sees every call.
import json
import requests
from crewai.tools import BaseTool
from runguard import guard, LoopDetectedError, BudgetExceededError

def _http_get(payload):
    r = requests.get(payload["url"], timeout=10)
    return {"status": r.status_code, "body": r.text}

guarded_http = guard(
    _http_get,
    signature=lambda i: f"http_get:{i['url']}",
    loop={"repeats": 3, "max_cycle_len": 8},
    budget={"max_usd": 5},
    cost=lambda _i, o: 0 if o["status"] >= 400 else 0.001,
    on_trip=lambda e: print("[runguard]", e["reason"], e.get("signature")),
)

class HttpGet(BaseTool):
    name: str = "http_get"
    description: str = "Fetch a URL. Trips on third identical signature."

    def _run(self, url: str) -> str:
        try:
            return json.dumps(guarded_http({"url": url}))
        except (LoopDetectedError, BudgetExceededError):
            raise  # propagate up, halt the crew
        except Exception as e:
            return f"error: {e}"

Defaults match every other surface in the SDK: window_size: 32, min_cycle_len: 1, max_cycle_len: 8, repeats: 3 — snake_case in Python, camelCase in TypeScript, same numbers either way. The wrapped function is plain, non-CrewAI Python — so the same wrap composes with raw requests, with openai.chat.completions.create, with whatever framework you reach for next sprint. The fingerprint-and-window approach is documented at how to detect LLM tool-call loops in production; the TypeScript equivalent for LangChain is here.

How the breaker behaves inside `Crew.kickoff()`

First two identical calls return normally. The detector pushes the signature into the window and returns detected=False. The agent sees its observation and continues. This is critical — legitimate retries against a transient 429 with backoff are common, and a breaker that trips on attempt two is more annoying than the loop it was supposed to catch.
The third identical call throws before the wrapped requests.get runs. LoopDetectedError is constructed with the cycle length, the repeats count, and the matching pattern. It propagates out of _run, up through CrewAI’s tool-execution shim, into the agent’s reasoning loop, and out of Crew.kickoff() as an unhandled exception. CrewAI does not silently catch arbitrary exceptions out of a tool — the crew halts.
Your on_trip hook fires before the throw. Page Slack, write a row to a trip log, kill the worker container — whatever you wire. Sync hooks run inline; async hooks are awaited if you use guard_async. An on_trip exception propagates instead of the trip error, by design (the host explicitly opted in to side-effecting on trip).
Reset is explicit. When the crew is restarted for a fresh run, call guarded_http.reset() to clear the window. The loop counter is per-guarded-fn, not per-process — you can keep one breaker per tool and reset them independently, or share one across a tool family for cross-tool loops.

Tuning for CrewAI’s loop shapes

CrewAI’s Agent defaults to max_iter: 25; in a hierarchical crew, the manager and worker each get their own ceiling, so the effective worst-case bill is roughly (manager_iter × worker_iter) LLM calls before anything halts. A breaker tuned to repeats: 3, max_cycle_len: 8 can catch a length-1 worker loop on iteration 3 and a length-2 manager-worker ping-pong on iteration 6 — both well inside any per-agent ceiling. For multi-agent crews where loops live at the crew level, share one guard instance across the agents that participate in the cycle so the detector sees every call in one window; per-agent guards will each see their own slice and miss a cross-agent ping-pong. If your tools genuinely retry idempotent reads, mark them as such by raising retryable errors that the detector skips, or split the tool into a per-attempt callable that the detector watches and an outer-retry wrapper that it does not. For high-cost runs — a research crew paying $0.50 per LLM step on the manager and worker combined — consider repeats: 2 on tools whose loop signatures are unique enough that a false-positive trip is cheap. The cost of a missed loop is the bill; the cost of a false-positive trip is one re-run.

Budget and context guards, on the same wrap

Budget cap. Pass budget={"max_usd": 5} with a cost function. The tracker accumulates after each successful call. The next call after the cap throws BudgetExceededError. This is the answer to “the crew shouldn’t ever spend more than $5 on this run” — an LLM doesn’t volunteer to stop, the wrapper does. In a multi-agent crew, share one tracker across all guarded tools to enforce a crew-wide cap; instantiate per-tool trackers to enforce per-tool ceilings.
Rolling-window throttle. Add window_ms: 60_000 for “$5 per minute” instead of cumulative-only. Old spend rolls off the front. Useful in long-running crews that legitimately spend dollars over an hour but should never spend dollars in a minute.
Context-window guard. Pass context={"max_context_tokens": 200_000, "headroom": 4_000} with a tokens function that projects total tokens (system prompts, accumulated scratchpad, the new call’s payload, reserved-for-output) using your provider’s tokenizer. CrewAI scratchpads grow fast in hierarchical crews because every worker observation lands back in the manager’s history; the breaker trips 4 000 tokens before the model’s ceiling, so the host can summarize, checkpoint the crew, or fork a fresh thread — before the next call would 400.
One guard(), three reasons. The same wrap watches loop, budget, and context simultaneously. The reason field on the trip event tells you which fired ("loop", "budget", "context"); the typed error tells the calling code which to handle.

The first loop our SDK caught was ours

It wasn’t a CrewAI crew — it was our own launch script firing a six-tweet thread against a shared paid API. The first attempt came back with HTTP 402 CreditsDepleted. Six consecutive sessions later, six identical signatures — post_tweet:402:CreditsDepleted — were sitting in a flat JSON file on disk. The seventh session loaded the six-row history, pushed it into the detector at startup, and exited at signature three with a RunGuardTripped preflight before a single HTTP request went out. It has held the breaker open every session since. Read the dogfood story on the 30-day log; the same pattern slots into a CrewAI hierarchical process when the manager keeps re-delegating to the same worker on the same broken upstream.

What this is not

Not a CrewAI plugin. RunGuard does not subclass BaseTool, ship a BaseAgent mixin, or hook into the crew’s task router. It wraps the underlying callable. That is the design — the same wrap composes with raw requests, with the OpenAI Python client, with browser-use, with whatever framework lands next quarter.
Not a replacement for an observability platform. A trace viewer answers “what happened on the run that already finished?”. A runtime breaker answers “should the next call go out?”. The two are complementary — one for forensics, one for prevention. Run both.
Not a server. No outbound network, no telemetry, no cookies, no daemon. The loop check is pure data flow inside your Python process. The same in-process discipline shows up in the embed-preview widget; the policy is one repo away in llms.txt.

The minimum CrewAI integration

One pip install runguard, one guard() call per tool whose loop you want to catch, one on_trip that pages the channel you actually read. Eight lines of wrap per tool, no callback to register, no agent or crew subclass. The breaker trips on the third repeat of any signature, halts the crew, and leaves a structured event and a typed error behind for the post-mortem you would have written on Sunday anyway. RunGuard ships it as runguard on PyPI and @runguard/sdk on npm — same primitive, both runtimes, in-process, zero deps.