How to detect LLM tool-call loops in production
Most production LLM agents loop the same way: a single tool call comes back with a transient error, the model retries it, gets the same error, retries again, and keeps going until something — a budget, a wall-clock, a tired engineer on Slack — stops the run. This page is the working approach we ship with, plus the TypeScript that runs the check.
What a loop actually looks like on the wire
- Same call, same arguments, same response shape. The model isn't introducing variety; it's stuck on a fixed point. The tool name, the argument blob, and the error reason are byte-equal across attempts.
- Three or more repeats inside one run. One repeat is a retry. Two is a stubborn retry. Three on the same fingerprint is almost always either a model that has stopped reasoning about the failure or a context that's silently truncated the part of the prompt explaining what failed.
- No upstream signal of progress. The token count is climbing, the bill is climbing, but the agent's working state (open tabs, files written, conversation messages) is not.
Why retry counts and per-step timeouts miss it
The naive guards — a per-call retry counter, a per-step timeout, an outer wall-clock — are the right primitives for the wrong granularity. A single tool call retrying twice is fine; a model invoking that same tool eight times across eight separate “steps” (each at attempt 1) is the same loop, and the per-call counter never ticks. Per-step timeouts only fire if the call hangs; a fast 429 from the same upstream finishes in 80 ms and resets the timer cleanly each time. Outer wall-clocks fire eventually, but on agent runs that take minutes by design, the bill is already in four-figure territory by the time they trip.
The fingerprint-and-window approach
- Fingerprint each tool call. Take the tool name, the arguments (canonicalized JSON, sorted keys), and — for failures — the response status and error title. Hash or stringify them into a single signature.
post_tweet:402:CreditsDepletedis one of ours. - Push each signature into a sliding window. A window of 32 entries is plenty for most agent runs; the cycles you care about are short.
- Scan the tail of the window for a repeating cycle. Cycles of length 1 (same call repeated), 2 (alternating pair), and up to 8 cover essentially every real-world loop. If the same cycle of length L has repeated
repeatstimes in a row at the tail, the run is in a loop. - Trip the breaker before the next call. Once detected, you don’t want to add latency on the happy path or wait for a billing signal — refuse the next tool call, raise a structured error, and let the caller decide whether to surface it, fall back to a different tool, or page someone.
The detector, in TypeScript
// product/sdk/src/loop-detector.ts — the core primitive in @runguard/sdk.
// Maintain a sliding window; after each new signature, scan the tail for a
// cycle of length L (minCycleLen..maxCycleLen) repeated `repeats` times.
import { LoopDetector } from '@runguard/sdk';
const detector = new LoopDetector({ repeats: 3, maxCycleLen: 8 });
async function runStep(call: ToolCall) {
const result = await tool.invoke(call);
const sig = `${call.name}:${result.status}:${result.errorTitle ?? 'ok'}`;
const match = detector.push(sig);
if (match.detected) {
throw new RunGuardTripped({
reason: 'loop',
pattern: match.pattern,
cycleLength: match.cycleLength,
repeats: match.repeats,
});
}
return result;
}
Defaults: windowSize: 32, minCycleLen: 1, maxCycleLen: 8, repeats: 3. The check is in-process and resolves in well under a millisecond per guarded call — no network hop, no remote service. The only operation that takes ~40 ms is the trip itself.
Tuning the thresholds for your agent
- Lower
repeatsfor spend-sensitive runs. A long-running research agent burning $0.20 per step can sit onrepeats: 3; a launch script burning $5 per attempt against a paid API should berepeats: 2. The cost of a false-positive trip is one re-run; the cost of a missed loop is the bill. - Raise
maxCycleLenfor tool-orchestration agents. Browser-use and other multi-tool runners legitimately repeat 4–8 step cycles (open page → read DOM → extract → next page). The default of 8 covers most; bump to 16 if your real cycles are longer. - Per-call escape hatches beat global thresholds. Mark a call
retryable: truewhen the tool genuinely needs more bites at the apple (eventually-consistent reads, slow upstreams that recover within seconds). The detector skips that signature entirely.
The first loop our SDK caught was ours
Our launch flow fires a six-tweet thread via deploy/post-launch-thread.js. The first attempt came back with HTTP 402 CreditsDepleted against a shared upstream account. Six consecutive sessions, six identical signatures — post_tweet:402:CreditsDepleted — logged to a flat JSON file on disk. We wired the detector in on the seventh session. It rehydrated the six-row history, matched repeats: 3 × cycleLength 1 at entry three, exited with RunGuardTripped before a single HTTP request went out, and has held the breaker open every session since. Detail in the 30-day log; the dogfood dataset is the canonical anchor for the day-0 post.
What this is not
- Not a replacement for your trace viewer. Langfuse, LangSmith, and Braintrust are the right tools for forensics on a run that already finished. A runtime detector sits in front of the next call; a trace viewer sits behind the last one. Run both.
- Not a model-quality signal. The detector tells you the agent is repeating itself, not that the model is wrong. A loop on a 502 from an upstream you don’t control isn’t the model’s fault; refusing the next call is still the right action.
- Not a server. No outbound network, no cookies, no analytics beacons — the entire loop check is pure data flow inside your process. You can verify this on the embed preview page, where the same in-process discipline shows up in the cost-estimator widget.
The minimum you need in production
If you’re writing this from scratch, the smallest defensible version is: a function that takes a tool name and arguments, returns a signature string, pushes it into a list, and trips when the last 3 × L entries are L-periodic for any L from 1 to 8. That’s ~40 lines. The reason it isn’t shipped by every agent team already is that writing it on a Sunday after the bill arrives is exhausting, and almost nobody does it twice. RunGuard ships it as @runguard/sdk on npm and runguard on PyPI — one line of install, the same primitive the dogfood story uses.