Traceloop OpenLLMetry vs RunGuard: loop detection for production LLM agents
Traceloop’s OpenLLMetry is the leading open standard for instrumenting LLM calls with OpenTelemetry. It wraps every major LLM SDK (OpenAI, Anthropic, Cohere, Vertex, Bedrock) and framework (LangChain, LlamaIndex, Haystack) and emits standard OTel spans that route to any backend — your own Jaeger, Grafana Tempo, Datadog, or Traceloop’s hosted dashboard. It is excellent infrastructure. It cannot stop a looping agent. This page covers why — with precision — and how to add RunGuard to an OpenLLMetry-instrumented stack so that traces are short by design, not by luck.
OpenLLMetry’s architecture: spans on the write path
OpenLLMetry works via monkey-patching at the SDK level. When you call Traceloop.init(), it patches the openai, anthropic, and other SDK clients so that every call emits an OTel span. The span records the request inputs (prompt, model, parameters), the response (completion text, usage), and the latency. These spans are sent asynchronously to your configured exporter.
This architecture has three properties that are ideal for observability:
- Zero application code change required. You call
Traceloop.init()once at startup; every subsequent LLM call is automatically traced, including calls inside third-party libraries. - Backend-agnostic. The same instrumentation routes to Jaeger, Datadog, Grafana, Honeycomb, or Traceloop’s own platform via standard OTel exporters.
- No overhead on the critical path. Spans are exported asynchronously in a background batch; the LLM call itself is not blocked waiting for the exporter to acknowledge the span.
Property three is the relevant one here: the exporter is async. The span is emitted after the call completes. There is no mechanism in the OTel model for an exporter to interrupt the call that generated its span — that would be a circular dependency in the data flow.
Why OTel spans can’t detect loops in real time
Loop detection requires two things that OTel spans structurally cannot provide:
- A window of previous call fingerprints, readable before the next call. To know that the current call is the third repeat of the same signature, you need to remember the previous two. OTel spans are write-only from the application’s perspective. The span exporter that receives them may be running in another process, another machine, or another cloud region. There is no standard OTel API for reading recent spans back into the process that emitted them.
- The ability to raise an exception before the call executes. Even if you could read the span history synchronously, the OTel instrumentation layer has already wrapped the underlying SDK. There is no hook in the standard OTel spec for an instrumentation layer to veto an outgoing call based on span history.
These are not implementation gaps in Traceloop or OpenLLMetry — they are structural properties of the observability architecture. Observability is read-after-write. Guards are pre-write. They require different abstractions.
RunGuard’s loop detection: how it actually works
- Wrap at the tool level, not the model level. RunGuard’s
guard()wraps the Python or TypeScript function that your agent calls as a “tool” — theweb_search,run_sql,read_file, orcall_apifunction. This is the right interception point: tool calls are what loops repeat, not individual model invocations (which may legitimately repeat as the model refines its response). - Fingerprint = tool name + canonical args + error status. The canonical form normalizes argument order, strips timestamps and request IDs, and extracts error codes from exception messages.
fetch_url("https://api.example.com/data", timeout=10)andfetch_url("https://api.example.com/data", timeout=30)produce different fingerprints (timeout differs), but two 429 responses to the same URL produce the same fingerprint regardless of response header differences. - Deque-backed sliding window, O(1) check. The window is a fixed-size deque (default 32 entries). Each entry is a fingerprint hash. Checking for a repeat is a count against the deque — microseconds, not database queries.
- Trip = typed exception, not a log event. When the threshold is crossed,
guard()raisesLoopDetectedErrorsynchronously before the underlying function is called. The exception propagates up the agent’s call stack, unwinding the agent loop. Your OpenLLMetry instrumentation records the exception as the outcome of the trace.
Adding RunGuard to an OpenLLMetry-instrumented agent
# Python — both libraries installed in the same environment
# pip install runguard traceloop-sdk
from traceloop.sdk import Traceloop
from traceloop.sdk.decorators import task
from runguard import guard, BudgetTracker, LoopDetectedError
# Init OpenLLMetry — patches all LLM SDKs automatically
Traceloop.init(
app_name="my-research-agent",
api_key=os.environ["TRACELOOP_API_KEY"],
)
tracker = BudgetTracker(max_usd=2.50)
# Stack: @task for Traceloop span, @guard for loop protection
@task(name="web_search")
@guard(budget=tracker, loop_window=20, loop_threshold=3)
async def web_search(query: str, max_results: int = 5) -> list:
# Traceloop creates a span for this call
# RunGuard checks the deque before calling the underlying API
return await search_api.search(query, n=max_results)
# In your agent loop, handle the trip gracefully
async def run_agent(task: str):
try:
await agent_loop(task)
except LoopDetectedError as e:
# Traceloop has the full span tree up to the trip
# Propagate a structured error for your caller to handle
raise AgentStuckError(
f"Agent looped on {e.tool_name}: {e.count} repeats"
) from e
Capability comparison: what each tool covers
| Capability | Traceloop OpenLLMetry | RunGuard |
|---|---|---|
| Auto-instrument LLM SDK calls | Yes — zero code change, all major SDKs | No (use Traceloop for this) |
| OTel-compatible spans to any backend | Yes — Jaeger, Datadog, Grafana, etc. | No |
| Association ID grouping across agent steps | Yes — @workflow decorator | No |
| Real-time loop detection (pre-call) | No | Yes — in-process, synchronous |
| Per-run budget cap | No | Yes — BudgetExceededError before next call |
| Context-window proximity alert | No | Yes — ContextOverflowError |
| Framework-agnostic tool wrapping | Partial — framework instrumentations | Yes — any callable |
| Works with any OTel backend | Yes | N/A (not an OTel component) |
What the combined stack looks like in production
With both tools running:
- OpenLLMetry gives you the full trace. Every LLM call, every tool call (via
@task), every framework callback shows up as an OTel span in your backend. You can browse the trace tree, compare latency across runs, and see exactly what the model was given and what it returned at each step. - RunGuard gives you the safety net. Any tool call that would form a loop is intercepted before it goes out. The
LoopDetectedErrorbecomes a span attribute in the Traceloop trace — searchable, filterable, alertable in your OTel backend. Runaway cost is capped at yourmax_usdceiling. - Post-mortem quality improves. Because RunGuard trips the loop at 3 repeats, your Traceloop traces for looping runs are short and precise: 3 spans showing the repeated fingerprint, then an exception. That is far more useful for debugging than a 25-span trace where the pattern is buried in noise.
Add loop detection to your OpenLLMetry stack
RunGuard installs in one command and wraps any function. If you’re already using Traceloop OpenLLMetry for observability, you can add a circuit breaker in five minutes.
Get started with RunGuard — or read about how loop detection works in detail, or compare with Langfuse and Arize Phoenix.