Weights & Biases Weave vs RunGuard: circuit breakers for production AI agents
Weights & Biases built Weave to be what W&B always was for ML training — a tracking and evaluation layer — now applied to LLM calls and agent traces. It integrates deeply with the W&B ecosystem: runs, artifacts, evaluations, and dashboards all live in the same platform you use for model training. What it shares with every other observability platform is the same architectural constraint: it records what happened, it does not interrupt what is happening. RunGuard is the interrupt. This page explains the distinction precisely and shows how the two tools compose.
W&B Weave: tracking-first LLM observability
Weave is W&B’s LLM observability and evaluation product. It instruments LLM calls via a @weave.op() decorator (Python) or a JS wrapper, groups them into call trees, and stores every input/output pair in the W&B cloud. From there you can:
- Browse call trees. Every decorated function call becomes a node in the Weave UI. You can see the full input, output, latency, and token count for each step in an agent run.
- Score calls with evaluators. Weave’s
Evaluationclass runs a dataset of examples through your application and attaches scorer functions (hallucination rate, task success, custom metrics) to each call. Results land in a W&B run that you can compare across versions. - Version prompts and models. Weave integrates with W&B Artifacts, so prompt versions and model checkpoints are tracked alongside their call data. A/B comparing two prompt versions is a first-class operation.
- Monitor production traffic. With Weave’s sampling configuration, a fraction of live production calls are logged to W&B. You get a real-world distribution of inputs for your next evaluation run.
This is excellent infrastructure for the engineering workflow around LLM applications. It answers: “Is prompt version B better than A on this dataset?”, “What fraction of my agent’s calls are hallucinating?”, “What was the exact input to the model on the run that produced a bad output?”
It does not answer “How do I stop this run from making 25 identical calls?” because that question requires in-process, synchronous intervention — the opposite architectural direction from a write-side tracer.
Why @weave.op() does not stop loops
When you decorate a function with @weave.op(), Weave wraps it to capture the call inputs, outputs, and timing, then ships those to the W&B backend. The wrapped function still executes normally; the decorator adds a logging side-effect, not a gate.
If your agent calls web_search("AI regulations 2026") three times in a row with the same arguments because a pagination bug caused it to never advance past page 1, Weave records three calls to web_search, all with the same inputs, all with the same outputs. You will see the pattern clearly in the Weave UI after the run completes. But the decorator did not have access to the call history; it had no window into what the previous two calls returned; and it had no mechanism to raise an exception on the third occurrence.
The fix would require you to add loop-detection logic explicitly to the decorated function or its caller — which is exactly what RunGuard’s guard() wrapper does, as a pre-built, configurable primitive.
How RunGuard’s circuit breaker is different
- Window-based fingerprint tracking.
guard(fn)maintains a deque of call signatures in memory. Each signature is a hash of(function_name, canonical_args, result_status_if_error). On each call, it checks the deque for repeats before delegating to the underlying function. If the same signature has appearedloop_thresholdtimes in the lastloop_windowentries, it raisesLoopDetectedErrorinstead of calling the function. - Budget preflight check.
BudgetTrackeraccumulates estimated cost for the current run (using the model name and token counts you report, or RunGuard’s built-in price table). Before each call, it checks whether the accumulated cost plus the estimated cost of the next call would exceedmax_usd. If so, it raisesBudgetExceededErrorbefore any tokens are sent. - No backend round-trip. Both checks run in-process, synchronously, before the function call. There is no HTTP request to a logging backend on the critical path. The overhead is a deque lookup and a comparison — microseconds, not milliseconds.
Using Weave and RunGuard together in Python
# Both decorators stack cleanly — apply RunGuard first (innermost), Weave second
import weave
from runguard import guard, BudgetTracker, LoopDetectedError, BudgetExceededError
weave.init("my-agent-project")
tracker = BudgetTracker(max_usd=3.0)
@weave.op()
@guard(budget=tracker, loop_window=24, loop_threshold=3)
async def web_search(query: str) -> dict:
# Weave logs every call to W&B; RunGuard halts on the 3rd repeat
response = await search_api.search(query)
return response
# In your agent loop:
try:
result = await web_search("AI regulations 2026")
except LoopDetectedError as e:
# Weave has the call tree up to the trip point in W&B
# RunGuard stopped the run before call #4
print(f"Loop detected: {e.signature} repeated {e.count} times")
except BudgetExceededError as e:
print(f"Budget exceeded: ${e.accumulated_usd:.2f} of ${e.max_usd:.2f}")
Decorator order matters: @guard is innermost (closest to the function), so it runs the fingerprint check before Weave’s logging wrapper even sees the call. When the guard raises, Weave records the exception as the call result — which is exactly right; you want the loop-detected event in your W&B trace.
Side-by-side capability comparison
| Capability | W&B Weave | RunGuard |
|---|---|---|
| LLM call tracing and storage | Yes — @weave.op() decorator | No (use Weave for this) |
| Prompt version tracking | Yes — W&B Artifacts integration | No |
| Dataset-based evaluation | Yes — Weave Evaluations | No |
| Halt on repeated call signature | No | Yes — raises LoopDetectedError |
| Per-run USD budget enforcement | No (cost visibility only) | Yes — raises BudgetExceededError |
| Context-window overflow protection | No | Yes — ContextOverflowError at configurable threshold |
| Real-time Slack alert on circuit trip | Not on trip — W&B alerts are post-run | Yes — webhook fires synchronously on trip |
| Works without backend round-trip | No — logs go to W&B cloud | Yes — in-process only |
| Python support | Yes | Yes |
| TypeScript/JavaScript support | Yes (limited) | Yes — first-class |
The production case for layering both
Teams that use W&B for model training already have strong habits around experiment tracking and evaluation. Adding Weave to the LLM side of their stack is a natural extension — the same mental model, the same platform, the same dashboards their data science team already knows.
RunGuard fills the production-safety gap that Weave (by design) does not address. When a new agent version ships and hits an edge case that turns into a loop, Weave will have the trace data you need to understand and fix it. RunGuard will have stopped the loop at 3 iterations instead of 25, capping the cost and the blast radius.
For teams already deep in the W&B ecosystem, the layering pattern is: @guard innermost for safety, @weave.op() outer for observability. Both decorators see every call. The guard decides whether the call goes through; Weave logs whatever happens.
Add a circuit breaker to your Weave-instrumented agents
RunGuard takes five minutes to add to an existing Python or TypeScript agent. Set a dollar ceiling, set a loop threshold, and deploy. Your W&B traces get more informative (looping runs terminate early with a typed error); your cost surprises stop.
Get started with RunGuard — or compare it to Braintrust, Arize Phoenix, and Langfuse.