Weights & Biases Weave vs RunGuard: circuit breakers for production AI agents

Weights & Biases built Weave to be what W&B always was for ML training — a tracking and evaluation layer — now applied to LLM calls and agent traces. It integrates deeply with the W&B ecosystem: runs, artifacts, evaluations, and dashboards all live in the same platform you use for model training. What it shares with every other observability platform is the same architectural constraint: it records what happened, it does not interrupt what is happening. RunGuard is the interrupt. This page explains the distinction precisely and shows how the two tools compose.

W&B Weave: tracking-first LLM observability

Weave is W&B’s LLM observability and evaluation product. It instruments LLM calls via a @weave.op() decorator (Python) or a JS wrapper, groups them into call trees, and stores every input/output pair in the W&B cloud. From there you can:

Browse call trees. Every decorated function call becomes a node in the Weave UI. You can see the full input, output, latency, and token count for each step in an agent run.
Score calls with evaluators. Weave’s Evaluation class runs a dataset of examples through your application and attaches scorer functions (hallucination rate, task success, custom metrics) to each call. Results land in a W&B run that you can compare across versions.
Version prompts and models. Weave integrates with W&B Artifacts, so prompt versions and model checkpoints are tracked alongside their call data. A/B comparing two prompt versions is a first-class operation.
Monitor production traffic. With Weave’s sampling configuration, a fraction of live production calls are logged to W&B. You get a real-world distribution of inputs for your next evaluation run.

This is excellent infrastructure for the engineering workflow around LLM applications. It answers: “Is prompt version B better than A on this dataset?”, “What fraction of my agent’s calls are hallucinating?”, “What was the exact input to the model on the run that produced a bad output?”

It does not answer “How do I stop this run from making 25 identical calls?” because that question requires in-process, synchronous intervention — the opposite architectural direction from a write-side tracer.

Why `@weave.op()` does not stop loops

When you decorate a function with @weave.op(), Weave wraps it to capture the call inputs, outputs, and timing, then ships those to the W&B backend. The wrapped function still executes normally; the decorator adds a logging side-effect, not a gate.

If your agent calls web_search("AI regulations 2026") three times in a row with the same arguments because a pagination bug caused it to never advance past page 1, Weave records three calls to web_search, all with the same inputs, all with the same outputs. You will see the pattern clearly in the Weave UI after the run completes. But the decorator did not have access to the call history; it had no window into what the previous two calls returned; and it had no mechanism to raise an exception on the third occurrence.

The fix would require you to add loop-detection logic explicitly to the decorated function or its caller — which is exactly what RunGuard’s guard() wrapper does, as a pre-built, configurable primitive.

How RunGuard’s circuit breaker is different

Window-based fingerprint tracking. guard(fn) maintains a deque of call signatures in memory. Each signature is a hash of (function_name, canonical_args, result_status_if_error). On each call, it checks the deque for repeats before delegating to the underlying function. If the same signature has appeared loop_threshold times in the last loop_window entries, it raises LoopDetectedError instead of calling the function.
Budget preflight check. BudgetTracker accumulates estimated cost for the current run (using the model name and token counts you report, or RunGuard’s built-in price table). Before each call, it checks whether the accumulated cost plus the estimated cost of the next call would exceed max_usd. If so, it raises BudgetExceededError before any tokens are sent.
No backend round-trip. Both checks run in-process, synchronously, before the function call. There is no HTTP request to a logging backend on the critical path. The overhead is a deque lookup and a comparison — microseconds, not milliseconds.

Using Weave and RunGuard together in Python

# Both decorators stack cleanly — apply RunGuard first (innermost), Weave second
import weave
from runguard import guard, BudgetTracker, LoopDetectedError, BudgetExceededError

weave.init("my-agent-project")

tracker = BudgetTracker(max_usd=3.0)

@weave.op()
@guard(budget=tracker, loop_window=24, loop_threshold=3)
async def web_search(query: str) -> dict:
    # Weave logs every call to W&B; RunGuard halts on the 3rd repeat
    response = await search_api.search(query)
    return response

# In your agent loop:
try:
    result = await web_search("AI regulations 2026")
except LoopDetectedError as e:
    # Weave has the call tree up to the trip point in W&B
    # RunGuard stopped the run before call #4
    print(f"Loop detected: {e.signature} repeated {e.count} times")
except BudgetExceededError as e:
    print(f"Budget exceeded: ${e.accumulated_usd:.2f} of ${e.max_usd:.2f}")

Decorator order matters: @guard is innermost (closest to the function), so it runs the fingerprint check before Weave’s logging wrapper even sees the call. When the guard raises, Weave records the exception as the call result — which is exactly right; you want the loop-detected event in your W&B trace.

Side-by-side capability comparison

Capability	W&B Weave	RunGuard
LLM call tracing and storage	Yes — `@weave.op()` decorator	No (use Weave for this)
Prompt version tracking	Yes — W&B Artifacts integration	No
Dataset-based evaluation	Yes — Weave Evaluations	No
Halt on repeated call signature	No	Yes — raises LoopDetectedError
Per-run USD budget enforcement	No (cost visibility only)	Yes — raises BudgetExceededError
Context-window overflow protection	No	Yes — ContextOverflowError at configurable threshold
Real-time Slack alert on circuit trip	Not on trip — W&B alerts are post-run	Yes — webhook fires synchronously on trip
Works without backend round-trip	No — logs go to W&B cloud	Yes — in-process only
Python support	Yes	Yes
TypeScript/JavaScript support	Yes (limited)	Yes — first-class

The production case for layering both

Teams that use W&B for model training already have strong habits around experiment tracking and evaluation. Adding Weave to the LLM side of their stack is a natural extension — the same mental model, the same platform, the same dashboards their data science team already knows.

RunGuard fills the production-safety gap that Weave (by design) does not address. When a new agent version ships and hits an edge case that turns into a loop, Weave will have the trace data you need to understand and fix it. RunGuard will have stopped the loop at 3 iterations instead of 25, capping the cost and the blast radius.

For teams already deep in the W&B ecosystem, the layering pattern is: @guard innermost for safety, @weave.op() outer for observability. Both decorators see every call. The guard decides whether the call goes through; Weave logs whatever happens.

Add a circuit breaker to your Weave-instrumented agents

RunGuard takes five minutes to add to an existing Python or TypeScript agent. Set a dollar ceiling, set a loop threshold, and deploy. Your W&B traces get more informative (looping runs terminate early with a typed error); your cost surprises stop.

Get started with RunGuard — or compare it to Braintrust, Arize Phoenix, and Langfuse.