Braintrust evaluates your AI agent’s outputs. RunGuard stops the loop before it runs up the bill.

Braintrust is a developer-focused LLM evaluation platform. You instrument your agent with the Braintrust SDK — import braintrust from "braintrust", wrap your LLM calls in traced() or create an experiment via Eval(), and every span, every score, every prompt version, and every token count flows into Braintrust’s project dashboard. You can build offline eval datasets from production traces, run them through LLM-judge or code-based scorers, track quality regressions across prompt versions using Braintrust’s experiment comparison view, and set up online evaluation pipelines that sample live traffic and write scores back to the dataset. Braintrust’s prompt playground lets you diff prompt versions side by side, replay a production trace with a new system prompt, and compare outputs from different models against the same eval suite. For teams that ship AI features to production and need to know whether quality is regressing, Braintrust is a well-built instrument. What Braintrust is not is a runtime guardrail: there is no braintrust.checkBudgetBeforeCall() method that throws before the next client.messages.create goes out, no in-process loop detector that counts repeated tool-call signatures and trips a breaker at the third repetition, and no mechanism that halts the agent mid-run rather than scoring it afterward. Braintrust records and scores what happened; it does not decide whether the next step should happen. The gap matters — a lot — when your agent loops.

What Braintrust gives you

The eval-vs-guardrail gap: what Braintrust cannot do for a looping agent

Imagine an AI research agent that calls a web search tool with the query "2026 AI safety summit keynotes", gets a 200-token result, reasons that it needs more detail, and calls the same search tool with the same query again. And again. The Braintrust trace for this run is illuminating: every traced() span is there, the tool call inputs and outputs are captured, and the cost-per-span meter ticks up with each generation. If you watch the Braintrust trace view in real time, you can see the loop forming — the same search query, the same tool, the same 200-token result, repeating. What the Braintrust trace view does not have is a braintrust.assertNoLoop() function you can call inside a traced() span that would have raised an exception before the fourth search call went out and prevented the fifth through tenth. The span lifecycle that Braintrust’s SDK defines is: start span → run wrapped function → log output → end span. There is no step between “start span” and “run wrapped function” where Braintrust reads the preceding span history and says “these three spans form a loop; I will halt the execution instead of running the function.” That pre-call decision point is the gap. For cost, the gap is the same: Braintrust’s cost accounting is computed from token counts that are recorded after each generation returns. There is no getCurrentRunCostUsd() synchronous function in the Braintrust SDK that you can call before the next client.messages.create to check whether the accumulated cost for this run has already exceeded your per-run cap. The cost data is there, on the server, after the flush — not in process, before the call. This means that a Braintrust-instrumented agent that loops can still run up a four-figure LLM bill in a weekend: Braintrust will record every dollar of it in precise per-span detail, and you will be able to see the full loop trace in the experiment view afterward. What you will not have had is a mechanism that prevented the loop from continuing past the third repetition.

What RunGuard adds to a Braintrust-instrumented stack

Side-by-side capability table

Capability Braintrust RunGuard
Records every LLM call with input/output/tokens Yes No (RunGuard is not a tracer)
Blocks the next LLM call if loop detected No Yes
Per-run budget cap enforced before call No (cost is post-hoc) Yes
Offline eval datasets & LLM-judge scorers Yes No
Prompt versioning and playground Yes No
Context-window overflow alert before provider 400s No Yes
Experiment comparison across prompt versions Yes No
Works in-process, zero HTTP calls on the guard path No (spans flushed to server) Yes
Structured error with pattern / spent / reason fields No Yes
Human feedback & score annotation UI Yes No

The table makes the architecture clear: Braintrust is observability and eval infrastructure; RunGuard is safety infrastructure. Neither product replaces the other. A stack without Braintrust lacks retrospective visibility into quality trends, prompt regressions, and cost patterns across thousands of runs. A stack without RunGuard lacks a pre-call gate that prevents the loop from starting run 101 through run 200 at 3 AM while the on-call engineer is asleep.

Integration: guard(traced(fn)) in TypeScript and Python

In both examples, the guard is the outermost wrapper. braintrust.traced / braintrust.currentSpan() are inside the try/catch, so a trip by RunGuard still lands the score annotation on the last Braintrust span. The Braintrust instrumentation on the callModel / call_model function captures every call that the guard allows through; RunGuard captures every call that should have been blocked.

The first loop our SDK caught was also caught by a Braintrust trace — the trace didn’t stop it

We built RunGuard while running a bespoke daily script that posts a six-tweet launch thread via the X API. The script had Braintrust-style logging (structured JSON to disk, the same retrospective tracing pattern). Session one came back HTTP 402 CreditsDepleted. Sessions two through six: same. Every session was fully recorded — timestamp, endpoint, payload, response code, error body. The log was a perfect trace. Looking at that log after session six, the loop was unmistakably obvious: same endpoint, same payload shape, same error, six consecutive entries. What the log did not have was the mechanism that would have read the loop signal before session seven’s call went out. At session seven we loaded the six-entry history into our LoopDetector on startup. It detected a length-1 cycle with depth 6, opened the breaker before any HTTP call was made, and exited cleanly with exit code 4. The seventh through twentieth sessions have all exited the same way: preflight detects the persisted history, breaker opens, zero new API calls, zero new cost. The Braintrust trace equivalent would have shown exactly the same six-entry loop in the span view. It also would not have stopped session seven. That is the eval-vs-guardrail gap in concrete form. Read the full incident writeup on the 30-day log.

What this is not

When to use both, and when each alone is sufficient

Use Braintrust without RunGuard when: your agent makes a bounded number of LLM calls per run (a chatbot that replies to each user message in a single generation, a batch classification job that processes one document per invocation), the cost per run is negligible (sub-cent runs where a loop would triple the cost to three cents), and your primary risk is quality regression rather than runaway spend. Braintrust’s eval and tracing layer is the right instrument for that risk profile. Use RunGuard without Braintrust when: you are shipping a prototype, cost control matters more than eval quality tracking, and you need the loop and budget guard in place before you have time to instrument a full eval pipeline. RunGuard’s SDK integration is one function wrap and two return fields; you can add it in fifteen minutes without any account setup. Use both when: your agent makes a variable number of LLM calls per run (a research agent, a coding agent, a browser-use agent), the cost per run varies widely (some runs are $0.10, a loop run could be $40), you care about both quality over time (Braintrust’s eval layer) and catastrophic single-run behavior (RunGuard’s circuit breaker), and you want the runguard_trip score annotation in your Braintrust dataset so you can study which input patterns cause loops. Most teams shipping autonomous agents in production land in the third category. The Langfuse alternative page and the LangSmith alternative page cover the same eval-vs-guardrail gap for the other two major observability platforms. The loop detection fundamentals page explains the underlying detection algorithm in detail. RunGuard ships as @runguard/sdk on npm and runguard on PyPI; the full API is documented in llms.txt.