Braintrust evaluates your AI agent’s outputs. RunGuard stops the loop before it runs up the bill.
Braintrust is a developer-focused LLM evaluation platform. You instrument your agent with the Braintrust SDK — import braintrust from "braintrust", wrap your LLM calls in traced() or create an experiment via Eval(), and every span, every score, every prompt version, and every token count flows into Braintrust’s project dashboard. You can build offline eval datasets from production traces, run them through LLM-judge or code-based scorers, track quality regressions across prompt versions using Braintrust’s experiment comparison view, and set up online evaluation pipelines that sample live traffic and write scores back to the dataset. Braintrust’s prompt playground lets you diff prompt versions side by side, replay a production trace with a new system prompt, and compare outputs from different models against the same eval suite. For teams that ship AI features to production and need to know whether quality is regressing, Braintrust is a well-built instrument. What Braintrust is not is a runtime guardrail: there is no braintrust.checkBudgetBeforeCall() method that throws before the next client.messages.create goes out, no in-process loop detector that counts repeated tool-call signatures and trips a breaker at the third repetition, and no mechanism that halts the agent mid-run rather than scoring it afterward. Braintrust records and scores what happened; it does not decide whether the next step should happen. The gap matters — a lot — when your agent loops.
What Braintrust gives you
- Experiment tracking with automatic prompt versioning. Braintrust stores every prompt version you evaluate and computes a hash-based identifier for each, so you can compare “does prompt v3 score higher than prompt v2 on my golden dataset?” with a single view. The experiment comparison table shows you per-row score deltas: you see exactly which test cases regressed, not just the aggregate score. When a prompt change causes a quality drop on a subset of inputs (say, multi-tool research queries regress while single-tool queries improve), the row-level diff surfaces that signal. This is genuinely valuable for catching prompt regressions before they reach production users.
- Structured traces with span-level cost and latency. The Braintrust
traced()wrapper andwrapOpenAI()/wrapAnthropic()helpers instrument every LLM call in your stack, capturing token counts, model names, latency per span, and the full input/output for each generation. Braintrust aggregates these into a project-level dashboard showing cost per day, cost per trace, and latency percentiles. The span-level detail is useful for debugging which part of a complex multi-step agent is slow or expensive: if your RAG retrieval step accounts for 60% of your per-trace cost because it’s running an embedding call per document rather than per query, that’s visible in the Braintrust trace tree. - Offline and online eval datasets from production traces. Every trace Braintrust captures can be added to a dataset with one click from the trace view, or in bulk via a filter query. Those dataset rows become the inputs to your eval scorers — Python functions, LLM-judge prompts, or custom TypeScript scorers. You can run the same scorer against a baseline and a new model version side by side in a Braintrust experiment and get a p-value-informed comparison of whether the difference is statistically significant at your dataset size. For teams running A/B tests across model versions or prompt strategies, this eval-dataset feedback loop is the primary instrument for making decisions about which version ships.
- The Braintrust playground for interactive debugging. Braintrust’s playground lets you load a production trace, edit the system prompt or user message, swap the model, and re-run it in the browser to see whether a different configuration produces better output. You can run the edited configuration against your full eval dataset directly from the playground, turning a one-off trace debugging session into an experiment. The playground is genuinely interactive in ways that most eval tools are not: you get the scorer output inline next to the model output, so you can iterate on the prompt while seeing the quality signal in the same view.
- Score logging and human feedback collection. Braintrust exposes a
currentSpan().log({ scores: { factuality: 0.9 } })API for programmatic score writing from inside your traced functions, and a human review UI for manual labeling of production traces. The human feedback and programmatic scores both land in the same dataset, so you can build mixed training/eval datasets that combine LLM-judge scores with human-labeled examples. This is useful for closing the loop between production signal and offline eval: a trace that generated a user complaint is immediately labelable in Braintrust’s review queue, and the labeled example becomes an eval case for the next experiment. - Braintrust is retrospective by architecture. Every feature above — experiment comparison, cost dashboards, trace-to-dataset conversion, playground replay, score logging — operates on data that already exists: a trace that has already been captured, a span that has already been flushed, an output that has already been generated. The
traced()wrapper fires a span-start event before the wrapped function runs and a span-end event after it returns; it cannot intercept the return value and decide “this output looks like a loop, do not proceed.” That interception point does not exist in Braintrust’s architecture because eval platforms are designed to be non-intrusive: you do not want your scorer to add latency or failure modes to the production code path. The consequence is that Braintrust, like all eval platforms, is an excellent instrument for understanding what your agent did and whether it did it well — and a fundamentally non-overlapping instrument with a runtime circuit breaker that determines whether the agent should take the next step at all.
The eval-vs-guardrail gap: what Braintrust cannot do for a looping agent
Imagine an AI research agent that calls a web search tool with the query "2026 AI safety summit keynotes", gets a 200-token result, reasons that it needs more detail, and calls the same search tool with the same query again. And again. The Braintrust trace for this run is illuminating: every traced() span is there, the tool call inputs and outputs are captured, and the cost-per-span meter ticks up with each generation. If you watch the Braintrust trace view in real time, you can see the loop forming — the same search query, the same tool, the same 200-token result, repeating. What the Braintrust trace view does not have is a braintrust.assertNoLoop() function you can call inside a traced() span that would have raised an exception before the fourth search call went out and prevented the fifth through tenth. The span lifecycle that Braintrust’s SDK defines is: start span → run wrapped function → log output → end span. There is no step between “start span” and “run wrapped function” where Braintrust reads the preceding span history and says “these three spans form a loop; I will halt the execution instead of running the function.” That pre-call decision point is the gap. For cost, the gap is the same: Braintrust’s cost accounting is computed from token counts that are recorded after each generation returns. There is no getCurrentRunCostUsd() synchronous function in the Braintrust SDK that you can call before the next client.messages.create to check whether the accumulated cost for this run has already exceeded your per-run cap. The cost data is there, on the server, after the flush — not in process, before the call. This means that a Braintrust-instrumented agent that loops can still run up a four-figure LLM bill in a weekend: Braintrust will record every dollar of it in precise per-span detail, and you will be able to see the full loop trace in the experiment view afterward. What you will not have had is a mechanism that prevented the loop from continuing past the third repetition.
What RunGuard adds to a Braintrust-instrumented stack
- A pre-call loop detector that fires before the next span starts. RunGuard’s
LoopDetectormaintains a sliding window of tool-call signatures (a 64-byte hash of the tool name plus a truncated slice of the tool input). Before each new call, it checks whether the last N signatures form a repeating cycle of length 1 through 8 that has appeared repeats times (default 3). If it finds one, it throwsLoopDetectedErrorbefore the LLM API call goes out — before the Braintrust span even starts. The Braintrust trace at the point of the error is complete as-of the last successful call; the halted calls leave no spans, no cost, and no noise in the trace tree. TheLoopDetectedErrorcarriese.pattern(the exact tool-call sequence),e.repeats(how many times it cycled), ande.reason("loop") — fields you can log into Braintrust as a score or metadata on the last span. - A per-run budget cap that fires in-process, before the next generation. RunGuard’s
BudgetTrackeraccumulatesusdvalues you pass after each generation (computed fromresponse.usageand your per-token rate — the same number you already pass to Braintrust for cost tracking) and throwsBudgetExceededErrorif the next call would push the run’s accumulated spend pastmaxUsd. This is a synchronous in-process check: no HTTP call, no SDK round-trip, no server-side lookup. It fires beforeclient.messages.create, before the Braintrust span, before the token cost lands on your invoice. Braintrust’s per-run cost dashboard shows you what the run cost up to the cap; RunGuard is the mechanism that enforced the cap. - Annotate Braintrust spans with RunGuard trip data. When RunGuard’s
onTripcallback fires, you have access to the run’s Braintrust span context (thecurrentSpan()is still active at the point of the throw if you wrap the guard in a try/catch inside atraced()function). Callbraintrust.currentSpan().log({ scores: { runguard_trip: 0 }, metadata: { reason: e.reason, pattern: e.pattern, spent: e.spent } })to annotate the Braintrust trace with the trip data. This means every Braintrust trace that ended with a RunGuard trip has a queryable score key: “show me all runs whererunguard_trip = 0” returns your full loop and budget-exceeded history. The trace shows the call sequence; the score metadata shows what triggered the stop. - Zero change to your Braintrust instrumentation. RunGuard wraps the innermost function that calls the LLM API — the same function where your
traced()call orwrapOpenAI()wrapper lives. The correct composition isguard(traced(innerFn)): the guard is outermost, so it fires before the tracer creates a span. If the guard trips, no span is created for the blocked call. If the guard passes,traced(innerFn)runs exactly as it would without RunGuard. Your existing Braintrust dataset rows, scorer results, experiment comparisons, and dashboard metrics are unaffected. RunGuard adds two new fields to the return value of the guarded function (usdfor the budget tracker andsigfor the loop detector) and potentially throws two new typed exceptions. Everything else is identical.
Side-by-side capability table
| Capability | Braintrust | RunGuard |
|---|---|---|
| Records every LLM call with input/output/tokens | Yes | No (RunGuard is not a tracer) |
| Blocks the next LLM call if loop detected | No | Yes |
| Per-run budget cap enforced before call | No (cost is post-hoc) | Yes |
| Offline eval datasets & LLM-judge scorers | Yes | No |
| Prompt versioning and playground | Yes | No |
| Context-window overflow alert before provider 400s | No | Yes |
| Experiment comparison across prompt versions | Yes | No |
| Works in-process, zero HTTP calls on the guard path | No (spans flushed to server) | Yes |
| Structured error with pattern / spent / reason fields | No | Yes |
| Human feedback & score annotation UI | Yes | No |
The table makes the architecture clear: Braintrust is observability and eval infrastructure; RunGuard is safety infrastructure. Neither product replaces the other. A stack without Braintrust lacks retrospective visibility into quality trends, prompt regressions, and cost patterns across thousands of runs. A stack without RunGuard lacks a pre-call gate that prevents the loop from starting run 101 through run 200 at 3 AM while the on-call engineer is asleep.
Integration: guard(traced(fn)) in TypeScript and Python
-
TypeScript with Braintrust’s
wrapOpenAI:import { guard } from "@runguard/sdk"; import braintrust, { wrapOpenAI } from "braintrust"; import OpenAI from "openai"; const bt = wrapOpenAI(new OpenAI()); async function callModel(input: string) { const response = await bt.chat.completions.create({ model: "gpt-4o", messages: [{ role: "user", content: input }], }); const usd = (response.usage!.total_tokens / 1_000_000) * 2.5; const sig = response.choices[0].message.tool_calls?.[0]?.function.name ?? "end_turn"; return { response, usd, sig }; } const guardedfn = guard(callModel, { budget: { maxUsd: 5 }, loop: { repeats: 3, maxCycleLen: 8 }, }); // Inside a Braintrust experiment or traced context: braintrust.traced(async () => { try { const { response } = await guardedfn(userMessage); // process response } catch (e: any) { braintrust.currentSpan().log({ scores: { runguard_trip: 0 }, metadata: { reason: e.reason, pattern: e.pattern, spent: e.spent }, }); throw e; } }); -
Python with Braintrust’s
traceddecorator:import braintrust from runguard import guard, LoopDetectedError, BudgetExceededError import anthropic client = anthropic.Anthropic() def call_model(messages: list) -> dict: response = client.messages.create( model="claude-sonnet-4-6", max_tokens=2048, messages=messages, ) usd = (response.usage.input_tokens * 3 + response.usage.output_tokens * 15) / 1_000_000 sig = response.content[0].name if response.stop_reason == "tool_use" else "end_turn" return {"response": response, "usd": usd, "sig": sig} guarded_call = guard(call_model, budget={"max_usd": 5}, loop={"repeats": 3}) @braintrust.traced def run_agent(task: str): try: result = guarded_call(messages=[{"role": "user", "content": task}]) return result except (LoopDetectedError, BudgetExceededError) as e: braintrust.current_span().log( scores={"runguard_trip": 0}, metadata={"reason": e.reason, "pattern": getattr(e, "pattern", None), "spent": e.spent}, ) raise
In both examples, the guard is the outermost wrapper. braintrust.traced / braintrust.currentSpan() are inside the try/catch, so a trip by RunGuard still lands the score annotation on the last Braintrust span. The Braintrust instrumentation on the callModel / call_model function captures every call that the guard allows through; RunGuard captures every call that should have been blocked.
The first loop our SDK caught was also caught by a Braintrust trace — the trace didn’t stop it
We built RunGuard while running a bespoke daily script that posts a six-tweet launch thread via the X API. The script had Braintrust-style logging (structured JSON to disk, the same retrospective tracing pattern). Session one came back HTTP 402 CreditsDepleted. Sessions two through six: same. Every session was fully recorded — timestamp, endpoint, payload, response code, error body. The log was a perfect trace. Looking at that log after session six, the loop was unmistakably obvious: same endpoint, same payload shape, same error, six consecutive entries. What the log did not have was the mechanism that would have read the loop signal before session seven’s call went out. At session seven we loaded the six-entry history into our LoopDetector on startup. It detected a length-1 cycle with depth 6, opened the breaker before any HTTP call was made, and exited cleanly with exit code 4. The seventh through twentieth sessions have all exited the same way: preflight detects the persisted history, breaker opens, zero new API calls, zero new cost. The Braintrust trace equivalent would have shown exactly the same six-entry loop in the span view. It also would not have stopped session seven. That is the eval-vs-guardrail gap in concrete form. Read the full incident writeup on the 30-day log.
What this is not
- Not a Braintrust replacement for eval, scoring, or prompt management. RunGuard has no experiment runner, no eval dataset, no LLM-judge scorer, no prompt playground, no latency dashboard, and no human feedback UI. It is a two-primitive in-process guard: a cumulative budget accumulator and a tool-call signature ring buffer. If you need retrospective visibility, eval workflows, or prompt versioning, use Braintrust. If you need a runtime circuit breaker, add RunGuard. The requirements do not overlap.
- Not a proxy-based cost enforcement layer. RunGuard is not a network proxy that intercepts HTTP requests to your LLM provider and blocks them if a cost threshold is exceeded. It is an SDK you install with
npm i @runguard/sdkorpip install runguard. The circuit-breaker logic runs entirely in your process, with no HTTP calls from the guard path itself. This means RunGuard works with any LLM provider, any network configuration, and any deployment environment that can run JavaScript or Python, without requiring you to route your traffic through a third-party proxy. - Not a replacement for Braintrust’s online evaluation pipelines. Braintrust’s online eval pipelines sample live traffic, run your scorers against sampled runs, and write scores back to your dataset. This is useful for detecting slow quality regressions that only appear at scale: a new model version that scores marginally worse on 5% of queries is hard to detect in a small eval set but visible after 10,000 production samples. RunGuard’s circuit breaker fires on per-run patterns (loop signatures and budget) that are detectable from the first occurrence. The two instruments catch different failure modes: Braintrust’s online eval catches gradual quality drift at fleet scale; RunGuard catches catastrophic runaway behavior at individual run scale.
- Not a cloud service or SaaS dependency in the guard path. RunGuard ships as a zero-dependency TypeScript package and a zero-dependency Python package. The
LoopDetectorandBudgetTrackerare pure data structures. No API key, no network call, no account required to use the guard primitives. You do need a RunGuard account for the dashboard (30 days of trip events per app, Slack and PagerDuty alerts) — but the circuit breaker itself works without the dashboard, and a tripped breaker costs you nothing even if the dashboard is unreachable.
When to use both, and when each alone is sufficient
Use Braintrust without RunGuard when: your agent makes a bounded number of LLM calls per run (a chatbot that replies to each user message in a single generation, a batch classification job that processes one document per invocation), the cost per run is negligible (sub-cent runs where a loop would triple the cost to three cents), and your primary risk is quality regression rather than runaway spend. Braintrust’s eval and tracing layer is the right instrument for that risk profile. Use RunGuard without Braintrust when: you are shipping a prototype, cost control matters more than eval quality tracking, and you need the loop and budget guard in place before you have time to instrument a full eval pipeline. RunGuard’s SDK integration is one function wrap and two return fields; you can add it in fifteen minutes without any account setup. Use both when: your agent makes a variable number of LLM calls per run (a research agent, a coding agent, a browser-use agent), the cost per run varies widely (some runs are $0.10, a loop run could be $40), you care about both quality over time (Braintrust’s eval layer) and catastrophic single-run behavior (RunGuard’s circuit breaker), and you want the runguard_trip score annotation in your Braintrust dataset so you can study which input patterns cause loops. Most teams shipping autonomous agents in production land in the third category. The Langfuse alternative page and the LangSmith alternative page cover the same eval-vs-guardrail gap for the other two major observability platforms. The loop detection fundamentals page explains the underlying detection algorithm in detail. RunGuard ships as @runguard/sdk on npm and runguard on PyPI; the full API is documented in llms.txt.