AI agent session replay and cost analytics: what replay tools do and what they don’t

Session replay for AI agents means recording the full sequence of LLM calls, tool invocations, context windows, and outputs from an agent run so that you can reconstruct and examine it later. Tools like Langfuse, Braintrust, Arize Phoenix, and Helicone all offer some form of this capability. It is genuinely useful for debugging, evaluating, and understanding cost patterns in your agents. It has one fundamental limitation: it is entirely retrospective. By the time you’re replaying a session to understand why it cost $47, the $47 has already been spent. This page is about the architectural gap between session replay and real-time cost enforcement — and how to use both effectively.

What session replay actually captures

A session replay for an AI agent captures:

Every LLM call. The exact prompt (system message, conversation history, tool definitions), the model selected, the parameters (temperature, max_tokens), the response, and the latency and token counts for each call.
Every tool invocation. Which tool was called, with what arguments, what it returned, how long it took, and whether it errored.
The full execution trace. The sequence and nesting of calls — which LLM call triggered which tool call, which tool result triggered which next LLM call. In platforms that use OpenTelemetry spans, this is a complete distributed trace.
Derived cost metrics. Total tokens in/out, estimated cost at the current model price, cost per tool call, cost per reasoning step.

From this data, you can answer questions like: “Why did this run cost $2.30 when I expected $0.15?”, “Which tool is responsible for most of our token spend?”, “At what point in the run did the context window start growing faster than expected?”

These are the right questions to ask when you’re debugging a cost anomaly or optimizing your agent architecture. They are answered after the fact, by examining completed runs.

The gap replay tools cannot close

Every session replay platform has the same architectural constraint: the tool is a passive observer of execution. It receives events that have already happened — an LLM call that has already been made, a tool call that has already been executed — and records them to a backend. The very nature of logging is that the logged event is in the past.

This means replay tools cannot:

Stop a run that has reached your cost limit. Helicone’s cost dashboard updates as calls come in and can send you an alert when your daily spend crosses a threshold, but the alert arrives after the threshold-crossing call has already been made. The call that put you over budget has already been billed.
Halt a looping agent. If Langfuse is tracing an agent that has called read_file("config.yaml") seventeen times in the same run, Langfuse has an excellent trace showing all seventeen calls. The agent is on its eighteenth call. Langfuse cannot raise an exception that interrupts the eighteenth call, because Langfuse is not in the call stack — it’s receiving trace data from a sidecar exporter.
Enforce a per-run budget before it is exceeded. Braintrust can show you that a run cost $4.20 after it completes. It cannot raise BudgetExceededError when the accumulated cost reaches $3.00 during execution, because Braintrust’s SDK instruments calls with decorators that add observability, not gates.

The architectural reason is always the same: replay and observability platforms are designed to be non-intrusive. They are built to add zero risk of incorrect behavior; an instrumentation bug in Langfuse should never cause your agent to fail. The safest way to guarantee this is to make the instrumentation purely additive and non-blocking. But non-blocking instrumentation cannot enforce blocking constraints.

Langfuse session replay: tracing, not enforcement

Langfuse provides one of the richest agent tracing UIs in the category. Its trace view shows a nested timeline of all LLM calls and tool calls, with token counts and cost estimates at every level. The session view groups multiple traces from the same user session. The dashboard aggregates cost trends across your entire application fleet.

Langfuse’s Python SDK uses a decorator (@observe) and a context manager that wrap functions with trace capture. The wrapping is transparent: the function executes normally, the trace data is shipped to the Langfuse backend asynchronously on a background thread. This is the right design for an observability tool — the trace exporter must not be in the critical path. But it means the SDK is genuinely incapable of raising an exception inside your function based on accumulated run cost.

Helicone: proxy-layer cost visibility

Helicone intercepts your LLM API calls at the proxy level — you point your OpenAI client at https://oai.hconeai.com instead of the OpenAI API directly, and Helicone logs every call. This architecture captures cost data extremely reliably (it sees every API call regardless of what SDK or framework you’re using), but it makes the enforcement gap even more explicit: the proxy logs the call it just forwarded. It forwarded the call first, then logged it. It cannot refuse to forward a call because doing so would mean intercepting your request before the underlying API call — which would make Helicone a breaking layer, not a transparent proxy.

Helicone does offer a “Custom Rate Limits” feature that can block requests from specific users after they exceed a per-user cost threshold. This is enforced at the proxy layer and is genuinely a gate, not just logging. However, it applies at the user level (for multi-tenant applications) and is configured with static limits — it doesn’t understand your per-run budget or loop patterns at the individual agent-run level.

Braintrust: eval-focused, post-run cost analysis

Braintrust’s primary use case is evaluation — running your AI application against a benchmark dataset and measuring quality scores. Its logging SDK captures LLM calls during evaluation runs and in production, attaching cost metadata to each logged event. Braintrust’s “Experiments” feature makes it easy to see whether a change to your prompt or model improved quality while also increasing cost.

Like Langfuse, Braintrust’s SDK instruments calls asynchronously. Its Python @traced decorator ships trace data to the Braintrust backend in the background. It has no mechanism for pre-call cost checking, loop detection, or raising exceptions based on accumulated per-run spend.

Using session replay and RunGuard together

Replay tools and RunGuard solve different problems and compose cleanly. The typical pattern is to instrument your tool functions with RunGuard for enforcement, and with a replay tool for observability:

from langfuse.decorators import observe
from runguard import guard, BudgetTracker, LoopDetectedError

tracker = BudgetTracker(max_usd=4.0)

# Apply RunGuard first (innermost — runs before Langfuse sees the call)
# Apply @observe outer (captures result of guard-protected call, including exceptions)
@observe()
@guard(budget=tracker, loop_window=20, loop_threshold=3)
async def run_code(snippet: str) -> dict:
    return await sandbox.execute(snippet)

# When a loop trips, Langfuse records the LoopDetectedError as the call outcome
# Your trace shows exactly where the agent stopped — the most useful debugging data
try:
    result = await run_code(snippet)
except LoopDetectedError as e:
    # RunGuard already sent a Slack alert
    # Langfuse has the full trace up to the trip point
    return {"error": "run halted — loop detected", "details": str(e)}

Decorator order matters: @guard is innermost (applied to the function directly), so it runs first. If the guard raises, the exception propagates outward through the @observe layer, which captures it as the trace outcome. Your Langfuse dashboard shows a failed span with the exception details — exactly the data you need to understand why the circuit tripped.

What you get from each layer

Capability	Session replay tools (Langfuse, Braintrust, Helicone)	RunGuard
Full call trace for debugging	Yes — every LLM call, every tool call, nested	No (not its job)
Post-run cost analysis	Yes — per-call, per-session, per-app aggregates	No
Quality scoring and evaluation	Yes (Langfuse + Braintrust)	No
Per-run budget enforcement (pre-call)	No — all logging is post-call	Yes — raises BudgetExceededError pre-call
Loop detection and halt	No — observes loops, cannot stop them	Yes — raises LoopDetectedError on repeat pattern
Real-time Slack/PagerDuty alert on incident	Threshold alerts (with delay)	Yes — synchronous on circuit trip
Zero backend required (in-process)	No — sends data to cloud backend	Yes — all checks run in-process

Why both layers are necessary in production

The argument for using both is simple: session replay tells you what went wrong; RunGuard limits how wrong it can get.

Without session replay, debugging a cost anomaly or quality regression requires you to reconstruct the run from logs, which is slow and often incomplete. With session replay, you have a full reproduction of every decision the agent made, including the exact inputs that caused the problem. That’s invaluable for fixing the underlying bug.

Without RunGuard, a looping run or a run that hits an edge case at 3am can consume your daily cost budget before anyone wakes up to page on it. Replay will give you a beautiful trace of the disaster after the fact. RunGuard would have stopped it at 3 loop iterations instead of 300, then paged you immediately.

The teams that ship AI agents safely in production use both layers: an observability platform for understanding historical behavior and debugging, and a runtime circuit breaker for limiting the blast radius of incidents that haven’t happened yet.

Add runtime enforcement to your observability stack

If you already have Langfuse, Braintrust, Arize, or Helicone instrumented in your agents, RunGuard adds the enforcement layer that your replay tools are architecturally unable to provide. Five minutes to install. No conflict with your existing instrumentation.

Get started with RunGuard — or compare it directly to Langfuse, Braintrust, and Arize Phoenix.