Arize Phoenix vs RunGuard: real-time guardrails for AI agents

Arize Phoenix is one of the best LLM observability and evaluation platforms in the ecosystem. It captures every span, scores outputs with LLM-as-a-judge, and surfaces regressions across experiments. What it cannot do is reach inside a running agent and stop it before the fifteenth redundant tool call goes out. That gap is what RunGuard fills. This page is a precise account of what each product does, where the line is drawn, and how to run them together in a production agent stack.

What Arize Phoenix actually does

Arize Phoenix is an open-source LLM observability platform built around the OpenTelemetry trace model. It instruments your LLM calls via openinference auto-instrumentation, collects spans into a local or cloud-hosted Phoenix server, and lets you browse traces, attach evaluators (hallucination scores, relevance scores, custom LLM-judge prompts), run experiments across prompt versions, and set up dataset-based regression tests.

The core primitive is a span. Every LLM call, retrieval, and tool invocation becomes a span. Phoenix assembles those spans into a trace tree. The tree is the unit of analysis: you can see which tool was called, in what order, with what arguments, and what the model returned at each step.
Evaluation is retrospective. Evaluators run after a trace is complete. You tag a trace with a hallucination score, a toxicity score, or a custom rubric. The score lives on the finished trace and can inform a dataset or trigger a webhook to a CI pipeline — not a guardrail that fires mid-run.
Experiments are for prompt engineering. Phoenix’s experiment runner lets you replay a dataset of inputs against different prompt versions and compare scores. This is the right tool for offline quality regression. It assumes you have a trace to evaluate; a looping agent never produces a clean trace to score because it never finishes.
Alerts are post-hoc. Phoenix’s alerting (via integrations with Slack, PagerDuty, or webhooks) fires on collected traces that cross an evaluation threshold. By the time the alert fires, the run that triggered it has already completed — including any runaway cost it incurred.

The observability gap: why traces don't stop loops

The architectural reason traces cannot stop loops is fundamental, not a product limitation of Arize. Tracing is a write-side operation: your application emits spans, the collector receives them. The collector does not have a call-back channel into the running agent. There is no mechanism by which Phoenix can reach into your process, inspect the in-flight call stack, and raise an exception before the next tool.run() executes.

Consider the common agentic loop shape: a tool call fails with a 429, the model observes the failure and emits the same call again, the result is the same 429, the model tries once more. Phoenix will faithfully record all three spans. If you set an alert for “tool error rate > 50% in last 10 minutes”, that alert might fire — after the tenth retry, not after the second. The trace for a 25-iteration loop (AutoGen’s default max_iter) will have 25 tool spans, all dutifully stored in Phoenix, and the alert fires when the run finally halts or crashes.

This is not a knock on Arize. Traces are invaluable for understanding what happened. They are the wrong abstraction for preventing what is happening.

What RunGuard actually does

In-process, synchronous interception. RunGuard’s guard() wrapper runs inside your process, on the call thread, before the wrapped function executes. When the circuit trips, it raises a typed exception (LoopDetectedError, BudgetExceededError, or ContextOverflowError) synchronously — no async round-trip to a collector, no polling interval to wait for.
Pattern-based fingerprinting, not score thresholds. Loop detection works by hashing (tool_name, canonical_args, result_status) into a signature and watching for that signature repeating in a sliding window. No LLM judge is involved. The decision is deterministic and sub-millisecond.
Per-run budget accounting. BudgetTracker accumulates estimated cost across every guarded call in a single agent run and raises before the next call if the ceiling is already breached. This is per-run accounting, not fleet-level rate limiting.
Context-window proximity alerts. ContextGuard compares the running token count to the model’s known context limit and raises ContextOverflowError when you hit a configurable threshold (default: 85% of the model’s max tokens). Silent truncation — the bug that silently degrades output quality without a visible error — is caught before the model receives a truncated prompt.
Framework-agnostic wrapping. Any Python callable or TypeScript async function can be wrapped. There is no CrewAI-specific callback, no LangChain-specific hook. The same guard() primitive works across every framework the agent touches.

Layering Arize Phoenix and RunGuard together

The right production setup uses both: RunGuard prevents catastrophic runs from completing; Arize Phoenix makes the prevented (and completed) runs understandable. Here is a TypeScript example of the two working together:

// Install: npm install runguard @arizeai/openinference-instrumentation-openai
import { OpenAI } from "openai";
import { guard, BudgetTracker } from "runguard";
import { registerInstrumentations } from "@opentelemetry/instrumentation";
import { OpenAIInstrumentation } from "@arizeai/openinference-instrumentation-openai";

// 1. Wire Arize Phoenix OTel instrumentation (spans go to Phoenix collector)
registerInstrumentations({
  instrumentations: [new OpenAIInstrumentation()],
});

// 2. Wrap your tool with RunGuard (trips before the next call, not after)
const tracker = new BudgetTracker({ maxUsd: 2.00 });

async function callWebSearch(query: string): Promise<string> {
  // real search implementation
  return `results for ${query}`;
}

const guardedSearch = guard(callWebSearch, {
  budget: tracker,
  loopWindow: 20,
  loopThreshold: 3,
});

// 3. Use the guarded function in your agent loop
// Phoenix traces every span; RunGuard trips before a repeat goes out
try {
  const result = await guardedSearch("AI agent loop detection");
} catch (err) {
  if (err.name === "LoopDetectedError") {
    // Phoenix recorded the spans leading up to the loop
    // RunGuard stopped it before the next redundant call
    console.error("Loop tripped:", err.signature, err.count);
  }
}

Phoenix gets a complete trace including the spans that led up to the trip. You can replay that trace in the Phoenix UI, attach an evaluator, and understand why the loop formed. RunGuard ensured the trace was short (3 repeats) rather than catastrophic (25 repeats, $40 in tokens).

Side-by-side capability table

Capability	Arize Phoenix	RunGuard
Captures LLM call spans	Yes — full OpenTelemetry trace	No (use Phoenix for this)
Scores outputs with LLM-as-a-judge	Yes — built-in evaluators	No
Prompt experiment tracking	Yes — dataset-based experiments	No
Stops a loop before the 3rd repeat	No	Yes — synchronous, in-process
Per-run dollar budget enforcement	No (fleet-level cost reporting only)	Yes — raises before the next call
Context-window proximity alert	No	Yes — ContextOverflowError at 85%
Real-time Slack/PagerDuty alert on trip	Post-hoc (evaluation threshold)	Yes — webhook on trip
Framework support	OpenInference instrumentations	Any Python/TS callable
Open source	Yes (Phoenix)	SDK open-source; dashboard SaaS
Self-hostable	Yes (Phoenix server)	SDK self-hostable; cloud dashboard optional

When to choose Arize, RunGuard, or both

Use Arize Phoenix if you need to understand what your agents are doing across many runs: trace browsing, quality evaluation, prompt regression, dataset curation. Phoenix is the right tool for the “what happened and why” question.
Use RunGuard if you need to stop bad runs before they finish: loop detection, per-run cost enforcement, context-window protection. RunGuard is the right tool for the “don’t let this get worse” problem.
Use both if you are running agents in production where a single bad run costs real money. Phoenix tells you what the run looked like; RunGuard ensures no run exceeds your acceptable blast radius. Together, they cover both the prevention and the forensics layer of agent operations.

Installing RunGuard alongside Arize Phoenix

# Python — install both in the same environment
pip install runguard openinference-instrumentation-openai opentelemetry-sdk

# Configure Phoenix endpoint (local or cloud)
import os
os.environ["PHOENIX_COLLECTOR_ENDPOINT"] = "http://localhost:6006/v1/traces"

from openinference.instrumentation.openai import OpenAIInstrumentor
from opentelemetry import trace
from opentelemetry.exporter.otlp.proto.http.trace_exporter import OTLPSpanExporter
from opentelemetry.sdk.trace import TracerProvider
from opentelemetry.sdk.trace.export import BatchSpanProcessor

provider = TracerProvider()
provider.add_span_processor(BatchSpanProcessor(OTLPSpanExporter()))
trace.set_tracer_provider(provider)
OpenAIInstrumentor().instrument()

# RunGuard — wrap your tool the usual way
from runguard import guard, BudgetTracker

tracker = BudgetTracker(max_usd=2.0)

@guard(budget=tracker, loop_window=20, loop_threshold=3)
def web_search(query: str) -> dict:
    # Phoenix traces this call; RunGuard guards repetition
    ...

The incident that makes this concrete

One of the patterns that surfaces repeatedly in agent post-mortems: a research agent hits a paginated API, the pagination token logic has a bug, the agent retrieves page 1 repeatedly. Arize Phoenix records every span faithfully. If you have an evaluation that scores “did the agent make progress?” on each trace, the score will be low for this run — but the evaluation runs after the trace is complete, which is after the 25th redundant API call.

With RunGuard in place, the third occurrence of (fetch_page, {token: null, query: "refund policy"}, 200) trips the breaker. The run exits with a LoopDetectedError. Phoenix gets a 3-span trace instead of a 25-span trace. You see the loop clearly in the Phoenix trace view, score it with a “stuck on page 1” evaluator, and add it to a regression dataset. The cost: three API calls. Without RunGuard: twenty-five, plus twenty-five model calls to “reason about” the same response.

Add RunGuard to your Arize-instrumented stack

RunGuard ships as a zero-config SDK. Add it to any tool your agent calls, set a dollar ceiling, and the circuit breaker is live. Your Phoenix traces get shorter and more informative; your cost surprises stop.

Get started with RunGuard — or read how it compares to Langfuse, LangSmith, and Helicone.