Arize Phoenix vs RunGuard: real-time guardrails for AI agents

Arize Phoenix is one of the best LLM observability and evaluation platforms in the ecosystem. It captures every span, scores outputs with LLM-as-a-judge, and surfaces regressions across experiments. What it cannot do is reach inside a running agent and stop it before the fifteenth redundant tool call goes out. That gap is what RunGuard fills. This page is a precise account of what each product does, where the line is drawn, and how to run them together in a production agent stack.

What Arize Phoenix actually does

Arize Phoenix is an open-source LLM observability platform built around the OpenTelemetry trace model. It instruments your LLM calls via openinference auto-instrumentation, collects spans into a local or cloud-hosted Phoenix server, and lets you browse traces, attach evaluators (hallucination scores, relevance scores, custom LLM-judge prompts), run experiments across prompt versions, and set up dataset-based regression tests.

The observability gap: why traces don't stop loops

The architectural reason traces cannot stop loops is fundamental, not a product limitation of Arize. Tracing is a write-side operation: your application emits spans, the collector receives them. The collector does not have a call-back channel into the running agent. There is no mechanism by which Phoenix can reach into your process, inspect the in-flight call stack, and raise an exception before the next tool.run() executes.

Consider the common agentic loop shape: a tool call fails with a 429, the model observes the failure and emits the same call again, the result is the same 429, the model tries once more. Phoenix will faithfully record all three spans. If you set an alert for “tool error rate > 50% in last 10 minutes”, that alert might fire — after the tenth retry, not after the second. The trace for a 25-iteration loop (AutoGen’s default max_iter) will have 25 tool spans, all dutifully stored in Phoenix, and the alert fires when the run finally halts or crashes.

This is not a knock on Arize. Traces are invaluable for understanding what happened. They are the wrong abstraction for preventing what is happening.

What RunGuard actually does

Layering Arize Phoenix and RunGuard together

The right production setup uses both: RunGuard prevents catastrophic runs from completing; Arize Phoenix makes the prevented (and completed) runs understandable. Here is a TypeScript example of the two working together:

// Install: npm install runguard @arizeai/openinference-instrumentation-openai
import { OpenAI } from "openai";
import { guard, BudgetTracker } from "runguard";
import { registerInstrumentations } from "@opentelemetry/instrumentation";
import { OpenAIInstrumentation } from "@arizeai/openinference-instrumentation-openai";

// 1. Wire Arize Phoenix OTel instrumentation (spans go to Phoenix collector)
registerInstrumentations({
  instrumentations: [new OpenAIInstrumentation()],
});

// 2. Wrap your tool with RunGuard (trips before the next call, not after)
const tracker = new BudgetTracker({ maxUsd: 2.00 });

async function callWebSearch(query: string): Promise<string> {
  // real search implementation
  return `results for ${query}`;
}

const guardedSearch = guard(callWebSearch, {
  budget: tracker,
  loopWindow: 20,
  loopThreshold: 3,
});

// 3. Use the guarded function in your agent loop
// Phoenix traces every span; RunGuard trips before a repeat goes out
try {
  const result = await guardedSearch("AI agent loop detection");
} catch (err) {
  if (err.name === "LoopDetectedError") {
    // Phoenix recorded the spans leading up to the loop
    // RunGuard stopped it before the next redundant call
    console.error("Loop tripped:", err.signature, err.count);
  }
}

Phoenix gets a complete trace including the spans that led up to the trip. You can replay that trace in the Phoenix UI, attach an evaluator, and understand why the loop formed. RunGuard ensured the trace was short (3 repeats) rather than catastrophic (25 repeats, $40 in tokens).

Side-by-side capability table

CapabilityArize PhoenixRunGuard
Captures LLM call spansYes — full OpenTelemetry traceNo (use Phoenix for this)
Scores outputs with LLM-as-a-judgeYes — built-in evaluatorsNo
Prompt experiment trackingYes — dataset-based experimentsNo
Stops a loop before the 3rd repeatNoYes — synchronous, in-process
Per-run dollar budget enforcementNo (fleet-level cost reporting only)Yes — raises before the next call
Context-window proximity alertNoYes — ContextOverflowError at 85%
Real-time Slack/PagerDuty alert on tripPost-hoc (evaluation threshold)Yes — webhook on trip
Framework supportOpenInference instrumentationsAny Python/TS callable
Open sourceYes (Phoenix)SDK open-source; dashboard SaaS
Self-hostableYes (Phoenix server)SDK self-hostable; cloud dashboard optional

When to choose Arize, RunGuard, or both

Installing RunGuard alongside Arize Phoenix

# Python — install both in the same environment
pip install runguard openinference-instrumentation-openai opentelemetry-sdk

# Configure Phoenix endpoint (local or cloud)
import os
os.environ["PHOENIX_COLLECTOR_ENDPOINT"] = "http://localhost:6006/v1/traces"

from openinference.instrumentation.openai import OpenAIInstrumentor
from opentelemetry import trace
from opentelemetry.exporter.otlp.proto.http.trace_exporter import OTLPSpanExporter
from opentelemetry.sdk.trace import TracerProvider
from opentelemetry.sdk.trace.export import BatchSpanProcessor

provider = TracerProvider()
provider.add_span_processor(BatchSpanProcessor(OTLPSpanExporter()))
trace.set_tracer_provider(provider)
OpenAIInstrumentor().instrument()

# RunGuard — wrap your tool the usual way
from runguard import guard, BudgetTracker

tracker = BudgetTracker(max_usd=2.0)

@guard(budget=tracker, loop_window=20, loop_threshold=3)
def web_search(query: str) -> dict:
    # Phoenix traces this call; RunGuard guards repetition
    ...

The incident that makes this concrete

One of the patterns that surfaces repeatedly in agent post-mortems: a research agent hits a paginated API, the pagination token logic has a bug, the agent retrieves page 1 repeatedly. Arize Phoenix records every span faithfully. If you have an evaluation that scores “did the agent make progress?” on each trace, the score will be low for this run — but the evaluation runs after the trace is complete, which is after the 25th redundant API call.

With RunGuard in place, the third occurrence of (fetch_page, {token: null, query: "refund policy"}, 200) trips the breaker. The run exits with a LoopDetectedError. Phoenix gets a 3-span trace instead of a 25-span trace. You see the loop clearly in the Phoenix trace view, score it with a “stuck on page 1” evaluator, and add it to a regression dataset. The cost: three API calls. Without RunGuard: twenty-five, plus twenty-five model calls to “reason about” the same response.

Add RunGuard to your Arize-instrumented stack

RunGuard ships as a zero-config SDK. Add it to any tool your agent calls, set a dollar ceiling, and the circuit breaker is live. Your Phoenix traces get shorter and more informative; your cost surprises stop.

Get started with RunGuard — or read how it compares to Langfuse, LangSmith, and Helicone.