MLflow Tracing and AI Gateway Cost Control: Circuit Breakers for LLM Observability Spend

MLflow has been the standard experiment tracking platform for ML teams for years, and since version 2.9 it has extended that role into LLM observability via MLflow Tracing and the MLflow AI Gateway. One line of mlflow.langchain.autolog() or mlflow.openai.autolog() turns on automatic span capture for every LLM call in your agent — inputs, outputs, token counts, latency, model name — all logged to your MLflow tracking server without instrumenting a single callsite. The AI Gateway adds centralized LLM API routing: a single endpoint your agents hit instead of provider APIs directly, with unified auth, per-route rate limits, and provider-agnostic request normalization.

The cost control challenge is the same one that affects every observability-first framework: MLflow captures what your agent does, but it does not stop what your agent does. Autolog will faithfully record a looping agent executing 3,000 LLM calls over two hours, with full token attribution and latency histograms, while your LLM provider charges you for every call. The AI Gateway will proxy every request through its routing layer, per-route rate limit semantics notwithstanding, because rate limits in the Gateway are designed to protect provider quotas from excess, not to detect pathological loops within a single agent session. The trace dashboard is detailed. The invoice is not kind.

Four structural patterns in MLflow-instrumented systems cause this silent cost amplification:

  • Autolog span accumulation in tight agent loopsmlflow.langchain.autolog() and mlflow.openai.autolog() write a span to the tracking server on every LLM call. In a looping agent, each span write is an I/O round-trip that adds synchronous overhead, masks timing signals used by duration-based guards, and turns the tracking server into a co-bottleneck whose latency scales with loop iteration count.
  • AI Gateway route fan-out — the Gateway enforces per-route, per-API-key rate limits across all callers sharing that key. A single looping agent can exhaust the shared window before other agents or users make a single request, and the Gateway's backpressure mechanism (429 + retry-after) interacts with agent retry logic to create a retry amplification cascade.
  • Evaluation suite fan-outmlflow.evaluate() with a generative judge or agentic metric fires one LLM call per row × per metric × per retry. A 200-row eval dataset with three LLM-as-judge metrics and a max-retry of 2 can generate over 1,200 LLM calls from a single mlflow.evaluate() invocation, all billed at the judge model's rate.
  • Artifact logging storm from tool output accumulation — agents that call mlflow.log_artifact() or mlflow.log_text() on each tool output write one file per call to the artifact store. A looping agent that logs results to S3 or GCS can accumulate thousands of artifact writes per run, each incurring storage and PUT request costs that accrue independently of LLM token costs.

MLflow's tracing architecture and where cost lives

MLflow Tracing follows the OpenTelemetry data model: each LLM call or agent step becomes a span, spans are grouped into a trace, and traces are written to the MLflow tracking server (local SQLite, PostgreSQL, or a managed MLflow service). The autolog integrations intercept the LLM client's API calls at the SDK level — patching openai.OpenAI().chat.completions.create, wrapping LangChain's chain invocation hooks, etc. — and emit spans automatically after each call completes.

This means span writes happen synchronously (or near-synchronously, depending on the integration) after each LLM call, before control returns to the agent loop. The write latency — typically 5–30ms on a local tracking server, 20–100ms on a remote one — is invisible during development where agents run once and finish, but compounds in production loops: at 30ms/span with 1,000 calls, you've added 30 seconds of pure I/O overhead and created 1,000 round-trips to your tracking server. If the tracking server becomes overloaded under this load, write latency increases, which increases the loop's iteration time, which can mask duration-based circuit breakers that would otherwise fire.

The MLflow AI Gateway adds a second cost surface. The Gateway maintains a routes config mapping route names to provider endpoints, with per-route limit settings (requests-per-minute, tokens-per-minute). These limits protect the upstream provider quota shared across all callers of that route. They are not designed to isolate one agent session from another — a looping agent on route openai-gpt4 and a legitimate user on the same route compete for the same per-route window. A looping agent that fires 600 requests/minute against a route with a 1,000 req/min limit consumes 60% of the team's shared quota.

Failure mode 1: autolog span accumulation in agent loops

The failure mode emerges when an agent uses a looping control flow — a ReAct-style think/act/observe loop, a retry-on-validation-failure pattern, or a recursive sub-task decomposition — with mlflow.langchain.autolog() or mlflow.openai.autolog() enabled. Each iteration fires one or more LLM calls; each call triggers a span write; the span writes accumulate as I/O load on the tracking server proportional to iteration count.

The subtle cost amplification is in the interaction between span write latency and the loop's timing assumptions. Many agent implementations use a wall-clock duration check as a coarse circuit breaker: "if this iteration took longer than N seconds, something is wrong." When the tracking server is under load from earlier iterations, span write latency grows, increasing wall-clock iteration time, making each iteration appear slower to the duration check. The duration check raises its threshold or is never triggered, while the actual LLM work (and billing) continues at the same rate. Observability overhead has defeated the timing-based guard.

Python — ReAct agent loop with autolog (the problem)
import mlflow
import mlflow.openai
from openai import OpenAI

mlflow.openai.autolog()  # spans written synchronously after every call

client = OpenAI()
run_start = time.time()

# This loop has no call-count ceiling — autolog overhead grows linearly
for step in range(max_steps):
    response = client.chat.completions.create(
        model="gpt-4o",
        messages=build_messages(history, tool_results),
    )
    # span write happens here (inside autolog's post-call hook)
    # at step 100: ~3s of accumulated span I/O on a remote tracking server

    action = parse_action(response)
    if action.type == "finish":
        break
    tool_results.append(execute_tool(action))

The guard wraps the loop with explicit call-count and token-budget ceilings that operate independently of span write timing. The span write latency does not affect when the circuit trips:

Python — call-count and token-budget guard for autolog loops
import mlflow
import mlflow.openai
from openai import OpenAI
from dataclasses import dataclass, field
from typing import Optional

@dataclass
class AgentLoopGuard:
    max_calls: int = 50
    max_input_tokens: int = 500_000
    max_output_tokens: int = 100_000
    _calls: int = field(default=0, init=False)
    _input_tokens: int = field(default=0, init=False)
    _output_tokens: int = field(default=0, init=False)

    def record_call(self, response) -> None:
        self._calls += 1
        usage = response.usage
        self._input_tokens += usage.prompt_tokens
        self._output_tokens += usage.completion_tokens

    def check(self) -> Optional[str]:
        if self._calls >= self.max_calls:
            return f"call ceiling hit ({self._calls}/{self.max_calls})"
        if self._input_tokens >= self.max_input_tokens:
            return f"input token budget exceeded ({self._input_tokens:,})"
        if self._output_tokens >= self.max_output_tokens:
            return f"output token budget exceeded ({self._output_tokens:,})"
        return None

mlflow.openai.autolog()
client = OpenAI()
guard = AgentLoopGuard(max_calls=50, max_input_tokens=500_000)

with mlflow.start_run():
    for step in range(1000):  # high ceiling — guard stops it, not range
        response = client.chat.completions.create(
            model="gpt-4o",
            messages=build_messages(history, tool_results),
        )
        guard.record_call(response)

        trip_reason = guard.check()
        if trip_reason:
            mlflow.set_tag("runguard.tripped", trip_reason)
            mlflow.log_metric("runguard.calls_at_trip", guard._calls)
            mlflow.log_metric("runguard.tokens_at_trip",
                              guard._input_tokens + guard._output_tokens)
            raise RuntimeError(f"RunGuard tripped: {trip_reason}")

        action = parse_action(response)
        if action.type == "finish":
            break
        tool_results.append(execute_tool(action))

The guard records call count and token usage directly from the response object — counts that autolog also captures, but whose values are available in-process before the span write completes. Logging the trip reason and metrics as MLflow tags and metrics means the trip event appears in the run's metadata, visible in the tracking UI alongside the last span before the circuit opened.

Failure mode 2: AI Gateway route fan-out from agent retry storms

The MLflow AI Gateway enforces limits at the route level, not the session level. A route config like limit: {calls: 100, renewal_period: minute} means all callers sharing that route's API key get 100 calls/minute combined. This design is correct for quota protection — you don't want one caller to exceed the upstream provider limit — but it has no concept of isolating a single agent session's usage.

The failure mode is a retry amplification cascade. The agent calls the Gateway on route openai-gpt4o. The Gateway enforces the per-route limit and returns 429 with a Retry-After header when the window is saturated. The agent's retry decorator (or the LLM client's built-in retry logic) sees the 429 and retries after the suggested delay. If the same looping agent is responsible for much of the route traffic, the retry delay just moves the request to the next window, where the agent's other in-flight iterations immediately fill the quota again. The agent is effectively in a rate-limited loop: request → 429 → retry → quota full again → 429 → retry.

Python — agent hitting AI Gateway with retry amplification (the problem)
import mlflow.deployments

client = mlflow.deployments.get_deploy_client("http://localhost:5000")

# This retry loop has no total-attempt ceiling
# Each 429 from the Gateway triggers an immediate retry with backoff
# but if the agent itself is the dominant traffic source,
# the backoff just moves the congestion window by seconds

for step in range(max_steps):
    try:
        response = client.predict(
            endpoint="openai-gpt4o",
            inputs={"messages": build_messages(history, tool_results)},
        )
    except Exception as e:
        if "429" in str(e):
            time.sleep(extract_retry_after(e))  # retry — still in the loop
            continue
        raise

    tool_results.append(execute_tool(parse_action(response)))

The guard adds a session-level request budget that trips independently of the Gateway's per-route limit. Once the session budget is exhausted, the agent stops making new requests regardless of whether the Gateway would allow them:

Python — session-scoped request budget guard for AI Gateway callers
import mlflow.deployments
import mlflow
import time
from dataclasses import dataclass, field
from typing import Optional

@dataclass
class GatewaySessionGuard:
    max_requests: int = 40
    max_429s: int = 5        # consecutive 429s = gateway is saturated by us
    max_total_latency_s: float = 120.0
    _requests: int = field(default=0, init=False)
    _consecutive_429s: int = field(default=0, init=False)
    _total_latency: float = field(default=0.0, init=False)

    def record_success(self, latency_s: float) -> None:
        self._requests += 1
        self._consecutive_429s = 0
        self._total_latency += latency_s

    def record_429(self) -> None:
        self._consecutive_429s += 1

    def check(self) -> Optional[str]:
        if self._requests >= self.max_requests:
            return f"session request ceiling ({self._requests}/{self.max_requests})"
        if self._consecutive_429s >= self.max_429s:
            return f"gateway saturation detected ({self._consecutive_429s} consecutive 429s)"
        if self._total_latency >= self.max_total_latency_s:
            return (f"session latency budget exceeded "
                    f"({self._total_latency:.1f}s/{self.max_total_latency_s}s)")
        return None

client = mlflow.deployments.get_deploy_client("http://localhost:5000")
guard = GatewaySessionGuard(max_requests=40, max_429s=5)

with mlflow.start_run() as run:
    for step in range(1000):
        trip_reason = guard.check()
        if trip_reason:
            mlflow.set_tag("runguard.tripped", trip_reason)
            mlflow.log_metric("runguard.gateway_requests", guard._requests)
            mlflow.log_metric("runguard.consecutive_429s", guard._consecutive_429s)
            raise RuntimeError(f"RunGuard tripped: {trip_reason}")

        t0 = time.monotonic()
        try:
            response = client.predict(
                endpoint="openai-gpt4o",
                inputs={"messages": build_messages(history, tool_results)},
            )
            guard.record_success(time.monotonic() - t0)
        except Exception as e:
            if "429" in str(e):
                guard.record_429()
                retry_delay = min(extract_retry_after(e), 10.0)
                time.sleep(retry_delay)
                continue
            raise

        action = parse_action(response["choices"][0]["message"])
        if action.type == "finish":
            break
        tool_results.append(execute_tool(action))

The consecutive-429 counter is the key signal. A looping agent that is itself the source of route saturation will see 429s repeatedly because its own retries fill each new rate-limit window. Five consecutive 429s without a success is a reliable indicator that the session is in a retry amplification loop, not experiencing transient infrastructure noise. At five consecutive 429s the guard trips, the run is tagged, and the Gateway route's window is freed for legitimate callers.

Failure mode 3: evaluation suite fan-out with LLM-as-judge metrics

MLflow's mlflow.evaluate() API supports custom metrics that call an LLM to score each prediction — the "LLM-as-judge" pattern. A typical agentic eval setup might use three judge metrics: a correctness scorer, a faithfulness scorer, and a relevance scorer, each calling GPT-4o on every row of the evaluation dataset. With a 200-row dataset, that's already 600 LLM calls per mlflow.evaluate() invocation. Add a per-metric max_retry of 2 for transient errors, and the ceiling is 1,800 calls from one evaluation run.

The amplification is worse when the task being evaluated is itself agentic. If each row of the dataset describes a task the agent must solve before the judge can score the result, the actual call count per row is agent_steps_per_task × number_of_judge_metrics × (1 + max_retries). A 5-step agent, three judge metrics, one retry: 30 LLM calls per row × 200 rows = 6,000 calls from a single evaluation run.

Python — LLM-as-judge evaluation with uncapped fan-out (the problem)
import mlflow
from mlflow.metrics.genai import make_genai_metric

correctness = make_genai_metric(
    name="correctness",
    definition="Is the agent's final answer factually correct?",
    grading_prompt="Rate correctness 1-5...",
    model="openai:/gpt-4o",  # judge LLM
    max_workers=4,            # concurrent judge calls — no per-eval ceiling
)
faithfulness = make_genai_metric(
    name="faithfulness",
    definition="Does the answer stay grounded in the retrieved context?",
    grading_prompt="Rate faithfulness 1-5...",
    model="openai:/gpt-4o",
    max_workers=4,
)

# 200-row dataset × 2 judge metrics × gpt-4o = 400 LLM calls minimum
# plus agent calls per row if the eval function is itself agentic
results = mlflow.evaluate(
    model=run_agent,          # agent function — may itself make N LLM calls
    data=eval_dataset,        # 200 rows
    extra_metrics=[correctness, faithfulness],
)

The guard wraps mlflow.evaluate() with dataset size and metric count pre-checks, then applies a sampling strategy for large datasets before the eval fires. A dry-run estimate gives the team a chance to gate-check before committing the budget:

Python — pre-eval budget estimator and sampler guard
import mlflow
import math
from typing import Callable, Any

def guarded_evaluate(
    model: Callable,
    data,
    extra_metrics: list,
    max_eval_llm_calls: int = 500,
    agent_calls_per_row_estimate: int = 5,
    judge_max_retries: int = 1,
    sample_if_over_budget: bool = True,
) -> Any:
    n_rows = len(data) if hasattr(data, "__len__") else None
    if n_rows is None:
        raise ValueError("Dataset must have a __len__ for budget estimation.")

    n_metrics = len(extra_metrics)
    judge_calls = n_rows * n_metrics * (1 + judge_max_retries)
    agent_calls = n_rows * agent_calls_per_row_estimate
    total_estimate = judge_calls + agent_calls

    if total_estimate > max_eval_llm_calls:
        if not sample_if_over_budget:
            raise RuntimeError(
                f"Eval budget exceeded before start: estimated {total_estimate} LLM calls "
                f"({n_rows} rows × {n_metrics} metrics × {1 + judge_max_retries} max_retries "
                f"+ {agent_calls} agent calls). Ceiling: {max_eval_llm_calls}."
            )
        # sample down to fit within budget
        max_rows = math.floor(
            (max_eval_llm_calls - 0) /
            (n_metrics * (1 + judge_max_retries) + agent_calls_per_row_estimate)
        )
        if max_rows < 10:
            raise RuntimeError(
                f"Dataset too large to sample within budget. "
                f"Would need to reduce to {max_rows} rows — too few for meaningful eval."
            )
        import pandas as pd
        if hasattr(data, "sample"):
            data = data.sample(n=max_rows, random_state=42)
        else:
            data = data[:max_rows]

        sampled_estimate = (
            max_rows * n_metrics * (1 + judge_max_retries)
            + max_rows * agent_calls_per_row_estimate
        )
        print(f"[RunGuard] Sampled dataset to {max_rows} rows "
              f"(estimated {sampled_estimate} LLM calls ≤ {max_eval_llm_calls} ceiling).")

    with mlflow.start_run(tags={
        "runguard.eval_rows": str(len(data) if hasattr(data, "__len__") else "unknown"),
        "runguard.eval_metrics": str(n_metrics),
        "runguard.eval_llm_estimate": str(total_estimate),
    }):
        return mlflow.evaluate(
            model=model,
            data=data,
            extra_metrics=extra_metrics,
        )

The pre-check runs before the first LLM call. If the dataset would exceed the ceiling, it samples the dataset to fit within budget — the random seed is fixed for reproducibility, and the sample size is logged so the team can tell which rows were evaluated. If even the minimum viable sample is too small (fewer than ten rows), the eval is blocked entirely rather than producing a statistically meaningless result.

Failure mode 4: artifact logging storm from tool output accumulation

MLflow's artifact store — S3, GCS, Azure Blob, or local filesystem — is designed for model artifacts: weights files, serialized pipelines, evaluation plots. Agents that use mlflow.log_artifact() or mlflow.log_text() to capture tool outputs (search results, fetched pages, generated files) write one artifact per call. This pattern is natural: it gives you a complete record of what the agent saw and produced, and the artifact browser in the MLflow UI is a useful debugging surface.

The cost amplification occurs when the agent is in a loop. At 100 loop iterations, each logging three tool outputs as artifacts, you have 300 artifact writes per run. On S3, that's 300 PUT requests ($0.000005 each = $0.0015) plus 300 objects in the bucket. Individually trivial. At scale — a batch job running 1,000 agent tasks in parallel — that's 300,000 PUT requests and 300,000 objects per batch, plus the per-object storage and LIST costs for any downstream pipeline that reads the artifacts. The artifact store cost can exceed the LLM cost for text-heavy agents whose tool outputs are small but numerous.

Python — agent logging every tool output as an artifact (the problem)
import mlflow

with mlflow.start_run():
    for step in range(max_steps):
        response = llm_call(build_messages(history, tool_results))
        action = parse_action(response)

        if action.type == "tool_call":
            result = execute_tool(action)
            tool_results.append(result)

            # one artifact write per tool call — accumulates with loop depth
            mlflow.log_text(
                str(result),
                artifact_file=f"tool_outputs/step_{step}_{action.tool}.txt"
            )

        elif action.type == "finish":
            break

The guard applies a write budget and a batching strategy: tool outputs accumulate in-process and are flushed as a single consolidated artifact at trip or finish, replacing per-call writes with one write per run:

Python — artifact write budget and consolidation guard
import mlflow
import json
from dataclasses import dataclass, field
from typing import Optional

@dataclass
class ArtifactLoggingGuard:
    max_artifact_writes: int = 20
    max_artifact_bytes: int = 10 * 1024 * 1024  # 10 MB total
    consolidate_on_trip: bool = True
    _writes: int = field(default=0, init=False)
    _total_bytes: int = field(default=0, init=False)
    _buffer: list = field(default_factory=list, init=False)

    def log_tool_output(
        self,
        step: int,
        tool_name: str,
        result: str,
    ) -> Optional[str]:
        encoded = result.encode("utf-8", errors="replace")
        self._total_bytes += len(encoded)
        self._buffer.append({
            "step": step,
            "tool": tool_name,
            "output": result[:2000],  # truncate individual entries for readability
        })

        if self._writes < self.max_artifact_writes and \
           self._total_bytes < self.max_artifact_bytes:
            self._writes += 1
            return None  # log normally below if caller wants per-call writes

        return (
            f"artifact ceiling: {self._writes} writes or "
            f"{self._total_bytes / 1024:.1f} KB accumulated"
        )

    def flush(self, suffix: str = "final") -> None:
        if not self._buffer:
            return
        mlflow.log_text(
            json.dumps(self._buffer, indent=2),
            artifact_file=f"tool_outputs_consolidated_{suffix}.json",
        )

artifact_guard = ArtifactLoggingGuard(max_artifact_writes=20)

with mlflow.start_run():
    for step in range(1000):
        response = llm_call(build_messages(history, tool_results))
        action = parse_action(response)

        if action.type == "tool_call":
            result = execute_tool(action)
            tool_results.append(result)

            trip_reason = artifact_guard.log_tool_output(step, action.tool, str(result))
            if trip_reason:
                artifact_guard.flush(suffix="at_trip")
                mlflow.set_tag("runguard.artifact_tripped", trip_reason)
                raise RuntimeError(f"RunGuard artifact ceiling hit: {trip_reason}")

        elif action.type == "finish":
            break

    # flush all buffered outputs as a single consolidated artifact
    artifact_guard.flush(suffix="final")
    mlflow.log_metric("total_tool_calls", len(artifact_guard._buffer))

The consolidated artifact writes once — at trip or at finish — rather than once per tool call. The JSON structure preserves the full sequence of steps and tool names, so the debugging value of the artifact log is unchanged. The artifact write count goes from O(N) with loop depth to O(1) per run.

Connecting MLflow's cost signals to the circuit breaker

MLflow Tracing gives you accurate, post-hoc cost attribution through mlflow.get_current_active_span() and the run's summary.weave.costs equivalent — the token counts and latency in each span's attributes. The gap between MLflow's post-hoc accounting and real-time cost enforcement is exactly the gap that circuit breakers fill.

The patterns above implement guards that act before the billing event, using in-process counters that are updated synchronously with each LLM call. The MLflow trace provides the audit trail after the fact; the in-process guard prevents the run from ever reaching the audit-trail-worthy event in the first place. The two mechanisms are complementary: guards stop the accumulation, traces explain what happened up to the trip point.

For teams already using MLflow across both classical ML experiments and LLM agents, the logging patterns above have an additional benefit: the trip reason, call counts, and token budgets are logged as MLflow tags and metrics on the run. This means your existing MLflow dashboards and alert rules can track circuit breaker activity across experiments — trip rate over time, calls-at-trip distribution, which agents hit which ceilings — without instrumenting a separate monitoring system.

See also: LangChain LCEL cost control, LiteLLM Proxy circuit breakers for unified LLM routing, and the AI agent cost control pattern reference for the full taxonomy of failure modes across frameworks.

Frequently asked questions

Does MLflow Tracing have any built-in cost alerting?

MLflow Tracing captures token counts and computes cost attribution per span using built-in model pricing tables, surfacing it in the mlflow.get_run().data.metrics and in the Traces tab of the MLflow UI. As of MLflow 2.x, there is no built-in threshold alert that fires mid-run when cost exceeds a limit — the cost data is post-hoc, available after the run completes. Real-time enforcement requires in-process guards on top of MLflow's observability layer, as shown in the patterns above.

How do the AI Gateway's rate limits interact with agent retry logic?

The Gateway's per-route limits enforce a calls-per-minute or tokens-per-minute ceiling across all callers sharing a route. When a looping agent hits the limit and receives a 429, a naive retry decorator re-enqueues the request at the next window boundary — but if the agent's own in-flight iterations fill that next window, the retry itself gets 429'd. The consecutive-429 guard in failure mode 2 detects this pattern (five consecutive 429s without a success) and treats it as a loop signal rather than a transient error, tripping the breaker before the agent's retries saturate multiple consecutive windows.

Why does autolog overhead matter if the tracking server is fast?

On a local MLflow server with a fast SQLite backend, span write latency is typically under 5ms and the overhead is negligible for agents that run tens of steps. The concern arises in two scenarios: (1) remote tracking servers under load from many concurrent agents, where write latency can reach 50–200ms and loops that use wall-clock timing for circuit breaking see inflated step durations; (2) high-iteration loops (1,000+ steps) where even 10ms/span adds 10 seconds of I/O overhead and the tracking server write queue becomes a resource contention point.

Can I use MLflow's autolog with async agents?

MLflow Tracing supports async via mlflow.trace() as an async context manager and the autolog integrations patch async client methods. The span write itself is currently synchronous in most integrations (the async variant flushes at context manager exit), so high-concurrency async agents that fire many parallel LLM calls can create a write burst to the tracking server when the context managers exit simultaneously. For parallel fan-out patterns, apply a concurrency ceiling (e.g., asyncio.Semaphore) before autolog writes become a bottleneck.

Does the artifact consolidation guard lose debugging fidelity?

The consolidated artifact preserves the full sequence of steps, tool names, and truncated outputs (first 2,000 characters per tool result) as a single JSON file. For debugging purposes, the consolidated artifact is typically more useful than 300 individual text files because the sequence is in one place and searchable. The only loss is the individual file's byte-for-byte exact content for results over 2,000 characters — the guard truncates per-result. If full-fidelity capture is required for specific tool types, write those directly (within the per-call ceiling) and rely on consolidation only for the remainder.

Stop the loop before the invoice arrives

RunGuard is a runtime SDK that trips a circuit breaker the moment your agent's tool-call pattern shows a loop, context overflow, or budget breach — before the bill lands. One-line install for TypeScript and Python.

See pricing Learn more