TensorZero Cost Control: Best-of-N Multipliers, Experiment Variant Skew, DICL Overhead, and Feedback-Inference Cycles

TensorZero is an open-source LLM inference gateway written in Rust. You define functions — typed inference endpoints with named inputs and structured output schemas — and variants — the specific model, prompt template, and sampling configuration each function uses. The TensorZero gateway sits between your application code and your LLM provider; your agent calls client.inference(function_name="my_fn", input={...}) and TensorZero handles routing, structured output enforcement, inference logging, and A/B variant assignment.

Beyond basic routing, TensorZero adds three optimization primitives that are central to its value proposition: multi-variant experimentation (route traffic across model variants by weight), best-of-N sampling (generate N candidate responses and score them with a judge function), and Dynamic In-Context Learning (retrieve similar past inferences from ClickHouse and inject them as few-shot examples). Each of these features is designed to improve output quality. None of them enforce a spend ceiling.

The result is a familiar gap: TensorZero routes and logs every call; it does not decide that the session has spent enough. An agent that calls client.inference() in a loop, under a best-of-N variant, against a DICL function, with programmatic quality gating, can run up a bill that is 10–50× the naive estimate before the outer loop terminates. This post covers four structural cost amplification patterns specific to TensorZero's architecture, and a circuit breaker guard for each.

What this post covers: Four cost amplification patterns specific to TensorZero's architecture, and a runtime circuit breaker guard for each. The guards work alongside TensorZero — they do not replace it. You keep the inference logs and experiment data; you add the spend ceilings.

Pattern 1: Best-of-N Sampling Hidden Call Multiplier

TensorZero's best-of-N sampling is a variant-level configuration. When active, a single call to client.inference() dispatches N parallel requests to the underlying model provider. A separate judge function scores all N candidate responses, and TensorZero returns the highest-scoring one to your application. From the caller's perspective, one inference call produced one response. From the provider's billing system, one inference call produced N + 1 billable calls (N candidate generations plus one judge evaluation).

A minimal TensorZero configuration activating best-of-N on a chat function looks like this in tensorzero.toml:

TOML

[functions.research_agent]
type = "chat"

[functions.research_agent.variants.best_of_5]
type = "chat_completion"
model = "openai::gpt-4o-mini"
weight = 1.0

[functions.research_agent.variants.best_of_5.best_of_n_sampling]
type = "latent_space_optimizer"
n = 5
inner_function = "research_agent_judge"

[functions.research_agent_judge]
type = "chat"

[functions.research_agent_judge.variants.judge_v1]
type = "chat_completion"
model = "openai::gpt-4o"   # judge uses a stronger model
weight = 1.0

Your agent code calls inference once per loop iteration:

Python

import tensorzero

client = tensorzero.TensorZeroGateway("http://localhost:3000")

results = []
for query in research_queries:          # 50 queries
    response = client.inference(
        function_name="research_agent",
        input={"messages": [{"role": "user", "content": query}]},
    )
    results.append(response.content)

The loop counter shows 50 iterations. The provider bill reflects:

Candidate generations: 50 queries × 5 candidates = 250 gpt-4o-mini calls
Judge evaluations: 50 queries × 1 judge call = 50 gpt-4o calls
Effective multiplier: 6× the cost of a direct gpt-4o-mini loop
At 2,000 input + 500 output tokens per call:
250 × ($0.15/1M in + $0.60/1M out) × 2,500 tokens ≈ $0.47
50 × ($5.00/1M in + $15.00/1M out) × 2,500 tokens ≈ $2.50
Total: ~$2.97 vs. ~$0.44 for a plain loop

The critical property: there is no TensorZero API response field that tells the caller how many underlying calls were made. response.content is the winning candidate's output. The multiplier is invisible in application code and only appears in provider billing dashboards or in TensorZero's ClickHouse inference log.

The guard: BestOfNBudgetGuard

Wrap the inference loop with a session budget tracker that accounts for best-of-N multipliers before each dispatch. The guard needs two inputs: the configured n value for the function and the judge model's cost rate.

Python

from dataclasses import dataclass, field
from typing import Optional
import tensorzero

# Cost per 1M tokens (blended input+output at typical prompt/completion ratio)
MODEL_BLENDED_COST_PER_CALL = {
    "openai::gpt-4o-mini":   0.00044,   # ~2k in + 500 out tokens
    "openai::gpt-4o":        0.0250,
    "openai::gpt-4.1":       0.0280,
    "openai::gpt-4.1-mini":  0.00105,
    "anthropic::claude-haiku-4-5": 0.00180,
    "anthropic::claude-sonnet-4-6": 0.0165,
}

@dataclass
class BestOfNBudgetGuard:
    candidate_model: str
    judge_model: str
    n: int
    max_session_cost_usd: float
    max_session_calls: int
    _session_cost: float = field(default=0.0, init=False)
    _session_calls: int = field(default=0, init=False)

    def _candidate_cost(self) -> float:
        return MODEL_BLENDED_COST_PER_CALL.get(self.candidate_model, 0.01)

    def _judge_cost(self) -> float:
        return MODEL_BLENDED_COST_PER_CALL.get(self.judge_model, 0.025)

    def preflight(self) -> None:
        """Call BEFORE each client.inference(). Raises BudgetExceededError if over ceiling."""
        projected_cost = self._session_cost + (
            self.n * self._candidate_cost() + self._judge_cost()
        )
        projected_calls = self._session_calls + self.n + 1

        if projected_cost > self.max_session_cost_usd:
            raise BudgetExceededError(
                f"BestOfNBudgetGuard: projected session cost ${projected_cost:.3f} "
                f"exceeds ceiling ${self.max_session_cost_usd:.2f} "
                f"(n={self.n}, candidate={self.candidate_model}, judge={self.judge_model})"
            )
        if projected_calls > self.max_session_calls:
            raise BudgetExceededError(
                f"BestOfNBudgetGuard: projected call count {projected_calls} "
                f"exceeds ceiling {self.max_session_calls}"
            )

    def record(self) -> None:
        """Call AFTER a successful inference() returns."""
        self._session_cost += self.n * self._candidate_cost() + self._judge_cost()
        self._session_calls += self.n + 1

    @property
    def session_cost(self) -> float:
        return self._session_cost

    @property
    def session_calls(self) -> int:
        return self._session_calls


class BudgetExceededError(Exception):
    pass


# Usage
guard = BestOfNBudgetGuard(
    candidate_model="openai::gpt-4o-mini",
    judge_model="openai::gpt-4o",
    n=5,
    max_session_cost_usd=2.00,   # trip before the bill exceeds $2
    max_session_calls=200,
)

results = []
for query in research_queries:
    try:
        guard.preflight()
    except BudgetExceededError as e:
        print(f"Session ceiling reached after {len(results)} queries: {e}")
        break

    response = client.inference(
        function_name="research_agent",
        input={"messages": [{"role": "user", "content": query}]},
    )
    guard.record()
    results.append(response.content)

print(f"Completed {len(results)} queries | "
      f"estimated cost: ${guard.session_cost:.3f} | "
      f"provider calls: {guard.session_calls}")

The guard's cost estimates use the same blended per-call model as your provider contract. Set max_session_cost_usd to the maximum acceptable cost for the loop before the first iteration starts. The guard fires before each dispatch so the budget ceiling is always respected at the N-call granularity, not only after all calls complete.

Pattern 2: Multi-Variant Experiment Cost Skew

TensorZero's variant experimentation is a traffic split: each client.inference() call is independently assigned to a variant by sampling from the configured weight distribution. A typical A/B experiment might route 80% of traffic to a cheap variant and 20% to an expensive one for comparison:

TOML

[functions.support_agent.variants.cheap_v1]
type = "chat_completion"
model = "openai::gpt-4o-mini"
weight = 0.8

[functions.support_agent.variants.premium_v1]
type = "chat_completion"
model = "openai::gpt-4o"
weight = 0.2

The expected blended per-call cost is 0.8 × $0.00044 + 0.2 × $0.0250 = $0.005. An agent running 500 iterations expects to spend 500 × $0.005 = $2.50. However, variant assignment is a random draw per call — not a deterministic round-robin. The actual number of expensive variant hits follows a binomial distribution: Binomial(n=500, p=0.2). The expected count is 100, but the standard deviation is √(500 × 0.2 × 0.8) ≈ 8.9. A one-standard-deviation run has 109 expensive hits; a two-standard-deviation run has 118.

In isolation, a 9-hit overrun on 100 expected expensive calls adds 9 × ($0.0250 − $0.00044) ≈ $0.22 — tolerable. The failure mode emerges when the agent is already expensive:

Long-context agents — premium_v1 with a 16k-token input costs not $0.0250 but $0.16 per call. A 9-hit overrun costs $1.44 extra.
High-frequency loops — an agent that runs 5,000 iterations with a 20% expensive variant has a 2-sigma overrun of 89 extra expensive hits, adding $2.22+ at standard pricing.
Experiment misconfiguration — a developer sets weights to [0.5, 0.5] intending to compare two variants but forgets to revert after the experiment ships. Every loop now has a 50% chance of hitting the expensive model.

The fundamental issue is that agent code cannot observe which variant it hit on any given call. response from TensorZero includes an inference_id for feedback, but not the variant name or per-call cost. The only way to see the variant breakdown is to query TensorZero's ClickHouse database after the fact.

The guard: VariantSkewDetector

TensorZero's Python client returns an inference_id but not the variant name. The guard uses the ClickHouse inference log to retroactively check the variant assignment and detect skew during long loops:

Python

import clickhouse_connect
from dataclasses import dataclass, field
from typing import Optional
import tensorzero

@dataclass
class VariantSkewDetector:
    """
    Polls TensorZero's ClickHouse log every check_interval calls to
    detect variant cost skew above tolerance_factor × expected cost.
    """
    function_name: str
    expected_weights: dict[str, float]   # {"cheap_v1": 0.8, "premium_v1": 0.2}
    variant_costs: dict[str, float]      # {"cheap_v1": 0.00044, "premium_v1": 0.025}
    check_interval: int = 25             # poll every N inference calls
    tolerance_factor: float = 1.5       # trip when actual cost > 1.5× expected
    ch_client: Optional[object] = None  # clickhouse_connect client
    _call_count: int = field(default=0, init=False)
    _episode_id: Optional[str] = field(default=None, init=False)

    def set_episode(self, episode_id: str) -> None:
        self._episode_id = episode_id

    def after_inference(self) -> None:
        self._call_count += 1
        if self._call_count % self.check_interval != 0:
            return
        if not self.ch_client or not self._episode_id:
            return
        self._check_skew()

    def _check_skew(self) -> None:
        rows = self.ch_client.query(
            """
            SELECT variant_name, count() AS hits
            FROM inference
            WHERE function_name = {function_name:String}
              AND episode_id = {episode_id:String}
            GROUP BY variant_name
            """,
            parameters={
                "function_name": self.function_name,
                "episode_id": self._episode_id,
            }
        ).result_rows

        actual_hits = {row[0]: row[1] for row in rows}
        total = sum(actual_hits.values())
        if total == 0:
            return

        expected_cost = sum(
            self.expected_weights.get(v, 0) * self.variant_costs.get(v, 0)
            for v in self.variant_costs
        )
        actual_cost = sum(
            (actual_hits.get(v, 0) / total) * self.variant_costs.get(v, 0)
            for v in self.variant_costs
        )

        if actual_cost > expected_cost * self.tolerance_factor:
            raise VariantSkewError(
                f"VariantSkewDetector: after {total} calls, actual blended cost "
                f"${actual_cost:.5f}/call is {actual_cost/expected_cost:.1f}× "
                f"expected ${expected_cost:.5f}/call. "
                f"Variant distribution: {actual_hits}"
            )


class VariantSkewError(Exception):
    pass


# Usage
import uuid

ch = clickhouse_connect.get_client(host="localhost", port=8123, database="tensorzero")
detector = VariantSkewDetector(
    function_name="support_agent",
    expected_weights={"cheap_v1": 0.8, "premium_v1": 0.2},
    variant_costs={"cheap_v1": 0.00044, "premium_v1": 0.025},
    check_interval=25,
    tolerance_factor=1.5,
    ch_client=ch,
)

episode_id = str(uuid.uuid4())
detector.set_episode(episode_id)

for ticket in support_tickets:
    response = client.inference(
        function_name="support_agent",
        input={"messages": [{"role": "user", "content": ticket}]},
        episode_id=episode_id,
    )
    try:
        detector.after_inference()
    except VariantSkewError as e:
        print(f"Skew detected, pausing loop: {e}")
        break
    handle_response(response)

The 25-call polling interval means the guard checks for skew without querying ClickHouse on every inference. The tolerance_factor=1.5 threshold gives the binomial distribution room to fluctuate while catching genuine misconfiguration (e.g., weights set to 50/50 instead of 80/20) after 25 calls. Reduce the interval to 10 for tighter enforcement in smaller loops; increase to 50 for very long loops where per-call overhead matters.

Pattern 3: DICL Token Injection Hidden Overhead

TensorZero's Dynamic In-Context Learning variant type retrieves similar past inferences from TensorZero's ClickHouse database and injects them as few-shot examples into each inference call's prompt. The goal is to improve output quality by showing the model examples of inputs similar to the current one and the high-scoring outputs those inputs received. The mechanism:

The caller submits client.inference(input={...}).
TensorZero vectorizes the current input using a configured embedding model (e.g., openai::text-embedding-3-small).
TensorZero queries ClickHouse for the top-K most similar past inferences with positive feedback scores.
The retrieved examples are injected into the system prompt as few-shot demonstrations.
The augmented prompt is sent to the generation model.

Steps 2–4 happen inside the gateway. The caller's code sees only one inference call. The provider's bill reflects the generation model call with the augmented prompt, which is larger than the unaugmented version by K × avg_example_tokens.

TOML

[functions.code_reviewer.variants.dicl_v1]
type = "experimental_dynamic_in_context_learning"
model = "openai::gpt-4o-mini"
embedding_model = "openai::text-embedding-3-small"
k = 8                   # retrieve top-8 similar past inferences
weight = 1.0

Each retrieved example contributes two message turns (input + output) to the injected few-shot block. If the function processes code review requests and each example averages 600 input tokens plus 400 output tokens, injecting 8 examples adds:

Per-call DICL overhead: 8 examples × (600 + 400) tokens = 8,000 extra input tokens
Cost at gpt-4o-mini rates ($0.15/1M input): 8,000 × $0.15/1M = $0.0012 per call
Plus embedding call: input tokens × $0.02/1M ≈ $0.00001 per call (negligible)
At 1,000 calls/day: $1.20/day in hidden injection overhead
At a larger k=16 or longer examples (2,000 tokens each): $4.80/day

The injection overhead is invisible in the caller's token accounting because the augmented prompt is assembled inside the gateway. If the caller estimates cost from the raw input size before the inference() call, every estimate underflows by the injection amount. This is particularly acute during:

Cold DICL databases — when few high-feedback examples exist, TensorZero may inject lower-quality examples to fill K slots, adding overhead without the quality benefit
High-k configurations — engineers often increase k to improve quality without realizing the token multiplier grows linearly
Long-context functions — when base prompts are already near the model's context window, DICL injection can push requests over the provider's token limit, causing 400 errors that the agent's retry logic then multiplies

The guard: DICLTokenBudgetGuard

The guard wraps the DICL inference call with a token budget that accounts for expected injection overhead before dispatch:

Python

from dataclasses import dataclass, field

@dataclass
class DICLTokenBudgetGuard:
    """
    Enforces a per-session and per-call token budget that accounts for
    DICL injection overhead. Reads k and avg_example_tokens from config.
    """
    k: int                          # DICL retrieval count from tensorzero.toml
    avg_example_tokens: int         # estimated tokens per retrieved example (input+output)
    model_input_cost_per_1m: float  # cost per 1M input tokens for the generation model
    max_session_cost_usd: float
    max_call_input_tokens: int      # safety ceiling per call (below model context limit)
    _session_injection_cost: float = field(default=0.0, init=False)
    _session_calls: int = field(default=0, init=False)

    def _injection_tokens(self) -> int:
        return self.k * self.avg_example_tokens

    def _injection_cost(self) -> float:
        return self._injection_tokens() * self.model_input_cost_per_1m / 1_000_000

    def preflight(self, base_input_tokens: int) -> None:
        """
        Call BEFORE client.inference(). base_input_tokens = your prompt token count
        BEFORE DICL injection. Guard checks the projected full prompt size.
        """
        projected_input = base_input_tokens + self._injection_tokens()
        if projected_input > self.max_call_input_tokens:
            raise DICLContextError(
                f"DICLTokenBudgetGuard: projected input {projected_input} tokens "
                f"(base={base_input_tokens} + injection={self._injection_tokens()}) "
                f"exceeds per-call ceiling {self.max_call_input_tokens}. "
                f"Reduce k={self.k} or max_call_input_tokens."
            )

        projected_session_cost = self._session_injection_cost + self._injection_cost()
        if projected_session_cost > self.max_session_cost_usd:
            raise DICLBudgetError(
                f"DICLTokenBudgetGuard: projected session injection cost "
                f"${projected_session_cost:.3f} exceeds ceiling "
                f"${self.max_session_cost_usd:.2f} "
                f"(k={self.k}, avg_example_tokens={self.avg_example_tokens}, "
                f"calls_so_far={self._session_calls})"
            )

    def record(self) -> None:
        self._session_injection_cost += self._injection_cost()
        self._session_calls += 1

    @property
    def session_injection_cost(self) -> float:
        return self._session_injection_cost


class DICLContextError(Exception):
    pass

class DICLBudgetError(Exception):
    pass


# Usage
import tiktoken

enc = tiktoken.encoding_for_model("gpt-4o-mini")

dicl_guard = DICLTokenBudgetGuard(
    k=8,
    avg_example_tokens=1000,         # 600 input + 400 output per example
    model_input_cost_per_1m=0.15,    # gpt-4o-mini input rate
    max_session_cost_usd=1.50,       # trip when injection alone costs >$1.50
    max_call_input_tokens=100_000,   # stay 28k below gpt-4o-mini's 128k context
)

for item in review_items:
    base_prompt = build_review_prompt(item)
    base_tokens = len(enc.encode(base_prompt))

    try:
        dicl_guard.preflight(base_input_tokens=base_tokens)
    except (DICLContextError, DICLBudgetError) as e:
        print(f"DICL guard tripped: {e}")
        # Fall back to non-DICL variant
        response = client.inference(
            function_name="code_reviewer",
            input={"messages": [{"role": "user", "content": base_prompt}]},
            variant_name="baseline_v1",   # explicit non-DICL fallback
        )
    else:
        response = client.inference(
            function_name="code_reviewer",
            input={"messages": [{"role": "user", "content": base_prompt}]},
        )
        dicl_guard.record()

    process_review(response)

The per-call context ceiling (max_call_input_tokens) is the primary safeguard against context-window overflow: set it to your model's context limit minus the maximum expected output size. The session cost ceiling is the secondary safeguard against accumulated injection overhead across a long loop. When either fires, the guard falls back to a variant_name-explicit call that bypasses DICL entirely — TensorZero supports explicit variant selection via the variant_name parameter.

Pattern 4: Feedback-Inference Cycle

TensorZero's feedback API lets callers post quality signals on past inferences. The intended use is offline optimization: collect feedback during production serving, run the TensorZero optimizer, generate improved prompt variants, deploy them as new variant weights. The feedback API is also callable programmatically in real time from agent code:

Python

# TensorZero feedback API — score a past inference
client.feedback(
    metric_name="quality_score",
    inference_id=response.inference_id,
    value=0.35,   # float metric, 0–1
)

When agents evaluate their own outputs against a quality threshold and re-infer on low scores, a feedback-inference cycle emerges. The pattern looks like this in agent code:

Python

# Naive quality-gating loop — NO convergence ceiling
def generate_with_quality_gate(task: str, threshold: float = 0.8) -> str:
    episode_id = str(uuid.uuid4())
    while True:
        response = client.inference(
            function_name="content_generator",
            input={"messages": [{"role": "user", "content": task}]},
            episode_id=episode_id,
        )
        score = evaluate_quality(response.content)   # local or LLM-as-judge eval

        client.feedback(
            metric_name="quality_score",
            inference_id=response.inference_id,
            value=score,
        )

        if score >= threshold:
            return response.content
        # loop forever if the model never reaches threshold

The cycle has no natural stopping condition. If the generation model's output distribution for the given task never produces a response that scores above threshold — because the judge calibration, the task difficulty, or the model's capability ceiling prevents it — the loop runs until an external kill signal, a context length overflow, or a budget exhaustion. Each iteration posts a negative feedback record, which TensorZero stores and uses for future optimizer runs. A 50-iteration quality-gate failure produces 50 negative feedback records on the same episode — an artifact that may skew the offline optimizer toward over-correction on that prompt variant.

The pattern is invisible in the TensorZero UI because the gateway records each inference independently with its own inference_id. There is no "episode replay" view that would show 50 inferences on the same task as a loop. An operator reviewing the ClickHouse log would see 50 inference records in one episode — nothing anomalous, since episodes are designed to contain multi-turn interactions. The negative feedback accumulation is the only indicator, and it surfaces only in the optimizer dashboard, not in the inference log.

The guard: FeedbackInferenceCycleGuard

The guard enforces a hard ceiling on re-inference attempts per task, a time budget for the quality-gate loop, and a score plateau detector that trips when consecutive attempts show no meaningful improvement:

Python

from dataclasses import dataclass, field
from typing import Optional
import time
import uuid
import tensorzero

@dataclass
class FeedbackInferenceCycleGuard:
    max_attempts: int = 5           # absolute re-inference ceiling per task
    max_elapsed_s: float = 120.0    # wall-clock budget for the quality-gate loop
    plateau_window: int = 3         # consecutive attempts to check for plateau
    plateau_threshold: float = 0.03 # score improvement below this = stuck
    _attempts: int = field(default=0, init=False)
    _start_time: float = field(default_factory=time.monotonic, init=False)
    _score_history: list[float] = field(default_factory=list, init=False)

    def reset(self) -> None:
        self._attempts = 0
        self._start_time = time.monotonic()
        self._score_history = []

    def check(self, latest_score: float) -> tuple[bool, str]:
        """
        Returns (should_retry: bool, reason: str).
        Call AFTER each inference + evaluation; before deciding to re-infer.
        """
        self._attempts += 1
        self._score_history.append(latest_score)
        elapsed = time.monotonic() - self._start_time

        if self._attempts >= self.max_attempts:
            return False, f"max_attempts_reached (attempts={self._attempts}, last_score={latest_score:.3f})"

        if elapsed >= self.max_elapsed_s:
            return False, f"time_budget_exceeded (elapsed={elapsed:.1f}s, max={self.max_elapsed_s}s)"

        if len(self._score_history) >= self.plateau_window:
            recent = self._score_history[-self.plateau_window:]
            improvement = max(recent) - min(recent)
            if improvement < self.plateau_threshold:
                return False, (
                    f"score_plateau (last {self.plateau_window} scores: "
                    f"{[f'{s:.3f}' for s in recent]}, "
                    f"improvement={improvement:.4f} < threshold={self.plateau_threshold})"
                )

        return True, "continue"


def generate_with_quality_gate(
    client: tensorzero.TensorZeroGateway,
    task: str,
    threshold: float = 0.8,
    guard: Optional[FeedbackInferenceCycleGuard] = None,
) -> dict:
    if guard is None:
        guard = FeedbackInferenceCycleGuard()
    guard.reset()

    episode_id = str(uuid.uuid4())
    last_response = None
    last_score = 0.0

    while True:
        response = client.inference(
            function_name="content_generator",
            input={"messages": [{"role": "user", "content": task}]},
            episode_id=episode_id,
        )
        last_response = response
        score = evaluate_quality(response.content)
        last_score = score

        client.feedback(
            metric_name="quality_score",
            inference_id=response.inference_id,
            value=score,
        )

        if score >= threshold:
            return {"content": response.content, "score": score,
                    "attempts": guard._attempts, "reason": "threshold_reached"}

        should_retry, reason = guard.check(latest_score=score)
        if not should_retry:
            return {"content": last_response.content, "score": last_score,
                    "attempts": guard._attempts, "reason": reason}


# Usage
cycle_guard = FeedbackInferenceCycleGuard(
    max_attempts=5,
    max_elapsed_s=90.0,
    plateau_window=3,
    plateau_threshold=0.03,
)

result = generate_with_quality_gate(
    client=client,
    task=user_request,
    threshold=0.80,
    guard=cycle_guard,
)
print(f"Generated in {result['attempts']} attempts | "
      f"score={result['score']:.3f} | reason={result['reason']}")

The plateau detector is the most important component. A model that alternates between scores of 0.62 and 0.66 for three consecutive attempts shows no meaningful progress toward a 0.80 threshold — the guard stops the loop and returns the best attempt found so far. Without the plateau detector, the loop continues through all max_attempts iterations even when the trajectory shows the threshold is unreachable. The guard returns the last response with metadata rather than raising an exception, so the caller can decide whether to handle a below-threshold result or escalate.

What TensorZero and RunGuard Cover Together

Layer	TensorZero responsibility	RunGuard responsibility
Inference routing	Route calls to variants by weight; enforce structured output schemas	—
Best-of-N sampling	Generate N candidates; score with judge function; return winner	Pre-compute N-multiplied cost; enforce session ceiling before dispatch
Variant experimentation	Assign variant per call by configured weight; log assignment to ClickHouse	Poll ClickHouse mid-loop; detect actual vs. expected cost skew; halt on threshold
DICL injection	Embed input; retrieve top-K examples; inject into prompt	Account for injection overhead in per-call token budget; fallback to non-DICL variant on trip
Feedback collection	Store feedback scores; surface in optimizer; update variant weights offline	Enforce per-episode re-inference ceiling, time budget, and plateau detection
Spend ceiling	Record inferences and costs in ClickHouse after the fact	Enforce ceiling before charges accrue

TensorZero's documentation frames the platform as an LLM gateway and optimization engine — routing, logging, and offline learning. It is the right framing: TensorZero records what your agents did and improves model performance over time. The policy layer — "this best-of-N loop must not exceed $2 per session," "this quality-gate cycle must stop after 5 attempts," "this DICL call must not inject more than 100k tokens" — belongs in the runtime guard layer that can stop execution before the charges or context overflows land.

Failure Modes This Post Does Not Cover

Four patterns were enough for one post. TensorZero's architecture has others:

Parallel episode fan-out — when agents dispatch many episodes concurrently (e.g., processing a batch of tasks), all episodes compete for the same TensorZero gateway connection pool and the same provider rate limits. A 10-episode parallel batch with best-of-N=5 fires 50+ simultaneous provider calls, likely exhausting a 60-RPM rate limit within seconds and triggering retry cascades.
Shadow variant double-billing — TensorZero supports shadow traffic configurations where 100% of requests are served by the primary variant but also sent to a shadow variant for comparison. Engineers configuring shadow traffic for a new model may not realize the shadow calls are billed at full provider rates despite not being returned to callers.
Offline optimizer LLM calls — TensorZero's optimizer scripts (run manually or on a schedule) make their own LLM calls to generate and evaluate candidate prompt variants. A tight feedback-loop configuration that triggers the optimizer on every 20 new feedback records can result in hourly optimizer runs, each costing $0.50–$5 in LLM calls depending on the variant generation strategy.
ClickHouse write amplification — every inference and feedback record is written to ClickHouse. High-frequency agents (thousands of calls per minute) can generate ClickHouse write pressure that the default single-node deployment does not handle gracefully, causing write timeouts that cascade back into TensorZero's response latency.

Common questions

Can I disable best-of-N for specific calls without changing tensorzero.toml?

Yes. Pass variant_name explicitly in the client.inference() call to bypass the variant weight sampling and force a specific variant. If you have a baseline non-best-of-N variant configured (e.g., research_agent_baseline), you can select it directly: client.inference(function_name="research_agent", variant_name="research_agent_baseline", input={...}). This is the recommended pattern for the DICL fallback guard and for ad-hoc overrides during debugging. For production enforcement, configure a separate function (e.g., research_agent_no_bon) rather than relying on the variant_name override, because the override path bypasses the experiment assignment logging.

Does VariantSkewDetector's ClickHouse polling add latency to the inference loop?

The poll is synchronous and runs on the calling thread every check_interval calls. A ClickHouse query against a local or same-datacenter TensorZero deployment typically completes in 5–20ms. At check_interval=25, the amortized latency overhead per call is 0.2–0.8ms — negligible for most agent loops. If your loop is latency-sensitive, increase check_interval to 50 or 100, or move the poll to a background thread that sets a flag the main loop checks. For very high-frequency loops (>100 calls/second), consider the ClickHouse query's read amplification and ensure the inference table has appropriate indexing on (function_name, episode_id).

How do I estimate avg_example_tokens for DICLTokenBudgetGuard?

Query your TensorZero ClickHouse database for the average token count of inferences stored for the function: SELECT avg(input_tokens + output_tokens) FROM inference WHERE function_name = 'my_fn'. If the function is new and has no history, estimate based on your typical prompt structure — a code review function with 300-token inputs and 200-token outputs averages 500 tokens per example. Run 20–50 calls manually, then query ClickHouse for the actual average. Set avg_example_tokens to the 90th percentile, not the mean, to avoid underestimating injection overhead for long examples.

Should I use TensorZero's built-in episode tracking or manage episode IDs in application code?

Manage episode IDs in application code. Generate a UUID per logical task at the start of each task, pass it through all inference calls for that task, and use it as the scope boundary for VariantSkewDetector and FeedbackInferenceCycleGuard. TensorZero treats episode_id as an opaque grouping key — there is no server-side episode lifecycle management or automatic episode termination. Keeping episode_id scoped to one task in application code makes it easy to query ClickHouse for per-task inference counts and to detect the looping pattern (SELECT count() FROM inference WHERE episode_id = ? returning 20+ rows is a reliable loop signal).

What is the right max_attempts value for FeedbackInferenceCycleGuard?

Calibrate against your function's typical quality score distribution. Run 30–50 representative tasks manually, record the score achieved on the first attempt for each, and find the 90th percentile first-attempt score. If 90% of tasks score above 0.65 on the first attempt and your threshold is 0.80, setting max_attempts=3 is sufficient — tasks that don't reach 0.80 by attempt 3 are unlikely to reach it at all. If your model frequently starts below 0.50 and climbs to 0.80 over 2–3 refinement cycles, set max_attempts=5 with the plateau detector configured at plateau_threshold=0.04. The plateau detector is more important than the attempt ceiling: it stops loops that are oscillating around a plateau score much faster than a fixed ceiling would.

Add spend ceilings to your TensorZero-routed agents

RunGuard's SDK gives you BudgetTracker, LoopDetector, and ContextGuard — three primitives that integrate alongside TensorZero's routing and feedback layers. TensorZero records what happened; RunGuard stops what shouldn't happen again.

See pricing