June 26, 2026 W&B Weave LLM Observability Cost Control Python

W&B Weave Cost Control and Loop Detection: Circuit Breakers for LLM Tracing Ops

Weights & Biases Weave has become the go-to LLM tracing and evaluation framework for teams already in the W&B ecosystem, and increasingly for teams outside it. The @weave.op() decorator instruments every function with automatic trace capture — inputs, outputs, token counts, latency, and cost attribution — all without modifying the function's logic. Add one line of weave.init("project") and one decorator, and every LLM call in your app flows into the Weave UI with full lineage.

The cost control challenge in Weave is a direct consequence of what makes it useful. Weave traces everything, but it enforces nothing. The framework's job is observation — capturing what your ops do with perfect fidelity — not intervention. This is the right design choice for an observability tool, but it creates a structural gap in production systems: Weave will faithfully record a looping agent running 2,000 LLM calls over six hours, timestamped and attributed to the right op, while your LLM provider charges you for every one of them. The trace log is beautiful. The invoice is not.

Four structural patterns in Weave-instrumented systems cause this silent amplification:

Recursive op explosion — retry or reflection logic inside a @weave.op()-decorated function calls the same op again; Weave traces each recursive invocation as a separate child call, each billed to the LLM, with no depth ceiling.
Evaluation dataset fan-out — weave.Evaluation.evaluate() fires one LLM call per row × per scorer × per retry; a 200-row dataset with three scorers and a max-retry of 3 can generate over 1,800 LLM calls from a single await evaluation.evaluate(model) invocation.
Async parallelism without concurrency caps — Weave's async-first design makes it trivial to run an op across a dataset with asyncio.gather(); without a semaphore, every row fires simultaneously, saturating your API rate limit and burning burst token budget before the first result returns.
Per-op retry accumulation — when the LLM client inside a @weave.op() raises a transient error and a retry decorator kicks in, the full input prompt (including any accumulated context) is re-submitted on each retry; Weave traces each retry as a separate span, each billed at full input cost.

Weave's tracing architecture and where cost lives

Weave's tracing model is call-graph-based. Every @weave.op()-decorated function that executes within the scope of an active Weave call becomes a node in a call tree, with parent-child relationships determined by the Python call stack. This means that nesting ops — calling one @weave.op() function from another — is the natural way to build composable pipelines in Weave, and the trace UI renders the nesting beautifully as an expandable tree.

The cost implication of call-graph tracing is that cost is attributed per-node, not aggregated at the root. If a top-level research_agent op calls search_web, summarize_source (×N), and synthesize_report, each with its own LLM call, the total cost of research_agent is the sum of all child op costs. This is correct accounting — but it also means that a looping child op multiplies the parent's cost linearly with iteration count, and the parent op's trace only shows a final aggregate after all children complete.

Weave captures token usage from the LLM response's usage field automatically for providers it recognizes (OpenAI, Anthropic, Cohere, Google Gemini, Groq, Mistral, and others via the Weave client wrappers). The framework computes per-call cost using built-in model pricing tables and surfaces it in the summary.weave.costs field of each call. This cost attribution is accurate and detailed — but it is post-hoc. Weave records what happened. It does not interrupt what is happening.

The practical implication: in a Weave-instrumented system, you will always know exactly what each runaway run cost you, down to the op and the token. You will not, without additional guards, know in advance that a run is going to be expensive until the trace is already written and the bill is already accruing.

Failure mode 1: recursive op explosion

The pattern emerges when agents implement reflection or self-correction inside a @weave.op()-decorated function by calling the same function again with enriched context. The intent is reasonable: if the LLM's first output fails a validation check, retry with the error message appended to the prompt. The problem is that when the correction logic itself calls the same @weave.op() function, Weave traces the recursive call as a new child op — which is correct, since it is a new call — and there is no ceiling on recursion depth in the framework.

The failure mode is subtle because the recursion usually terminates in testing, where validation passes quickly. In production, under adversarial or out-of-distribution inputs, the validator may consistently reject the output, and the retry chain runs until the call stack limit or until the consuming code raises a timeout — whichever comes first. A five-level deep recursion with a 30,000-token context window burns five × the full context in input tokens on every retry that adds only the error message to the prompt.

Python — recursive Weave op with no depth ceiling (the problem)

import weave
import openai

weave.init("my-research-agent")
client = openai.OpenAI()


@weave.op()
def validate_output(text: str) -> bool:
    # Returns True if the output meets quality criteria
    return len(text) > 200 and "conclusion" in text.lower()


@weave.op()
def generate_report(topic: str, context: str = "", attempt: int = 0) -> str:
    messages = [
        {"role": "system", "content": "You are a research assistant. Write a thorough report."},
    ]
    if context:
        messages.append({"role": "user", "content": f"Previous attempt failed: {context}"})
    messages.append({"role": "user", "content": f"Write a report on: {topic}"})

    response = client.chat.completions.create(
        model="gpt-4o",
        messages=messages,
    )
    output = response.choices[0].message.content

    # Self-correction: if output fails validation, retry with error context
    # BUG: no depth ceiling — this can recurse indefinitely on bad inputs
    if not validate_output(output):
        return generate_report(
            topic,
            context=f"Output was too short or lacked a conclusion. Got: {output[:500]}",
            attempt=attempt + 1,
        )

    return output

The guard wraps the retry logic with an explicit depth counter passed through the call, raising a BudgetExceededError when the ceiling is hit. The depth counter must live outside Weave's call context (as a plain Python argument or thread-local) to remain accessible without adding Weave overhead.

Python — recursive op depth guard for Weave

import weave
import openai
from dataclasses import dataclass

weave.init("my-research-agent")
client = openai.OpenAI()

MAX_SELF_CORRECTION_DEPTH = 3


class RecursiveOpBudgetExceeded(RuntimeError):
    """Raised when a Weave op exceeds the allowed self-correction depth."""
    pass


@weave.op()
def validate_output(text: str) -> bool:
    return len(text) > 200 and "conclusion" in text.lower()


@weave.op()
def generate_report(topic: str, context: str = "", _depth: int = 0) -> str:
    """
    Generates a report on the given topic, with self-correction up to
    MAX_SELF_CORRECTION_DEPTH retries. Raises RecursiveOpBudgetExceeded
    if the validator consistently rejects the output.
    """
    if _depth > MAX_SELF_CORRECTION_DEPTH:
        raise RecursiveOpBudgetExceeded(
            f"generate_report exceeded {MAX_SELF_CORRECTION_DEPTH} self-correction "
            f"attempts for topic '{topic}'. Last context: {context[:200]}"
        )

    messages = [
        {"role": "system", "content": "You are a research assistant. Write a thorough report."},
    ]
    if context:
        # Truncate context to prevent prompt growth across retries
        messages.append({
            "role": "user",
            "content": f"Previous attempt failed validation: {context[:300]}. Please fix.",
        })
    messages.append({"role": "user", "content": f"Write a report on: {topic}"})

    response = client.chat.completions.create(
        model="gpt-4o",
        messages=messages,
        max_tokens=1500,
    )
    output = response.choices[0].message.content

    if not validate_output(output):
        return generate_report(
            topic,
            context=f"Output lacked required structure. Got: {output[:300]}",
            _depth=_depth + 1,   # explicit depth tracking, surfaced in Weave trace
        )

    return output


# Usage: Weave traces each attempt as a child call with _depth visible in the trace
try:
    report = generate_report("quantum computing advances in 2026")
except RecursiveOpBudgetExceeded as e:
    print(f"[RunGuard] Self-correction loop stopped: {e}")
    # surface the error in the Weave trace rather than silently failing
    raise

Failure mode 2: evaluation dataset fan-out

Weave's evaluation framework is powerful. Define a dataset, write a scorer, call await evaluation.evaluate(model), and Weave runs your model on every row, scores the output with every scorer, and presents a full report with per-example results and aggregate metrics. The ergonomics are excellent for offline evaluation workflows.

The cost amplification is directly proportional to the dataset size and scorer count, and it is easy to underestimate. A 200-row dataset with three scorers — one for correctness, one for faithfulness, one for fluency — fires 200 × 3 = 600 LLM calls from the scorers alone, plus 200 calls for the model itself, for 800 total. If any scorer uses a retry policy (Weave's built-in retry or a custom one), each failed scorer call multiplies further. At GPT-4o pricing with a 2,000-token average prompt, 800 calls × 2,000 tokens = 1.6M input tokens, roughly $4–8 per evaluation run. Run this in a CI pipeline that triggers on every PR, and a week of active development generates 50+ evaluation runs totaling $200–400 in LLM spend — from evaluation alone, before any production traffic.

Python — Weave evaluation with per-run cost ceiling and row budget

import asyncio
import weave
from weave import Evaluation
from dataclasses import dataclass
from typing import Any

weave.init("my-eval-project")

# Cost model: approximate input cost per model call in USD
# Adjust per your model and average prompt length
COST_PER_MODEL_CALL_USD = 0.004          # ~2k tokens input at gpt-4o pricing
COST_PER_SCORER_CALL_USD = 0.002         # ~1k tokens input per scorer call
MAX_EVAL_BUDGET_USD = 2.00               # hard cap per evaluation run
MAX_ROWS_PER_RUN = 50                    # cap on dataset rows evaluated per run


@dataclass
class EvalBudget:
    max_usd: float
    max_rows: int
    spent_usd: float = 0.0
    rows_run: int = 0

    def check_and_deduct(self, cost: float) -> None:
        if self.spent_usd + cost > self.max_usd:
            raise RuntimeError(
                f"[RunGuard] Evaluation budget exceeded: "
                f"${self.spent_usd:.3f} spent + ${cost:.3f} projected > "
                f"${self.max_usd:.2f} ceiling. Halting evaluation."
            )
        self.spent_usd += cost

    def check_rows(self) -> None:
        if self.rows_run >= self.max_rows:
            raise RuntimeError(
                f"[RunGuard] Evaluation row cap reached: {self.rows_run} rows "
                f"(max {self.max_rows}). Halting to prevent runaway fan-out."
            )


def make_guarded_model(base_model, budget: EvalBudget):
    """Wraps a Weave model's predict method with budget deduction."""

    @weave.op()
    async def guarded_predict(question: str) -> dict:
        budget.check_rows()
        budget.check_and_deduct(COST_PER_MODEL_CALL_USD)
        budget.rows_run += 1
        return await base_model.predict(question)

    return guarded_predict


@weave.op()
def correctness_scorer(question: str, model_output: dict, target: str) -> dict:
    # Real scorer calls an LLM to judge correctness
    # (implementation omitted for brevity)
    return {"correct": True, "score": 1.0}


async def run_budgeted_evaluation(model, dataset_rows: list[dict]) -> dict:
    budget = EvalBudget(max_usd=MAX_EVAL_BUDGET_USD, max_rows=MAX_ROWS_PER_RUN)

    # Pre-check: estimate cost before starting
    n_rows = min(len(dataset_rows), MAX_ROWS_PER_RUN)
    n_scorers = 1  # adjust to number of scorers
    estimated_cost = n_rows * (COST_PER_MODEL_CALL_USD + n_scorers * COST_PER_SCORER_CALL_USD)
    if estimated_cost > MAX_EVAL_BUDGET_USD:
        raise RuntimeError(
            f"[RunGuard] Estimated evaluation cost ${estimated_cost:.2f} exceeds "
            f"budget ${MAX_EVAL_BUDGET_USD:.2f}. Reduce dataset to "
            f"{int(MAX_EVAL_BUDGET_USD / (COST_PER_MODEL_CALL_USD + n_scorers * COST_PER_SCORER_CALL_USD))} rows."
        )

    evaluation = Evaluation(
        dataset=dataset_rows[:n_rows],
        scorers=[correctness_scorer],
    )

    try:
        results = await evaluation.evaluate(model)
        print(f"[RunGuard] Evaluation complete. Rows: {budget.rows_run}, "
              f"estimated spend: ${budget.spent_usd:.3f}")
        return results
    except RuntimeError as e:
        # Budget exceeded mid-run — partial results may be available in Weave UI
        print(f"[RunGuard] Evaluation halted: {e}")
        raise

Failure mode 3: async parallelism without concurrency caps

Weave is built async-first, and its evaluation framework uses asyncio to parallelize row processing. This makes evaluation fast: with enough API concurrency, a 100-row dataset completes in roughly the same wall-clock time as a single row. The cost is identical whether rows run in parallel or in series — both fire the same number of LLM calls — but the burst behavior changes everything when it comes to rate limits and unexpected spend spikes.

The failure mode appears in two contexts. First, when developers build batch processing pipelines on top of Weave ops and use asyncio.gather() to run all rows at once. Second, within the evaluation framework itself when the max_concurrency parameter is not set (it defaults to unbounded in some Weave versions). At GPT-4o's tier-1 rate limit of 500 requests per minute, firing 200 simultaneous requests triggers 200 − 500 = burst spend that is limited only by the rate limiter's retry behavior. Without explicit concurrency control, a single await asyncio.gather(*tasks) with 1,000 tasks can spend hundreds of dollars before the first rate limit error propagates back.

Python — bounded async concurrency guard for Weave batch ops

import asyncio
import weave
from typing import Any, Callable, Coroutine, TypeVar

weave.init("my-batch-project")

T = TypeVar("T")

MAX_CONCURRENT_OPS = 10          # max simultaneous Weave op calls
MAX_TOTAL_OPS_PER_BATCH = 200    # hard ceiling per batch invocation


class AsyncBatchBudgetExceeded(RuntimeError):
    pass


async def run_ops_with_budget(
    op_fn: Callable[..., Coroutine[Any, Any, T]],
    inputs: list[dict],
    concurrency: int = MAX_CONCURRENT_OPS,
    max_total: int = MAX_TOTAL_OPS_PER_BATCH,
) -> list[T | Exception]:
    """
    Runs a @weave.op()-decorated async function over a list of inputs with
    bounded concurrency and a hard total-calls ceiling.

    Returns a list of results (or exceptions for failed calls).
    """
    if len(inputs) > max_total:
        raise AsyncBatchBudgetExceeded(
            f"[RunGuard] Batch of {len(inputs)} inputs exceeds max_total={max_total}. "
            "Split the batch or raise the ceiling explicitly."
        )

    semaphore = asyncio.Semaphore(concurrency)
    completed = 0
    results: list[T | Exception] = [None] * len(inputs)

    async def run_one(i: int, kwargs: dict) -> None:
        nonlocal completed
        async with semaphore:
            try:
                results[i] = await op_fn(**kwargs)
            except Exception as e:
                results[i] = e
            finally:
                completed += 1
                if completed % 50 == 0:
                    print(f"[RunGuard] Batch progress: {completed}/{len(inputs)} ops complete")

    await asyncio.gather(*[run_one(i, inp) for i, inp in enumerate(inputs)])
    return results


# Example usage with a Weave op
@weave.op()
async def analyze_document(doc_id: str, text: str) -> dict:
    # Real implementation calls an LLM
    return {"doc_id": doc_id, "summary": "..."}


async def batch_analyze(documents: list[dict]) -> list[dict]:
    results = await run_ops_with_budget(
        op_fn=analyze_document,
        inputs=documents,
        concurrency=8,           # never more than 8 simultaneous LLM calls
        max_total=150,           # hard stop before starting if batch is too large
    )
    failures = [r for r in results if isinstance(r, Exception)]
    if failures:
        print(f"[RunGuard] {len(failures)}/{len(results)} ops failed in batch")
    return [r for r in results if not isinstance(r, Exception)]

Failure mode 4: per-op retry accumulation

Retry logic is ubiquitous in production LLM applications. Rate limits, transient network errors, model server timeouts, and content filter rejections all justify retrying the call. The standard approach — an exponential backoff decorator like tenacity's @retry — is correct for infrastructure failures. The cost problem arises when the same retry decorator is applied to a @weave.op()-decorated function that includes accumulated context in the prompt.

The pattern: an agent builds up a research context of 40,000 tokens across multiple tool calls, then calls a @weave.op() synthesis function with the full context as input. The synthesis call hits a transient 429 rate limit. The retry decorator re-submits the same call after 2 seconds — paying 40,000 input tokens again. If the call hits three rate limits before succeeding, the effective input cost for that synthesis step is 4 × 40,000 = 160,000 input tokens. Weave traces each attempt correctly, each as a separate call with its own cost attribution. The aggregate is accurate. The spend is still 4× what it should have been.

Python — retry-aware cost guard for expensive Weave ops

import weave
import openai
import time
from functools import wraps
from typing import Any, Callable, TypeVar

weave.init("my-agent-project")

F = TypeVar("F", bound=Callable[..., Any])

MAX_RETRY_COST_MULTIPLIER = 2.5   # allow up to 2.5x the base call cost via retries
BASE_COST_PER_TOKEN_INPUT = 2.5e-6  # gpt-4o: $2.50 per 1M tokens


class RetryBudgetExceeded(RuntimeError):
    pass


def cost_aware_retry(
    max_retries: int = 3,
    base_delay: float = 1.0,
    max_total_cost_multiplier: float = MAX_RETRY_COST_MULTIPLIER,
    estimate_input_tokens: Callable[[tuple, dict], int] | None = None,
):
    """
    Retry decorator for @weave.op() functions that tracks cumulative retry cost
    and refuses to retry if the retry spend would exceed the ceiling.

    estimate_input_tokens: optional callable(args, kwargs) -> int that estimates
    the input token count from the function arguments. Used to compute retry cost.
    If not provided, retries without cost tracking (only max_retries is enforced).
    """
    def decorator(fn: F) -> F:
        @wraps(fn)
        def wrapper(*args, **kwargs):
            base_token_estimate: int | None = None
            cumulative_cost_usd = 0.0

            for attempt in range(max_retries + 1):
                try:
                    return fn(*args, **kwargs)
                except openai.RateLimitError as e:
                    if attempt == max_retries:
                        raise

                    # Estimate cost of this retry
                    if estimate_input_tokens is not None:
                        tokens = estimate_input_tokens(args, kwargs)
                        if base_token_estimate is None:
                            base_token_estimate = tokens
                        retry_cost = tokens * BASE_COST_PER_TOKEN_INPUT
                        cumulative_cost_usd += retry_cost

                        base_cost = (base_token_estimate or tokens) * BASE_COST_PER_TOKEN_INPUT
                        if base_cost > 0 and cumulative_cost_usd > base_cost * max_total_cost_multiplier:
                            raise RetryBudgetExceeded(
                                f"[RunGuard] Retry cost ceiling hit on '{fn.__name__}': "
                                f"cumulative retry spend ${cumulative_cost_usd:.4f} exceeds "
                                f"{max_total_cost_multiplier}x base cost ${base_cost:.4f}. "
                                "Retrying this call is more expensive than its value."
                            ) from e

                    delay = base_delay * (2 ** attempt)
                    print(f"[RunGuard] Rate limit on '{fn.__name__}' (attempt {attempt+1}/"
                          f"{max_retries}). Retrying in {delay:.1f}s.")
                    time.sleep(delay)

        return wrapper  # type: ignore
    return decorator


def estimate_messages_tokens(args: tuple, kwargs: dict) -> int:
    """Estimate input tokens from a messages= kwarg or first positional arg."""
    messages = kwargs.get("messages") or (args[0] if args else [])
    if not isinstance(messages, list):
        return 4096  # safe default estimate
    total = 0
    for m in messages:
        content = m.get("content") or ""
        total += len(content) // 4  # rough 4-chars-per-token approximation
        total += 4  # role overhead
    return max(total, 100)


client = openai.OpenAI()


@weave.op()
@cost_aware_retry(
    max_retries=3,
    base_delay=1.0,
    max_total_cost_multiplier=2.5,
    estimate_input_tokens=lambda args, kwargs: estimate_messages_tokens(args, kwargs),
)
def synthesize_research(messages: list[dict], model: str = "gpt-4o") -> str:
    """
    Synthesizes a research report from accumulated context.
    With a 40k-token context, each retry costs ~$0.10 input tokens.
    The guard stops retrying if cumulative retry cost exceeds 2.5x the base call cost.
    """
    response = client.chat.completions.create(
        model=model,
        messages=messages,
        max_tokens=2000,
    )
    return response.choices[0].message.content

Summary: what Weave tells you vs. what you need to add

Failure Mode	What Weave Shows	What Weave Does Not Do	Guard
Recursive op explosion `@weave.op()` with self-correction calling itself	Each recursive call as a child trace with full cost attribution	Impose a depth limit or halt recursion	`_depth` parameter + `RecursiveOpBudgetExceeded` at ceiling
Evaluation fan-out `weave.Evaluation` on large datasets	Per-row and per-scorer call traces, aggregate cost in summary	Cap the number of rows or total spend per run	Pre-flight cost estimate + `EvalBudget` per-row deduction
Async burst parallelism `asyncio.gather()` without semaphore	All calls traced with accurate timestamps and latency	Limit concurrency or enforce a total-calls ceiling	`asyncio.Semaphore(N)` + `max_total` pre-check in `run_ops_with_budget()`
Retry accumulation `@retry` on ops with large inputs	Each retry as a separate traced call with its own cost	Refuse to retry when cumulative cost exceeds the base call cost	`cost_aware_retry` with cumulative cost multiplier ceiling

Wiring Weave traces to RunGuard for real-time intervention

Weave's post-hoc cost attribution and RunGuard's pre-incident circuit breaking are complementary, not competing. Weave tells you every token spent and by which op; RunGuard trips before the spend happens. The integration is straightforward: read Weave's per-call cost from summary.weave.costs in a Weave callback and feed it into RunGuard's budget.track() method, so the circuit breaker has accurate cost data rather than estimates.

Python — Weave on_finish_call hook feeding RunGuard's budget tracker

import weave
from weave.trace.weave_client import CallsFilter

weave.init("my-production-agent")


class WeaveCostTracker:
    """
    Subscribes to Weave's on_finish_call event to extract per-call costs
    and feed them into a running budget for circuit breaking.
    """

    def __init__(self, session_budget_usd: float = 5.00):
        self.session_budget_usd = session_budget_usd
        self.session_spent_usd = 0.0
        self.call_count = 0

    def on_finish_call(self, call) -> None:
        """Called by Weave after each op finishes."""
        costs = (call.summary or {}).get("weave", {}).get("costs", {})
        call_cost = sum(
            c.get("total_cost", 0.0)
            for c in costs.values()
        )

        self.call_count += 1
        self.session_spent_usd += call_cost

        if self.session_spent_usd > self.session_budget_usd:
            raise RuntimeError(
                f"[RunGuard] Session budget exceeded: "
                f"${self.session_spent_usd:.4f} > ${self.session_budget_usd:.2f} "
                f"after {self.call_count} Weave calls. "
                f"Last op: {call.op_name} cost ${call_cost:.4f}."
            )

        if call_cost > 0.10:
            print(
                f"[RunGuard] High-cost Weave call: {call.op_name} = ${call_cost:.4f} "
                f"(session total: ${self.session_spent_usd:.4f} / ${self.session_budget_usd:.2f})"
            )


# Attach the tracker to the Weave client
tracker = WeaveCostTracker(session_budget_usd=5.00)
weave.client().add_on_finish_call_hook(tracker.on_finish_call)

The on_finish_call hook fires synchronously after each op completes, giving you accurate post-call cost data while still allowing you to raise an exception that propagates through the call stack to the parent op. Combined with the guards above — recursion depth limits, evaluation row caps, async semaphores, and retry cost multiplier ceilings — this gives you both observability and enforcement in the same framework.

The observability/enforcement gap

Weave is excellent at the observability job. It captures call graphs, token counts, latency distributions, cost attribution, evaluation results, and dataset versions with a single decorator. The traces are detailed, the UI is polished, and the evaluation framework reduces the friction of running systematic LLM quality checks to near zero.

The observability/enforcement gap is not a failing in Weave's design — it is an explicit scope decision. Weave's job is to record what your system does, not to change what your system does. But in production systems with looping agents, large evaluation datasets, and parallel batch jobs, recording what happened does not prevent the expensive thing from completing first.

The guards in this post fill that gap without modifying Weave's core behavior. They sit at the function boundary — where the recursive call would happen, where the batch would fire, where the retry would accumulate — and enforce spend ceilings before the LLM client is invoked. Weave's trace log then shows not a runaway call tree but a clean, bounded call graph with explicit budget-exceeded exceptions at the trip points, which is more useful for debugging than a trace that successfully records three hours of a loop that should have been stopped in the first minute.

Frequently asked questions

Does Weave have a built-in spend cap feature?

As of mid-2026, Weave does not have a runtime spend cap that halts execution. It has cost attribution (visible in summary.weave.costs per call and in the Weave UI), and W&B has budget alerts at the workspace level for overall W&B usage, but these are post-hoc reporting tools, not runtime circuit breakers. The guards in this post are additive to Weave's tracing, not replacements for it.

Will adding depth parameters to my ops break the Weave trace?

No. Weave captures all arguments to @weave.op()-decorated functions as part of the trace, including the _depth counter. This is actually useful for debugging: you can filter the Weave trace by _depth to see exactly how deep recursive correction chains went before the circuit tripped. The convention of prefixing internal-only parameters with _ keeps them visually distinct in the Weave UI.

How do I set the right max_concurrent_ops for my use case?

Start with your LLM provider's requests-per-minute (RPM) limit divided by the expected calls-per-second you want to sustain. For GPT-4o at tier-1 (500 RPM), a safe default is 8–12 concurrent ops (480–720 RPM theoretical max, leaving headroom for other traffic). For evaluation workflows where latency matters less than cost predictability, 4–6 concurrent ops is a better default. Adjust up after observing the actual rate-limit hit rate in Weave's trace data.

The on_finish_call hook fires after the call completes — doesn't that mean the money is already spent?

Yes, for that specific call. The hook's value is preventing the next call from starting — it raises an exception that propagates through any parent op still in progress, halting the call tree. For multi-call pipelines with many sequential ops, the hook will stop the pipeline after the first call that pushes the session over budget. The session-level ceiling therefore functions as a "worst-case overage = one additional call's cost" guarantee, not a zero-overage guarantee.

Should I use Weave's evaluation framework for CI/CD, or only for offline analysis?

Weave evaluation works well in CI/CD with a small, representative subset of the full dataset (20–30 rows is a practical CI budget) and a per-run cost cap of $0.50–$1.00. The full dataset evaluation runs less frequently — before major releases or weekly — with a higher budget ceiling and the row cap raised accordingly. The run_budgeted_evaluation pattern above handles both cases by parameterizing MAX_ROWS_PER_RUN and MAX_EVAL_BUDGET_USD from environment variables, so the CI run and the release run use the same code with different limits.

RunGuard: the circuit breaker your Weave ops needed yesterday

RunGuard wraps your @weave.op() functions and weave.Evaluation runs with a runtime circuit breaker that trips on recursion depth, evaluation fan-out, async burst parallelism, and retry accumulation — before the bill lands. One-line SDK install for Python and TypeScript.

Start free 14-day trial