DSPy Cost Control: Loop Detection and Budget Enforcement in Production

DSPy reframes LLM programming as a compilation problem: instead of hand-writing prompts, you define a signature (input fields → output fields), choose a module (dspy.Predict, dspy.ChainOfThought, dspy.ReAct), and run an optimizer (BootstrapFewShot, MIPROv2) that finds the few-shot demonstrations and instruction phrasings that maximize your metric on a dev set. The compiled program then replaces every hand-crafted prompt with a learned, metric-optimized equivalent.

The abstraction is genuinely novel. It also inserts several layers between you and the LLM API where costs can multiply silently. DSPy's most visible safety valve — dspy.Assert(condition, msg, max_backtracks=4) — limits the number of times the framework will retry a failing assertion by rerunning the chain with a corrective hint. What it cannot do is detect when multiple assertions cascade, when a ReAct agent iterates without changing its tool call pattern, when a multi-hop retriever generates semantically fixed queries across hops, or when a compiled program's accumulated demo context has tripled its per-call token cost. This post covers those four failure modes and shows how to build a DspyBreaker circuit breaker that catches them all.

Why max_backtracks is not a circuit breaker

DSPy's assertion system works at the module level. When you write dspy.Assert(len(output.answer) > 20, "answer is too short"), DSPy catches the assertion failure and reruns the surrounding Predict or ChainOfThought call with the failure message appended as a hint. This repeats up to max_backtracks times (default: 4) before the framework raises dspy.primitives.assertions.DSPyAssertionError.

A circuit breaker detects a pattern — a sequence of calls that indicates the system is spending without making progress — and trips to prevent further spend. max_backtracks is an iteration count, not a pattern detector. It has no state machine, no recovery phase, and no awareness of what happened in prior assertions in the same program run. Three assertions in a pipeline each with max_backtracks=4 is not a 4-retry limit on the run. It is three independent 4-retry budgets that can each be consumed in sequence, at the cost of running the full preceding pipeline for each retry.

More critically, max_backtracks is scoped to the assertion site. A dspy.ReAct agent's max_iters — the other common limit DSPy users reach for — counts loop iterations without checking whether the agent is taking distinct actions or repeating the same tool call with identical arguments. The counter increments; the bill accumulates; the agent stays stuck.

Failure mode 1: Assertion cascade storm

The backtrack multiplication problem occurs when assertions appear at multiple stages of a pipeline. Consider a three-stage dspy.Module: a retrieval stage, a reasoning stage, and a formatting stage. Each stage has one dspy.Assert call that validates its output before the next stage consumes it. When the formatting assertion fails, DSPy reruns the formatting stage with a corrective hint — up to 4 times. If all 4 retries fail, the exception propagates. But if the formatting assertion passes on retry 2, the program continues. If the reasoning assertion then fails, DSPy reruns the reasoning stage — which re-runs retrieval as context, and then calls the (now-retried) formatting stage again — up to 4 more times.

The cascade: in the worst case where each assertion trips and exhausts its budget, the program makes stage_1_cost + max_backtracks × (stage_2_cost + max_backtracks × stage_3_cost) LLM calls. With max_backtracks=4 and three equally expensive stages, that is 1 + 4 × (1 + 4 × 1) = 21 LLM calls for what looks like a single program invocation. With 5-stage pipelines at max_backtracks=4, the worst case climbs to 341 calls.

In practice, the cascade is rarely fully nested — each assertion is more likely to pass eventually, so the real cost is lower. But the failure mode is not the fully-nested worst case; it is the moderate case where each assertion trips 2-3 times, compounding to 4–10× more calls than expected. Teams discover it when a production pipeline that costs $0.008 per run on the happy path costs $0.07 on a noisy input batch. The budget math looked correct for the happy path; nobody modelled the assertion retry compounding.

Detection requires a run-level backtrack counter that counts total assertion retries across all assertions in the program execution, not just per-assertion retries. When the run-level counter hits a threshold, the circuit breaker trips — regardless of which individual assertion triggered it.

Failure mode 2: ReAct tool-signature stagnation

DSPy's dspy.ReAct implements the Reasoning + Acting loop: at each iteration, the agent produces a Thought (reasoning trace) and an Action (tool name + arguments or Finish). max_iters (default: 5) limits the total number of iterations. When the limit is hit, ReAct returns whatever partial state it has accumulated.

The stagnation failure mode occurs when the agent's reasoning produces the same action at consecutive iterations. The pattern: the agent calls a search tool with a query; the tool returns unhelpful results; the agent's Thought concludes it needs to search again; the next query is semantically indistinguishable from the prior one; the tool returns the same unhelpful results. The agent is stuck, but max_iters keeps counting normally because iterations are progressing — they are just not making progress toward a useful answer.

This happens in two common configurations. First, a search tool whose retrieval corpus does not contain the information the agent needs — the agent keeps rephrasing the same query hoping for different coverage that does not exist. Second, a tool that returns stochastic but low-signal results — each call returns slightly different content, but none of it satisfies the agent's reasoning condition, so the pattern repeats. In both cases, max_iters=5 means 5 identical (or near-identical) LLM + tool calls before the agent gives up with no useful result.

Detection requires hashing the (action_name, action_args_normalized) tuple at each iteration and checking for repeats within a sliding window. Argument normalization is important: query="what is the capital of France" and query="capital of France" should be recognized as the same query for stagnation purposes. A simple normalization is lowercasing + stripping punctuation + sorting multi-word tokens. If the last N hashes contain fewer than K distinct values, the agent is stagnating.

Failure mode 3: Multi-hop retrieval query fixation

Multi-hop retrieval — chaining multiple dspy.Retrieve calls where each hop's query is generated from the prior hop's retrieved context — is a common DSPy pattern for open-domain QA tasks. The failure mode is query fixation: the ChainOfThought module that generates the next hop's query gets stuck producing semantically identical queries across hops.

The mechanism: the retrieved documents from hop 1 contain partial, ambiguous information. The CoT module, conditioned on this context, generates a follow-up query that is essentially the same as the original question — because the retrieved context didn't shift the model's understanding enough to change the direction of inquiry. Hop 2 retrieves similar documents (same query → similar BM25 or dense vector scores → similar documents). The CoT module generates the same follow-up query. The pattern repeats until max_hops is exhausted.

Unlike ReAct stagnation, the individual calls here each return results — there is no error signal. From the program's perspective, every hop executed successfully. The cost of max_hops=5 in this state is 5 retrieval calls + 5 CoT calls, all of which produced output, none of which made progress. The output quality is equivalent to 1-hop retrieval at 5× the cost.

Detection uses a query similarity check between consecutive hops. Because exact string matching misses paraphrases, use a lightweight character-level similarity measure: if the Jaccard similarity of the token sets of hop N and hop N-1 queries exceeds a threshold (0.75 works well in practice), the retriever has fixated on the same semantic territory and the next hop will not add information. Trip the breaker and return the best answer accumulated so far.

Failure mode 4: Compiled demo token bloat

After running a DSPy optimizer, the compiled program's Predict modules each carry a demos list — the few-shot examples the optimizer selected from your training set. Every call to that Predict module prepends these demos to the prompt before the current input. The demos are selected once at compile time; they do not change in production unless you recompile.

The bloat failure mode is not a loop, but it is a cost multiplication that DSPy provides no built-in check for. If the optimizer selects verbose training examples as demos — long reasoning traces, multi-paragraph reference texts, detailed chain-of-thought steps — the per-call token cost can be several times higher than the base prompt. A program that costs 800 tokens per call with no demos might cost 3,200 tokens per call after compilation with 4 verbose demos. This is invisible in production monitoring unless you specifically track the ratio of demo tokens to task tokens per call.

The compounding happens when demo-heavy compiled programs are used in batch settings — evaluating 10,000 documents with a program that has 2,400 extra demo tokens per call costs 24 million extra tokens that appear nowhere in your cost estimates. And because the demos are correct examples from your training set, they do improve quality — which makes the cost increase easy to rationalize away rather than audit.

Detection requires a pre-flight check that sums demo token counts across all Predict modules in the compiled program before any batch run. If the total demo overhead per call exceeds a threshold relative to your expected task token count, warn or refuse to start the batch. This check has zero runtime overhead — it runs once before the loop begins.

Building DspyBreaker

The breaker uses three tracking mechanisms: a run-level assertion backtrack counter (failure mode 1), an iteration-level action hash window for ReAct (failure mode 2 + 3), and a pre-flight demo token audit (failure mode 4). The CLOSED / OPEN / HALF_OPEN state machine provides recovery: after reset_timeout_seconds, one probe is allowed through; if it succeeds, the breaker resets to CLOSED.

from __future__ import annotations

import hashlib
import re
import time
from collections import deque
from dataclasses import dataclass, field
from enum import Enum
from typing import Any, Optional

import dspy


class BreakerState(Enum):
    CLOSED = "closed"
    OPEN = "open"
    HALF_OPEN = "half_open"


@dataclass
class DspyBreakerConfig:
    max_total_backtracks: int = 8
    max_stagnant_iters: int = 3
    stagnation_window: int = 4
    query_fixation_threshold: float = 0.75
    max_demo_tokens_overhead: int = 4000
    reset_timeout_seconds: float = 60.0
    tokens_per_word_estimate: float = 1.3


@dataclass
class DspyBreakerState:
    state: BreakerState = BreakerState.CLOSED
    total_backtracks: int = 0
    action_hashes: deque = field(default_factory=lambda: deque(maxlen=5))
    last_query: Optional[str] = None
    opened_at: Optional[float] = None
    trip_reason: Optional[str] = None

    def reset(self) -> None:
        self.state = BreakerState.CLOSED
        self.total_backtracks = 0
        self.action_hashes.clear()
        self.last_query = None
        self.opened_at = None
        self.trip_reason = None


class DspyBreaker:
    """Circuit breaker for DSPy programs and ReAct agents."""

    def __init__(self, config: Optional[DspyBreakerConfig] = None):
        self.config = config or DspyBreakerConfig()
        self._state = DspyBreakerState()

    # ── state machine ──────────────────────────────────────────────────────

    def _trip(self, reason: str) -> None:
        self._state.state = BreakerState.OPEN
        self._state.opened_at = time.monotonic()
        self._state.trip_reason = reason

    def _check_open(self) -> None:
        if self._state.state == BreakerState.OPEN:
            elapsed = time.monotonic() - (self._state.opened_at or 0)
            if elapsed >= self.config.reset_timeout_seconds:
                self._state.state = BreakerState.HALF_OPEN
            else:
                raise RuntimeError(
                    f"DspyBreaker OPEN: {self._state.trip_reason} "
                    f"(resets in {self.config.reset_timeout_seconds - elapsed:.0f}s)"
                )

    def on_probe_success(self) -> None:
        if self._state.state == BreakerState.HALF_OPEN:
            self._state.reset()

    def reset(self) -> None:
        self._state.reset()

    # ── failure mode 1: assertion cascade tracking ─────────────────────────

    def record_backtrack(self) -> None:
        """Call once each time DSPy retries an assertion."""
        self._state.total_backtracks += 1
        if self._state.total_backtracks >= self.config.max_total_backtracks:
            self._trip(
                f"assertion cascade: {self._state.total_backtracks} total "
                f"backtracks this run (limit={self.config.max_total_backtracks})"
            )

    # ── failure mode 2: ReAct tool-signature stagnation ───────────────────

    def _normalize_args(self, args: Any) -> str:
        text = str(args).lower()
        text = re.sub(r"[^a-z0-9\s]", " ", text)
        tokens = sorted(text.split())
        return " ".join(tokens)

    def record_action(self, action_name: str, action_args: Any) -> None:
        """Call at each ReAct iteration with the chosen action."""
        self._check_open()
        normalized = self._normalize_args(action_args)
        sig = f"{action_name}:{normalized}"
        action_hash = hashlib.sha256(sig.encode()).hexdigest()[:16]
        self._state.action_hashes.append(action_hash)

        window = list(self._state.action_hashes)
        if len(window) >= self.config.stagnation_window:
            distinct = len(set(window[-self.config.stagnation_window:]))
            if distinct <= (self.config.stagnation_window - self.config.max_stagnant_iters):
                self._trip(
                    f"ReAct stagnation: only {distinct} distinct actions "
                    f"in last {self.config.stagnation_window} iterations"
                )

    # ── failure mode 3: multi-hop query fixation ──────────────────────────

    def _jaccard(self, a: str, b: str) -> float:
        tok_a = set(a.lower().split())
        tok_b = set(b.lower().split())
        if not tok_a or not tok_b:
            return 0.0
        return len(tok_a & tok_b) / len(tok_a | tok_b)

    def record_retrieval_query(self, query: str) -> None:
        """Call before each retrieval hop."""
        self._check_open()
        if self._state.last_query is not None:
            similarity = self._jaccard(query, self._state.last_query)
            if similarity >= self.config.query_fixation_threshold:
                self._trip(
                    f"multi-hop query fixation: Jaccard={similarity:.2f} "
                    f"between consecutive queries (threshold={self.config.query_fixation_threshold})"
                )
        self._state.last_query = query

    # ── failure mode 4: compiled demo token audit ─────────────────────────

    def audit_compiled_demos(self, program: dspy.Module) -> dict[str, int]:
        """
        Pre-flight check: sum demo token estimates across all Predict modules.
        Returns a dict with per-module and total overhead.
        Raises RuntimeError if total exceeds max_demo_tokens_overhead.
        """
        results: dict[str, int] = {}
        total = 0

        for name, module in program.named_sub_modules():
            if not isinstance(module, dspy.Predict):
                continue
            demos = getattr(module, "demos", []) or []
            demo_text = " ".join(
                " ".join(str(v) for v in d.values()) if isinstance(d, dict)
                else str(d)
                for d in demos
            )
            est_tokens = int(len(demo_text.split()) * self.config.tokens_per_word_estimate)
            results[name] = est_tokens
            total += est_tokens

        results["__total__"] = total

        if total >= self.config.max_demo_tokens_overhead:
            raise RuntimeError(
                f"DspyBreaker pre-flight: compiled demos add ~{total} tokens per call "
                f"(limit={self.config.max_demo_tokens_overhead}). "
                f"Recompile with fewer/shorter demos or raise max_demo_tokens_overhead."
            )
        return results

Wiring into a DSPy assertion program

To track assertion backtracks, subclass dspy.Module and override forward(). DSPy's assertion system raises dspy.primitives.assertions.DSPyAssertionError when an assertion is retried — intercept that exception, call breaker.record_backtrack(), and re-raise to let DSPy continue its normal retry flow. Wrap the whole forward() call to check the breaker state before each attempt:

import dspy.primitives.assertions as dspy_assert


class GuardedDspyModule(dspy.Module):
    """Subclass your DSPy programs from this to enable assertion cascade tracking."""

    def __init__(self, breaker: DspyBreaker):
        super().__init__()
        self.breaker = breaker

    def guarded_forward(self, *args, **kwargs):
        """Override this instead of forward()."""
        raise NotImplementedError

    def forward(self, *args, **kwargs):
        self.breaker._check_open()
        try:
            result = self.guarded_forward(*args, **kwargs)
            if self.breaker._state.state == BreakerState.HALF_OPEN:
                self.breaker.on_probe_success()
            return result
        except dspy_assert.DSPyAssertionError:
            self.breaker.record_backtrack()
            raise
        except RuntimeError as exc:
            if "DspyBreaker OPEN" in str(exc):
                raise
            raise

Wiring into a ReAct agent

DSPy's ReAct processes actions in a loop inside forward(). To intercept each iteration's action, subclass ReAct and override the tool dispatch method. DSPy 2.x exposes tool calls through the _act method (or equivalent internal dispatch). The most reliable approach is to wrap each tool function in a thin proxy that records the call before delegating:

class GuardedReAct(dspy.ReAct):
    """ReAct subclass with stagnation detection and multi-hop query fixation tracking."""

    def __init__(self, signature, tools: list, breaker: DspyBreaker, **kwargs):
        super().__init__(signature, tools=self._wrap_tools(tools, breaker), **kwargs)
        self.breaker = breaker
        self._raw_tools = {t.__name__: t for t in tools}

    def _wrap_tools(self, tools: list, breaker: DspyBreaker) -> list:
        wrapped = []
        for tool in tools:
            def make_wrapper(fn):
                def wrapper(*args, **kwargs):
                    breaker.record_action(fn.__name__, kwargs or args)
                    return fn(*args, **kwargs)
                wrapper.__name__ = fn.__name__
                wrapper.__doc__ = fn.__doc__
                return wrapper
            wrapped.append(make_wrapper(tool))
        return wrapped

    def forward(self, **kwargs):
        self.breaker._check_open()
        result = super().forward(**kwargs)
        if self.breaker._state.state == BreakerState.HALF_OPEN:
            self.breaker.on_probe_success()
        return result

Wiring into a multi-hop retrieval program

For multi-hop retrieval, call breaker.record_retrieval_query(query) before each dspy.Retrieve invocation. In a hand-rolled multi-hop module, this is straightforward — insert the call in your loop body. In a compiled dspy.Module, override the Retrieve call:

class GuardedMultiHop(GuardedDspyModule):
    def __init__(self, breaker: DspyBreaker, num_passages: int = 3, max_hops: int = 3):
        super().__init__(breaker)
        self.retrieve = dspy.Retrieve(k=num_passages)
        self.generate_query = [dspy.ChainOfThought("context, question -> query")
                               for _ in range(max_hops)]
        self.generate_answer = dspy.ChainOfThought("context, question -> answer")
        self.max_hops = max_hops

    def guarded_forward(self, question: str) -> dspy.Prediction:
        context: list[str] = []

        for hop in range(self.max_hops):
            query_pred = self.generate_query[hop](
                context="\n".join(context), question=question
            )
            query = query_pred.query
            self.breaker.record_retrieval_query(query)  # fixation check
            passages = self.retrieve(query).passages
            context.extend(passages)

        answer = self.generate_answer(
            context="\n".join(context), question=question
        )
        return dspy.Prediction(context=context, answer=answer.answer)

Pre-flight demo audit in a batch pipeline

def run_batch(compiled_program: dspy.Module, inputs: list[dict]) -> list:
    breaker = DspyBreaker()

    # Audit before touching a single input
    overhead = breaker.audit_compiled_demos(compiled_program)
    print(f"Demo token overhead per call: {overhead['__total__']} tokens")
    # Raises RuntimeError if overhead exceeds config.max_demo_tokens_overhead

    results = []
    for item in inputs:
        breaker.reset()  # fresh run-level state per item
        try:
            result = compiled_program(**item)
            if breaker._state.state == BreakerState.HALF_OPEN:
                breaker.on_probe_success()
            results.append(result)
        except RuntimeError as exc:
            results.append({"error": str(exc), "input": item})
    return results

Configuration reference

Parameter	Conservative default	Aggressive default	Notes
Total backtrack limit `max_total_backtracks`	8	5	Set to `num_assertions × max_backtracks` for a flat budget; lower for dense assertion pipelines
Stagnation window `stagnation_window`	4 iterations	3 iterations	How many recent ReAct iterations to inspect; lower = faster detection, higher = more tolerance for incidental repeats
Stagnation threshold `max_stagnant_iters`	3 of 4 identical	2 of 3 identical	Number of identical action hashes in the window that triggers the trip
Query fixation threshold `query_fixation_threshold`	0.75 Jaccard	0.65 Jaccard	Tune down if your retrieval queries are naturally short and share many tokens; tune up for long, verbose queries
Demo token overhead limit `max_demo_tokens_overhead`	4,000 tokens	2,000 tokens	Sum of demo token estimates across all Predict modules; set relative to your task's expected working context size

Frequently asked questions

DSPy 2.x introduced dspy.LM with a max_tokens parameter. Does that solve the cost problem?

dspy.LM(model, max_tokens=N) caps the number of output tokens per LLM call. It does not cap the number of LLM calls. A program that makes 40 calls — due to assertion cascades or ReAct stagnation — at max_tokens=500 costs 20,000 output tokens plus whatever input tokens each call carries. max_tokens is useful for preventing run-away generation in a single call; it provides no protection against the failure modes described here, which multiply the number of calls rather than the length of any individual call's output. Use max_tokens and the circuit breaker in combination: max_tokens bounds per-call output cost, the breaker bounds total call count.

How do I track assertion backtracks automatically without modifying every Assert call site?

DSPy's assertion system uses Python exceptions internally: a failing dspy.Assert raises DSPyAssertionError, which is caught by the surrounding Predict or ChainOfThought module's retry logic. The cleanest interception point is the forward() method of your top-level dspy.Module subclass — catch DSPyAssertionError, call breaker.record_backtrack(), and re-raise so DSPy's normal retry flow continues. You do not need to modify individual Assert call sites; the exception propagates through your pipeline naturally. If you use assertion-heavy intermediate modules that you do not control, subclass those modules and add the same exception interception at their forward() boundary.

We use MIPROv2 for optimization, which runs many LLM calls during compilation. Should the circuit breaker be active during the compilation phase?

No — the compilation phase is a deliberate, expected-cost operation. MIPROv2 and other teleprompters run hundreds to thousands of LLM calls as part of their search process; tripping the breaker during compilation would abort the optimization prematurely. The circuit breaker is intended for the production inference phase, where you deploy the compiled program and expect predictable per-call costs. Keep the breaker inactive (or use a separate breaker instance with very high thresholds) during compilation. The post-compilation pre-flight demo audit (failure mode 4) is the right check at the boundary between compilation and deployment: run it once after compile() completes and before the compiled program is deployed to production.

The Jaccard similarity check for query fixation assumes token-level comparison. What about embedding-based similarity for multi-hop retrieval with dense retrievers?

Jaccard similarity on token sets is intentionally cheap — it adds zero LLM calls and near-zero latency to each hop. For dense retrievers (DPR, ColBERT, OpenAI embeddings), embedding-based similarity would be more semantically accurate: two queries can be token-disjoint but semantically identical ("capital of France" vs. "where is Paris located?"). If your multi-hop program uses dense retrieval and you observe false negatives (fixated queries with low Jaccard but high semantic similarity), replace the Jaccard check with a cosine similarity check on cached embeddings. Cache the embedding for each query as it is generated — most dense retrievers already compute the query embedding internally, so you can retrieve it from the retriever's encoding step rather than making a separate embedding API call. The threshold moves from 0.75 (Jaccard) to around 0.92–0.95 (cosine) for typical sentence embedding models.

DSPy programs can be nested — a Module whose forward() calls another Module. Does the run-level backtrack counter correctly aggregate across nested modules?

Yes, if you pass the same DspyBreaker instance through to all nested modules. The backtrack counter is an instance-level integer on the breaker; as long as every GuardedDspyModule in your program hierarchy holds a reference to the same breaker instance, every caught DSPyAssertionError at any nesting level increments the same counter. The pattern is: create one DspyBreaker at the top-level program entry point, pass it as a constructor argument to each GuardedDspyModule subclass you instantiate, and call breaker.reset() between independent program runs (not between nested module calls within the same run). If you need per-module isolation — for example, you want to charge different assertion budgets to a retrieval sub-module versus a reasoning sub-module — use separate breaker instances with separate budgets, and add a parent-level cost tracker that sums their counters for global budget enforcement.

Skip the hand-rolling. RunGuard does this in one line.

The implementation above works, but it is ~160 lines of infrastructure that every team building on DSPy writes from scratch. RunGuard wraps it in a single install call — adds the assertion cascade guard, the ReAct stagnation detector, the multi-hop fixation check, the pre-flight demo audit, a Slack alert on trip, and a dashboard showing your last 30 days of incidents by module, failure mode, and estimated cost saved.

See pricing View RunGuard

Also in this series