June 13, 2026 LangChain LCEL Cost Control

LangChain LCEL Cost Control: Loop Detection and Budget Enforcement for Expression Language Chains

LangChain Expression Language (LCEL) is the composition layer that powers most LangChain 0.2+ applications. The | pipe operator — prompt | llm | parser — is elegant. What's less obvious is how four of LCEL's core abstractions interact with LLM billing to produce silent cost blowouts that look nothing like the infinite loops most developers watch for.

This post is specifically about LCEL chain patterns: Runnable.with_retry(), ConversationBufferMemory in multi-turn chains, RunnableParallel fan-out, and unbounded streaming accumulation. These are not LangGraph problems — LangGraph is the stateful orchestration layer above LCEL. If your agent uses .compile() and StateGraph, see LangGraph Circuit Breaker and Cost Control. If your system is a chain of runnables composed with | and .invoke() / .astream(), you're in LCEL territory and these failure modes apply directly.

Scope. All code examples target LangChain 0.2+ with langchain-core>=0.2 and langchain-openai or any other provider. The BaseCallbackHandler API and Runnable interface are stable across providers. For async patterns using asyncio that apply across multiple frameworks, see Async Python AI Agent Cost Control. For LangGraph-specific loop detection at the graph node level, see LangGraph Circuit Breaker Cost Control.

Why LCEL cost failure modes differ from agent loop detection

Most loop-detection guides focus on the same core pattern: an agent calls a tool, gets a result, and makes the same tool call again. The failure is observable as duplicate tool invocations. This is the pattern covered in posts on OpenAI Agents SDK, CrewAI, and AutoGen.

LCEL chains often don't have agents in the traditional sense — no tool-calling loop, no agent executor. They're pipelines: data flows in one direction through a sequence of runnables. The cost failures don't look like loops at the application level. They look like a single chain invocation that runs longer or costs more than expected. The four failure modes are:

Retry exponential blowout — .with_retry() silently re-runs the entire chain segment, including the LLM call, on every error. No log line says "retrying." The first retry doubles cost; the third quadruples it relative to a direct call plus overhead.
Memory explosion in chained calls — ConversationBufferMemory prepends the full conversation history to every LLM call. Turn 1 pays for N tokens; turn 20 pays for the same N × 20 tokens. A modest 200-turn conversation multiplies every call by the conversation length.
Parallel execution fan-out — RunnableParallel fires all branches concurrently. If a chain contains a parallel node with 5 branches, every invocation makes 5 LLM calls instead of 1. Chains composed with parallel nodes inside retry loops compound both multipliers.
Streaming accumulation without a ceiling — .astream() generators hold the connection open indefinitely. Without a token ceiling or timeout, a single streaming call can accumulate tokens until the model's context limit is hit, generating a completion that's 10× longer and 10× more expensive than intended.

Failure mode 1: RunnableRetry exponential blowout

LCEL's .with_retry() combinator is the idiomatic way to handle transient failures:

from langchain_openai import ChatOpenAI
from langchain_core.prompts import ChatPromptTemplate
from langchain_core.output_parsers import StrOutputParser

llm = ChatOpenAI(model="gpt-4o")
chain = (
    ChatPromptTemplate.from_template("Summarize: {text}")
    | llm.with_retry(stop_after_attempt=3, wait_exponential_jitter=True)
    | StrOutputParser()
)

The problem isn't with .with_retry() itself — transient 429 rate-limit errors are a legitimate use case. The problem arises in two scenarios:

Parsing validation errors triggering retries. If the chain's output parser raises a OutputParserException and the retry is attached at the LLM level, the LLM call is re-run on every parse failure. A brittle JSON parser that fails intermittently triggers 3 full LLM calls per invocation — at 3× the cost — without ever surfacing an error to the caller.
Retries nested inside outer retry loops. A common pattern wraps the entire chain in .with_retry() at the application level and configures .with_retry() on the LLM instance. A single failure can produce up to outer_attempts × inner_attempts LLM calls: 3 × 3 = 9 calls where 1 was intended.

The fix is a RetryBudget wrapper that enforces a hard ceiling on total LLM calls across all retry layers:

import threading
from langchain_core.runnables import RunnableLambda
from langchain_core.callbacks import BaseCallbackHandler


class RetryBudget:
    """Thread-local call counter that trips a circuit breaker after max_calls."""

    def __init__(self, max_calls: int = 3):
        self._max = max_calls
        self._local = threading.local()

    def _count(self) -> int:
        return getattr(self._local, "count", 0)

    def increment(self) -> None:
        current = self._count()
        if current >= self._max:
            raise RuntimeError(
                f"RetryBudget: exceeded {self._max} LLM calls in this invocation. "
                "Check for nested retry wrappers or parsing loops."
            )
        self._local.count = current + 1

    def reset(self) -> None:
        self._local.count = 0

    def wrap(self, chain):
        """Return a new chain that enforces this budget on every invocation."""
        budget = self

        def guarded_invoke(input, config=None):
            budget.reset()
            return chain.invoke(input, config)

        async def guarded_ainvoke(input, config=None):
            budget.reset()
            return await chain.ainvoke(input, config)

        return RunnableLambda(guarded_invoke, afunc=guarded_ainvoke)


class RetryBudgetCallback(BaseCallbackHandler):
    """Callback that increments the budget counter before each LLM call."""

    def __init__(self, budget: RetryBudget):
        self._budget = budget

    def on_llm_start(self, serialized, prompts, **kwargs):
        self._budget.increment()


# Usage
budget = RetryBudget(max_calls=3)
callback = RetryBudgetCallback(budget)

llm = ChatOpenAI(model="gpt-4o", callbacks=[callback])
chain = (
    ChatPromptTemplate.from_template("Summarize: {text}")
    | llm.with_retry(stop_after_attempt=3)
    | StrOutputParser()
)
guarded_chain = budget.wrap(chain)

# Now guarded_chain.invoke(...) raises RuntimeError after 3 total LLM calls,
# regardless of how many retry layers are nested.

The key design choice: RetryBudget uses thread-local storage so concurrent chain invocations each get their own counter. The reset() call at the start of each invocation ensures the budget is per-run, not cumulative across all time.

Detecting retry storms in production

If RetryBudget raises consistently, the underlying issue is almost always one of:

A JSON output parser rejecting valid LLM output (model format drift, prompt regression)
A 429 rate limit hitting on every call (insufficient rate limit tier for current load)
A network proxy truncating responses, causing consistent parse failures
An outer application-level retry wrapping a chain that already has .with_retry() at the LLM level

Log the RetryBudget increment count on each exception to distinguish which scenario you're in. Three quick increments + error = likely parsing issue. Three slow increments with network delays = likely rate limit.

Failure mode 2: ConversationBufferMemory explosion

This is the most expensive silent failure in LCEL chains. ConversationBufferMemory — and its variants ConversationSummaryBufferMemory and ConversationTokenBufferMemory — prepends the full conversation history to every LLM call. The cost structure is quadratic in the number of turns:

Turn	History size (msgs)	Tokens sent to LLM	Cumulative spend (approx. at $2.50/1M input)
1	0	500	$0.00125
5	8	2,000	$0.025
20	38	10,000	$0.25
50	98	28,000	$1.75
100	198	57,000	$7.13
200	398	116,000	$29

For a 200-turn customer support conversation at GPT-4o prices, the total spend is $29 for what most developers expect to cost $2–3. The problem compounds when multiple concurrent sessions share a server — a 24-hour customer support deployment with 50 concurrent 200-turn sessions can easily run up $1,450/day from memory overhead alone.

The correct mitigation depends on whether your application needs full history or can tolerate a rolling window:

from langchain_core.messages import BaseMessage, HumanMessage, AIMessage
from langchain_core.memory import BaseMemory
from typing import Any


class BoundedBufferMemory(BaseMemory):
    """
    Drop-in replacement for ConversationBufferMemory that enforces
    a hard ceiling on retained messages and token count.
    """

    max_messages: int = 20
    max_tokens: int = 4000
    messages: list[BaseMessage] = []

    @property
    def memory_variables(self) -> list[str]:
        return ["history"]

    def _estimate_tokens(self, messages: list[BaseMessage]) -> int:
        # 3.5 chars/token is a conservative estimate for mixed content
        return sum(len(m.content) for m in messages) // 3

    def load_memory_variables(self, inputs: dict[str, Any]) -> dict[str, Any]:
        return {"history": self.messages}

    def save_context(
        self, inputs: dict[str, Any], outputs: dict[str, str]
    ) -> None:
        self.messages.append(HumanMessage(content=inputs.get("input", "")))
        self.messages.append(AIMessage(content=outputs.get("output", "")))
        self._trim()

    def _trim(self) -> None:
        # Remove oldest message pairs until within both limits
        while len(self.messages) > self.max_messages:
            self.messages = self.messages[2:]  # drop oldest human+ai pair
        while self._estimate_tokens(self.messages) > self.max_tokens:
            if len(self.messages) < 2:
                break
            self.messages = self.messages[2:]

    def clear(self) -> None:
        self.messages = []


# Usage with an LCEL chain
from langchain_core.prompts import ChatPromptTemplate, MessagesPlaceholder
from langchain_core.runnables import RunnablePassthrough

memory = BoundedBufferMemory(max_messages=20, max_tokens=4000)

prompt = ChatPromptTemplate.from_messages([
    ("system", "You are a helpful assistant."),
    MessagesPlaceholder(variable_name="history"),
    ("human", "{input}"),
])

chain = (
    RunnablePassthrough.assign(history=lambda x: memory.load_memory_variables(x)["history"])
    | prompt
    | llm
    | StrOutputParser()
)

def chat(user_input: str) -> str:
    response = chain.invoke({"input": user_input})
    memory.save_context({"input": user_input}, {"output": response})
    return response

If you need to preserve long conversation history for accuracy, use ConversationSummaryBufferMemory with an explicit max_token_limit. This summarizes older turns rather than truncating them, preserving semantic content at a fixed token cost. The tradeoff: summarization itself costs tokens. At very high turn counts, the summarization calls can exceed the savings — set a session token budget that accounts for both the chain calls and the summary calls.

Token counting for LCEL chains

LCEL doesn't expose per-call token usage in a standard place before LangChain 0.3.x. The most reliable approach is a callback that accumulates on_llm_end response metadata:

class TokenBudgetCallback(BaseCallbackHandler):
    """Accumulates token usage and raises BudgetError when ceiling is hit."""

    def __init__(self, max_tokens_per_session: int = 50_000):
        self._ceiling = max_tokens_per_session
        self._used = 0

    @property
    def tokens_used(self) -> int:
        return self._used

    def on_llm_end(self, response, **kwargs):
        usage = getattr(response, "llm_output", {}) or {}
        token_usage = usage.get("token_usage", {})
        total = (
            token_usage.get("total_tokens")
            or token_usage.get("prompt_tokens", 0) + token_usage.get("completion_tokens", 0)
        )
        self._used += total
        if self._used > self._ceiling:
            raise RuntimeError(
                f"TokenBudget: session exceeded {self._ceiling:,} tokens "
                f"({self._used:,} used). Conversation terminated."
            )

    def reset(self) -> None:
        self._used = 0

Failure mode 3: RunnableParallel cost fan-out

RunnableParallel (also written as a dict of runnables in the pipe) is one of LCEL's most powerful features — it runs multiple chains concurrently and merges their outputs. It's also a silent cost multiplier that developers routinely underestimate:

from langchain_core.runnables import RunnableParallel

# This chain makes 3 concurrent LLM calls on every invocation
analysis_chain = RunnableParallel(
    sentiment=sentiment_chain,
    entities=entity_chain,
    summary=summary_chain,
)

# Composing parallel nodes inside a retry creates a 9× multiplier:
# 3 branches × 3 retry attempts = 9 LLM calls per logical invocation
risky_chain = (
    prompt
    | analysis_chain.with_retry(stop_after_attempt=3)
    | merge_outputs
)

The danger compounds when parallel chains contain their own memory or iterative patterns. A RunnableParallel with 4 branches, each using ConversationBufferMemory at turn 30, makes 4 LLM calls each carrying 30 turns of history.

The mitigation is to make the fan-out multiplier visible and bounded:

class ParallelBudgetGuard:
    """
    Wraps RunnableParallel to enforce a maximum concurrent branch count
    and a maximum per-branch retry budget.
    """

    def __init__(
        self,
        parallel: RunnableParallel,
        max_branches: int = 5,
        token_callback: TokenBudgetCallback | None = None,
    ):
        branch_count = len(parallel.steps__)
        if branch_count > max_branches:
            raise ValueError(
                f"ParallelBudgetGuard: {branch_count} branches exceeds "
                f"max_branches={max_branches}. Reduce parallel fan-out or "
                "raise the limit explicitly."
            )
        self._parallel = parallel
        self._token_callback = token_callback

    def invoke(self, input, config=None):
        if self._token_callback:
            config = config or {}
            callbacks = config.get("callbacks", [])
            callbacks.append(self._token_callback)
            config["callbacks"] = callbacks
        return self._parallel.invoke(input, config)

    async def ainvoke(self, input, config=None):
        if self._token_callback:
            config = config or {}
            callbacks = config.get("callbacks", [])
            callbacks.append(self._token_callback)
            config["callbacks"] = callbacks
        return await self._parallel.ainvoke(input, config)


# Usage
budget_callback = TokenBudgetCallback(max_tokens_per_session=30_000)
guarded_parallel = ParallelBudgetGuard(
    RunnableParallel(
        sentiment=sentiment_chain,
        entities=entity_chain,
        summary=summary_chain,
    ),
    max_branches=3,
    token_callback=budget_callback,
)

Beyond the guard, a more architectural fix is to move from parallel fan-out to conditional routing using RunnableBranch where possible. If not all branches are needed for every input, RunnableBranch routes to exactly one branch and pays for exactly one LLM call:

from langchain_core.runnables import RunnableBranch

# Instead of always running 3 chains, route to one based on a classifier
routing_chain = RunnableBranch(
    (lambda x: x["type"] == "sentiment", sentiment_chain),
    (lambda x: x["type"] == "entities", entity_chain),
    summary_chain,  # default
)

Failure mode 4: Streaming accumulation without a ceiling

LCEL's .astream() returns an async generator that yields chunks as the model produces them. The typical consumption loop is:

async for chunk in chain.astream({"input": user_text}):
    print(chunk, end="", flush=True)

This loop runs until the model stops generating — either by reaching its internal stopping criteria (an EOS token, a stop sequence, or the context window limit) or until the application closes the connection. Without a ceiling, a model asked to "write a comprehensive guide to..." will generate until it hits the context limit, which at 128K tokens for GPT-4o costs roughly $0.32 per completion. For a customer-facing application where users control the prompt, this is an uncontrolled spend surface.

The fix is a wrapper that counts tokens from streaming chunks and terminates the generator when a ceiling is hit:

import asyncio
from typing import AsyncIterator


class BoundedStreamConsumer:
    """
    Consumes a streaming chain with a hard token ceiling.
    Raises BudgetError if the stream exceeds max_output_tokens.
    Calculates elapsed wall-clock time and raises TimeoutError if exceeded.
    """

    def __init__(
        self,
        max_output_tokens: int = 2000,
        max_wall_seconds: float = 30.0,
        chars_per_token: float = 4.0,
    ):
        self._max_tokens = max_output_tokens
        self._max_seconds = max_wall_seconds
        self._cpt = chars_per_token

    async def consume(
        self, stream: AsyncIterator[str]
    ) -> str:
        """Collect all chunks into a string, enforcing token and time ceilings."""
        parts: list[str] = []
        total_chars = 0
        start = asyncio.get_event_loop().time()

        async for chunk in stream:
            elapsed = asyncio.get_event_loop().time() - start
            if elapsed > self._max_seconds:
                raise TimeoutError(
                    f"BoundedStream: stream exceeded {self._max_seconds}s wall time. "
                    f"Collected {total_chars // self._cpt:.0f} estimated tokens so far."
                )
            parts.append(chunk)
            total_chars += len(chunk)
            estimated_tokens = total_chars / self._cpt
            if estimated_tokens > self._max_tokens:
                raise RuntimeError(
                    f"BoundedStream: output exceeded {self._max_tokens} tokens "
                    f"(estimated {estimated_tokens:.0f}). Stream terminated early."
                )

        return "".join(parts)


# Usage
consumer = BoundedStreamConsumer(max_output_tokens=1500, max_wall_seconds=20.0)
result = await consumer.consume(chain.astream({"input": user_text}))

For real-time streaming to a UI (where you need to yield chunks to the frontend as they arrive, not buffer them), combine BoundedStreamConsumer's logic with a yield-through pattern:

async def bounded_stream_passthrough(
    stream: AsyncIterator[str],
    max_tokens: int = 2000,
    max_seconds: float = 30.0,
) -> AsyncIterator[str]:
    """Yield chunks to caller while enforcing ceilings."""
    total_chars = 0
    start = asyncio.get_event_loop().time()

    async for chunk in stream:
        elapsed = asyncio.get_event_loop().time() - start
        if elapsed > max_seconds:
            raise TimeoutError(f"Stream timeout after {max_seconds}s")
        total_chars += len(chunk)
        if total_chars / 4.0 > max_tokens:
            raise RuntimeError(f"Stream exceeded {max_tokens} token estimate")
        yield chunk

# Caller usage:
async for chunk in bounded_stream_passthrough(chain.astream(input), max_tokens=1500):
    await websocket.send(chunk)

Composing all four guards: the LCEL BudgetChain

In production, you rarely face only one of these failure modes. A typical LCEL chain that uses memory, retries, and streaming can hit all four. The cleanest integration point is a single BudgetChain wrapper that composes all guards:

from dataclasses import dataclass, field
from langchain_core.runnables import Runnable


@dataclass
class LCELBudgetConfig:
    max_llm_calls_per_run: int = 5
    max_memory_messages: int = 20
    max_memory_tokens: int = 4_000
    max_output_tokens: int = 2_000
    max_wall_seconds: float = 30.0
    max_session_tokens: int = 50_000


class BudgetChain:
    """
    Wraps any LCEL Runnable with all four cost guards.

    - RetryBudget: caps total LLM calls per invocation
    - TokenBudgetCallback: caps total session token spend
    - BoundedStreamConsumer: caps output length per streaming call
    - (Memory must be BoundedBufferMemory — passed separately)
    """

    def __init__(self, chain: Runnable, config: LCELBudgetConfig | None = None):
        self._config = config or LCELBudgetConfig()
        self._token_callback = TokenBudgetCallback(
            max_tokens_per_session=self._config.max_session_tokens
        )
        self._retry_budget = RetryBudget(
            max_calls=self._config.max_llm_calls_per_run
        )
        self._retry_callback = RetryBudgetCallback(self._retry_budget)
        self._stream_consumer = BoundedStreamConsumer(
            max_output_tokens=self._config.max_output_tokens,
            max_wall_seconds=self._config.max_wall_seconds,
        )
        self._chain = chain

    def _make_config(self, extra_config: dict | None = None) -> dict:
        config = extra_config or {}
        callbacks = config.get("callbacks", [])
        callbacks.extend([self._token_callback, self._retry_callback])
        return {**config, "callbacks": callbacks}

    def invoke(self, input, config=None) -> str:
        self._retry_budget.reset()
        return self._chain.invoke(input, self._make_config(config))

    async def ainvoke(self, input, config=None) -> str:
        self._retry_budget.reset()
        return await self._chain.ainvoke(input, self._make_config(config))

    async def astream_bounded(self, input, config=None) -> str:
        self._retry_budget.reset()
        stream = self._chain.astream(input, self._make_config(config))
        return await self._stream_consumer.consume(stream)

    def reset_session(self) -> None:
        """Call between user sessions to reset the cumulative token counter."""
        self._token_callback.reset()


# Usage
config = LCELBudgetConfig(
    max_llm_calls_per_run=4,
    max_memory_messages=20,
    max_session_tokens=40_000,
    max_output_tokens=1_500,
)
guarded = BudgetChain(your_lcel_chain, config)

# Single call
result = guarded.invoke({"input": "Summarize this document..."})

# Streaming call with ceiling
result = await guarded.astream_bounded({"input": "Write a comprehensive guide to..."})

LCEL vs LangGraph: which cost controls apply where

Concern	LCEL chains (this post)	LangGraph (see linked post)
retry_loops	RetryBudget on .with_retry() — caps total LLM calls per invocation	Node visit counter on StateGraph — caps visits to any single node
memory_growth	BoundedBufferMemory replacing ConversationBufferMemory	State size guard on messages channel — trims oldest messages from graph state
parallel_cost	ParallelBudgetGuard on RunnableParallel — validates branch count at build time	Fan-out node budget — applied at Send() primitive; separate concern
stream_ceiling	BoundedStreamConsumer on .astream() — token count + wall-clock ceiling	N/A — LangGraph streams state diffs, not LLM tokens directly
session_budget	TokenBudgetCallback via BaseCallbackHandler — accumulates across all nodes in session	Same — callbacks work across LangGraph nodes since LangGraph builds on LCEL internally
tool_loop	N/A — LCEL chains typically don't run agent-style tool loops	Tool call deduplicator — fingerprints (tool_name, args) tuples in state

Note the last row: if your LangGraph graph uses LCEL chains as node implementations (the standard pattern), the LCEL-level guards and LangGraph-level guards compose — they don't conflict. The callback handlers work at the LangChain runtime layer and fire regardless of whether the call originates from a direct .invoke() or from within a StateGraph node.

Putting it all together: a production-ready LCEL chain

Here's a complete pattern for a multi-turn customer support chain that incorporates all four guards:

from langchain_openai import ChatOpenAI
from langchain_core.prompts import ChatPromptTemplate, MessagesPlaceholder
from langchain_core.output_parsers import StrOutputParser
from langchain_core.runnables import RunnablePassthrough


def build_support_chain() -> tuple[BudgetChain, BoundedBufferMemory]:
    memory = BoundedBufferMemory(max_messages=20, max_tokens=4_000)

    llm = ChatOpenAI(
        model="gpt-4o-mini",  # cheaper model for support; upgrade to gpt-4o only for escalations
        max_retries=1,        # disable internal SDK retries; RetryBudget handles this
        temperature=0.2,
    )

    prompt = ChatPromptTemplate.from_messages([
        ("system", "You are a helpful customer support agent for RunGuard. "
                   "Answer concisely. If you don't know, say so."),
        MessagesPlaceholder(variable_name="history"),
        ("human", "{input}"),
    ])

    raw_chain = (
        RunnablePassthrough.assign(
            history=lambda x: memory.load_memory_variables(x)["history"]
        )
        | prompt
        | llm.with_retry(stop_after_attempt=2)  # retry budget caps this at 2 total LLM calls
        | StrOutputParser()
    )

    budget_config = LCELBudgetConfig(
        max_llm_calls_per_run=2,       # matches .with_retry(stop_after_attempt=2)
        max_session_tokens=30_000,     # ~$0.015 ceiling per session at gpt-4o-mini prices
        max_output_tokens=800,         # support answers should be concise
        max_wall_seconds=15.0,
    )
    return BudgetChain(raw_chain, budget_config), memory


# In your request handler
chain, memory = build_support_chain()  # one per session

async def handle_message(user_input: str) -> str:
    try:
        response = await chain.ainvoke({"input": user_input})
        memory.save_context({"input": user_input}, {"output": response})
        return response
    except RuntimeError as e:
        # Budget exceeded — return a graceful fallback
        chain.reset_session()
        memory.clear()
        return "I've reached my response limit for this session. Please start a new conversation."

Frequently asked questions

Does TokenBudgetCallback work with providers other than OpenAI?

Yes, with minor adjustments. The on_llm_end callback receives a LLMResult object whose llm_output dict contains provider-specific keys. For Anthropic (langchain-anthropic), the usage is at llm_output["usage"] with input_tokens and output_tokens keys. For Google Vertex, it's under llm_output["usage_metadata"]. Wrapping the extraction in a try/except with a character-estimate fallback makes the callback portable across providers.

Does .with_retry() on the prompt or parser (not the LLM) also multiply costs?

Only if the retry boundary wraps a node that includes an LLM call. If you put .with_retry() on a pure Python output parser that doesn't call the model, retries are free. The cost risk is when the retry boundary is at or above the LLM runnable in the chain — whether that's llm.with_retry(), chain.with_retry(), or a RunnableSequence wrapping the entire prompt+LLM+parser segment. RetryBudget catches all of these because on_llm_start fires on any LLM call regardless of where in the chain it originates.

How does RunnableWithMessageHistory compare to BoundedBufferMemory?

RunnableWithMessageHistory is LangChain 0.2+'s preferred interface for persisting chat history across requests (backed by an InMemoryChatMessageHistory, Redis, DynamoDB, etc.). It doesn't enforce any token or message ceiling — it stores and retrieves exactly what you put in. BoundedBufferMemory is the guard layer on top: you'd implement a custom BaseChatMessageHistory that enforces the trimming logic and pass it to RunnableWithMessageHistory. The two compose cleanly — RunnableWithMessageHistory for the wiring, bounded history for the ceiling.

Can I use LangSmith or Langfuse to detect these failures rather than adding guards?

Observability platforms like LangSmith and Langfuse are excellent for detecting failures after they happen — they'll show you the retry count, token usage, and memory growth in traces. What they can't do is prevent the cost from accruing in real time. A RetryBudget that trips on the third LLM call stops the fourth call from being made. An observability dashboard shows you that four calls were made. For production cost control, you need both: guards for prevention, observability for diagnosis.

Is there a RunGuard SDK integration for LangChain LCEL?

Yes. The RunGuard SDK implements the BaseCallbackHandler interface and plugs into any LCEL chain via the callbacks parameter. It handles token budgets, retry detection, and session-level spend ceilings with configurable alert thresholds (email, Slack) — the same patterns in this post, with persistence, alerting, and a dashboard for tracking spend across all your chains and sessions.

Stop silent LCEL cost blowouts with RunGuard

The guards in this post are production-ready — drop them in today. RunGuard wraps them with persistent storage, multi-chain dashboards, and Slack/email alerts when any budget ceiling trips. One-line install, works with any LCEL chain or LangChain callback-compatible framework.

See pricing — free 14-day trial

Also in this series