LangChain has `max_iterations`. That’s not a budget limit — here’s how to add one.

LangChain’s AgentExecutor accepts a max_iterations parameter (default 15) that stops the agent after N tool-calling steps. That’s a step-count backstop, not a budget limit. A 15-step cap on a GPT-4o agent that uses 20,000-token prompts on each step can still cost $7.50 for a single run. max_iterations=15 is useful for preventing runaway duration, but it doesn’t prevent runaway cost. LangGraph has no built-in iteration or budget limit at all — it runs until the graph reaches a terminal node or the process is killed. This page covers three approaches to add a genuine per-run budget limit to LangChain-based agents, in order of increasing protection.

Approach 1 — `max_iterations` with a token-size prompt guard (lightweight)

The combination. max_iterations bounds the number of steps; adding a per-step token limit (via max_tokens on the LLM and a context-trimming hook) bounds the cost per step. Together they bound the total run cost: max_cost ≈ max_iterations × max_tokens × price_per_token. For GPT-4o at $10/1M output tokens: 15 steps × 2,000 max_tokens × $0.00001 = $0.30 worst case in output tokens. Add the input cost: if the average prompt is 5,000 tokens, input adds 15 × 5,000 × $0.0000025 = $0.19. Total worst case: ~$0.50 per run with these settings.

Implementation.

from langchain_openai import ChatOpenAI
from langchain.agents import AgentExecutor, create_tool_calling_agent
from langchain.prompts import ChatPromptTemplate

llm = ChatOpenAI(
    model="gpt-4o",
    max_tokens=2000,      # bound completion cost per call
    temperature=0,
)

prompt = ChatPromptTemplate.from_messages([
    ("system", "You are a helpful assistant. Use the tools provided."),
    ("human", "{input}"),
    ("placeholder", "{agent_scratchpad}"),
])

agent = create_tool_calling_agent(llm, tools=TOOLS, prompt=prompt)

executor = AgentExecutor(
    agent=agent,
    tools=TOOLS,
    max_iterations=10,         # bound number of steps
    max_execution_time=30,     # bound wall-clock time (seconds)
    handle_parsing_errors=True,
    verbose=False,
)

result = executor.invoke({"input": user_task})

Limitation. This approach bounds cost only if prompt size stays flat. In practice, each iteration appends the tool results from previous steps to the prompt, so input tokens grow by 500–2,000 tokens per iteration. By iteration 10, the prompt may be 20,000+ tokens. The bound is not tight. Use Approach 2 or 3 for accurate per-run dollar caps.

Approach 2 — Callback-based cost accumulation (accurate but post-call)

LangChain callbacks. LangChain’s callback system fires on_llm_end after each LLM response. The callback receives the LLMResult including token usage. You can accumulate cost in the callback and raise an exception if the cap is exceeded. The exception propagates up through AgentExecutor and terminates the run.

Full implementation.

from langchain.callbacks.base import BaseCallbackHandler
from langchain.schema import LLMResult
from typing import Any

class BudgetCallbackHandler(BaseCallbackHandler):
    """Raises BudgetExceededError if the run exceeds max_usd."""

    PRICING = {
        "gpt-4o":      (2.50, 10.00),   # (input, output) per 1M tokens
        "gpt-4o-mini": (0.15,  0.60),
        "claude-sonnet-4-6": (3.00, 15.00),
    }

    def __init__(self, max_usd: float, model: str = "gpt-4o"):
        super().__init__()
        self.max_usd = max_usd
        self.model = model
        self.spent = 0.0
        self.input_px, self.output_px = self.PRICING.get(model, (2.50, 10.00))

    def on_llm_end(self, response: LLMResult, **kwargs: Any) -> None:
        for gen_list in response.generations:
            for gen in gen_list:
                usage = getattr(gen.message, "usage_metadata", None)
                if usage:
                    call_cost = (
                        usage.get("input_tokens", 0) * self.input_px +
                        usage.get("output_tokens", 0) * self.output_px
                    ) / 1_000_000
                    self.spent += call_cost
        if self.spent > self.max_usd:
            raise RuntimeError(
                f"Budget exceeded: ${self.spent:.4f} spent (cap ${self.max_usd:.2f})"
            )

# Usage
budget_handler = BudgetCallbackHandler(max_usd=1.50, model="gpt-4o")

result = executor.invoke(
    {"input": user_task},
    config={"callbacks": [budget_handler]},
)
# handler.spent contains total cost after the run

Limitation. The callback fires after each LLM call completes. The last call can overshoot the budget by its own cost. If the budget is $1.50 and the run has spent $1.45, a $0.30 final call pushes total spend to $1.75 — 17% over the cap. For most workloads this overshoot is acceptable. For tight cost control, use Approach 3 which fires before each call.

Approach 3 — RunGuard wrapped LLM (pre-call enforcement + loop detection)

How it works. RunGuard’s guard() function wraps the LLM’s callable. Before each LLM call, it checks accumulated cost against the cap and the recent call-signature window against the loop pattern. If either condition fires, the call is blocked before the HTTP request goes out. This prevents the overshoot problem of Approach 2 and also catches tool-call loops that a cost-only cap misses.

LangChain + LangGraph integration.

from langchain_openai import ChatOpenAI
from langchain.schema import BaseMessage
from runguard import guard, BudgetExceededError, LoopDetectedError

# Subclass ChatOpenAI to add RunGuard
class GuardedChatOpenAI(ChatOpenAI):
    """ChatOpenAI subclass with per-run budget cap and loop detection."""

    def __init__(self, *args, max_usd: float = 2.0, **kwargs):
        super().__init__(*args, **kwargs)
        object.__setattr__(self, "_max_usd", max_usd)
        object.__setattr__(self, "_run_guard", None)

    def _make_guard(self):
        def _inner(messages, **kw):
            response = super(GuardedChatOpenAI, self)._generate(messages, **kw)
            # Extract cost from usage metadata
            usage = getattr(response.generations[0][0].message, "usage_metadata", {})
            usd = (
                usage.get("input_tokens", 0) * 2.50 +
                usage.get("output_tokens", 0) * 10.0
            ) / 1_000_000
            # Extract tool-call signature
            msg = response.generations[0][0].message
            sig = "end_turn"
            if hasattr(msg, "tool_calls") and msg.tool_calls:
                sig = msg.tool_calls[0]["name"]
            return {"response": response, "usd": usd, "sig": sig}
        return guard(
            _inner,
            budget={"max_usd": self._max_usd},
            loop={"repeats": 3, "max_cycle_len": 8},
        )

    def _generate(self, messages: list[BaseMessage], **kwargs):
        if self._run_guard is None:
            object.__setattr__(self, "_run_guard", self._make_guard())
        result = self._run_guard(messages, **kwargs)
        return result["response"]

# Use it in AgentExecutor or LangGraph exactly like ChatOpenAI
llm = GuardedChatOpenAI(
    model="gpt-4o",
    max_usd=2.00,   # $2 per-run hard cap
    temperature=0,
)

# Or for LangGraph
from langgraph.prebuilt import create_react_agent
agent_graph = create_react_agent(llm, tools=TOOLS)

try:
    result = agent_graph.invoke({"messages": [("user", user_task)]})
except BudgetExceededError as e:
    print(f"Budget: ${e.spent:.3f} of $2.00")
except LoopDetectedError as e:
    print(f"Loop: {e.pattern!r} × {e.repeats}")

LangGraph: thread-scoped guard for multi-turn agents. LangGraph supports multi-turn (threaded) agents where one graph handles many turns of a conversation. For these, scope the guard to the turn, not the thread: create a new guard instance at the start of each user turn and accumulate cost within that turn. Reset the guard between turns to prevent prior-turn cost from counting toward the current turn’s budget.

Which approach to use

Approach	Accuracy	Detects loops?	Works with LangGraph?	Complexity
max_iterations + max_tokens	Approximate	No	Partial	Low
BudgetCallbackHandler	Good (1-call overshoot)	No	Yes	Medium
GuardedChatOpenAI (RunGuard)	Exact (pre-call)	Yes	Yes	Medium

LangChain has max_iterations. That’s not a budget limit — here’s how to add one.

Approach 1 — max_iterations with a token-size prompt guard (lightweight)

Approach 2 — Callback-based cost accumulation (accurate but post-call)

Approach 3 — RunGuard wrapped LLM (pre-call enforcement + loop detection)

Which approach to use

LangChain has `max_iterations`. That’s not a budget limit — here’s how to add one.

Approach 1 — `max_iterations` with a token-size prompt guard (lightweight)