LangChain has max_iterations. That’s not a budget limit — here’s how to add one.
LangChain’s AgentExecutor accepts a max_iterations parameter (default 15) that stops the agent after N tool-calling steps. That’s a step-count backstop, not a budget limit. A 15-step cap on a GPT-4o agent that uses 20,000-token prompts on each step can still cost $7.50 for a single run. max_iterations=15 is useful for preventing runaway duration, but it doesn’t prevent runaway cost. LangGraph has no built-in iteration or budget limit at all — it runs until the graph reaches a terminal node or the process is killed. This page covers three approaches to add a genuine per-run budget limit to LangChain-based agents, in order of increasing protection.
Approach 1 — max_iterations with a token-size prompt guard (lightweight)
- The combination.
max_iterationsbounds the number of steps; adding a per-step token limit (viamax_tokenson the LLM and a context-trimming hook) bounds the cost per step. Together they bound the total run cost:max_cost ≈ max_iterations × max_tokens × price_per_token. For GPT-4o at $10/1M output tokens:15 steps × 2,000 max_tokens × $0.00001 = $0.30 worst casein output tokens. Add the input cost: if the average prompt is 5,000 tokens, input adds15 × 5,000 × $0.0000025 = $0.19. Total worst case: ~$0.50 per run with these settings. - Implementation.
from langchain_openai import ChatOpenAI from langchain.agents import AgentExecutor, create_tool_calling_agent from langchain.prompts import ChatPromptTemplate llm = ChatOpenAI( model="gpt-4o", max_tokens=2000, # bound completion cost per call temperature=0, ) prompt = ChatPromptTemplate.from_messages([ ("system", "You are a helpful assistant. Use the tools provided."), ("human", "{input}"), ("placeholder", "{agent_scratchpad}"), ]) agent = create_tool_calling_agent(llm, tools=TOOLS, prompt=prompt) executor = AgentExecutor( agent=agent, tools=TOOLS, max_iterations=10, # bound number of steps max_execution_time=30, # bound wall-clock time (seconds) handle_parsing_errors=True, verbose=False, ) result = executor.invoke({"input": user_task}) - Limitation. This approach bounds cost only if prompt size stays flat. In practice, each iteration appends the tool results from previous steps to the prompt, so input tokens grow by 500–2,000 tokens per iteration. By iteration 10, the prompt may be 20,000+ tokens. The bound is not tight. Use Approach 2 or 3 for accurate per-run dollar caps.
Approach 2 — Callback-based cost accumulation (accurate but post-call)
- LangChain callbacks. LangChain’s callback system fires
on_llm_endafter each LLM response. The callback receives theLLMResultincluding token usage. You can accumulate cost in the callback and raise an exception if the cap is exceeded. The exception propagates up throughAgentExecutorand terminates the run. - Full implementation.
from langchain.callbacks.base import BaseCallbackHandler from langchain.schema import LLMResult from typing import Any class BudgetCallbackHandler(BaseCallbackHandler): """Raises BudgetExceededError if the run exceeds max_usd.""" PRICING = { "gpt-4o": (2.50, 10.00), # (input, output) per 1M tokens "gpt-4o-mini": (0.15, 0.60), "claude-sonnet-4-6": (3.00, 15.00), } def __init__(self, max_usd: float, model: str = "gpt-4o"): super().__init__() self.max_usd = max_usd self.model = model self.spent = 0.0 self.input_px, self.output_px = self.PRICING.get(model, (2.50, 10.00)) def on_llm_end(self, response: LLMResult, **kwargs: Any) -> None: for gen_list in response.generations: for gen in gen_list: usage = getattr(gen.message, "usage_metadata", None) if usage: call_cost = ( usage.get("input_tokens", 0) * self.input_px + usage.get("output_tokens", 0) * self.output_px ) / 1_000_000 self.spent += call_cost if self.spent > self.max_usd: raise RuntimeError( f"Budget exceeded: ${self.spent:.4f} spent (cap ${self.max_usd:.2f})" ) # Usage budget_handler = BudgetCallbackHandler(max_usd=1.50, model="gpt-4o") result = executor.invoke( {"input": user_task}, config={"callbacks": [budget_handler]}, ) # handler.spent contains total cost after the run - Limitation. The callback fires after each LLM call completes. The last call can overshoot the budget by its own cost. If the budget is $1.50 and the run has spent $1.45, a $0.30 final call pushes total spend to $1.75 — 17% over the cap. For most workloads this overshoot is acceptable. For tight cost control, use Approach 3 which fires before each call.
Approach 3 — RunGuard wrapped LLM (pre-call enforcement + loop detection)
- How it works. RunGuard’s
guard()function wraps the LLM’s callable. Before each LLM call, it checks accumulated cost against the cap and the recent call-signature window against the loop pattern. If either condition fires, the call is blocked before the HTTP request goes out. This prevents the overshoot problem of Approach 2 and also catches tool-call loops that a cost-only cap misses. - LangChain + LangGraph integration.
from langchain_openai import ChatOpenAI from langchain.schema import BaseMessage from runguard import guard, BudgetExceededError, LoopDetectedError # Subclass ChatOpenAI to add RunGuard class GuardedChatOpenAI(ChatOpenAI): """ChatOpenAI subclass with per-run budget cap and loop detection.""" def __init__(self, *args, max_usd: float = 2.0, **kwargs): super().__init__(*args, **kwargs) object.__setattr__(self, "_max_usd", max_usd) object.__setattr__(self, "_run_guard", None) def _make_guard(self): def _inner(messages, **kw): response = super(GuardedChatOpenAI, self)._generate(messages, **kw) # Extract cost from usage metadata usage = getattr(response.generations[0][0].message, "usage_metadata", {}) usd = ( usage.get("input_tokens", 0) * 2.50 + usage.get("output_tokens", 0) * 10.0 ) / 1_000_000 # Extract tool-call signature msg = response.generations[0][0].message sig = "end_turn" if hasattr(msg, "tool_calls") and msg.tool_calls: sig = msg.tool_calls[0]["name"] return {"response": response, "usd": usd, "sig": sig} return guard( _inner, budget={"max_usd": self._max_usd}, loop={"repeats": 3, "max_cycle_len": 8}, ) def _generate(self, messages: list[BaseMessage], **kwargs): if self._run_guard is None: object.__setattr__(self, "_run_guard", self._make_guard()) result = self._run_guard(messages, **kwargs) return result["response"] # Use it in AgentExecutor or LangGraph exactly like ChatOpenAI llm = GuardedChatOpenAI( model="gpt-4o", max_usd=2.00, # $2 per-run hard cap temperature=0, ) # Or for LangGraph from langgraph.prebuilt import create_react_agent agent_graph = create_react_agent(llm, tools=TOOLS) try: result = agent_graph.invoke({"messages": [("user", user_task)]}) except BudgetExceededError as e: print(f"Budget: ${e.spent:.3f} of $2.00") except LoopDetectedError as e: print(f"Loop: {e.pattern!r} × {e.repeats}") - LangGraph: thread-scoped guard for multi-turn agents. LangGraph supports multi-turn (threaded) agents where one graph handles many turns of a conversation. For these, scope the guard to the turn, not the thread: create a new guard instance at the start of each user turn and accumulate cost within that turn. Reset the guard between turns to prevent prior-turn cost from counting toward the current turn’s budget.
Which approach to use
| Approach | Accuracy | Detects loops? | Works with LangGraph? | Complexity |
|---|---|---|---|---|
| max_iterations + max_tokens | Approximate | No | Partial | Low |
| BudgetCallbackHandler | Good (1-call overshoot) | No | Yes | Medium |
| GuardedChatOpenAI (RunGuard) | Exact (pre-call) | Yes | Yes | Medium |