AI agent token budget enforcement in Python: code patterns for hard spending limits
A soft limit is a threshold that triggers an alert when crossed. A hard limit is a threshold that prevents the crossing from happening in the first place. In LLM cost management, the distinction matters enormously: a soft limit on a runaway agent session tells you the session has already consumed $12 when it should have stopped at $1. A hard limit stops the session at $1.02, logs a structured error, and returns a graceful degradation response to the user. Most Python LLM applications start with soft limits — a Slack alert, a dashboard threshold — because they’re easier to implement. They stay with soft limits until the first incident where a session consumes 40× its expected budget before an engineer notices the alert. This page is a practical, code-first guide to implementing hard budget enforcement in Python: pre-call cost estimation and rejection, thread-safe running total accumulation, budget-exceeded callbacks with graceful degradation, and integration patterns for LangChain, CrewAI, AutoGen, and the raw OpenAI SDK.
Why soft limits fail in production
- Alerts have latency; LLM calls do not. A typical cost alerting pipeline has a detection latency of 1–5 minutes from the moment spend exceeds threshold to the moment an engineer receives a notification. In that window, an agent in a runaway reasoning loop can make dozens of additional API calls. A GPT-4o loop that costs $0.15 per iteration and runs 40 iterations before an alert is acknowledged has already consumed $6 — 60× the $0.10 per-session budget that triggered the alert. The alert told you the problem existed; it did not prevent the problem from growing by 60× after detection. Hard limits eliminate the gap between detection and containment by making detection and containment the same event: the limit check that detects the budget exceedance is the same code that blocks the next API call.
- Human response time is incompatible with agent speed. Agents can make 5–20 LLM calls per second in multi-tool orchestration scenarios. Expecting a human to respond to a cost alert fast enough to prevent significant overspend is unrealistic — this is like expecting a human to close a flood gate by hand once the water is already rushing through. Hard limits are the automated equivalent of a pressure relief valve: they activate instantly without requiring human judgment, prevent runaway accumulation, and leave a structured log that the human can review at their leisure to understand what happened.
- Soft limits provide false confidence. Teams that have cost alerts configured often feel that their cost risk is managed — “we have alerts, we’ll catch it.” This false confidence leads to under-investment in hard limits, which means the first time an alert fires and an engineer snoozes it (because they get 20 false-positive cost alerts per week), the actual runaway session is free to run until the next engineer looks at the dashboard. Hard limits are not a replacement for alerts; they’re a safety net that makes the entire alerting system less critical by reducing the blast radius of any individual alert being missed.
- Budget enforcement is a product feature, not just an ops concern. For B2B SaaS products, per-tenant budget enforcement is a contractual obligation: your customer is on a plan that includes a certain amount of LLM usage, and exceeding that usage either costs you money (if you absorb the overage) or requires you to charge them (which requires accurate per-tenant tracking). Neither outcome is acceptable without a hard enforcement layer. Free-tier products have an even stronger need: without hard limits, a single abusive free-tier user can consume a week’s worth of your LLM budget in an afternoon through automated API calls. Hard limits turn budget enforcement from an ops concern into a product capability that can be exposed to customers as a feature (“set your monthly AI usage budget”).
The pre-call budget check pattern
- Estimate cost before calling the API. A pre-call budget check estimates the cost of the pending API call based on the input token count and the expected output token count, compares this estimate against the remaining budget for the current session, and either proceeds or raises a
BudgetExceededExceptionbefore any API call is made. Token counting before the call is straightforward using thetiktokenlibrary for OpenAI models or the provider’s tokenizer for other models. Output token estimation is necessarily approximate — a conservative approach is to use themax_tokensparameter you’re passing to the API as the worst-case output estimate, with a configurable confidence factor (0.6–0.8 ofmax_tokensworks well for most chatbot-style tasks). The pre-call check pattern looks like this:import tiktoken from dataclasses import dataclass, field from typing import Optional COST_PER_1K = { "gpt-4o": {"input": 0.0025, "output": 0.01}, "gpt-4o-mini": {"input": 0.000150, "output": 0.000600}, "gpt-3.5-turbo": {"input": 0.0005, "output": 0.0015}, } class BudgetExceededException(Exception): def __init__(self, session_id: str, budget: float, spent: float, estimated: float): self.session_id = session_id self.budget = budget self.spent = spent self.estimated = estimated super().__init__( f"Session {session_id}: budget ${budget:.4f}, " f"spent ${spent:.4f}, estimated call ${estimated:.4f} — would exceed budget" ) @dataclass class BudgetTracker: session_id: str budget_usd: float model: str = "gpt-4o-mini" output_estimate_factor: float = 0.7 _spent_usd: float = field(default=0.0, init=False) def estimate_call_cost(self, messages: list[dict], max_tokens: int = 1024) -> float: """Estimate cost of an API call before making it.""" enc = tiktoken.encoding_for_model(self.model) input_tokens = sum( len(enc.encode(msg.get("content", "") or "")) + 4 for msg in messages ) + 3 # per-message overhead estimated_output = int(max_tokens * self.output_estimate_factor) rates = COST_PER_1K.get(self.model, COST_PER_1K["gpt-4o-mini"]) return ( input_tokens / 1000 * rates["input"] + estimated_output / 1000 * rates["output"] ) def check_budget(self, messages: list[dict], max_tokens: int = 1024) -> float: """ Check if the budget allows this call. Returns estimated cost. Raises BudgetExceededException if the call would exceed the budget. """ estimated = self.estimate_call_cost(messages, max_tokens) if self._spent_usd + estimated > self.budget_usd: raise BudgetExceededException( self.session_id, self.budget_usd, self._spent_usd, estimated ) return estimated def record_actual_cost(self, input_tokens: int, output_tokens: int) -> None: """Record the actual cost after a successful API call.""" rates = COST_PER_1K.get(self.model, COST_PER_1K["gpt-4o-mini"]) actual = input_tokens / 1000 * rates["input"] + output_tokens / 1000 * rates["output"] self._spent_usd += actual @property def remaining_budget(self) -> float: return max(0.0, self.budget_usd - self._spent_usd) @property def spent(self) -> float: return self._spent_usdThis pattern gives you a synchronous pre-call check that raises before any API call is made, a post-call actual cost recorder that corrects for estimation error, and aremaining_budgetproperty that other code can inspect to make routing or degradation decisions. - Handle estimation error gracefully. Pre-call cost estimation is accurate to within 10–15% for most prompts but can be off by more for structured outputs with highly variable length. The correct handling is not to make the estimate more pessimistic (that causes premature budget exhaustion) but to treat the remaining budget check as a soft guard and record actual costs post-call, allowing a small overage buffer. A practical pattern is to set
output_estimate_factor=0.7for the pre-call check but allow the session to continue until actual spend exceedsbudget * 1.1— a 10% overage buffer that handles estimation variance without exposing you to unbounded overspend. - Integrate pre-call checks into your API wrapper. Don’t scatter budget check calls throughout your application code. Wrap your LLM client in a
BudgetedLLMClientclass that enforces the pre-call check on everychat.completions.createcall and automatically records actual costs from the response’s usage object. Any code that uses theBudgetedLLMClientgets enforcement for free; the budget check becomes a cross-cutting concern at the infrastructure layer rather than an application-layer responsibility.
Thread-safe running total accumulation
- Single-session accumulation is not enough. In a multi-user production system, you need budget enforcement at multiple scopes simultaneously: per-session (this specific conversation), per-user-per-day (how much has this user spent today across all sessions), and per-plan-tier (free plan users cannot exceed $0.50/month total). A naively implemented budget tracker that only covers the current session will be defeated by a user who opens 10 simultaneous sessions. Thread-safe multi-scope accumulation requires a shared state store (Redis is the standard choice for production) and careful locking semantics. The following pattern implements thread-safe accumulation using Redis atomic operations:
import threading import time from contextlib import contextmanager from typing import Optional import redis class MultiScopeBudgetEnforcer: """ Thread-safe budget enforcement at session, user, and plan scopes. Uses Redis atomic INCRBYFLOAT for distributed accumulation. """ def __init__(self, redis_client: redis.Redis, budgets: dict): self.redis = redis_client self.budgets = budgets # {"session": 1.0, "user_daily": 5.0, "plan_monthly": 20.0} self._local = threading.local() def _scope_key(self, scope: str, identifier: str) -> str: if scope == "user_daily": day = time.strftime("%Y-%m-%d") return f"budget:user_daily:{identifier}:{day}" elif scope == "plan_monthly": month = time.strftime("%Y-%m") return f"budget:plan_monthly:{identifier}:{month}" else: return f"budget:session:{identifier}" def get_spent(self, scope: str, identifier: str) -> float: key = self._scope_key(scope, identifier) val = self.redis.get(key) return float(val) if val else 0.0 def check_and_reserve( self, session_id: str, user_id: str, plan_id: str, estimated_cost: float ) -> None: """ Atomically check all budget scopes and reserve the estimated cost. Raises BudgetExceededException if any scope would be exceeded. Uses a Lua script for atomicity. """ lua_script = """ local keys = KEYS local estimated = tonumber(ARGV[1]) local budgets = {} for i = 2, #ARGV do budgets[i-1] = tonumber(ARGV[i]) end -- Check all scopes for i, key in ipairs(keys) do local current = tonumber(redis.call('GET', key) or 0) if current + estimated > budgets[i] then return {i, current, budgets[i]} end end -- All checks passed: increment all scopes for i, key in ipairs(keys) do redis.call('INCRBYFLOAT', key, estimated) redis.call('EXPIRE', key, 2678400) -- 31-day TTL end return nil """ keys = [ self._scope_key("session", session_id), self._scope_key("user_daily", user_id), self._scope_key("plan_monthly", plan_id), ] budget_values = [ self.budgets["session"], self.budgets["user_daily"], self.budgets["plan_monthly"], ] result = self.redis.eval( lua_script, len(keys), *keys, str(estimated_cost), *[str(b) for b in budget_values], ) if result is not None: scope_idx, current, budget = int(result[0]) - 1, float(result[1]), float(result[2]) scope_names = ["session", "user_daily", "plan_monthly"] raise BudgetExceededException( session_id=session_id, budget=budget, spent=current, estimated=estimated_cost, )The Lua script ensures that the check-and-increment is atomic: there is no race condition between two concurrent sessions where both pass the check before either has incremented the counter. This is critical for multi-tenant systems where a user opens parallel sessions. - Correct reservation errors after actual cost is known. Because pre-call estimates differ from actual costs, you need a correction step after each successful API call: compute the difference between your estimate and the actual cost from the response’s usage data, and apply a delta increment (positive or negative) to all scope accumulators. This keeps the running totals accurate over time. A session where every call estimates 10% over actual costs will appear to have exhausted its budget after consuming only 90% of it, causing premature rejections; the correction step prevents this drift.
Budget exceeded callbacks and graceful degradation
- Hard stop vs. graceful degrade: choose based on context. When a budget is exceeded, you have two choices: hard stop (raise an exception that bubbles up to the user as an error) or graceful degrade (return a pre-computed fallback response, route to a cheaper model, or summarize and continue with reduced capability). Hard stop is appropriate for cost-control-critical scenarios (free-tier abuse prevention, per-tenant billing enforcement) where you cannot risk additional spend. Graceful degrade is appropriate for user-facing features where the cost of interrupting the user experience exceeds the cost of the marginal additional spend. Implementing both patterns and choosing based on session context gives you the flexibility to be strict where it matters and permissive where the UX value justifies it. The callback pattern decouples the enforcement decision from the application code:
from typing import Callable, Optional, Any from enum import Enum class DegradationStrategy(Enum): HARD_STOP = "hard_stop" FALLBACK_RESPONSE = "fallback_response" DOWNGRADE_MODEL = "downgrade_model" SUMMARIZE_AND_CONTINUE = "summarize_and_continue" class BudgetedAgent: def __init__( self, tracker: BudgetTracker, on_budget_exceeded: Optional[Callable] = None, strategy: DegradationStrategy = DegradationStrategy.FALLBACK_RESPONSE, fallback_message: str = "I’ve reached the usage limit for this session. Please start a new conversation or upgrade your plan.", downgrade_model: Optional[str] = None, ): self.tracker = tracker self.strategy = strategy self.fallback_message = fallback_message self.downgrade_model = downgrade_model self.on_budget_exceeded = on_budget_exceeded or self._default_handler def _default_handler(self, exc: BudgetExceededException) -> Any: if self.strategy == DegradationStrategy.HARD_STOP: raise exc elif self.strategy == DegradationStrategy.FALLBACK_RESPONSE: return {"role": "assistant", "content": self.fallback_message} elif self.strategy == DegradationStrategy.DOWNGRADE_MODEL: return {"downgrade_to": self.downgrade_model, "retry": True} else: return {"role": "assistant", "content": self.fallback_message} def call(self, client, messages: list[dict], max_tokens: int = 512, **kwargs) -> Any: try: self.tracker.check_budget(messages, max_tokens) except BudgetExceededException as exc: return self.on_budget_exceeded(exc) response = client.chat.completions.create( model=self.tracker.model, messages=messages, max_tokens=max_tokens, **kwargs ) usage = response.usage self.tracker.record_actual_cost(usage.prompt_tokens, usage.completion_tokens) return response.choices[0].messageThis pattern makes the degradation strategy a configuration concern rather than an application-logic concern. Teams can change the strategy per deployment environment (strict in production, lenient in staging) without touching agent code. - LangChain integration via callbacks. LangChain’s callback system provides a clean integration point for budget enforcement. Implement a
BudgetCallbackHandlerthat overrideson_llm_startto perform the pre-call check andon_llm_endto record actual costs. Register this handler globally in your LangChain chain initialization. Any LangChain chain or agent that uses this handler gets enforcement automatically, including tool calls that are made internally by the agent without explicit application-layer code. - AutoGen and CrewAI integration patterns. For multi-agent frameworks like AutoGen and CrewAI, budget enforcement needs to operate at the conversation level, not just the individual LLM call level. In AutoGen, wrap the
ConversableAgent’sgenerate_replymethod to check the session-level budget before each agent turn. In CrewAI, implement a customLLMclass that wraps the underlying provider client with budget enforcement, then pass your custom LLM class to each crew member’s constructor. Both frameworks are designed for extensibility at the LLM integration point, making this pattern straightforward to implement in 30–50 lines of code.
RunGuard for Python budget enforcement
- Drop-in budget enforcement without custom infrastructure. Implementing the patterns above correctly requires Redis infrastructure, careful Lua scripting, and ongoing maintenance as your application evolves. RunGuard provides the same multi-scope budget enforcement as a managed service with a Python SDK that replaces the custom implementation with three lines of setup code:
import runguard from openai import OpenAI # Wrap your OpenAI client with RunGuard enforcement openai_client = OpenAI() guarded_client = runguard.wrap( openai_client, session_budget_usd=1.00, user_daily_budget_usd=5.00, on_exceeded="fallback", # or "raise", "downgrade" fallback_message="Usage limit reached for this session.", metadata={"user_id": user_id, "feature": "research_agent"} ) # Use exactly like the standard OpenAI client — enforcement is transparent response = guarded_client.chat.completions.create( model="gpt-4o-mini", messages=messages, max_tokens=512 )Therunguard.wrap()call returns a client that is API-compatible with the standard OpenAI client but enforces budgets, records cost attribution to RunGuard’s audit log, and handles degradation according to youron_exceededconfiguration. Themetadatadict is used for cost attribution in the dashboard and API — auser_idandfeaturetag on every call gives you per-user and per-feature cost reports out of the box. - LangChain and CrewAI native integrations. RunGuard ships first-class integrations for LangChain (as a callback handler), CrewAI (as a custom LLM wrapper), and AutoGen (as a monkey-patch for the message generation loop). These integrations are maintained alongside the frameworks’ API changes, so you don’t have to update your budget enforcement code every time a framework releases a breaking change. Install with
pip install runguard[langchain]orpip install runguard[crewai]for the respective integration packages. - Budget status as a first-class session property. RunGuard exposes the current budget status (spent, remaining, budget, percentage used) as a property on the wrapped client, so your application code can query it to make routing decisions independent of enforcement. For example, a research agent that is approaching its budget limit might switch from a “deep dive” mode (many tool calls, long reasoning chains) to a “quick answer” mode (single tool call, short response) when it detects that 80% of the budget is consumed — providing a better user experience than a hard stop at 100%.
Ship hard budget limits in your Python agents today
RunGuard’s Python SDK gives you production-grade token budget enforcement — multi-scope accumulators, pre-call checks, graceful degradation callbacks, and LangChain/CrewAI/AutoGen integrations — without building and maintaining the Redis infrastructure yourself. Start your free trial and wrap your first LLM client in under five minutes.
Start free trial →