AI agent token budget enforcement in Python: code patterns for hard spending limits

A soft limit is a threshold that triggers an alert when crossed. A hard limit is a threshold that prevents the crossing from happening in the first place. In LLM cost management, the distinction matters enormously: a soft limit on a runaway agent session tells you the session has already consumed $12 when it should have stopped at $1. A hard limit stops the session at $1.02, logs a structured error, and returns a graceful degradation response to the user. Most Python LLM applications start with soft limits — a Slack alert, a dashboard threshold — because they’re easier to implement. They stay with soft limits until the first incident where a session consumes 40× its expected budget before an engineer notices the alert. This page is a practical, code-first guide to implementing hard budget enforcement in Python: pre-call cost estimation and rejection, thread-safe running total accumulation, budget-exceeded callbacks with graceful degradation, and integration patterns for LangChain, CrewAI, AutoGen, and the raw OpenAI SDK.

Why soft limits fail in production

Alerts have latency; LLM calls do not. A typical cost alerting pipeline has a detection latency of 1–5 minutes from the moment spend exceeds threshold to the moment an engineer receives a notification. In that window, an agent in a runaway reasoning loop can make dozens of additional API calls. A GPT-4o loop that costs $0.15 per iteration and runs 40 iterations before an alert is acknowledged has already consumed $6 — 60× the $0.10 per-session budget that triggered the alert. The alert told you the problem existed; it did not prevent the problem from growing by 60× after detection. Hard limits eliminate the gap between detection and containment by making detection and containment the same event: the limit check that detects the budget exceedance is the same code that blocks the next API call.
Human response time is incompatible with agent speed. Agents can make 5–20 LLM calls per second in multi-tool orchestration scenarios. Expecting a human to respond to a cost alert fast enough to prevent significant overspend is unrealistic — this is like expecting a human to close a flood gate by hand once the water is already rushing through. Hard limits are the automated equivalent of a pressure relief valve: they activate instantly without requiring human judgment, prevent runaway accumulation, and leave a structured log that the human can review at their leisure to understand what happened.
Soft limits provide false confidence. Teams that have cost alerts configured often feel that their cost risk is managed — “we have alerts, we’ll catch it.” This false confidence leads to under-investment in hard limits, which means the first time an alert fires and an engineer snoozes it (because they get 20 false-positive cost alerts per week), the actual runaway session is free to run until the next engineer looks at the dashboard. Hard limits are not a replacement for alerts; they’re a safety net that makes the entire alerting system less critical by reducing the blast radius of any individual alert being missed.
Budget enforcement is a product feature, not just an ops concern. For B2B SaaS products, per-tenant budget enforcement is a contractual obligation: your customer is on a plan that includes a certain amount of LLM usage, and exceeding that usage either costs you money (if you absorb the overage) or requires you to charge them (which requires accurate per-tenant tracking). Neither outcome is acceptable without a hard enforcement layer. Free-tier products have an even stronger need: without hard limits, a single abusive free-tier user can consume a week’s worth of your LLM budget in an afternoon through automated API calls. Hard limits turn budget enforcement from an ops concern into a product capability that can be exposed to customers as a feature (“set your monthly AI usage budget”).

The pre-call budget check pattern

Estimate cost before calling the API. A pre-call budget check estimates the cost of the pending API call based on the input token count and the expected output token count, compares this estimate against the remaining budget for the current session, and either proceeds or raises a BudgetExceededException before any API call is made. Token counting before the call is straightforward using the tiktoken library for OpenAI models or the provider’s tokenizer for other models. Output token estimation is necessarily approximate — a conservative approach is to use the max_tokens parameter you’re passing to the API as the worst-case output estimate, with a configurable confidence factor (0.6–0.8 of max_tokens works well for most chatbot-style tasks). The pre-call check pattern looks like this:

import tiktoken
from dataclasses import dataclass, field
from typing import Optional

COST_PER_1K = {
    "gpt-4o": {"input": 0.0025, "output": 0.01},
    "gpt-4o-mini": {"input": 0.000150, "output": 0.000600},
    "gpt-3.5-turbo": {"input": 0.0005, "output": 0.0015},
}

class BudgetExceededException(Exception):
    def __init__(self, session_id: str, budget: float, spent: float, estimated: float):
        self.session_id = session_id
        self.budget = budget
        self.spent = spent
        self.estimated = estimated
        super().__init__(
            f"Session {session_id}: budget ${budget:.4f}, "
            f"spent ${spent:.4f}, estimated call ${estimated:.4f} — would exceed budget"
        )

@dataclass
class BudgetTracker:
    session_id: str
    budget_usd: float
    model: str = "gpt-4o-mini"
    output_estimate_factor: float = 0.7
    _spent_usd: float = field(default=0.0, init=False)

    def estimate_call_cost(self, messages: list[dict], max_tokens: int = 1024) -> float:
        """Estimate cost of an API call before making it."""
        enc = tiktoken.encoding_for_model(self.model)
        input_tokens = sum(
            len(enc.encode(msg.get("content", "") or "")) + 4
            for msg in messages
        ) + 3  # per-message overhead

        estimated_output = int(max_tokens * self.output_estimate_factor)
        rates = COST_PER_1K.get(self.model, COST_PER_1K["gpt-4o-mini"])
        return (
            input_tokens / 1000 * rates["input"] +
            estimated_output / 1000 * rates["output"]
        )

    def check_budget(self, messages: list[dict], max_tokens: int = 1024) -> float:
        """
        Check if the budget allows this call. Returns estimated cost.
        Raises BudgetExceededException if the call would exceed the budget.
        """
        estimated = self.estimate_call_cost(messages, max_tokens)
        if self._spent_usd + estimated > self.budget_usd:
            raise BudgetExceededException(
                self.session_id, self.budget_usd, self._spent_usd, estimated
            )
        return estimated

    def record_actual_cost(self, input_tokens: int, output_tokens: int) -> None:
        """Record the actual cost after a successful API call."""
        rates = COST_PER_1K.get(self.model, COST_PER_1K["gpt-4o-mini"])
        actual = input_tokens / 1000 * rates["input"] + output_tokens / 1000 * rates["output"]
        self._spent_usd += actual

    @property
    def remaining_budget(self) -> float:
        return max(0.0, self.budget_usd - self._spent_usd)

    @property
    def spent(self) -> float:
        return self._spent_usd

This pattern gives you a synchronous pre-call check that raises before any API call is made, a post-call actual cost recorder that corrects for estimation error, and a remaining_budget property that other code can inspect to make routing or degradation decisions.

Handle estimation error gracefully. Pre-call cost estimation is accurate to within 10–15% for most prompts but can be off by more for structured outputs with highly variable length. The correct handling is not to make the estimate more pessimistic (that causes premature budget exhaustion) but to treat the remaining budget check as a soft guard and record actual costs post-call, allowing a small overage buffer. A practical pattern is to set output_estimate_factor=0.7 for the pre-call check but allow the session to continue until actual spend exceeds budget * 1.1 — a 10% overage buffer that handles estimation variance without exposing you to unbounded overspend.
Integrate pre-call checks into your API wrapper. Don’t scatter budget check calls throughout your application code. Wrap your LLM client in a BudgetedLLMClient class that enforces the pre-call check on every chat.completions.create call and automatically records actual costs from the response’s usage object. Any code that uses the BudgetedLLMClient gets enforcement for free; the budget check becomes a cross-cutting concern at the infrastructure layer rather than an application-layer responsibility.

Thread-safe running total accumulation

Single-session accumulation is not enough. In a multi-user production system, you need budget enforcement at multiple scopes simultaneously: per-session (this specific conversation), per-user-per-day (how much has this user spent today across all sessions), and per-plan-tier (free plan users cannot exceed $0.50/month total). A naively implemented budget tracker that only covers the current session will be defeated by a user who opens 10 simultaneous sessions. Thread-safe multi-scope accumulation requires a shared state store (Redis is the standard choice for production) and careful locking semantics. The following pattern implements thread-safe accumulation using Redis atomic operations:

import threading
import time
from contextlib import contextmanager
from typing import Optional
import redis

class MultiScopeBudgetEnforcer:
    """
    Thread-safe budget enforcement at session, user, and plan scopes.
    Uses Redis atomic INCRBYFLOAT for distributed accumulation.
    """

    def __init__(self, redis_client: redis.Redis, budgets: dict):
        self.redis = redis_client
        self.budgets = budgets  # {"session": 1.0, "user_daily": 5.0, "plan_monthly": 20.0}
        self._local = threading.local()

    def _scope_key(self, scope: str, identifier: str) -> str:
        if scope == "user_daily":
            day = time.strftime("%Y-%m-%d")
            return f"budget:user_daily:{identifier}:{day}"
        elif scope == "plan_monthly":
            month = time.strftime("%Y-%m")
            return f"budget:plan_monthly:{identifier}:{month}"
        else:
            return f"budget:session:{identifier}"

    def get_spent(self, scope: str, identifier: str) -> float:
        key = self._scope_key(scope, identifier)
        val = self.redis.get(key)
        return float(val) if val else 0.0

    def check_and_reserve(
        self,
        session_id: str,
        user_id: str,
        plan_id: str,
        estimated_cost: float
    ) -> None:
        """
        Atomically check all budget scopes and reserve the estimated cost.
        Raises BudgetExceededException if any scope would be exceeded.
        Uses a Lua script for atomicity.
        """
        lua_script = """
        local keys = KEYS
        local estimated = tonumber(ARGV[1])
        local budgets = {}
        for i = 2, #ARGV do
            budgets[i-1] = tonumber(ARGV[i])
        end
        -- Check all scopes
        for i, key in ipairs(keys) do
            local current = tonumber(redis.call('GET', key) or 0)
            if current + estimated > budgets[i] then
                return {i, current, budgets[i]}
            end
        end
        -- All checks passed: increment all scopes
        for i, key in ipairs(keys) do
            redis.call('INCRBYFLOAT', key, estimated)
            redis.call('EXPIRE', key, 2678400)  -- 31-day TTL
        end
        return nil
        """
        keys = [
            self._scope_key("session", session_id),
            self._scope_key("user_daily", user_id),
            self._scope_key("plan_monthly", plan_id),
        ]
        budget_values = [
            self.budgets["session"],
            self.budgets["user_daily"],
            self.budgets["plan_monthly"],
        ]

        result = self.redis.eval(
            lua_script,
            len(keys),
            *keys,
            str(estimated_cost),
            *[str(b) for b in budget_values],
        )

        if result is not None:
            scope_idx, current, budget = int(result[0]) - 1, float(result[1]), float(result[2])
            scope_names = ["session", "user_daily", "plan_monthly"]
            raise BudgetExceededException(
                session_id=session_id,
                budget=budget,
                spent=current,
                estimated=estimated_cost,
            )

The Lua script ensures that the check-and-increment is atomic: there is no race condition between two concurrent sessions where both pass the check before either has incremented the counter. This is critical for multi-tenant systems where a user opens parallel sessions.

Correct reservation errors after actual cost is known. Because pre-call estimates differ from actual costs, you need a correction step after each successful API call: compute the difference between your estimate and the actual cost from the response’s usage data, and apply a delta increment (positive or negative) to all scope accumulators. This keeps the running totals accurate over time. A session where every call estimates 10% over actual costs will appear to have exhausted its budget after consuming only 90% of it, causing premature rejections; the correction step prevents this drift.

Budget exceeded callbacks and graceful degradation

Hard stop vs. graceful degrade: choose based on context. When a budget is exceeded, you have two choices: hard stop (raise an exception that bubbles up to the user as an error) or graceful degrade (return a pre-computed fallback response, route to a cheaper model, or summarize and continue with reduced capability). Hard stop is appropriate for cost-control-critical scenarios (free-tier abuse prevention, per-tenant billing enforcement) where you cannot risk additional spend. Graceful degrade is appropriate for user-facing features where the cost of interrupting the user experience exceeds the cost of the marginal additional spend. Implementing both patterns and choosing based on session context gives you the flexibility to be strict where it matters and permissive where the UX value justifies it. The callback pattern decouples the enforcement decision from the application code:

from typing import Callable, Optional, Any
from enum import Enum

class DegradationStrategy(Enum):
    HARD_STOP = "hard_stop"
    FALLBACK_RESPONSE = "fallback_response"
    DOWNGRADE_MODEL = "downgrade_model"
    SUMMARIZE_AND_CONTINUE = "summarize_and_continue"

class BudgetedAgent:
    def __init__(
        self,
        tracker: BudgetTracker,
        on_budget_exceeded: Optional[Callable] = None,
        strategy: DegradationStrategy = DegradationStrategy.FALLBACK_RESPONSE,
        fallback_message: str = "I’ve reached the usage limit for this session. Please start a new conversation or upgrade your plan.",
        downgrade_model: Optional[str] = None,
    ):
        self.tracker = tracker
        self.strategy = strategy
        self.fallback_message = fallback_message
        self.downgrade_model = downgrade_model
        self.on_budget_exceeded = on_budget_exceeded or self._default_handler

    def _default_handler(self, exc: BudgetExceededException) -> Any:
        if self.strategy == DegradationStrategy.HARD_STOP:
            raise exc
        elif self.strategy == DegradationStrategy.FALLBACK_RESPONSE:
            return {"role": "assistant", "content": self.fallback_message}
        elif self.strategy == DegradationStrategy.DOWNGRADE_MODEL:
            return {"downgrade_to": self.downgrade_model, "retry": True}
        else:
            return {"role": "assistant", "content": self.fallback_message}

    def call(self, client, messages: list[dict], max_tokens: int = 512, **kwargs) -> Any:
        try:
            self.tracker.check_budget(messages, max_tokens)
        except BudgetExceededException as exc:
            return self.on_budget_exceeded(exc)

        response = client.chat.completions.create(
            model=self.tracker.model,
            messages=messages,
            max_tokens=max_tokens,
            **kwargs
        )
        usage = response.usage
        self.tracker.record_actual_cost(usage.prompt_tokens, usage.completion_tokens)
        return response.choices[0].message

This pattern makes the degradation strategy a configuration concern rather than an application-logic concern. Teams can change the strategy per deployment environment (strict in production, lenient in staging) without touching agent code.

LangChain integration via callbacks. LangChain’s callback system provides a clean integration point for budget enforcement. Implement a BudgetCallbackHandler that overrides on_llm_start to perform the pre-call check and on_llm_end to record actual costs. Register this handler globally in your LangChain chain initialization. Any LangChain chain or agent that uses this handler gets enforcement automatically, including tool calls that are made internally by the agent without explicit application-layer code.
AutoGen and CrewAI integration patterns. For multi-agent frameworks like AutoGen and CrewAI, budget enforcement needs to operate at the conversation level, not just the individual LLM call level. In AutoGen, wrap the ConversableAgent’s generate_reply method to check the session-level budget before each agent turn. In CrewAI, implement a custom LLM class that wraps the underlying provider client with budget enforcement, then pass your custom LLM class to each crew member’s constructor. Both frameworks are designed for extensibility at the LLM integration point, making this pattern straightforward to implement in 30–50 lines of code.

RunGuard for Python budget enforcement

Drop-in budget enforcement without custom infrastructure. Implementing the patterns above correctly requires Redis infrastructure, careful Lua scripting, and ongoing maintenance as your application evolves. RunGuard provides the same multi-scope budget enforcement as a managed service with a Python SDK that replaces the custom implementation with three lines of setup code:
```
import runguard
from openai import OpenAI

# Wrap your OpenAI client with RunGuard enforcement
openai_client = OpenAI()
guarded_client = runguard.wrap(
    openai_client,
    session_budget_usd=1.00,
    user_daily_budget_usd=5.00,
    on_exceeded="fallback",  # or "raise", "downgrade"
    fallback_message="Usage limit reached for this session.",
    metadata={"user_id": user_id, "feature": "research_agent"}
)

# Use exactly like the standard OpenAI client — enforcement is transparent
response = guarded_client.chat.completions.create(
    model="gpt-4o-mini",
    messages=messages,
    max_tokens=512
)
```
The runguard.wrap() call returns a client that is API-compatible with the standard OpenAI client but enforces budgets, records cost attribution to RunGuard’s audit log, and handles degradation according to your on_exceeded configuration. The metadata dict is used for cost attribution in the dashboard and API — a user_id and feature tag on every call gives you per-user and per-feature cost reports out of the box.
LangChain and CrewAI native integrations. RunGuard ships first-class integrations for LangChain (as a callback handler), CrewAI (as a custom LLM wrapper), and AutoGen (as a monkey-patch for the message generation loop). These integrations are maintained alongside the frameworks’ API changes, so you don’t have to update your budget enforcement code every time a framework releases a breaking change. Install with pip install runguard[langchain] or pip install runguard[crewai] for the respective integration packages.
Budget status as a first-class session property. RunGuard exposes the current budget status (spent, remaining, budget, percentage used) as a property on the wrapped client, so your application code can query it to make routing decisions independent of enforcement. For example, a research agent that is approaching its budget limit might switch from a “deep dive” mode (many tool calls, long reasoning chains) to a “quick answer” mode (single tool call, short response) when it detects that 80% of the budget is consumed — providing a better user experience than a hard stop at 100%.

Ship hard budget limits in your Python agents today

RunGuard’s Python SDK gives you production-grade token budget enforcement — multi-scope accumulators, pre-call checks, graceful degradation callbacks, and LangChain/CrewAI/AutoGen integrations — without building and maintaining the Redis infrastructure yourself. Start your free trial and wrap your first LLM client in under five minutes.

Start free trial →