AI agent token budget enforcement in Python: code patterns for hard spending limits

A soft limit is a threshold that triggers an alert when crossed. A hard limit is a threshold that prevents the crossing from happening in the first place. In LLM cost management, the distinction matters enormously: a soft limit on a runaway agent session tells you the session has already consumed $12 when it should have stopped at $1. A hard limit stops the session at $1.02, logs a structured error, and returns a graceful degradation response to the user. Most Python LLM applications start with soft limits — a Slack alert, a dashboard threshold — because they’re easier to implement. They stay with soft limits until the first incident where a session consumes 40× its expected budget before an engineer notices the alert. This page is a practical, code-first guide to implementing hard budget enforcement in Python: pre-call cost estimation and rejection, thread-safe running total accumulation, budget-exceeded callbacks with graceful degradation, and integration patterns for LangChain, CrewAI, AutoGen, and the raw OpenAI SDK.

Why soft limits fail in production

The pre-call budget check pattern

Thread-safe running total accumulation

Budget exceeded callbacks and graceful degradation

RunGuard for Python budget enforcement

Ship hard budget limits in your Python agents today

RunGuard’s Python SDK gives you production-grade token budget enforcement — multi-scope accumulators, pre-call checks, graceful degradation callbacks, and LangChain/CrewAI/AutoGen integrations — without building and maintaining the Redis infrastructure yourself. Start your free trial and wrap your first LLM client in under five minutes.

Start free trial →