LLM tool call budget exceeded: how to handle it gracefully
Adding a per-run budget cap to an LLM agent is the first line of defense against runaway costs. The question that comes right after: what happens when the cap fires? In most codebases, the answer is “an unhandled exception propagates to the caller, which returns a 500 to the user.” That is correct behavior for a staging environment where you want loud failures. It is catastrophic in production, where the user was mid-task and now sees an error with no context, no partial result, and no guidance on what to do next. This page covers the four graceful-degradation patterns for budget-exceeded handling — summarize-what-we-have, checkpoint-and-resume, degrade-to-cheaper-model, and user-visible soft limit — and shows how RunGuard’s BudgetExceededError carries enough context to implement each one.
What the BudgetExceededError needs to tell you
A bare BudgetExceededError that says “limit exceeded” is almost useless for graceful handling. To implement any of the four patterns below, your error object needs to carry:
- Accumulated cost:
e.accumulated_usd— how much was spent before the cap fired. Determines whether the partial result is valuable. - Budget ceiling:
e.max_usd— what the limit was. Needed to communicate the limit to the user in soft-limit mode. - Tool call history:
e.tool_calls— the list of tool calls made up to the trip, with their results. Needed for summarize-what-we-have and checkpoint-and-resume. - Context snapshot:
e.context_snapshot— the messages list at the time of the trip. Needed for checkpoint-and-resume, where you want to restart from the last successful tool result rather than from scratch. - Which call triggered it:
e.triggering_tool— the tool call that would have pushed cost over the limit. Useful for per-tool cost auditing.
RunGuard’s BudgetExceededError carries all five. The examples below use these fields.
Pattern 1: Summarize what we have
When the budget trips, the agent has already gathered some information. Rather than returning an error, send the accumulated tool results to the LLM with a “summarize what you know so far” prompt. This produces a partial but useful response. The cost is one additional LLM call (but a cheap, no-tools call), which RunGuard’s budget allows because the BudgetExceededError fires before the next expensive call, not after it.
from runguard import BudgetTracker, BudgetExceededError
async def run_agent_with_fallback(query: str, llm_client) -> str:
budget = BudgetTracker(max_usd=0.50)
try:
return await run_full_agent(query, budget)
except BudgetExceededError as e:
# Build a summarization context from what was collected
tool_results_summary = "\n".join(
f"[{tc.tool_name}({tc.args})] → {tc.result[:200]}"
for tc in e.tool_calls
)
# One cheap call to summarize — no tools, no budget enforcement needed
summary = await llm_client.complete([
{"role": "system", "content":
"Summarize the following research results for the user. "
"Note that research was cut short due to a cost limit."},
{"role": "user", "content":
f"Original question: {query}\n\nGathered so far:\n{tool_results_summary}"},
], model="gpt-4.1-mini") # cheap model for the fallback
return f"[Partial result — research capped at ${e.max_usd:.2f}]\n\n{summary}"
Pattern 2: Checkpoint and resume
For long-running agents (research tasks, document processing), serialize the context snapshot on budget trip and resume from that checkpoint rather than restarting. This is particularly useful when the task can be split across API calls: the first run gathers N pieces of information, the second run continues from that state using the remaining budget.
import json
from runguard import BudgetExceededError
async def run_with_checkpoint(query: str, checkpoint_key: str, budget_per_run: float):
# Load prior checkpoint if it exists
prior = cache.get(checkpoint_key)
initial_messages = json.loads(prior)["messages"] if prior else None
try:
result = await run_agent(query, initial_messages, max_usd=budget_per_run)
cache.delete(checkpoint_key) # done — clear the checkpoint
return {"status": "complete", "result": result}
except BudgetExceededError as e:
# Serialize the context snapshot for the next run
checkpoint = json.dumps({
"messages": e.context_snapshot,
"accumulated_usd": e.accumulated_usd,
"tool_calls_made": len(e.tool_calls),
})
cache.set(checkpoint_key, checkpoint, ttl=3600)
return {
"status": "partial",
"message": f"Research checkpoint saved. Call again to continue (${e.accumulated_usd:.2f} spent).",
"checkpoint_key": checkpoint_key,
}
Pattern 3: Degrade to a cheaper model
Some agent tasks can be partially completed with a cheaper model that has lower accuracy but costs 10x less. When the budget trips on the expensive model, restart the remaining steps with the cheaper model using the accumulated context as a compressed starting point.
from runguard import BudgetExceededError, BudgetTracker
async def run_tiered_agent(query: str):
# Phase 1: expensive model, strict budget
expensive_budget = BudgetTracker(max_usd=0.30)
try:
return await run_agent(query, model="claude-opus-4-7", budget=expensive_budget)
except BudgetExceededError as e:
# Phase 2: cheaper model, larger budget, picks up from context snapshot
fallback_budget = BudgetTracker(max_usd=0.20)
# Compress the history to reduce token cost on the cheap model
compressed_context = await compress_messages(e.context_snapshot)
return await run_agent(
query,
model="claude-haiku-4-5",
budget=fallback_budget,
initial_messages=compressed_context,
)
Pattern 4: User-visible soft limit
For interactive agents embedded in a UI, the most user-friendly behavior is a soft limit: warn the user that the task is approaching its cost limit, then let them decide whether to continue (consuming more budget) or stop and take the partial result. RunGuard supports this with a configurable warning threshold that fires before the hard limit.
from runguard import BudgetTracker, BudgetWarning, BudgetExceededError
budget = BudgetTracker(
max_usd=1.00,
warn_at_fraction=0.75, # warn at 75% ($0.75)
on_warn=lambda e: send_ui_event("budget_warning", {
"spent": e.accumulated_usd,
"remaining": e.max_usd - e.accumulated_usd,
"message": f"Research has used ${e.accumulated_usd:.2f} of your ${e.max_usd:.2f} limit.",
})
)
# The UI can call /api/agent/continue-budget to raise the limit
# or /api/agent/stop to cancel and take the partial result
async def handle_budget_increase(session_id: str, additional_usd: float):
budget = session_budgets[session_id]
budget.extend(additional_usd) # raise the ceiling for this run
Handling patterns: tradeoffs summary
| Pattern | Best for | Extra cost on trip | User experience |
|---|---|---|---|
| Summarize what we have | Research + Q&A agents | 1 cheap LLM call | Partial answer with budget note |
| Checkpoint and resume | Long-running document tasks | None (serialized to cache) | Continue in next request |
| Degrade to cheaper model | Quality-vs-cost tradeoff tasks | Cheap model cost for remainder | Slightly lower quality answer |
| User-visible soft limit | Interactive chat agents | None until user continues | User chooses to extend or stop |
Setting the right budget ceiling
The right per-run budget ceiling depends on three factors: your average query complexity (how many tool calls a typical task requires), your LLM pricing (input vs. output token cost at the models you use), and your gross margin targets (what percentage of LLM cost you can absorb at your pricing tier).
A practical starting point: instrument five representative queries through your agent with no budget limit, log the actual cost per run, take the 95th percentile, and set your ceiling at 2× that value. This makes the ceiling never fire on legitimate queries while catching loops and runaway edge cases. Adjust down as you see the distribution in production via RunGuard’s dashboard, which shows cost-per-run histograms and trip frequency by tool and model.
Handle budget-exceeded errors gracefully today
RunGuard’s Python SDK installs with pip install runguard. The BudgetTracker carries full context in BudgetExceededError for all four graceful-degradation patterns. Framework integrations for LangChain, PydanticAI, AutoGen, Phidata, and Haystack are in the SDK docs.
Get started with RunGuard — or see the budget tracker in action for runaway cost prevention, infinite loop detection, and LangChain specifically.