LLM tool call budget exceeded: how to handle it gracefully

Adding a per-run budget cap to an LLM agent is the first line of defense against runaway costs. The question that comes right after: what happens when the cap fires? In most codebases, the answer is “an unhandled exception propagates to the caller, which returns a 500 to the user.” That is correct behavior for a staging environment where you want loud failures. It is catastrophic in production, where the user was mid-task and now sees an error with no context, no partial result, and no guidance on what to do next. This page covers the four graceful-degradation patterns for budget-exceeded handling — summarize-what-we-have, checkpoint-and-resume, degrade-to-cheaper-model, and user-visible soft limit — and shows how RunGuard’s BudgetExceededError carries enough context to implement each one.

What the BudgetExceededError needs to tell you

A bare BudgetExceededError that says “limit exceeded” is almost useless for graceful handling. To implement any of the four patterns below, your error object needs to carry:

RunGuard’s BudgetExceededError carries all five. The examples below use these fields.

Pattern 1: Summarize what we have

When the budget trips, the agent has already gathered some information. Rather than returning an error, send the accumulated tool results to the LLM with a “summarize what you know so far” prompt. This produces a partial but useful response. The cost is one additional LLM call (but a cheap, no-tools call), which RunGuard’s budget allows because the BudgetExceededError fires before the next expensive call, not after it.

from runguard import BudgetTracker, BudgetExceededError

async def run_agent_with_fallback(query: str, llm_client) -> str:
    budget = BudgetTracker(max_usd=0.50)

    try:
        return await run_full_agent(query, budget)

    except BudgetExceededError as e:
        # Build a summarization context from what was collected
        tool_results_summary = "\n".join(
            f"[{tc.tool_name}({tc.args})] → {tc.result[:200]}"
            for tc in e.tool_calls
        )

        # One cheap call to summarize — no tools, no budget enforcement needed
        summary = await llm_client.complete([
            {"role": "system", "content":
                "Summarize the following research results for the user. "
                "Note that research was cut short due to a cost limit."},
            {"role": "user", "content":
                f"Original question: {query}\n\nGathered so far:\n{tool_results_summary}"},
        ], model="gpt-4.1-mini")  # cheap model for the fallback

        return f"[Partial result — research capped at ${e.max_usd:.2f}]\n\n{summary}"

Pattern 2: Checkpoint and resume

For long-running agents (research tasks, document processing), serialize the context snapshot on budget trip and resume from that checkpoint rather than restarting. This is particularly useful when the task can be split across API calls: the first run gathers N pieces of information, the second run continues from that state using the remaining budget.

import json
from runguard import BudgetExceededError

async def run_with_checkpoint(query: str, checkpoint_key: str, budget_per_run: float):
    # Load prior checkpoint if it exists
    prior = cache.get(checkpoint_key)
    initial_messages = json.loads(prior)["messages"] if prior else None

    try:
        result = await run_agent(query, initial_messages, max_usd=budget_per_run)
        cache.delete(checkpoint_key)  # done — clear the checkpoint
        return {"status": "complete", "result": result}

    except BudgetExceededError as e:
        # Serialize the context snapshot for the next run
        checkpoint = json.dumps({
            "messages": e.context_snapshot,
            "accumulated_usd": e.accumulated_usd,
            "tool_calls_made": len(e.tool_calls),
        })
        cache.set(checkpoint_key, checkpoint, ttl=3600)

        return {
            "status": "partial",
            "message": f"Research checkpoint saved. Call again to continue (${e.accumulated_usd:.2f} spent).",
            "checkpoint_key": checkpoint_key,
        }

Pattern 3: Degrade to a cheaper model

Some agent tasks can be partially completed with a cheaper model that has lower accuracy but costs 10x less. When the budget trips on the expensive model, restart the remaining steps with the cheaper model using the accumulated context as a compressed starting point.

from runguard import BudgetExceededError, BudgetTracker

async def run_tiered_agent(query: str):
    # Phase 1: expensive model, strict budget
    expensive_budget = BudgetTracker(max_usd=0.30)
    try:
        return await run_agent(query, model="claude-opus-4-7", budget=expensive_budget)

    except BudgetExceededError as e:
        # Phase 2: cheaper model, larger budget, picks up from context snapshot
        fallback_budget = BudgetTracker(max_usd=0.20)

        # Compress the history to reduce token cost on the cheap model
        compressed_context = await compress_messages(e.context_snapshot)

        return await run_agent(
            query,
            model="claude-haiku-4-5",
            budget=fallback_budget,
            initial_messages=compressed_context,
        )

Pattern 4: User-visible soft limit

For interactive agents embedded in a UI, the most user-friendly behavior is a soft limit: warn the user that the task is approaching its cost limit, then let them decide whether to continue (consuming more budget) or stop and take the partial result. RunGuard supports this with a configurable warning threshold that fires before the hard limit.

from runguard import BudgetTracker, BudgetWarning, BudgetExceededError

budget = BudgetTracker(
    max_usd=1.00,
    warn_at_fraction=0.75,  # warn at 75% ($0.75)
    on_warn=lambda e: send_ui_event("budget_warning", {
        "spent": e.accumulated_usd,
        "remaining": e.max_usd - e.accumulated_usd,
        "message": f"Research has used ${e.accumulated_usd:.2f} of your ${e.max_usd:.2f} limit.",
    })
)

# The UI can call /api/agent/continue-budget to raise the limit
# or /api/agent/stop to cancel and take the partial result

async def handle_budget_increase(session_id: str, additional_usd: float):
    budget = session_budgets[session_id]
    budget.extend(additional_usd)  # raise the ceiling for this run

Handling patterns: tradeoffs summary

PatternBest forExtra cost on tripUser experience
Summarize what we haveResearch + Q&A agents1 cheap LLM callPartial answer with budget note
Checkpoint and resumeLong-running document tasksNone (serialized to cache)Continue in next request
Degrade to cheaper modelQuality-vs-cost tradeoff tasksCheap model cost for remainderSlightly lower quality answer
User-visible soft limitInteractive chat agentsNone until user continuesUser chooses to extend or stop

Setting the right budget ceiling

The right per-run budget ceiling depends on three factors: your average query complexity (how many tool calls a typical task requires), your LLM pricing (input vs. output token cost at the models you use), and your gross margin targets (what percentage of LLM cost you can absorb at your pricing tier).

A practical starting point: instrument five representative queries through your agent with no budget limit, log the actual cost per run, take the 95th percentile, and set your ceiling at 2× that value. This makes the ceiling never fire on legitimate queries while catching loops and runaway edge cases. Adjust down as you see the distribution in production via RunGuard’s dashboard, which shows cost-per-run histograms and trip frequency by tool and model.

Handle budget-exceeded errors gracefully today

RunGuard’s Python SDK installs with pip install runguard. The BudgetTracker carries full context in BudgetExceededError for all four graceful-degradation patterns. Framework integrations for LangChain, PydanticAI, AutoGen, Phidata, and Haystack are in the SDK docs.

Get started with RunGuard — or see the budget tracker in action for runaway cost prevention, infinite loop detection, and LangChain specifically.