A runtime budget alert for OpenAI AgentKit agents

AgentKit ships a tracing dashboard that shows you what every Runner.run() spent after the fact. That is a finance tool. A budget alert is a different tool: it sits in-process, watches the cumulative dollar count of the run that’s happening right now, and stops the next turn before the LLM call goes out when the cap is crossed. max_turns bounds the loop; it does not bound the bill. This page is the runtime budget breaker we ship and how it slots into an AgentKit Runner.run() in eight lines of Python.

Where the dollars actually accumulate inside an AgentKit run

The planner LLM call on every turn. Each turn is a Responses API call against the agent’s configured model with the running conversation history, the tool schemas, the system instructions, and the most recent tool outputs. Once the agent has done six or seven turns the input grows past the system instructions plus the original prompt — the input-token count climbs steadily even when the assistant output stays small. On gpt-4o a mid-run turn easily lands in the 8–20K input-token range, dominated by accumulated tool outputs.
Parallel tool fan-outs. When the assistant emits four tool calls in one turn, AgentKit dispatches them in parallel and feeds all four outputs back into the next turn. The next turn’s prompt grows by the sum of all four tool outputs. A search tool that returns 8K of HTML, called in parallel with three sibling tools, multiplies the next turn’s input cost.
Handoff chains. A handoff swaps the active agent and its instructions, but the conversation thread carries forward and the next turn’s planner LLM call still includes the prior turns. A two-handoff run is three planners worth of context-growth on a single thread.
Tool-call retries on validation errors. When the model emits arguments that don’t match the tool’s pydantic schema, AgentKit feeds the validation error back as a tool result and asks the model to retry. The retry is a fresh planner LLM call. A subtly-mistyped tool argument can chew three turns before the model corrects course — or fail to correct at all and chew until max_turns.

What AgentKit’s existing knobs give you and what they don’t

AgentKit’s primitives are correct in shape and wrong in unit. max_turns on Runner.run() is a count, not a dollar cap; a turn on a 30K-token thread costs five times what a turn on a 6K-token thread costs, and the cap doesn’t know the difference. tool_use_behavior controls when the loop ends after a tool call, not how much each turn cost. Input and output guardrails are content checks — PII detection, jailbreak filters, output schema validation — not dollar checks; a guardrail can refuse to ship a response that mentions a credit card, but it cannot refuse to start a turn that would push the run past $50. Tracing is the dashboard view of what already happened: the trace is uploaded after the run completes, and the developer reads it the next morning. None of these look at cumulative dollars spent so far in this run and none of them stop the next turn before it fires. A run that legitimately needs eight turns to research and summarise and a run that fires the same broken tool call in a loop until max_turns hits both look identical to the executor — they just produce different invoices.

What a runtime budget alert actually has to do

Track real dollars, not turn counts. A turn that hit a 50K-token thread costs ~10× a turn on a 5K-token thread. A turn count over-credits the early ones and under-bills the late ones. The tracker has to take a USD number from the host after each Runner.run() step — pulled from result.context_wrapper.usage’s total-cost field, or computed from input/output tokens times the published per-token price for the model.
Trip before the next turn fires, not after. The check is in-process, on a numeric accumulator. It runs in microseconds. When the cap is crossed, the next call into the loop raises a typed error and the run halts — the next planner Responses call never goes out, the next parallel tool fan never dispatches, the next handoff never swaps.
Support a rolling window, not just cumulative. “Don’t spend more than $5 on this run” is one rule. “Don’t spend more than $5 in any rolling minute” is a different rule, and the right one for long-running orchestrations: a research agent that legitimately spends dollars over an hour but should never spend dollars in a minute. The tracker has to evict old entries on every query so the second rule is enforceable in microseconds, same as the first.
Be a primitive, not a framework opinion. The same wrap should compose with AgentKit, with the raw OpenAI SDK, with LangChain tools, with whatever framework lands next quarter. A breaker that ships as an AgentKit guardrail subclass or a custom Runner mixin is brittle; a breaker that wraps any callable is portable.

Wrapping `Runner.run()` with `runguard`

# OpenAI AgentKit + runguard. The Agent stays an Agent; we wrap the Runner
# step so the budget tracker sees every paid call and trips before the next.
from agents import Agent, Runner
from runguard import guard, BudgetExceededError, LoopDetectedError

researcher = Agent(name="researcher", instructions="...", tools=[search_web, summarize])

async def _run_step(payload):
    # One Runner.run() call. Returns the final output plus the dollar number
    # the tracker needs to push into the rolling-window ledger.
    result = await Runner.run(payload["agent"], input=payload["input"])
    usage = result.context_wrapper.usage  # tokens + usd from the model client
    last = result.new_items[-1] if result.new_items else None
    return {"final_output": result.final_output,
            "usd": usage.total_cost_usd,
            "last_kind": getattr(last, "type", "end")}

guarded_run = guard(
    _run_step,
    signature=lambda i: f"agentkit:{i['agent'].name}:{i['input'][:64]}",
    budget={"max_usd": 5, "window_ms": 60_000},
    loop={"repeats": 3, "max_cycle_len": 8},
    cost=lambda _i, o: o["usd"],
    on_trip=lambda e: print("[runguard]", e["reason"], e.get("spent"), "of", e.get("cap")),
)

try:
    out = await guarded_run({"agent": researcher, "input": "Brief me on Q3 SEC filings for $TICK"})
except (BudgetExceededError, LoopDetectedError) as e:
    print("halted:", e)

The budget primitive is the BudgetTracker shipped at product/sdk/src/budget.ts: maxUsd for the cap, optional windowMs for rolling-window throttles, an add(usd) the host calls post-call, and an exceeded() the wrap reads pre-call — a hard cap with rolling-window option, no daemon, no telemetry. The BudgetTracker file is 84 lines; the LoopDetector at product/sdk/src/loop-detector.ts is 111 lines. Defaults are honest: $5 per run is enough for a research orchestration on a frontier model, low enough that a stuck tool-call loop doesn’t become a six-figure incident. The same wrap watches for loops on the same step signature; the fingerprint-and-window approach is documented at how to detect LLM tool-call loops in production; the LangChain wrap is here; the multi-agent CrewAI wrap is here; the browser-use wrap is here.

How the breaker behaves inside `Runner.run()`

Costs accumulate after each run step. The wrap reads output.usd on success and pushes it into the BudgetTracker. Successful runs under the cap pass through transparently — the host gets final_output and continues. Zero-cost runs (a cached run, a trivial assistant-only response) never trip the budget; the tracker explicitly skips zero entries via if (usd === 0) return.
The first run over the cap throws before its planner Responses call goes out. BudgetExceededError is constructed with the cumulative spend, the cap, and a reason field. It propagates out of the guarded callable into the host’s exception handler. The agent thread state, tool registries, and any open clients can be cleanly torn down by the host — the tracker doesn’t close streams or kill processes, that’s for the host to do.
Your on_trip hook fires before the throw. Page Slack with the spend curve, write a row to a trip log keyed on the agent name, snapshot the last tool call and its arguments — whatever you wire. Sync hooks run inline; async hooks are awaited. An on_trip exception propagates instead of the trip error, by design (the host explicitly opted in to side-effecting on trip).
Reset is explicit. When a fresh run starts, call guarded_run.reset() to clear the spend ledger. The tracker is per-guarded-fn, not per-process — you can keep one budget per agent and run several agents independently, or share one guard() across a handoff chain for a chain-wide cap. The same reset() wipes the loop window on the same wrap.

Tuning for AgentKit cost shapes

AgentKit’s default max_turns on Runner.run() is 10. On gpt-4o at a typical research-orchestration prompt size, a mid-run turn lands around $0.05–$0.20 of input tokens before assistant output, climbing as the conversation grows. The default max_usd: 5 on the budget tracker corresponds to roughly 25–100 turns on the small end, 12–50 on the heavy end — an honest research run with two or three handoffs, generous enough that legitimate workflows finish, tight enough that a stuck retry loop trips the breaker before the bill triples. For long-running orchestrations (a continuous monitor agent, a daily digest agent), set window_ms: 60_000 with the same max_usd: 5: the cap rolls; old spend evicts; the cumulative invoice over an hour is unbounded but the per-minute spike is. For high-stakes work where an over-spend is worse than an under-spend (production fan-out, paid lead enrichment), drop to max_usd: 1 — a tighter cap costs you one re-run on legitimate workflows; a looser cap costs you one Friday-night incident. Stack the budget guard with the loop detector on the same wrap: a stuck retry loop usually trips the loop guard first (signature repeats fast on identical tool arguments), but a slow-burn drift on slightly-different-each-time tool inputs trips the budget instead — both stop the run, both leave a typed error, both are cheap to retry.

Loop detection on the same wrap

Signature is the run fingerprint. agentkit:<agent_name>:<first_64_chars_of_input> is the coarse default for the cross-run case — the same agent against the same input, three times back-to-back, almost always means an upstream caller is firing the same orchestration in a loop. For per-turn loop detection inside a single run, wrap the underlying step instead of Runner.run() and signature on the proposed tool name plus the canonicalised arguments. The detector pushes the signature into a 32-entry sliding window and looks for any cycle of length 1–8 repeating 3+ times. Length 1 catches the stuck-tool loop (same tool with same args three times). Length 2 catches the validation-error ping-pong (call → error → call → error). Higher lengths cover the multi-step retry shapes the planner falls into when a tool keeps returning a degenerate result and the agent keeps pulling the same lever.
Trip event tells you which fired. reason: "loop" for a cycle hit; reason: "budget" for a cost cap; reason: "context" if you also pass a context-window guard for thread-history bloat. The typed error is one of LoopDetectedError, BudgetExceededError, ContextLimitError — the calling code dispatches on the type, not on string parsing.
Per-fn or shared. One guarded_run per agent gives you per-agent isolation. One shared guard() across multiple agents in a handoff chain gives you cross-agent loop detection — useful when researcher hands off to summariser hands back to researcher and the chain re-enters the same upstream tool three times.
Zero outbound calls. The whole check is pure data flow inside your Python process. No telemetry, no daemon, no SaaS, nothing leaves your VPC. The wrap is the only thing in your process that knows the agent is loop-stuck or over-budget — the loop counter and the dollar accumulator both live in-memory, scoped to the guarded function.

The first loop our SDK caught was ours

It wasn’t an AgentKit run — it was our own launch script firing a six-tweet thread against a paid X API. The first attempt came back with HTTP 402 CreditsDepleted. Six consecutive sessions later, six identical signatures — post_tweet:402:CreditsDepleted — were sitting in a flat JSON file on disk. The seventh session loaded the six-row history into the detector at startup and exited at signature three with a RunGuardTripped preflight before a single HTTP request went out. It has held the breaker open every session since. Read the dogfood story on the 30-day log; the same pattern slots into an AgentKit run when a planner replans the same stuck tool against the same arguments three times in a row.

What this is not

Not an AgentKit guardrail subclass. RunGuard does not subclass InputGuardrail, ship a Runner mixin, or hook into the tool registry. It wraps the underlying callable. That is the design — the same wrap composes with the raw OpenAI SDK, with the LangChain agent executor, with whatever framework lands next quarter. The SDK at product/sdk/src/budget.ts is 84 lines; the loop detector at product/sdk/src/loop-detector.ts is 111 lines; both are in-process primitives.
Not a replacement for AgentKit tracing. Tracing answers “what did the agent do yesterday and how much did it cost?”. A runtime budget alert answers “should the next paid turn go out?”. The two are complementary — one for finance, one for prevention. Run both. The trace is your morning-after audit; the breaker is your tonight-before-bed insurance.
Not a server. No outbound network, no telemetry, no cookies, no daemon, no SaaS. The budget check is pure data flow inside your Python process. The same in-process discipline shows up in the embed-preview widget; the policy is one repo away in llms.txt.

The minimum AgentKit integration

One pip install runguard, one guard() wrap around a thin _run_step that calls Runner.run() and pulls the dollar number from result.context_wrapper.usage, and one on_trip that pages the channel you actually read. Eight lines of wrap, no guardrail subclass to register, no Runner override, no agent mixin. The breaker trips on the dollar cap or the third repeat of any agent-input signature, halts the orchestration, and leaves a structured event and a typed error behind for the post-mortem. RunGuard ships it as runguard on PyPI and @runguard/sdk on npm — same primitive, both runtimes, in-process, zero deps.