A runtime per-request budget limit for the OpenAI SDK

The OpenAI SDK ships max_tokens on every chat.completions.create() call — and on the newer client.responses.create() as max_output_tokens — and people read it as a budget knob. It is not. It is a token cap on the model’s output, not on the request’s bill. The input side is uncapped: a 100K-token system prompt with max_tokens=10 still bills 100K input tokens; a fifty-shot eval that ships fifty exemplars in the system message bills fifty exemplars whether the model emits one token or one thousand. The SDK has no per-request dollar cap, no per-run cumulative budget, no rolling-window throttle, no on_budget_exceeded hook. client.with_options(timeout=30) is a wall-clock cap, client.with_options(max_retries=3) is a retry-count knob, the dashboard’s monthly limit is an org-wide soft cap that fires after the bill has landed. None of them look at cumulative dollars spent so far in this run and none of them stop the next request before it fires. This page is the runtime per-request budget limit we ship and how it slots around a bare OpenAI SDK call in eight lines.

Where the dollars actually accumulate inside an OpenAI SDK call

Input tokens are uncapped on every call. max_tokens caps completion only. A retrieval-augmented prompt that stuffs forty 2K-token chunks into the system message ships 80K input tokens on every request — on gpt-4o at $2.50 per million input tokens, that’s $0.20 of input bill before the model emits a single token. Set max_tokens=50 and the floor is still $0.20; set max_tokens=4000 and you cap the upside, not the floor. A retry on parse failure that re-sends the same RAG prompt with the schema appended is another full $0.20 input plus output.
The n parameter scales output cost linearly. chat.completions.create(model="gpt-4o", messages=…, n=5) generates five completions on one request. The prompt is processed once; the output billing is five times. Self-consistency voters that ask for five samples to majority-vote pay 5× output. Self-consistency voters implemented as five separate calls with different temperatures pay 5× both input and output. Both look identical from the dashboard the morning after; only one looks identical to a per-request guard.
The Responses API carries server-side state forward. client.responses.create(previous_response_id=…) threads the prior turn’s state into the next request as part of the input. Your code looks like a one-line conversation; your bill looks like a transcript. The previous_response_id chain doesn’t show in the local messages list because it’s server-side, but the input-token count on each subsequent responses.create() includes the prior turns. A four-turn refinement chain on a 4K-token base is a 4K, 8K, 12K, 16K input progression — cubic-ish in cost, linear in lines of code.
Streaming abort doesn’t refund the input. stream=True on either Chat Completions or the Responses API server-streams the output. A user clicking cancel at the 200-token mark of a 4K-token response stops your client from receiving the rest, but the server has already billed the full input plus the tokens emitted up to the cut. The response.usage on the final chunk reflects what was billed; mid-stream aborts never see that chunk and never know what they paid.
Built-in retries are full-cost. client.with_options(max_retries=3) retries on transient 429s and 5xx errors. Each retry is a complete paid request — same input tokens, fresh output tokens, fresh per-call surcharge if you’re on a per-request priced tier. A noisy upstream that rate-limits intermittently triples your bill on every hot path until you notice in the dashboard a week later.
Function-calling validation retries are billed retries. When the model emits arguments that fail your Pydantic schema or your hand-rolled parser, the canonical pattern is to feed the validation error back as a tool result and ask for a corrected call. That’s a fresh chat-completions request, full input plus output. A subtly-mistyped tool argument can chew three or four paid retries before the model corrects course; a model that simply cannot produce the schema (because the schema is wrong) chews retries until your loop counter saves you.
Hand-rolled agent loops on chat.completions have no framework safety net. The fifty-line “agent in raw OpenAI” that wraps a while not done loop around a tool-dispatch step has all the loop hazards of a LangChain executor or a CrewAI crew — same retry-on-tool-error pattern, same accumulating message history, same chance of a stuck conditional — with none of the framework’s recursion-limit backstop. The bare SDK has no recursion_limit equivalent. The loop is bounded only by your while condition and your patience.

What the OpenAI SDK’s knobs give you and what they don’t

The OpenAI SDK’s knobs are correct in shape and wrong in unit. max_tokens (Chat Completions) and max_output_tokens (Responses API) are token caps on the model’s output, not on the request’s bill, and not on the conversation’s cumulative spend — setting them to 50 saves you the upside on one call and leaves the input floor untouched. The n parameter is an output multiplier, not a budget rule; it makes the per-call bill bigger, not smaller. stream=True is a delivery mechanism, not a cap; aborting mid-stream cancels delivery, not billing. client.with_options(timeout=30) is a wall-clock cap in seconds — a 30-second timeout on gpt-4o at roughly 100 tokens per second of output is on the order of 3K output tokens of bill on a cancel, far above what most callers expect when they read “30-second timeout.” client.with_options(max_retries=3) is a transient-failure retry count and every retry is a full-cost request. Org-tier RPM and TPM rate limits are throttles, not dollars; you can stay well under your TPM and still write a four-figure invoice. response_format={"type": "json_schema", …} controls output shape, not output length and not output cost — a structured-outputs request is the same per-token bill as an unstructured one. The dashboard’s monthly usage limit is an org-wide soft cap; it stops new requests org-wide once tripped, but it doesn’t stop the in-flight request that crossed the line, and it has no per-job awareness — one runaway script can lock out every other workload sharing the org. None of these look at cumulative dollars spent so far in this run, none of them stop the next request before it fires, and none of them tell you which call inside your loop is the one that finally crossed the cap.

What a per-request budget guard actually has to do

Detect the cycle on a fingerprint, not a call count. The same prompt fired three times in a row is a stuck retry — same model, same messages, same arguments, same parse-failure handler. The same prompt fired three times with three different temperatures is self-consistency sampling and is doing real work. A call-count guard can’t tell them apart; a signature guard can. The detector takes a per-call signature — the model name plus the canonicalised tail of the messages list, optionally a hash of the active tool schemas, optionally the function name that wrapped the call — and looks for any cycle of length 1–8 repeating 3+ times in the most-recent 32 calls. Cycle of length 1 catches the stuck-retry-on-the-same-prompt case. Length 2 catches the agent/tool ping-pong on a hand-rolled raw-SDK loop. Higher lengths cover multi-step refinement pipelines that fall into a draft-critique-revise-critique cycle.
Track real dollars, not call count. A call to gpt-4o at $2.50 per million input plus $10.00 per million output costs roughly ten times what a call to gpt-4o-mini at $0.15 per million input plus $0.60 per million output costs; an o1 call at $15.00 per million input plus $60.00 per million output costs another six times more. The tracker reads response.usage.prompt_tokens and response.usage.completion_tokens on Chat Completions, or response.usage.input_tokens and response.usage.output_tokens on the Responses API, multiplies by the published per-token rate for the chosen model, and adds the result to a rolling-window or cumulative ledger.
Trip before the next request fires, not after. The check is in-process, on a numeric accumulator and a small ring buffer. It runs in microseconds. When the cap is crossed or the cycle threshold is hit, the next call into the wrapped function raises a typed error and the host halts — the next client.chat.completions.create() never executes, the next retry never schedules, the next refinement turn never builds.
Be a primitive, not a framework opinion. The same wrap should compose with client.chat.completions.create(), with client.responses.create(), with the client.beta.realtime session opener, with the Azure OpenAI flavour, with whatever the SDK adds next quarter. A breaker that ships as an OpenAI SDK monkey-patch or a bespoke OpenAIWithGuard client class is brittle; a breaker that wraps any callable is portable.

Wrapping a `chat.completions.create()` call with `runguard`

# openai sdk + runguard. Wrap the bare client call so the loop detector
# and budget tracker see every paid request before the next.
from openai import OpenAI
from runguard import guard, BudgetExceededError, LoopDetectedError

client = OpenAI()
RATE_IN  = 2.5e-6   # gpt-4o per input token
RATE_OUT = 10e-6    # gpt-4o per output token

def _ask(messages, model="gpt-4o", **kw):
    resp = client.chat.completions.create(model=model, messages=messages, **kw)
    u = resp.usage
    usd = u.prompt_tokens * RATE_IN + u.completion_tokens * RATE_OUT
    tail = "|".join(f"{m['role']}:{(m.get('content') or '')[:32]}" for m in messages[-2:])
    sig = f"openai:{model}:{tail}"
    return {"resp": resp, "usd": usd, "sig": sig}

guarded_ask = guard(
    _ask,
    signature=lambda _args, out: out["sig"],
    budget={"max_usd": 5, "window_ms": 60_000},
    loop={"repeats": 3, "max_cycle_len": 8},
    cost=lambda _args, out: out["usd"],
    on_trip=lambda e: print("[runguard]", e["reason"], e.get("spent"), "of", e.get("cap")),
)

try:
    out = guarded_ask([{"role": "user", "content": "Brief me on Q3 SEC filings for $TICK"}])
    print(out["resp"].choices[0].message.content)
except (BudgetExceededError, LoopDetectedError) as e:
    print("halted:", e)

The loop primitive is the LoopDetector shipped at product/sdk/src/loop-detector.ts: defaults windowSize: 32, minCycleLen: 1, maxCycleLen: 8, repeats: 3 — a push(signature) the wrap calls per call, a scan() that returns a typed match, a reset() for fresh runs, and constructor-time validation that rejects repeats < 2 and windowSize < maxCycleLen * repeats. The budget primitive is the BudgetTracker at product/sdk/src/budget.ts: maxUsd for the cap, optional windowMs for rolling-window throttles, an add(usd) the host calls post-call (which silently no-ops on zero, if (usd === 0) return), and an exceeded() the wrap reads pre-call. The BudgetTracker file is 84 lines; the LoopDetector is 111 lines — both are pure in-process primitives, no daemon, no telemetry. The fingerprint-and-window approach is documented at how to detect LLM tool-call loops in production; the LangChain AgentExecutor wrap is here; the multi-agent CrewAI wrap is here; the browser-use wrap is here; the OpenAI AgentKit wrap (one layer up from this page’s bare-SDK case) is here; the LangGraph StateGraph wrap is here.

How the breaker behaves around `client.chat.completions.create()`

Cost accumulates after each request returns. The wrap reads the usd field on the inner function’s output dict and pushes it into the BudgetTracker. Successful requests under the cap pass through transparently — the host sees the response, parses it, dispatches the next step. Zero-cost calls (a cached completion, a request that errored before billing on a stub) never trip the budget; the tracker explicitly skips zero entries via if (usd === 0) return.
The first request over the cap throws before its API call goes out. BudgetExceededError is constructed with the cumulative spend, the cap, and a reason field. It propagates out of the wrap before the next client.chat.completions.create() fires — no in-flight HTTP request, no bandwidth out, no tokens billed. The previous response is preserved on the host’s side; the cap-crossing call simply never executes.
The loop detector trips on the third repeat of any signature cycle. The wrap pushes the signature into a 32-entry sliding window after each call and scans for a length-1 to length-8 cycle that’s repeated three or more times in a row at the tail. LoopDetectedError carries the cycle length, the pattern itself, and the repeat count — the calling code dispatches on the type. A length-1 trip is the canonical raw-SDK retry-on-the-same-prompt loop (the parse-failure-handler that keeps re-sending the same messages). A length-2 trip is the hand-rolled agent/tool ping-pong (your code sends prompt P, gets tool-call T, dispatches T, gets error E, feeds (P,T,E) back, gets the same T again). A length-3 trip is the draft-critique-revise refinement cycle that stops actually revising.
Your on_trip hook fires before the throw. Page Slack with the spend curve and the offending cycle pattern, write a row to a trip log keyed on the wrap name plus the run id, persist the cumulative usage so the next process can reload it — whatever you wire. Sync hooks run inline; async hooks are awaited. An on_trip exception propagates instead of the trip error, by design (the host explicitly opted in to side-effecting on trip).
Reset is explicit. When a fresh job starts, call guarded_ask.reset() to clear both the spend ledger and the loop window. The tracker is per-guarded-fn, not per-process — you can wrap one _ask for fast cheap calls and another _ask_o1 for expensive reasoning calls with independent budgets, or share one guard() across both for a global per-job cap. Pair the wrap with client.with_options(max_retries=0) if you want the breaker to be the only retry policy in the stack — otherwise SDK-level retries fire inside one logical call and the budget guard sees them after the fact.

Tuning for the OpenAI SDK cost shape

A typical chat.completions.create() with a 4K-token RAG prompt and a 1K-token response on gpt-4o lands around $0.02 of bill ($0.01 input + $0.01 output). On gpt-4o-mini the same call is around $0.0012; on o1 around $0.12. The default max_usd: 5 on the budget tracker corresponds to roughly 250 calls on gpt-4o, 4,000 on gpt-4o-mini, and 40 on o1 — a normal job finishes well inside the cap; a stuck retry loop trips the breaker before the bill triples. For interactive applications behind a per-user request handler (a chat UI, a search-with-LLM-rerank endpoint, a one-call summariser), set window_ms: 60_000 with the same max_usd: 5: the cap rolls; old spend evicts; the cumulative invoice over an hour is unbounded but the per-minute spike is bounded. For batch pipelines that run unattended overnight (a documentation re-indexer, a nightly news summariser, a knowledge-base re-embedder), drop window_ms entirely — you want a hard cumulative cap on the whole job. For high-stakes high-cost work where an over-spend is worse than an under-spend (production data enrichment hitting o1, paid-content moderation passes, large-context legal extraction), drop max_usd: 1 — a tighter cap costs you one re-run on legitimate workflows; a looser cap costs you one Friday-night incident. Stack the budget guard with the loop detector on the same wrap: a stuck parse-retry usually trips the loop guard first (same model plus same trailing messages hash to the same signature on each retry), but a slow-burn drift on slightly-different-each-time prompts trips the budget instead — both stop the run, both leave a typed error, both are cheap to retry.

The retry, n-best, and CoT shapes on the same wrap

Signature is the run fingerprint. The default openai:<model>:<role:content[:32]> for last 2 messages covers the canonical raw-SDK loop — the host code feeds the same prompt three times after a parse failure, the breaker halts before the fourth attempt. For self-consistency sampling that legitimately re-sends the same prompt with different temperatures, signature on (model, messages, temperature) — three temperatures are three different signatures and the loop detector doesn’t trip; only the budget guard does, which is what you want. For hand-rolled agent loops that dispatch tool calls in code, signature on (model, last_tool_call_name, last_tool_call_args[:64]) — the canonical agent/tool ping-pong becomes a length-2 cycle and the breaker halts on the third repeat. The detector pushes the signature into a 32-entry sliding window and looks for any cycle of length 1–8 repeating 3+ times.
Trip event tells you which fired. reason: "loop" for a cycle hit; reason: "budget" for a cost cap; reason: "context" if you also pass a context-window guard for input-token bloat (e.g. when a Responses API previous_response_id chain is silently growing the per-call input). The typed error is one of LoopDetectedError, BudgetExceededError, ContextLimitError — the calling code dispatches on the type, not on string parsing.
Per-wrap or shared. One guard() per logical call type gives you per-type isolation — a cheap classifier wrap has one budget, an expensive reasoning wrap has another. One shared guard() across the whole job (every wrapped callable references the same guard() instance) gives you cross-call loop detection — useful when a draft-critique-revise pipeline keeps falling into the same three-call cycle even though no single wrap is repeating in isolation.
Plays nicely with the Responses API state. The wrap is in-process and stateless across runs — it doesn’t persist its loop window or spend ledger across processes. That’s deliberate: a follow-up call with a previous_response_id that resumes a prior server-side conversation is a fresh run from the breaker’s point of view, which is what you want for the legitimate-resume case. For the breaker-state-must-survive-restart case (a long-running daemon, a worker pool that recycles processes), persist the trip event yourself in your on_trip hook and refuse to start the next request if the trip event is still open.
Zero outbound calls. The whole check is pure data flow inside your Python (or Node) process. No telemetry, no daemon, no SaaS, nothing leaves your VPC. The wrap is the only thing in your process that knows the request is loop-stuck or over-budget; the only place it surfaces is the typed error, the on_trip hook you wrote, and a structured event in the trip log.

The first loop our SDK caught was ours

It wasn’t a chat.completions.create() — it was our own launch script firing a six-tweet thread against a paid X API. The first attempt came back with HTTP 402 CreditsDepleted. Six consecutive sessions later, six identical signatures — post_tweet:402:CreditsDepleted — were sitting in a flat JSON file on disk. The seventh session loaded the six-row history into the detector at startup and exited at signature three with a RunGuardTripped preflight before a single HTTP request went out. It has held the breaker open every session since. Read the dogfood story on the 30-day log; the same pattern slots into a raw OpenAI SDK call when the host code keeps re-sending the same prompt against the same parse-failure handler three calls in a row, or when a previous_response_id chain quietly grows the input-token count past the model’s context window and every retry fails on the same boundary.

What this is not

Not a replacement for max_tokens. Keep max_tokens set on every chat.completions.create() call — it’s an upper bound on the per-call output, complementary to the per-run dollar cap. The two are different units: max_tokens bounds one call’s output length; guard() bounds the run’s cumulative dollars. Set max_tokens tightly for predictable per-call ceilings; set max_usd for the cumulative-across-calls ceiling that max_tokens can’t see. The SDK at product/sdk/src/budget.ts is 84 lines; the loop detector at product/sdk/src/loop-detector.ts is 111 lines; both are in-process primitives.
Not a monkey-patch on the OpenAI client. RunGuard does not subclass OpenAI, ship an OpenAIWithGuard drop-in, or hook into the SDK’s internals. It wraps the underlying callable that calls the SDK. That is the design — the same wrap composes with chat.completions.create(), with responses.create(), with the Realtime API, with the Azure flavour, with whatever the SDK adds next quarter. A breaker that depends on the SDK’s shape is a maintenance liability the first time the SDK pivots; a breaker that wraps any callable is portable.
Not Helicone, Langfuse, or the OpenAI usage dashboard. Those answer “what did the run do yesterday and how much did it cost?”. A runtime per-request budget guard answers “should the next paid request fire?”. The two are complementary — one for finance, one for prevention. Run both. The trace is your morning-after audit; the breaker is your tonight-before-bed insurance. The dashboard’s monthly soft cap is an org-wide blast radius; the per-job max_usd is the per-job blast radius. Both are useful; only one stops the runaway script before it locks out every other workload.
Not a server, not a proxy. No outbound network, no telemetry, no cookies, no daemon, no LLM-proxy gateway in front of your SDK calls. The check is pure data flow inside your Python or Node process. The same in-process discipline shows up in the embed-preview widget; the policy is one repo away in llms.txt. If your security review asked “does this guard ship our prompts off-prem?”, the answer is the wrap reads response.usage off your SDK’s response object and that’s the entire data flow.

The minimum OpenAI SDK integration

One pip install runguard (or npm i @runguard/sdk), one guard() wrap around a thin _ask that calls client.chat.completions.create() and returns {resp, usd, sig}, and one on_trip that pages the channel you actually read. Eight lines of wrap, no OpenAI subclass to register, no SDK monkey-patch, no proxy gateway in front of every call. The breaker trips on the dollar cap or the third repeat of any request signature, halts the run, and leaves a structured event and a typed error behind for the post-mortem — long before max_tokens would have bounded the next call’s upside, long before the dashboard’s monthly soft cap fires, and long before the bill arrives. RunGuard ships it as runguard on PyPI and @runguard/sdk on npm — same primitive, both runtimes, in-process, zero deps. Same wrap composes with the bare SDK on this page, the AgentKit Runner one layer up, and the LangChain or LangGraph or CrewAI or browser-use stacks one layer further up; pick whichever level your code lives at and the breaker reads the same response.usage in the end.