A runtime per-request budget limit for the OpenAI SDK

The OpenAI SDK ships max_tokens on every chat.completions.create() call — and on the newer client.responses.create() as max_output_tokens — and people read it as a budget knob. It is not. It is a token cap on the model’s output, not on the request’s bill. The input side is uncapped: a 100K-token system prompt with max_tokens=10 still bills 100K input tokens; a fifty-shot eval that ships fifty exemplars in the system message bills fifty exemplars whether the model emits one token or one thousand. The SDK has no per-request dollar cap, no per-run cumulative budget, no rolling-window throttle, no on_budget_exceeded hook. client.with_options(timeout=30) is a wall-clock cap, client.with_options(max_retries=3) is a retry-count knob, the dashboard’s monthly limit is an org-wide soft cap that fires after the bill has landed. None of them look at cumulative dollars spent so far in this run and none of them stop the next request before it fires. This page is the runtime per-request budget limit we ship and how it slots around a bare OpenAI SDK call in eight lines.

Where the dollars actually accumulate inside an OpenAI SDK call

What the OpenAI SDK’s knobs give you and what they don’t

The OpenAI SDK’s knobs are correct in shape and wrong in unit. max_tokens (Chat Completions) and max_output_tokens (Responses API) are token caps on the model’s output, not on the request’s bill, and not on the conversation’s cumulative spend — setting them to 50 saves you the upside on one call and leaves the input floor untouched. The n parameter is an output multiplier, not a budget rule; it makes the per-call bill bigger, not smaller. stream=True is a delivery mechanism, not a cap; aborting mid-stream cancels delivery, not billing. client.with_options(timeout=30) is a wall-clock cap in seconds — a 30-second timeout on gpt-4o at roughly 100 tokens per second of output is on the order of 3K output tokens of bill on a cancel, far above what most callers expect when they read “30-second timeout.” client.with_options(max_retries=3) is a transient-failure retry count and every retry is a full-cost request. Org-tier RPM and TPM rate limits are throttles, not dollars; you can stay well under your TPM and still write a four-figure invoice. response_format={"type": "json_schema", …} controls output shape, not output length and not output cost — a structured-outputs request is the same per-token bill as an unstructured one. The dashboard’s monthly usage limit is an org-wide soft cap; it stops new requests org-wide once tripped, but it doesn’t stop the in-flight request that crossed the line, and it has no per-job awareness — one runaway script can lock out every other workload sharing the org. None of these look at cumulative dollars spent so far in this run, none of them stop the next request before it fires, and none of them tell you which call inside your loop is the one that finally crossed the cap.

What a per-request budget guard actually has to do

Wrapping a chat.completions.create() call with runguard

# openai sdk + runguard. Wrap the bare client call so the loop detector
# and budget tracker see every paid request before the next.
from openai import OpenAI
from runguard import guard, BudgetExceededError, LoopDetectedError

client = OpenAI()
RATE_IN  = 2.5e-6   # gpt-4o per input token
RATE_OUT = 10e-6    # gpt-4o per output token

def _ask(messages, model="gpt-4o", **kw):
    resp = client.chat.completions.create(model=model, messages=messages, **kw)
    u = resp.usage
    usd = u.prompt_tokens * RATE_IN + u.completion_tokens * RATE_OUT
    tail = "|".join(f"{m['role']}:{(m.get('content') or '')[:32]}" for m in messages[-2:])
    sig = f"openai:{model}:{tail}"
    return {"resp": resp, "usd": usd, "sig": sig}

guarded_ask = guard(
    _ask,
    signature=lambda _args, out: out["sig"],
    budget={"max_usd": 5, "window_ms": 60_000},
    loop={"repeats": 3, "max_cycle_len": 8},
    cost=lambda _args, out: out["usd"],
    on_trip=lambda e: print("[runguard]", e["reason"], e.get("spent"), "of", e.get("cap")),
)

try:
    out = guarded_ask([{"role": "user", "content": "Brief me on Q3 SEC filings for $TICK"}])
    print(out["resp"].choices[0].message.content)
except (BudgetExceededError, LoopDetectedError) as e:
    print("halted:", e)

The loop primitive is the LoopDetector shipped at product/sdk/src/loop-detector.ts: defaults windowSize: 32, minCycleLen: 1, maxCycleLen: 8, repeats: 3 — a push(signature) the wrap calls per call, a scan() that returns a typed match, a reset() for fresh runs, and constructor-time validation that rejects repeats < 2 and windowSize < maxCycleLen * repeats. The budget primitive is the BudgetTracker at product/sdk/src/budget.ts: maxUsd for the cap, optional windowMs for rolling-window throttles, an add(usd) the host calls post-call (which silently no-ops on zero, if (usd === 0) return), and an exceeded() the wrap reads pre-call. The BudgetTracker file is 84 lines; the LoopDetector is 111 lines — both are pure in-process primitives, no daemon, no telemetry. The fingerprint-and-window approach is documented at how to detect LLM tool-call loops in production; the LangChain AgentExecutor wrap is here; the multi-agent CrewAI wrap is here; the browser-use wrap is here; the OpenAI AgentKit wrap (one layer up from this page’s bare-SDK case) is here; the LangGraph StateGraph wrap is here.

How the breaker behaves around client.chat.completions.create()

Tuning for the OpenAI SDK cost shape

A typical chat.completions.create() with a 4K-token RAG prompt and a 1K-token response on gpt-4o lands around $0.02 of bill ($0.01 input + $0.01 output). On gpt-4o-mini the same call is around $0.0012; on o1 around $0.12. The default max_usd: 5 on the budget tracker corresponds to roughly 250 calls on gpt-4o, 4,000 on gpt-4o-mini, and 40 on o1 — a normal job finishes well inside the cap; a stuck retry loop trips the breaker before the bill triples. For interactive applications behind a per-user request handler (a chat UI, a search-with-LLM-rerank endpoint, a one-call summariser), set window_ms: 60_000 with the same max_usd: 5: the cap rolls; old spend evicts; the cumulative invoice over an hour is unbounded but the per-minute spike is bounded. For batch pipelines that run unattended overnight (a documentation re-indexer, a nightly news summariser, a knowledge-base re-embedder), drop window_ms entirely — you want a hard cumulative cap on the whole job. For high-stakes high-cost work where an over-spend is worse than an under-spend (production data enrichment hitting o1, paid-content moderation passes, large-context legal extraction), drop max_usd: 1 — a tighter cap costs you one re-run on legitimate workflows; a looser cap costs you one Friday-night incident. Stack the budget guard with the loop detector on the same wrap: a stuck parse-retry usually trips the loop guard first (same model plus same trailing messages hash to the same signature on each retry), but a slow-burn drift on slightly-different-each-time prompts trips the budget instead — both stop the run, both leave a typed error, both are cheap to retry.

The retry, n-best, and CoT shapes on the same wrap

The first loop our SDK caught was ours

It wasn’t a chat.completions.create() — it was our own launch script firing a six-tweet thread against a paid X API. The first attempt came back with HTTP 402 CreditsDepleted. Six consecutive sessions later, six identical signatures — post_tweet:402:CreditsDepleted — were sitting in a flat JSON file on disk. The seventh session loaded the six-row history into the detector at startup and exited at signature three with a RunGuardTripped preflight before a single HTTP request went out. It has held the breaker open every session since. Read the dogfood story on the 30-day log; the same pattern slots into a raw OpenAI SDK call when the host code keeps re-sending the same prompt against the same parse-failure handler three calls in a row, or when a previous_response_id chain quietly grows the input-token count past the model’s context window and every retry fails on the same boundary.

What this is not

The minimum OpenAI SDK integration

One pip install runguard (or npm i @runguard/sdk), one guard() wrap around a thin _ask that calls client.chat.completions.create() and returns {resp, usd, sig}, and one on_trip that pages the channel you actually read. Eight lines of wrap, no OpenAI subclass to register, no SDK monkey-patch, no proxy gateway in front of every call. The breaker trips on the dollar cap or the third repeat of any request signature, halts the run, and leaves a structured event and a typed error behind for the post-mortem — long before max_tokens would have bounded the next call’s upside, long before the dashboard’s monthly soft cap fires, and long before the bill arrives. RunGuard ships it as runguard on PyPI and @runguard/sdk on npm — same primitive, both runtimes, in-process, zero deps. Same wrap composes with the bare SDK on this page, the AgentKit Runner one layer up, and the LangChain or LangGraph or CrewAI or browser-use stacks one layer further up; pick whichever level your code lives at and the breaker reads the same response.usage in the end.