OpenAI Assistants API budget control: cap spend per run, thread, and user
The OpenAI Assistants API introduces a cost structure that is different from the standard Chat Completions API in a way that makes budget management harder. With Chat Completions, you control the full message history you send on each call and can measure tokens precisely. With the Assistants API, OpenAI manages the thread state: each Run call sends the accumulated thread messages plus any tool results to the model, and the thread grows longer with every turn. This means your cost per Run increases monotonically as the thread accumulates messages — a conversation that costs $0.01 on turn 1 may cost $0.30 on turn 30 because the full thread is re-sent each time. The second cost problem is tool-call loops: when an assistant calls a tool that returns an error or unexpected result, it may attempt the same tool call multiple times in the same Run (using the “requires_action” loop in the Runs API) or across Runs. Each iteration re-sends the growing thread context, multiplying the per-token cost. This guide covers: the token cost math for thread accumulation, how to use max_prompt_tokens and truncation_strategy to prevent thread blow-out, per-run cost tracking with the usage field, per-thread and per-user budget caps, and RunGuard loop detection for assistant tool-call loops.
Thread token accumulation: the hidden cost multiplier
- Why thread cost grows with every message. Each Run call re-sends the entire thread history to the model. If your thread has 30 messages and each message averages 200 tokens, the input token count for turn 30 is at least 6,000 tokens just for history, plus the system prompt, plus the current user message. At GPT-4o pricing of $2.50 per million input tokens, that’s $0.015 per turn in context cost alone before any output is generated. By turn 100 in a long-running thread, input cost per turn can exceed $0.10, making a 100-turn conversation cost $10+ in input tokens.
-
Token accumulation math for GPT-4o (2026 pricing).
- GPT-4o input: $2.50/M tokens; output: $10.00/M tokens
- Turn 1 (system 500 + user 100 tokens): 600 input + ~200 output = $0.00350
- Turn 10 (10×300 avg history + 100 tokens): 3,100 input + ~200 output = $0.00975
- Turn 30 (30×300 history + 100 tokens): 9,100 input + ~200 output = $0.02475
- 30-turn conversation total (rough): ~$0.35 for moderate-length messages
-
Python: use
max_prompt_tokensto cap thread contextfrom openai import OpenAI from openai.types.beta.threads import Run client = OpenAI() def create_run_with_token_cap( thread_id: str, assistant_id: str, max_prompt_tokens: int = 8_000, # cap thread context max_completion_tokens: int = 1_024, ) -> Run: """ Create an Assistants Run with explicit token caps. max_prompt_tokens: total input token budget for this run (thread + system + tools) max_completion_tokens: output token cap for this run The API will truncate old messages from the thread to stay within max_prompt_tokens using the 'auto' truncation strategy (drops oldest messages first). """ return client.beta.threads.runs.create( thread_id=thread_id, assistant_id=assistant_id, max_prompt_tokens=max_prompt_tokens, max_completion_tokens=max_completion_tokens, truncation_strategy={ "type": "auto", # Alternatively: "last_messages" with last_messages=N to keep # only the N most recent messages regardless of token count }, ) def estimate_run_cost_usd(run: Run) -> float: """Calculate USD cost from run.usage field.""" if not run.usage: return 0.0 GPT4O_IN_PER_TOK = 2.50 / 1_000_000 GPT4O_OUT_PER_TOK = 10.00 / 1_000_000 return (run.usage.prompt_tokens * GPT4O_IN_PER_TOK + run.usage.completion_tokens * GPT4O_OUT_PER_TOK) -
Choosing
max_prompt_tokensvalues. Setmax_prompt_tokensto 2-3× your typical input context per turn. For a customer service assistant with short messages, 4,000 tokens is appropriate. For a coding assistant with long code blocks, 16,000–32,000 is reasonable. The truncation strategy “auto” drops the oldest messages first, preserving the system prompt and most recent context — this is correct behavior for most conversational agents. Use “last_messages” if you need to control how many turns of history to retain.
Per-thread and per-user budget caps
-
Python: per-user budget tracker across Runs
import sqlite3 from datetime import datetime, timezone from openai import OpenAI client = OpenAI() GPT4O_IN = 2.50 / 1_000_000 GPT4O_OUT = 10.00 / 1_000_000 def init_budget_db(db_path: str) -> None: with sqlite3.connect(db_path) as conn: conn.execute(""" CREATE TABLE IF NOT EXISTS run_costs ( run_id TEXT PRIMARY KEY, thread_id TEXT NOT NULL, user_id TEXT NOT NULL, prompt_tokens INTEGER, completion_tokens INTEGER, cost_usd REAL, created_at TEXT ) """) def record_run_cost( db_path: str, run, # openai.types.beta.threads.Run user_id: str, ) -> float: """Record a completed run's cost and return the cost in USD.""" if not run.usage: return 0.0 cost = (run.usage.prompt_tokens * GPT4O_IN + run.usage.completion_tokens * GPT4O_OUT) with sqlite3.connect(db_path) as conn: conn.execute( "INSERT OR REPLACE INTO run_costs " "(run_id, thread_id, user_id, prompt_tokens, completion_tokens, cost_usd, created_at) " "VALUES (?,?,?,?,?,?,?)", (run.id, run.thread_id, user_id, run.usage.prompt_tokens, run.usage.completion_tokens, cost, datetime.now(timezone.utc).isoformat()), ) return cost def get_user_spend_today(db_path: str, user_id: str) -> float: """Return total USD spent by user_id today (UTC).""" today = datetime.now(timezone.utc).strftime("%Y-%m-%d") with sqlite3.connect(db_path) as conn: row = conn.execute( "SELECT COALESCE(SUM(cost_usd),0) FROM run_costs " "WHERE user_id=? AND created_at LIKE ?", (user_id, f"{today}%"), ).fetchone() return row[0] if row else 0.0 def run_with_budget_check( thread_id: str, assistant_id: str, user_id: str, db_path: str, daily_cap_usd: float = 1.00, ) -> str: """ Run the assistant only if the user is within their daily budget cap. Returns the assistant's text response or a budget-exceeded message. """ # Pre-check: is user within daily budget? spent = get_user_spend_today(db_path, user_id) if spent >= daily_cap_usd: return ( f"[BUDGET] Daily limit ${daily_cap_usd:.2f} reached " f"(used ${spent:.4f}). Try again tomorrow." ) run = create_run_with_token_cap(thread_id, assistant_id) # Poll until terminal state while run.status in ("queued", "in_progress", "requires_action"): run = client.beta.threads.runs.retrieve( thread_id=thread_id, run_id=run.id ) cost = record_run_cost(db_path, run, user_id) # Post-check: did this run push user over cap? if spent + cost > daily_cap_usd: return ( f"[BUDGET] Daily limit ${daily_cap_usd:.2f} reached after this run " f"(used ${spent + cost:.4f}). Usage recorded." ) messages = client.beta.threads.messages.list(thread_id=thread_id) return messages.data[0].content[0].text.value
RunGuard loop detection for Assistants tool-call loops
-
How tool-call loops manifest in the Assistants API. In the Assistants API, tool calls happen inside the “requires_action” polling loop: the Run status becomes
requires_action, you submit tool outputs, and the Run continues. If the tool output causes the model to call the same tool again with the same arguments, the Run enters anotherrequires_actionstate, and you submit outputs again. This can cycle indefinitely. Each cycle re-bills the accumulated thread tokens plus the new tool call tokens. -
Python: RunGuard loop detection in the requires_action loop
from openai import OpenAI from runguard import LoopDetector, LoopDetectedError client = OpenAI() detector = LoopDetector(repeats=3, max_cycle_len=3) def handle_run_with_loop_detection( thread_id: str, run_id: str, ) -> str: """ Poll a Run to completion, using RunGuard to detect tool-call loops. Returns the final assistant message or a loop-detected error. """ while True: run = client.beta.threads.runs.retrieve( thread_id=thread_id, run_id=run_id ) if run.status == "completed": messages = client.beta.threads.messages.list(thread_id=thread_id) return messages.data[0].content[0].text.value if run.status == "requires_action": tool_calls = run.required_action.submit_tool_outputs.tool_calls tool_outputs = [] for tc in tool_calls: # Build a canonical signature for loop detection sig = f"{tc.function.name}:{tc.function.arguments[:100]}" match = detector.record(sig) if match: # Cancel the run and return a safe error client.beta.threads.runs.cancel( thread_id=thread_id, run_id=run_id ) return ( f"[LOOP] Tool '{tc.function.name}' called in a repeating " f"pattern ({match.repeats}x). Run cancelled to prevent cost overrun." ) # Execute tool and collect output output = dispatch_tool(tc.function.name, tc.function.arguments) tool_outputs.append({"tool_call_id": tc.id, "output": output}) # Submit all tool outputs to continue the run client.beta.threads.runs.submit_tool_outputs( thread_id=thread_id, run_id=run_id, tool_outputs=tool_outputs, ) elif run.status in ("failed", "cancelled", "expired"): return f"[ERROR] Run ended with status: {run.status}" # Brief polling delay import time; time.sleep(0.5) def dispatch_tool(name: str, arguments_json: str) -> str: """Replace with real tool dispatch.""" import json args = json.loads(arguments_json) return f"Tool {name} result: {args}"
Assistants API cost control comparison
| Control mechanism | What it prevents | Limitation |
|---|---|---|
max_prompt_tokens |
Thread context blow-out on long conversations | Only caps input tokens per run; does not prevent multiple expensive runs |
max_completion_tokens |
Verbose output generation | Only caps output tokens per run; input cost grows unbounded as thread accumulates |
Per-run cost tracking (run.usage) |
Post-hoc visibility into run cost | Reactive — the cost is already incurred before you see it |
| Per-user daily budget cap | A single user exhausting monthly budget | Requires persistent storage across sessions; does not stop in-flight run |
| RunGuard loop detection | Tool-call loops inside the requires_action cycle | Cancels the run on detection — partial work lost; resumption requires caller logic |
For the broader cost control patterns that Assistants-based agents need, see autonomous agent cost control best practices. For loop detection in non-Assistants OpenAI agents, see OpenAI Agents SDK loop guard. For the retry storm pattern that tool loops amplify, see AI agent retry storm prevention.
Add budget control to your OpenAI Assistants integration
RunGuard installs in one command: pip install runguard for Python, npm install @runguard/sdk for TypeScript. For Assistants API agents, the highest-impact step is adding max_prompt_tokens to every Run create call (prevents thread blow-out without any application code change) and then wrapping your requires_action polling loop with a LoopDetector (prevents tool-call loops that would otherwise run until budget exhaustion). Both take under 10 minutes to add to an existing Assistants integration.
RunGuard pricing: Solo plan at $19/month for individual developers. Team plan at $79/month adds Slack and PagerDuty webhook alerts, shared dashboards, and audit log. Both plans include a 14-day free trial — no credit card required.
Start your 14-day free trial — or explore related: OpenAI Agents SDK loop guard, autonomous agent cost control best practices, set max cost per LLM request, retry storm prevention, and prevent runaway cost in real time.