OpenAI Assistants API budget control: cap spend per run, thread, and user

The OpenAI Assistants API introduces a cost structure that is different from the standard Chat Completions API in a way that makes budget management harder. With Chat Completions, you control the full message history you send on each call and can measure tokens precisely. With the Assistants API, OpenAI manages the thread state: each Run call sends the accumulated thread messages plus any tool results to the model, and the thread grows longer with every turn. This means your cost per Run increases monotonically as the thread accumulates messages — a conversation that costs $0.01 on turn 1 may cost $0.30 on turn 30 because the full thread is re-sent each time. The second cost problem is tool-call loops: when an assistant calls a tool that returns an error or unexpected result, it may attempt the same tool call multiple times in the same Run (using the “requires_action” loop in the Runs API) or across Runs. Each iteration re-sends the growing thread context, multiplying the per-token cost. This guide covers: the token cost math for thread accumulation, how to use max_prompt_tokens and truncation_strategy to prevent thread blow-out, per-run cost tracking with the usage field, per-thread and per-user budget caps, and RunGuard loop detection for assistant tool-call loops.

Thread token accumulation: the hidden cost multiplier

Per-thread and per-user budget caps

RunGuard loop detection for Assistants tool-call loops

Assistants API cost control comparison

Control mechanism What it prevents Limitation
max_prompt_tokens Thread context blow-out on long conversations Only caps input tokens per run; does not prevent multiple expensive runs
max_completion_tokens Verbose output generation Only caps output tokens per run; input cost grows unbounded as thread accumulates
Per-run cost tracking (run.usage) Post-hoc visibility into run cost Reactive — the cost is already incurred before you see it
Per-user daily budget cap A single user exhausting monthly budget Requires persistent storage across sessions; does not stop in-flight run
RunGuard loop detection Tool-call loops inside the requires_action cycle Cancels the run on detection — partial work lost; resumption requires caller logic

For the broader cost control patterns that Assistants-based agents need, see autonomous agent cost control best practices. For loop detection in non-Assistants OpenAI agents, see OpenAI Agents SDK loop guard. For the retry storm pattern that tool loops amplify, see AI agent retry storm prevention.

Add budget control to your OpenAI Assistants integration

RunGuard installs in one command: pip install runguard for Python, npm install @runguard/sdk for TypeScript. For Assistants API agents, the highest-impact step is adding max_prompt_tokens to every Run create call (prevents thread blow-out without any application code change) and then wrapping your requires_action polling loop with a LoopDetector (prevents tool-call loops that would otherwise run until budget exhaustion). Both take under 10 minutes to add to an existing Assistants integration.

RunGuard pricing: Solo plan at $19/month for individual developers. Team plan at $79/month adds Slack and PagerDuty webhook alerts, shared dashboards, and audit log. Both plans include a 14-day free trial — no credit card required.

Start your 14-day free trial — or explore related: OpenAI Agents SDK loop guard, autonomous agent cost control best practices, set max cost per LLM request, retry storm prevention, and prevent runaway cost in real time.