OpenAI Assistants API budget control: cap spend per run, thread, and user

The OpenAI Assistants API introduces a cost structure that is different from the standard Chat Completions API in a way that makes budget management harder. With Chat Completions, you control the full message history you send on each call and can measure tokens precisely. With the Assistants API, OpenAI manages the thread state: each Run call sends the accumulated thread messages plus any tool results to the model, and the thread grows longer with every turn. This means your cost per Run increases monotonically as the thread accumulates messages — a conversation that costs $0.01 on turn 1 may cost $0.30 on turn 30 because the full thread is re-sent each time. The second cost problem is tool-call loops: when an assistant calls a tool that returns an error or unexpected result, it may attempt the same tool call multiple times in the same Run (using the “requires_action” loop in the Runs API) or across Runs. Each iteration re-sends the growing thread context, multiplying the per-token cost. This guide covers: the token cost math for thread accumulation, how to use max_prompt_tokens and truncation_strategy to prevent thread blow-out, per-run cost tracking with the usage field, per-thread and per-user budget caps, and RunGuard loop detection for assistant tool-call loops.

Thread token accumulation: the hidden cost multiplier

Why thread cost grows with every message. Each Run call re-sends the entire thread history to the model. If your thread has 30 messages and each message averages 200 tokens, the input token count for turn 30 is at least 6,000 tokens just for history, plus the system prompt, plus the current user message. At GPT-4o pricing of $2.50 per million input tokens, that’s $0.015 per turn in context cost alone before any output is generated. By turn 100 in a long-running thread, input cost per turn can exceed $0.10, making a 100-turn conversation cost $10+ in input tokens.
Token accumulation math for GPT-4o (2026 pricing).
- GPT-4o input: $2.50/M tokens; output: $10.00/M tokens
- Turn 1 (system 500 + user 100 tokens): 600 input + ~200 output = $0.00350
- Turn 10 (10×300 avg history + 100 tokens): 3,100 input + ~200 output = $0.00975
- Turn 30 (30×300 history + 100 tokens): 9,100 input + ~200 output = $0.02475
- 30-turn conversation total (rough): ~$0.35 for moderate-length messages

Python: use max_prompt_tokens to cap thread context

from openai import OpenAI
from openai.types.beta.threads import Run

client = OpenAI()

def create_run_with_token_cap(
    thread_id: str,
    assistant_id: str,
    max_prompt_tokens: int = 8_000,  # cap thread context
    max_completion_tokens: int = 1_024,
) -> Run:
    """
    Create an Assistants Run with explicit token caps.
    max_prompt_tokens: total input token budget for this run (thread + system + tools)
    max_completion_tokens: output token cap for this run
    The API will truncate old messages from the thread to stay within max_prompt_tokens
    using the 'auto' truncation strategy (drops oldest messages first).
    """
    return client.beta.threads.runs.create(
        thread_id=thread_id,
        assistant_id=assistant_id,
        max_prompt_tokens=max_prompt_tokens,
        max_completion_tokens=max_completion_tokens,
        truncation_strategy={
            "type": "auto",
            # Alternatively: "last_messages" with last_messages=N to keep
            # only the N most recent messages regardless of token count
        },
    )

def estimate_run_cost_usd(run: Run) -> float:
    """Calculate USD cost from run.usage field."""
    if not run.usage:
        return 0.0
    GPT4O_IN_PER_TOK  = 2.50  / 1_000_000
    GPT4O_OUT_PER_TOK = 10.00 / 1_000_000
    return (run.usage.prompt_tokens * GPT4O_IN_PER_TOK +
            run.usage.completion_tokens * GPT4O_OUT_PER_TOK)

Choosing max_prompt_tokens values. Set max_prompt_tokens to 2-3× your typical input context per turn. For a customer service assistant with short messages, 4,000 tokens is appropriate. For a coding assistant with long code blocks, 16,000–32,000 is reasonable. The truncation strategy “auto” drops the oldest messages first, preserving the system prompt and most recent context — this is correct behavior for most conversational agents. Use “last_messages” if you need to control how many turns of history to retain.

Per-thread and per-user budget caps

Python: per-user budget tracker across Runs

import sqlite3
from datetime import datetime, timezone
from openai import OpenAI

client = OpenAI()

GPT4O_IN  = 2.50  / 1_000_000
GPT4O_OUT = 10.00 / 1_000_000

def init_budget_db(db_path: str) -> None:
    with sqlite3.connect(db_path) as conn:
        conn.execute("""
            CREATE TABLE IF NOT EXISTS run_costs (
                run_id TEXT PRIMARY KEY,
                thread_id TEXT NOT NULL,
                user_id TEXT NOT NULL,
                prompt_tokens INTEGER,
                completion_tokens INTEGER,
                cost_usd REAL,
                created_at TEXT
            )
        """)

def record_run_cost(
    db_path: str,
    run,  # openai.types.beta.threads.Run
    user_id: str,
) -> float:
    """Record a completed run's cost and return the cost in USD."""
    if not run.usage:
        return 0.0
    cost = (run.usage.prompt_tokens * GPT4O_IN +
            run.usage.completion_tokens * GPT4O_OUT)
    with sqlite3.connect(db_path) as conn:
        conn.execute(
            "INSERT OR REPLACE INTO run_costs "
            "(run_id, thread_id, user_id, prompt_tokens, completion_tokens, cost_usd, created_at) "
            "VALUES (?,?,?,?,?,?,?)",
            (run.id, run.thread_id, user_id,
             run.usage.prompt_tokens, run.usage.completion_tokens,
             cost, datetime.now(timezone.utc).isoformat()),
        )
    return cost

def get_user_spend_today(db_path: str, user_id: str) -> float:
    """Return total USD spent by user_id today (UTC)."""
    today = datetime.now(timezone.utc).strftime("%Y-%m-%d")
    with sqlite3.connect(db_path) as conn:
        row = conn.execute(
            "SELECT COALESCE(SUM(cost_usd),0) FROM run_costs "
            "WHERE user_id=? AND created_at LIKE ?",
            (user_id, f"{today}%"),
        ).fetchone()
        return row[0] if row else 0.0

def run_with_budget_check(
    thread_id: str,
    assistant_id: str,
    user_id: str,
    db_path: str,
    daily_cap_usd: float = 1.00,
) -> str:
    """
    Run the assistant only if the user is within their daily budget cap.
    Returns the assistant's text response or a budget-exceeded message.
    """
    # Pre-check: is user within daily budget?
    spent = get_user_spend_today(db_path, user_id)
    if spent >= daily_cap_usd:
        return (
            f"[BUDGET] Daily limit ${daily_cap_usd:.2f} reached "
            f"(used ${spent:.4f}). Try again tomorrow."
        )

    run = create_run_with_token_cap(thread_id, assistant_id)
    # Poll until terminal state
    while run.status in ("queued", "in_progress", "requires_action"):
        run = client.beta.threads.runs.retrieve(
            thread_id=thread_id, run_id=run.id
        )
    cost = record_run_cost(db_path, run, user_id)

    # Post-check: did this run push user over cap?
    if spent + cost > daily_cap_usd:
        return (
            f"[BUDGET] Daily limit ${daily_cap_usd:.2f} reached after this run "
            f"(used ${spent + cost:.4f}). Usage recorded."
        )

    messages = client.beta.threads.messages.list(thread_id=thread_id)
    return messages.data[0].content[0].text.value

RunGuard loop detection for Assistants tool-call loops

How tool-call loops manifest in the Assistants API. In the Assistants API, tool calls happen inside the “requires_action” polling loop: the Run status becomes requires_action, you submit tool outputs, and the Run continues. If the tool output causes the model to call the same tool again with the same arguments, the Run enters another requires_action state, and you submit outputs again. This can cycle indefinitely. Each cycle re-bills the accumulated thread tokens plus the new tool call tokens.

Python: RunGuard loop detection in the requires_action loop

from openai import OpenAI
from runguard import LoopDetector, LoopDetectedError

client = OpenAI()
detector = LoopDetector(repeats=3, max_cycle_len=3)

def handle_run_with_loop_detection(
    thread_id: str,
    run_id: str,
) -> str:
    """
    Poll a Run to completion, using RunGuard to detect tool-call loops.
    Returns the final assistant message or a loop-detected error.
    """
    while True:
        run = client.beta.threads.runs.retrieve(
            thread_id=thread_id, run_id=run_id
        )
        if run.status == "completed":
            messages = client.beta.threads.messages.list(thread_id=thread_id)
            return messages.data[0].content[0].text.value

        if run.status == "requires_action":
            tool_calls = run.required_action.submit_tool_outputs.tool_calls
            tool_outputs = []

            for tc in tool_calls:
                # Build a canonical signature for loop detection
                sig = f"{tc.function.name}:{tc.function.arguments[:100]}"
                match = detector.record(sig)
                if match:
                    # Cancel the run and return a safe error
                    client.beta.threads.runs.cancel(
                        thread_id=thread_id, run_id=run_id
                    )
                    return (
                        f"[LOOP] Tool '{tc.function.name}' called in a repeating "
                        f"pattern ({match.repeats}x). Run cancelled to prevent cost overrun."
                    )

                # Execute tool and collect output
                output = dispatch_tool(tc.function.name, tc.function.arguments)
                tool_outputs.append({"tool_call_id": tc.id, "output": output})

            # Submit all tool outputs to continue the run
            client.beta.threads.runs.submit_tool_outputs(
                thread_id=thread_id,
                run_id=run_id,
                tool_outputs=tool_outputs,
            )

        elif run.status in ("failed", "cancelled", "expired"):
            return f"[ERROR] Run ended with status: {run.status}"

        # Brief polling delay
        import time; time.sleep(0.5)

def dispatch_tool(name: str, arguments_json: str) -> str:
    """Replace with real tool dispatch."""
    import json
    args = json.loads(arguments_json)
    return f"Tool {name} result: {args}"

Assistants API cost control comparison

Control mechanism	What it prevents	Limitation
`max_prompt_tokens`	Thread context blow-out on long conversations	Only caps input tokens per run; does not prevent multiple expensive runs
`max_completion_tokens`	Verbose output generation	Only caps output tokens per run; input cost grows unbounded as thread accumulates
Per-run cost tracking (`run.usage`)	Post-hoc visibility into run cost	Reactive — the cost is already incurred before you see it
Per-user daily budget cap	A single user exhausting monthly budget	Requires persistent storage across sessions; does not stop in-flight run
RunGuard loop detection	Tool-call loops inside the requires_action cycle	Cancels the run on detection — partial work lost; resumption requires caller logic

For the broader cost control patterns that Assistants-based agents need, see autonomous agent cost control best practices. For loop detection in non-Assistants OpenAI agents, see OpenAI Agents SDK loop guard. For the retry storm pattern that tool loops amplify, see AI agent retry storm prevention.

Add budget control to your OpenAI Assistants integration

RunGuard installs in one command: pip install runguard for Python, npm install @runguard/sdk for TypeScript. For Assistants API agents, the highest-impact step is adding max_prompt_tokens to every Run create call (prevents thread blow-out without any application code change) and then wrapping your requires_action polling loop with a LoopDetector (prevents tool-call loops that would otherwise run until budget exhaustion). Both take under 10 minutes to add to an existing Assistants integration.

RunGuard pricing: Solo plan at $19/month for individual developers. Team plan at $79/month adds Slack and PagerDuty webhook alerts, shared dashboards, and audit log. Both plans include a 14-day free trial — no credit card required.

Start your 14-day free trial — or explore related: OpenAI Agents SDK loop guard, autonomous agent cost control best practices, set max cost per LLM request, retry storm prevention, and prevent runaway cost in real time.