OpenAI Assistants API Cost Control: Thread Accumulation, Run Polling Loops, and Tool Call Spirals
The OpenAI Assistants API is architecturally different from the standard Chat Completions API in ways that matter for cost. Chat Completions is stateless: you send a message list, get a response, pay for that exchange. The Assistants API is stateful: threads persist indefinitely on OpenAI's servers, accumulating every message ever sent. Each new run on an existing thread sends the full conversation history as context — a thread that's been active for three months can cost 60× more per run than a fresh one, invisibly.
This is distinct from the failure modes in the OpenAI Agents SDK cost control guide, which covers the newer Python-first SDK with its own handoff and guardrail abstractions. The Assistants API (also called the legacy Assistants API in newer OpenAI docs) uses a different object model: Assistants, Threads, Messages, Runs, and Run Steps. Each object has a distinct failure mode.
Four failure modes to understand:
- Thread history accumulation — every message on a thread is sent as context on every new run. Old threads inflate input token costs monotonically.
- Run polling retry spiral — asynchronous run status polling with naive retry logic re-submits failed runs to the same thread, compounding accumulated context with the failed run's partial output.
- Tool call step explosion inside a single run — the Assistants API allows unlimited tool call rounds within one run. An assistant that generates
tool_callsrequiring large outputs, then generates moretool_callsbased on those outputs, can accumulate hundreds of thousands of tokens across run steps with no per-step ceiling. - Repeated file vector store embedding — attaching files at the message level (rather than at the thread or assistant level) triggers a new vector store embedding operation for every message. Large file sets re-embedded on each message generate substantial costs independent of LLM calls.
Key distinction: The Assistants API bills for thread-level context, not just the current message. A 100-turn thread running gpt-4o at average 1,000 tokens per message accumulates 100,000 input tokens of context per run by turn 100 — roughly $0.25 per run at current pricing, before the model even sees your new question. At 100 runs per day that's $25/day on context alone.
Failure Mode 1: Thread History Accumulation
The Assistants API persists thread messages server-side. When you call client.beta.threads.runs.create(thread_id=thread_id, assistant_id=assistant_id), OpenAI automatically includes all prior thread messages as context. There is no explicit message list to trim the way you would in a Chat Completions call.
The naive usage pattern creates one thread per user session and keeps it indefinitely:
import openai
client = openai.OpenAI()
# Naive: create once, reuse forever
def chat(thread_id: str, user_message: str) -> str:
client.beta.threads.messages.create(
thread_id=thread_id,
role="user",
content=user_message
)
run = client.beta.threads.runs.create_and_poll(
thread_id=thread_id,
assistant_id=ASSISTANT_ID
)
messages = client.beta.threads.messages.list(thread_id=thread_id)
return messages.data[0].content[0].text.value
After 50 turns this thread's context is 50 messages. After 200 turns it's 200 messages. The cost of each run grows linearly with conversation length because the model re-reads the entire history every time.
Cost measurement for a typical support-bot conversation at gpt-4o pricing ($2.50/1M input, $10/1M output):
| Thread age (turns) | Avg context tokens / run | Input cost / run | Cost ratio vs fresh |
|---|---|---|---|
| 1 (fresh) | 800 | $0.002 | 1× |
| 20 turns | 17,600 | $0.044 | 22× |
| 50 turns | 43,200 | $0.108 | 54× |
| 100 turns | 86,000 | $0.215 | 107× |
| 200 turns | 171,500 | $0.429 | 214× |
The fix is a thread rotation strategy: cap threads at a maximum turn count, then create a new thread with a trimmed summary context injected as the first system message. The RunGuard ThreadGuard class handles this transparently:
import openai
import time
from dataclasses import dataclass, field
@dataclass
class ThreadGuard:
client: openai.OpenAI
assistant_id: str
max_turns: int = 30 # rotate thread after this many turns
max_tokens_per_run: int = 40_000 # hard ceiling on context tokens
session_tokens_used: int = field(default=0, init=False)
run_count: int = field(default=0, init=False)
_thread_id: str = field(default=None, init=False)
_turn_count: int = field(default=0, init=False)
def _get_or_create_thread(self, context_summary: str = None) -> str:
if self._thread_id is None:
thread = self.client.beta.threads.create()
self._thread_id = thread.id
self._turn_count = 0
if context_summary:
self.client.beta.threads.messages.create(
thread_id=self._thread_id,
role="user",
content=f"[Context from prior conversation]: {context_summary}"
)
return self._thread_id
def _rotate_if_needed(self, last_response: str = None) -> None:
if self._turn_count >= self.max_turns:
summary = f"Previous conversation summary (last response): {last_response[:500]}" \
if last_response else "Conversation context rotated due to length."
old_thread_id = self._thread_id
self._thread_id = None
self._get_or_create_thread(context_summary=summary)
# optional: delete old thread to avoid storage costs
try:
self.client.beta.threads.delete(old_thread_id)
except Exception:
pass
def chat(self, user_message: str) -> str:
thread_id = self._get_or_create_thread()
# Check token ceiling before submitting
# Estimate: average 800 tokens/turn * current turns
estimated_context = self._turn_count * 800
if estimated_context > self.max_tokens_per_run:
raise RuntimeError(
f"ThreadGuard: estimated context {estimated_context} tokens "
f"exceeds ceiling {self.max_tokens_per_run}. Rotate thread first."
)
self.client.beta.threads.messages.create(
thread_id=thread_id,
role="user",
content=user_message
)
run = self.client.beta.threads.runs.create_and_poll(
thread_id=thread_id,
assistant_id=self.assistant_id,
timeout=120
)
if run.status != "completed":
raise RuntimeError(f"Run ended with status: {run.status}")
usage = run.usage
if usage:
self.session_tokens_used += usage.total_tokens
messages = self.client.beta.threads.messages.list(thread_id=thread_id)
response_text = messages.data[0].content[0].text.value
self._turn_count += 1
self.run_count += 1
self._rotate_if_needed(last_response=response_text)
return response_text
def summary(self) -> dict:
return {
"run_count": self.run_count,
"current_thread_id": self._thread_id,
"current_turn_count": self._turn_count,
"session_tokens_used": self.session_tokens_used,
}
Set max_turns=30 for general chat, lower for code-review or research tasks where each turn carries large tool outputs. The summary injection preserves continuity without carrying the full 200-message history forward.
Failure Mode 2: Run Polling Retry Spiral
Runs in the Assistants API are asynchronous. You create a run and poll for completed, failed, expired, or requires_action status. The built-in create_and_poll helper abstracts this, but many production implementations write their own polling loop and add retry logic around run failures — creating a spiral.
The dangerous pattern: a run expires (60-second default timeout) or fails with rate_limit_exceeded, and the application immediately creates a new run on the same thread. This is correct in principle but catastrophic in practice when the failure cause is still present. A model stuck on a malformed tool call schema will fail every run identically, generating a new run charge plus the full thread context cost on each attempt:
# DANGEROUS: naive retry on run failure
def run_with_retry(client, thread_id, assistant_id, max_attempts=5):
for attempt in range(max_attempts):
run = client.beta.threads.runs.create(
thread_id=thread_id,
assistant_id=assistant_id
)
while run.status in ("queued", "in_progress"):
time.sleep(1)
run = client.beta.threads.runs.retrieve(
thread_id=thread_id,
run_id=run.id
)
if run.status == "completed":
return run
# BUG: creates new run immediately — same thread, same broken tool call
print(f"Run {run.id} failed with {run.status}, retrying...")
raise RuntimeError("Max attempts reached")
Each retry on a 50-turn thread pays full context cost again. At 5 retries on a 50-turn thread at gpt-4o: 5 × 43,200 tokens × $2.50/1M = $0.54 per failed interaction — before any successful output is produced.
The correct pattern introduces failure classification before deciding to retry. Only transient errors (rate limits, server errors) warrant immediate retry. Structural failures (tool schema errors, content policy, expired with malformed tool output) require inspection before retry — and the thread context may need cleaning:
from enum import Enum
import time
class RunFailureClass(Enum):
TRANSIENT = "transient" # safe to retry after backoff
STRUCTURAL = "structural" # inspect tool calls before retry
FATAL = "fatal" # do not retry
TRANSIENT_ERRORS = {"rate_limit_exceeded", "server_error"}
STRUCTURAL_ERRORS = {"expired", "requires_action"}
FATAL_ERRORS = {"failed", "cancelled", "cancelling"}
def classify_run_failure(run) -> RunFailureClass:
if run.status in TRANSIENT_ERRORS:
return RunFailureClass.TRANSIENT
if run.status in STRUCTURAL_ERRORS:
return RunFailureClass.STRUCTURAL
return RunFailureClass.FATAL
def safe_run_with_retry(
client,
thread_id: str,
assistant_id: str,
max_transient_retries: int = 3,
poll_interval: float = 2.0,
run_timeout: int = 120,
) -> object:
transient_attempts = 0
while True:
run = client.beta.threads.runs.create(
thread_id=thread_id,
assistant_id=assistant_id,
timeout=run_timeout
)
# Poll for terminal status
deadline = time.time() + run_timeout
while run.status in ("queued", "in_progress") and time.time() < deadline:
time.sleep(poll_interval)
run = client.beta.threads.runs.retrieve(
thread_id=thread_id, run_id=run.id
)
if run.status == "completed":
return run
failure_class = classify_run_failure(run)
if failure_class == RunFailureClass.TRANSIENT and transient_attempts < max_transient_retries:
transient_attempts += 1
backoff = 2 ** transient_attempts
time.sleep(backoff)
continue
if failure_class == RunFailureClass.STRUCTURAL:
# Inspect last run step for tool call errors before deciding
steps = client.beta.threads.runs.steps.list(
thread_id=thread_id, run_id=run.id
)
tool_errors = [
s for s in steps.data
if s.type == "tool_calls" and
any(tc.type == "function" for tc in (s.step_details.tool_calls or []))
]
raise RuntimeError(
f"Run {run.id} ended with structural failure '{run.status}'. "
f"Tool call steps: {len(tool_errors)}. Inspect thread before retrying."
)
raise RuntimeError(
f"Run {run.id} ended with unrecoverable status '{run.status}': "
f"{getattr(run, 'last_error', None)}"
)
The key change is the STRUCTURAL classification: when a run expires or gets stuck in requires_action, the caller must inspect the run steps to understand the tool call state before creating a new run. Blind retry on structural failures multiplies context cost without diagnostic value.
Failure Mode 3: Tool Call Step Explosion Inside a Single Run
Within a single run, the Assistants API executes a run step loop: the model generates a response that may include tool_calls; the application submits tool outputs; the model generates another response; and so on until the model produces a message step with no tool calls. There is no default ceiling on the number of steps within a run.
An assistant configured with multiple tools and instructed to "be thorough" will chain tool calls aggressively. Each tool call generates a tool_calls step (input tokens = full context + function schema + tool call JSON), then a tool_output step (additional context = tool response), then another model inference cycle reading the accumulated context. A run with 10 tool call steps on a 30-turn thread can consume 10× the expected input tokens:
# Tool call step cost model for a 30-turn thread
base_context_tokens = 26_000 # 30 turns × 866 avg tokens/turn
tool_schema_tokens = 1_200 # 4 tools × 300 tokens each
tool_output_tokens = 2_000 # avg per tool call
# Cost per run at 5 tool call rounds (expected):
# 5 × (26,000 + 1,200 + 2,000 * step_number) = escalating
step_1_input = 26_000 + 1_200 + 2_000 * 0 # 27,200 tokens
step_2_input = 26_000 + 1_200 + 2_000 * 1 # 29,200 tokens
step_3_input = 26_000 + 1_200 + 2_000 * 2 # 31,200 tokens
step_4_input = 26_000 + 1_200 + 2_000 * 3 # 33,200 tokens
step_5_input = 26_000 + 1_200 + 2_000 * 4 # 35,200 tokens
# Total input: 156,000 tokens = $0.39 input alone
# At 10 tool call rounds: 321,000 tokens = $0.80 input alone
The Assistants API does not expose a max_tool_call_rounds parameter. You enforce it by monitoring run steps during polling and cancelling the run if the step count exceeds your threshold:
import openai
import time
def run_with_step_ceiling(
client: openai.OpenAI,
thread_id: str,
assistant_id: str,
max_tool_call_steps: int = 5,
poll_interval: float = 2.0,
run_timeout: int = 180,
) -> object:
run = client.beta.threads.runs.create(
thread_id=thread_id,
assistant_id=assistant_id
)
tool_call_step_count = 0
deadline = time.time() + run_timeout
while run.status in ("queued", "in_progress", "requires_action") and time.time() < deadline:
time.sleep(poll_interval)
run = client.beta.threads.runs.retrieve(
thread_id=thread_id, run_id=run.id
)
if run.status == "requires_action":
# Count tool call steps accumulated so far
steps = client.beta.threads.runs.steps.list(
thread_id=thread_id, run_id=run.id
)
tool_call_step_count = sum(
1 for s in steps.data if s.type == "tool_calls"
)
if tool_call_step_count >= max_tool_call_steps:
# Cancel the run — this is the circuit breaker
client.beta.threads.runs.cancel(
thread_id=thread_id, run_id=run.id
)
raise RuntimeError(
f"RunGuard: tool call step ceiling hit ({tool_call_step_count} steps). "
f"Run {run.id} cancelled. Inspect the assistant's tool selection policy."
)
# Process legitimate tool calls and submit outputs
tool_outputs = handle_tool_calls(
run.required_action.submit_tool_outputs.tool_calls
)
run = client.beta.threads.runs.submit_tool_outputs_and_poll(
thread_id=thread_id,
run_id=run.id,
tool_outputs=tool_outputs
)
if run.status != "completed":
raise RuntimeError(f"Run ended with status: {run.status}")
return run
def handle_tool_calls(tool_calls) -> list:
outputs = []
for tc in tool_calls:
result = dispatch_tool(tc.function.name, tc.function.arguments)
outputs.append({"tool_call_id": tc.id, "output": str(result)})
return outputs
Set max_tool_call_steps=5 for general research assistants. Reduce to 3 for customer support bots where tool chaining beyond 3 rounds usually indicates a confused model state, not legitimate multi-step reasoning. Increase to 8 only for explicitly designed multi-step research agents where you've measured the typical step count and budgeted accordingly.
Cancel semantics: Cancelling a run mid-execution leaves the thread in a recoverable state. The thread messages up to the point of cancellation are preserved; the cancelled run's partial tool call steps are recorded but not appended to the visible conversation. You can create a new run after cancellation — but be aware the cancelled run's token consumption is still billed.
Failure Mode 4: Repeated File Vector Store Embedding
The Assistants API's file search tool uses vector stores: files are chunked, embedded, and stored for semantic retrieval during runs. The embedding operation is priced separately ($0.10/1M tokens at the time of writing). There are three levels at which files can be attached: the assistant level (attached once, shared across all threads), the thread level (attached when the thread is created, persists for all runs on that thread), and the message level (attached to a specific message).
The dangerous pattern attaches the same files at the message level on every turn:
# EXPENSIVE: re-attaches files on every message
def chat_with_files(client, thread_id, user_message, file_ids: list[str]):
client.beta.threads.messages.create(
thread_id=thread_id,
role="user",
content=user_message,
attachments=[
{"file_id": fid, "tools": [{"type": "file_search"}]}
for fid in file_ids
]
)
# Each message attachment creates a new vector store embedding operation
# for each file_id listed — even if the file was already embedded last turn
If file_ids contains 10 PDF documents averaging 50 pages each (roughly 25,000 tokens/file), and you call this on every turn for 20 turns, you pay for 200 embedding operations on the same files: 10 files × 25,000 tokens × 20 turns × $0.10/1M = $0.50 in embedding costs, before any LLM inference. That sounds small until you have 100 concurrent users doing the same thing.
The fix is to attach files at the lowest appropriate level and reuse vector store IDs:
import openai
from functools import lru_cache
@lru_cache(maxsize=128)
def get_or_create_vector_store(client: openai.OpenAI, file_ids_key: str) -> str:
"""Create a vector store for a frozenset of file IDs and cache the store ID."""
file_ids = file_ids_key.split(",")
vector_store = client.beta.vector_stores.create(
name=f"session-{hash(file_ids_key) % 10000}",
file_ids=file_ids
)
# Poll until ready
while vector_store.status == "in_progress":
import time; time.sleep(1)
vector_store = client.beta.vector_stores.retrieve(vector_store.id)
if vector_store.status != "completed":
raise RuntimeError(f"Vector store creation failed: {vector_store.status}")
return vector_store.id
def create_thread_with_files(
client: openai.OpenAI,
file_ids: list[str]
) -> str:
"""Attach files once at thread level — no per-message embedding cost."""
file_ids_key = ",".join(sorted(file_ids))
vector_store_id = get_or_create_vector_store(client, file_ids_key)
thread = client.beta.threads.create(
tool_resources={
"file_search": {
"vector_store_ids": [vector_store_id]
}
}
)
return thread.id
# Usage: create the thread once with files, then chat without re-attaching
thread_id = create_thread_with_files(client, file_ids=["file-abc", "file-def"])
for turn in range(20):
client.beta.threads.messages.create(
thread_id=thread_id,
role="user",
content="What does section 3.2 say about data retention?"
# No 'attachments' key — files already on the thread
)
run = client.beta.threads.runs.create_and_poll(
thread_id=thread_id,
assistant_id=ASSISTANT_ID
)
The vector store is created once, cached by the frozenset of file IDs, and attached at thread creation. All 20 turns share the same vector store with zero additional embedding cost. If the same set of files is used across multiple sessions (e.g., a company knowledge base), attach the vector store at the assistant level instead of the thread level — then it's embedded exactly once per assistant configuration change.
Composite Guard: AssistantGuard
All four failure modes compound each other in production. A long-running thread (mode 1) that hits a run polling spiral (mode 2) on a tool-heavy assistant (mode 3) with per-message file attachments (mode 4) will produce a bill that is 100–500× what the interaction should cost. The AssistantGuard class wraps all four protections:
import openai
import time
from dataclasses import dataclass, field
from functools import lru_cache
@dataclass
class AssistantGuard:
client: openai.OpenAI
assistant_id: str
max_turns: int = 30
max_tool_call_steps: int = 5
max_transient_retries: int = 3
session_token_ceiling: int = 500_000
poll_interval: float = 2.0
_thread_id: str = field(default=None, init=False)
_turn_count: int = field(default=0, init=False)
_vector_store_id: str = field(default=None, init=False)
session_tokens_used: int = field(default=0, init=False)
run_count: int = field(default=0, init=False)
cancelled_runs: int = field(default=0, init=False)
def attach_files(self, file_ids: list[str]) -> None:
"""Set up vector store once — reused for all threads in this session."""
file_ids_key = ",".join(sorted(file_ids))
vs = self.client.beta.vector_stores.create(
name=f"guard-{hash(file_ids_key) % 9999}",
file_ids=file_ids
)
while vs.status == "in_progress":
time.sleep(1)
vs = self.client.beta.vector_stores.retrieve(vs.id)
self._vector_store_id = vs.id
def _ensure_thread(self, context_summary: str = None) -> str:
if self._thread_id is None:
kwargs = {}
if self._vector_store_id:
kwargs["tool_resources"] = {
"file_search": {"vector_store_ids": [self._vector_store_id]}
}
thread = self.client.beta.threads.create(**kwargs)
self._thread_id = thread.id
self._turn_count = 0
if context_summary:
self.client.beta.threads.messages.create(
thread_id=self._thread_id,
role="user",
content=f"[Prior context]: {context_summary}"
)
return self._thread_id
def _rotate_thread(self, last_response: str = None) -> None:
old_id = self._thread_id
self._thread_id = None
summary = f"Last assistant response: {last_response[:500]}" if last_response else "Thread rotated."
self._ensure_thread(context_summary=summary)
try:
self.client.beta.threads.delete(old_id)
except Exception:
pass
def _run_with_guards(self, thread_id: str) -> object:
transient_attempts = 0
while True:
run = self.client.beta.threads.runs.create(
thread_id=thread_id,
assistant_id=self.assistant_id
)
tool_call_step_count = 0
deadline = time.time() + 180
while run.status in ("queued", "in_progress", "requires_action") \
and time.time() < deadline:
time.sleep(self.poll_interval)
run = self.client.beta.threads.runs.retrieve(
thread_id=thread_id, run_id=run.id
)
if run.status == "requires_action":
steps = self.client.beta.threads.runs.steps.list(
thread_id=thread_id, run_id=run.id
)
tool_call_step_count = sum(
1 for s in steps.data if s.type == "tool_calls"
)
if tool_call_step_count >= self.max_tool_call_steps:
self.client.beta.threads.runs.cancel(
thread_id=thread_id, run_id=run.id
)
self.cancelled_runs += 1
raise RuntimeError(
f"AssistantGuard: tool call step ceiling "
f"({self.max_tool_call_steps}) hit. Run cancelled."
)
tool_outputs = self._dispatch_tools(
run.required_action.submit_tool_outputs.tool_calls
)
run = self.client.beta.threads.runs.submit_tool_outputs_and_poll(
thread_id=thread_id, run_id=run.id,
tool_outputs=tool_outputs
)
if run.status == "completed":
if run.usage:
self.session_tokens_used += run.usage.total_tokens
if self.session_tokens_used > self.session_token_ceiling:
raise RuntimeError(
f"AssistantGuard: session token ceiling "
f"({self.session_token_ceiling}) exceeded."
)
return run
# Classify failure
if run.status in ("rate_limit_exceeded", "server_error") \
and transient_attempts < self.max_transient_retries:
transient_attempts += 1
time.sleep(2 ** transient_attempts)
continue
raise RuntimeError(
f"AssistantGuard: run ended with '{run.status}' "
f"after {transient_attempts} transient retries."
)
def _dispatch_tools(self, tool_calls) -> list:
return [
{"tool_call_id": tc.id, "output": "[]"}
for tc in tool_calls
]
def chat(self, user_message: str) -> str:
thread_id = self._ensure_thread()
self.client.beta.threads.messages.create(
thread_id=thread_id,
role="user",
content=user_message
)
run = self._run_with_guards(thread_id)
messages = self.client.beta.threads.messages.list(thread_id=thread_id)
response_text = messages.data[0].content[0].text.value
self._turn_count += 1
self.run_count += 1
if self._turn_count >= self.max_turns:
self._rotate_thread(last_response=response_text)
return response_text
def summary(self) -> dict:
return {
"run_count": self.run_count,
"cancelled_runs": self.cancelled_runs,
"current_turn_count": self._turn_count,
"session_tokens_used": self.session_tokens_used,
"vector_store_id": self._vector_store_id,
}
Override _dispatch_tools with your application's actual tool dispatch logic. The summary() method feeds directly into RunGuard SDK telemetry for per-session cost tracking.
Comparison: Assistants API vs Chat Completions API Cost Profile
| Dimension | Chat Completions | Assistants API (unguarded) | Assistants API (AssistantGuard) |
|---|---|---|---|
| Context management | Caller controls message list | Full thread history always included | Thread rotated at max_turns |
| Cost scaling | Linear with context length you send | Linear with total thread history | Capped at max_turns × avg_tokens |
| Tool call ceiling | Caller controls via function_call logic | No ceiling — unlimited steps per run | max_tool_call_steps enforced mid-run |
| Run retry safety | N/A (synchronous) | Blind retry on failure — re-pays context | Failure classified; structural faults raise |
| File embedding | N/A (handled in your RAG pipeline) | Per-message attachment re-embeds files | Vector store attached at thread level once |
| Observability | usage field on every response | usage field only on completed runs | Aggregated session_tokens_used + cancelled_runs |
Production Checklist
- Thread rotation at ≤30 turns — audit your thread creation code; look for code paths that create one thread per user and reuse it indefinitely. The sign is a
thread_idstored in a database without an expiry field. - Never retry a run without classifying the failure — add
run.statusandrun.last_errorto every error log. If you seeexpiredorrequires_actionin failures, check your tool submission loop before adding more retries. - Set
max_tool_call_stepsbased on your measured p99 — instrument a week of production runs to measure actual step count distribution. Set the ceiling at p99 + 2. A well-designed assistant should have a narrow step count distribution. - Attach files at the assistant level for shared knowledge bases — anything that doesn't change per user (product documentation, legal contracts, code bases) belongs on the assistant, not the thread or message. File vector stores attached to assistants are embedded once and billed once per file update.
- Delete unused threads — the Assistants API charges for vector store storage by GB-month. Threads with file search attachments continue to accumulate storage costs if the vector store isn't deleted when the session ends.
OpenAI Assistants API vs OpenAI Agents SDK: If you're starting a new project in 2026, evaluate the OpenAI Agents SDK (the Python-first framework with handoffs and guardrails built in) rather than the Assistants API REST interface. The Agents SDK gives you more explicit control over context management and doesn't have the implicit thread accumulation problem. The Assistants API remains in widespread use in existing integrations and is the only option for the file search vector store feature via a managed API.
FAQ
What's the actual token limit for a thread? Will the run just fail if it gets too long?
The Assistants API automatically handles truncation when thread context exceeds the model's context window. It truncates older messages to fit — which means your most recent tool outputs and conversation history may be silently dropped. The model doesn't error; it just answers with less context, producing degraded or inconsistent output. The cost isn't avoided either: you're still billed for the tokens that were included before truncation. Thread rotation prevents both the silent degradation and the accumulating cost.
How do I measure how many tokens a thread is consuming per run?
The run.usage field on a completed run contains prompt_tokens, completion_tokens, and total_tokens. Log prompt_tokens per run and plot it over turn count — you'll see the linear growth clearly. For runs with tool calls, log usage on each intermediate step from runs.steps.list(); the sum across steps shows the true run cost. Alert when prompt_tokens in a single run exceeds your per-run ceiling.
Can I trim a thread's message history without rotating to a new thread?
The Assistants API does not currently support deleting individual thread messages or setting a maximum turn count on thread retrieval. Thread rotation (create a new thread, inject a summary, optionally delete the old thread) is the only supported way to bound context costs. Some teams inject a system message at thread creation instructing the model to compress older context when summarizing — this reduces output length but doesn't reduce input token cost, since the full history is still sent.
What happens to billing when a run is cancelled mid-execution?
Cancelled runs are billed for all tokens processed up to the point of cancellation — both the input tokens (thread context + tool schemas) and any output tokens generated before the cancel request was processed. The run.usage field on a cancelled run reflects the actual tokens consumed. This means early cancellation of a tool-call-heavy run on a long thread still carries a meaningful cost. Use cancellation to prevent further damage, not as a free abort.
Is there a way to share a vector store across multiple assistants without re-embedding?
Yes. Vector stores are first-class objects in the Assistants API — you create one with client.beta.vector_stores.create(file_ids=[...]) and attach the resulting vector_store_id to any number of assistants or threads. The embedding is paid once per vector store creation (and per file update). Multiple assistants pointing at the same vector store do not incur additional embedding costs per assistant — they share the stored embeddings. This is the right pattern for a shared knowledge base used by multiple agent personas.