AI agent stateless vs. stateful cost comparison: which architecture costs less and at what conversation length
The stateless vs. stateful choice for an AI agent is fundamentally a cost and complexity tradeoff. A stateful agent maintains conversation history in memory and sends the growing message list to the model on every turn. This is simple to implement — just append to a list — but the input token cost grows with conversation length. At turn 20 of a multi-turn support conversation where each turn adds 200 tokens of history, the model receives 4,000 tokens of accumulated history as input, regardless of how much of that history is relevant to the current turn. A stateless agent stores conversation history externally (in a vector store or a database), retrieves only the relevant prior context for each call, and sends a fixed-size context window to the model. This controls input costs but adds retrieval latency and infrastructure complexity. The cost crossover point depends on average conversation length, retrieval quality (irrelevant retrieved context is waste), and the token pricing of your chosen model. This guide quantifies the cost difference at realistic conversation lengths and shows how to use RunGuard to enforce session budgets in both architectures.
Cost model: stateful history accumulation vs. stateless context retrieval
- Stateful: input cost grows O(n²) with conversation length. If each turn adds an average of T tokens to the history, turn N sends N×T input tokens. The total input token cost across a conversation of N turns is T × (1 + 2 + 3 + ... + N) = T × N(N+1)/2. For T=300 tokens/turn and N=20 turns at $3/MTok (Sonnet input pricing), input token cost = 300 × (20×21/2) = 300 × 210 = 63,000 input tokens = $0.189. The output cost is independent of architecture choice (model generates the same output tokens either way). At N=50 turns the same calculation gives 300 × (50×51/2) = 382,500 input tokens = $1.15 — for input alone, before output.
- Stateless: input cost is flat if retrieval quality is consistent. Each call retrieves K chunks of relevant history from storage and constructs a context window of approximately K × chunk_size tokens. If K=3 chunks of 400 tokens each, the context window is ~1,200 tokens of retrieved history regardless of whether this is turn 2 or turn 50. Total input cost across N turns is N × (system_prompt + retrieved_context + current_message) = N × (800 + 1,200 + 150) = N × 2,150. At N=20 turns: 43,000 input tokens = $0.129. At N=50 turns: 107,500 input tokens = $0.323.
- The crossover: stateless is cheaper for conversations longer than ~10 turns. For short conversations (under 10 turns), the retrieval infrastructure overhead (system prompt tokens describing retrieval context structure) may make stateful cheaper. For conversations of 10–15 turns, costs are approximately equal. For conversations over 15 turns, stateless becomes progressively cheaper as the stateful architecture’s accumulated input tokens grow quadratically. At 50 turns, stateful costs 3.5x more in input tokens than stateless at the parameters above.
Python: budget tracking for both architectures with RunGuard
-
Python: stateful agent with rolling window and budget cap
import anthropic from runguard import guard, BudgetExceededError from collections import deque client = anthropic.Anthropic() SONNET_IN = 3.0 / 1_000_000 SONNET_OUT = 15.0 / 1_000_000 class StatefulAgent: """ Stateful agent with a rolling context window. Keeps the last `max_history` turns to bound input token growth. Enforces a per-session budget cap via RunGuard. """ def __init__(self, budget_usd: float = 1.0, max_history: int = 10): self.history: deque = deque(maxlen=max_history * 2) # pairs of user/assistant self.spent_usd: float = 0.0 self.budget_usd = budget_usd self.turn_count = 0 def _check_budget(self, estimated_cost: float) -> None: if self.spent_usd + estimated_cost > self.budget_usd: raise BudgetExceededError( f"Session budget ${self.budget_usd} exceeded: " f"spent ${self.spent_usd:.4f} + estimated ${estimated_cost:.4f}" ) def chat(self, user_message: str) -> str: # Estimate input: history + current message (~200 token estimate) estimated_input_tokens = len(self.history) * 200 + len(user_message) // 4 + 50 estimated_cost = estimated_input_tokens * SONNET_IN + 300 * SONNET_OUT self._check_budget(estimated_cost) messages = list(self.history) + [{"role": "user", "content": user_message}] resp = client.messages.create( model="claude-sonnet-4-6", max_tokens=512, messages=messages, ) response_text = resp.content[0].text # Record actual cost actual_cost = resp.usage.input_tokens * SONNET_IN + resp.usage.output_tokens * SONNET_OUT self.spent_usd += actual_cost self.turn_count += 1 # Append to rolling window self.history.append({"role": "user", "content": user_message}) self.history.append({"role": "assistant", "content": response_text}) print(f" [turn {self.turn_count}] cost: ${actual_cost:.5f} | total: ${self.spent_usd:.4f} | history: {len(self.history)//2} turns") return response_text class StatelessAgent: """ Stateless agent that retrieves relevant context from storage per call. Budget cap works the same — cost per call is bounded by retrieval window. """ def __init__(self, budget_usd: float = 1.0): self.spent_usd: float = 0.0 self.budget_usd = budget_usd self.turn_count = 0 # In production: replace with a vector store client self._storage: list[dict] = [] def _retrieve_context(self, query: str, k: int = 3) -> list[dict]: """Simple recency-based retrieval (replace with semantic search).""" return self._storage[-k*2:] if len(self._storage) > k * 2 else self._storage[:] def _store(self, role: str, content: str) -> None: self._storage.append({"role": role, "content": content}) def _check_budget(self, estimated_cost: float) -> None: if self.spent_usd + estimated_cost > self.budget_usd: raise BudgetExceededError( f"Session budget ${self.budget_usd} exceeded: " f"spent ${self.spent_usd:.4f} + estimated ${estimated_cost:.4f}" ) def chat(self, user_message: str) -> str: retrieved = self._retrieve_context(user_message, k=3) # Context is bounded regardless of total history length estimated_cost = (len(retrieved) * 200 + len(user_message) // 4 + 800) * SONNET_IN + 300 * SONNET_OUT self._check_budget(estimated_cost) messages = retrieved + [{"role": "user", "content": user_message}] resp = client.messages.create( model="claude-sonnet-4-6", max_tokens=512, messages=messages, ) response_text = resp.content[0].text actual_cost = resp.usage.input_tokens * SONNET_IN + resp.usage.output_tokens * SONNET_OUT self.spent_usd += actual_cost self.turn_count += 1 self._store("user", user_message) self._store("assistant", response_text) print(f" [turn {self.turn_count}] cost: ${actual_cost:.5f} | total: ${self.spent_usd:.4f} | retrieved: {len(retrieved)//2} turns") return response_text -
The rolling window in
StatefulAgentis a cost control mechanism, not just a memory management trick. By settingmax_history=10, the agent never sends more than 20 messages (10 turns) as history, bounding input tokens at approximately 10×200 = 2,000 history tokens per call regardless of how long the conversation has been running. The tradeoff is that the model loses access to context older than 10 turns. For support conversations, this is typically acceptable — if the user is still talking about the same issue 10 turns later, the recent turns contain sufficient context. For tasks that require access to early-conversation decisions (project planning, code review), use the stateless architecture with semantic retrieval instead.
Stateful vs. stateless cost comparison at different conversation lengths
| Turns | Stateful input tokens (full history) | Stateful input cost (Sonnet, $3/MTok) | Stateless input tokens (3 chunks, 400 tokens each) | Stateless input cost (Sonnet, $3/MTok) |
|---|---|---|---|---|
| 5 turns | 3,000 total (avg 600/call) | $0.009 | 10,750 total (avg 2,150/call) | $0.032 |
| 10 turns | 16,500 total (avg 1,650/call) | $0.050 | 21,500 total (avg 2,150/call) | $0.065 |
| 20 turns | 63,000 total (avg 3,150/call) | $0.189 | 43,000 total (avg 2,150/call) | $0.129 |
| 50 turns | 382,500 total (avg 7,650/call) | $1.148 | 107,500 total (avg 2,150/call) | $0.323 |
For context window management, see LLM context window exceeded agent recovery. For session-level cost tracking, see AI agent cost per user session.
Enforce session budgets for both stateless and stateful agents
RunGuard’s BudgetTracker — accessible via guard(budget_usd=...) or directly — works identically for both architectures. The only difference is what contributes to the per-call input cost estimate: stateful agents should project cost based on current history length, while stateless agents project based on retrieval window size. Either way, BudgetExceededError fires when the session cap is reached, giving you a clean exit point before the model makes another call.
RunGuard pricing: Solo plan at $19/month for individual developers. Team plan at $79/month adds Slack and PagerDuty webhook alerts, shared dashboards, and audit log. Both plans include a 14-day free trial — no credit card required.
Start your 14-day free trial — or explore related: AI agent cost per user session, LLM context window exceeded agent recovery, agent task decomposition cost efficiency, autonomous agent cost control best practices, and set max cost per LLM request.