Multi-turn conversation cost optimization: flattening the O(n²) token growth curve in LLM agents

Every turn in a multi-turn LLM conversation re-sends the full conversation history as input tokens. Turn 1 sends 1,000 tokens; turn 5 sends 5,000 tokens; turn 50 sends 50,000 tokens. The cost does not grow linearly — it grows quadratically with conversation length. At GPT-4o’s $2.50/MTok input rate, turn 1 costs $0.0025 and turn 50 costs $0.125, a 50-fold increase for a single exchange. For an AI agent handling 100 such conversations per day, the input-token cost from conversation history growth alone is $12.50/day, every day, from a problem that is entirely solvable in software. This page explains the mechanics of that O(n²) growth, ranks the four strategies that flatten it, identifies the turn-count thresholds where each strategy crosses over in effectiveness, and provides complete Python and TypeScript implementations with RunGuard per-turn cost tracking.

The O(n²) cost growth explained with real numbers

The quadratic growth pattern emerges from a simple mechanic: LLMs are stateless, so your application re-sends the entire conversation history on every API call. At GPT-4o pricing ($2.50/MTok input, $10.00/MTok output), the input token cost per turn scales with the accumulated message count.

Four strategies ranked by effectiveness and the crossover points where each wins

There is no single best strategy for all agents. The right choice depends on conversation length, semantic requirements, and implementation complexity budget.

Per-provider conversation caching and the prefix stability requirement

Both Anthropic and OpenAI cache conversation prefixes, which adds a fourth lever to conversation cost control on top of the four strategies above.

Python implementation: RollingConversation with summarization and RunGuard cost tracking

This implementation provides a RollingConversation class with configurable window size, automatic batch summarization using a cheap model, and RunGuard’s ConversationCostTracker for per-turn cost reporting.

import anthropic
import runguard
from dataclasses import dataclass, field
from typing import Callable

# Module-level client reuse (connection pooling — see cold start guide)
_client = anthropic.Anthropic()

# RunGuard per-turn cost tracker
cost_tracker = runguard.ConversationCostTracker(
    on_turn_cost=lambda ctx: print(
        f"[Turn {ctx.turn_number}] input={ctx.input_tokens} output={ctx.output_tokens} "
        f"cost=${ctx.turn_cost_usd:.6f} cumulative=${ctx.cumulative_cost_usd:.4f}"
    ),
    on_budget_exceeded=lambda ctx: (_ for _ in ()).throw(
        runguard.BudgetExceededError(
            f"Conversation budget exceeded at turn {ctx.turn_number}: "
            f"${ctx.cumulative_cost_usd:.4f}"
        )
    ),
    session_budget_usd=2.00,  # Abort conversation if cumulative cost exceeds $2
)


@dataclass
class Message:
    role: str  # "user" or "assistant"
    content: str


@dataclass
class RollingConversation:
    """
    Multi-turn conversation manager with rolling summary and RunGuard cost tracking.

    Strategy:
    - Keep the most recent max_window_turns turns verbatim in context
    - When conversation exceeds summarize_threshold turns, summarize the oldest batch
    - Replace the summarized batch with a compact summary message
    - Use a cheap model (Haiku) for summarization to minimize summary cost
    """
    max_window_turns: int = 20
    summarize_threshold: int = 15
    summarize_batch_size: int = 10
    summary_model: str = "claude-haiku-3-5"
    main_model: str = "claude-sonnet-4-5"
    system_prompt: str = ""
    messages: list[Message] = field(default_factory=list)
    _turn_number: int = field(default=0, init=False)

    def _should_summarize(self) -> bool:
        return len(self.messages) >= self.summarize_threshold

    def _summarize_oldest_batch(self) -> None:
        """Summarize the oldest summarize_batch_size messages and replace them."""
        batch = self.messages[:self.summarize_batch_size]
        remaining = self.messages[self.summarize_batch_size:]

        # Format batch for summarization
        batch_text = "\n".join(
            f"{m.role.upper()}: {m.content}" for m in batch
        )

        summary_response = _client.messages.create(
            model=self.summary_model,
            max_tokens=400,
            messages=[
                {
                    "role": "user",
                    "content": (
                        "Summarize the following conversation excerpt into a concise paragraph "
                        "preserving all key decisions, facts, and context needed for future turns. "
                        "Be specific — include names, values, and conclusions reached.\n\n"
                        f"{batch_text}"
                    ),
                }
            ],
        )

        summary_text = summary_response.content[0].text
        summary_cost = (
            summary_response.usage.input_tokens * 0.25 / 1_000_000 +
            summary_response.usage.output_tokens * 1.25 / 1_000_000
        )
        print(f"[Summary] Compressed {self.summarize_batch_size} turns to {len(summary_text)} chars "
              f"(cost: ${summary_cost:.6f})")

        # Replace the batch with a single summary message
        summary_message = Message(
            role="user",
            content=f"[CONVERSATION SUMMARY — earlier turns]: {summary_text}",
        )
        self.messages = [summary_message] + remaining

    def _build_api_messages(self) -> list[dict]:
        """Build the message list for the API call, applying window truncation."""
        messages = self.messages
        if len(messages) > self.max_window_turns:
            messages = messages[-self.max_window_turns:]
        return [{"role": m.role, "content": m.content} for m in messages]

    def chat(self, user_message: str) -> str:
        """Send a user message and return the assistant response."""
        self._turn_number += 1
        self.messages.append(Message(role="user", content=user_message))

        # Summarize if threshold reached (before building API messages)
        if self._should_summarize():
            self._summarize_oldest_batch()

        api_messages = self._build_api_messages()

        system = [
            {
                "type": "text",
                "text": self.system_prompt,
                "cache_control": {"type": "ephemeral"},  # Cache system prompt
            }
        ] if self.system_prompt else []

        response = _client.messages.create(
            model=self.main_model,
            max_tokens=1024,
            system=system,
            messages=api_messages,
        )

        assistant_text = response.content[0].text
        self.messages.append(Message(role="assistant", content=assistant_text))

        # Report per-turn cost to RunGuard
        cache_read = getattr(response.usage, "cache_read_input_tokens", 0)
        cache_write = getattr(response.usage, "cache_creation_input_tokens", 0)
        standard_input = response.usage.input_tokens - cache_read - cache_write

        input_cost = (
            standard_input * 3.00 / 1_000_000 +
            cache_write * 3.75 / 1_000_000 +
            cache_read * 0.30 / 1_000_000
        )
        output_cost = response.usage.output_tokens * 15.00 / 1_000_000

        cost_tracker.record_turn(
            turn_number=self._turn_number,
            input_tokens=response.usage.input_tokens,
            output_tokens=response.usage.output_tokens,
            turn_cost_usd=input_cost + output_cost,
        )

        return assistant_text


# Usage example
def run_support_conversation():
    conv = RollingConversation(
        system_prompt=(
            "You are a helpful technical support agent. "
            "Resolve issues efficiently and ask clarifying questions when needed."
        ),
        max_window_turns=20,
        summarize_threshold=15,
        summarize_batch_size=10,
    )

    turns = [
        "I’m having trouble connecting to the database.",
        "The error message is: connection refused on port 5432.",
        "I’m using PostgreSQL 15 on Ubuntu 22.04.",
        "The service is running — I can see it with systemctl status.",
        # ... more turns
    ]

    for turn in turns:
        response = conv.chat(turn)
        print(f"User: {turn}")
        print(f"Assistant: {response}\n")

The critical design decision in this implementation is that summarization happens before _build_api_messages(), not after. This means the summary is immediately available for the current call, not deferred to the next one. The cost_tracker.record_turn() call uses the actual usage fields from the API response — including cache read and write tokens — for accurate per-turn cost reporting. RunGuard’s on_budget_exceeded callback fires if the cumulative conversation cost crosses the session budget before the agent can accumulate runaway spend. For fleet-level cost analysis across many concurrent conversations, see agent workflow orchestration cost analysis.

TypeScript implementation: ConversationManager with window and rolling summary

The TypeScript implementation uses a class-based design compatible with Express, Next.js, and serverless environments, with OpenAI as the provider and RunGuard’s TypeScript SDK for per-turn tracking.

import OpenAI from "openai";
import RunGuard from "@runguard/sdk";

const client = new OpenAI({ maxRetries: 0 });

const costTracker = new RunGuard.ConversationCostTracker({
  sessionBudgetUsd: 2.0,
  onTurnCost: (ctx) => {
    console.log(
      `[Turn ${ctx.turnNumber}] ` +
      `input=${ctx.inputTokens} output=${ctx.outputTokens} ` +
      `cost=$${ctx.turnCostUsd.toFixed(6)} cumulative=$${ctx.cumulativeCostUsd.toFixed(4)}`
    );
  },
  onBudgetExceeded: (ctx) => {
    throw new Error(
      `Conversation budget exceeded at turn ${ctx.turnNumber}: $${ctx.cumulativeCostUsd.toFixed(4)}`
    );
  },
});

interface Message {
  role: "user" | "assistant" | "system";
  content: string;
}

interface ConversationManagerOptions {
  systemPrompt?: string;
  maxWindowTurns?: number;
  summarizeThreshold?: number;
  summarizeBatchSize?: number;
  mainModel?: string;
  summaryModel?: string;
}

class ConversationManager {
  private messages: Message[] = [];
  private turnNumber = 0;
  private readonly systemPrompt: string;
  private readonly maxWindowTurns: number;
  private readonly summarizeThreshold: number;
  private readonly summarizeBatchSize: number;
  private readonly mainModel: string;
  private readonly summaryModel: string;

  constructor(options: ConversationManagerOptions = {}) {
    this.systemPrompt = options.systemPrompt ?? "";
    this.maxWindowTurns = options.maxWindowTurns ?? 20;
    this.summarizeThreshold = options.summarizeThreshold ?? 15;
    this.summarizeBatchSize = options.summarizeBatchSize ?? 10;
    this.mainModel = options.mainModel ?? "gpt-4o";
    this.summaryModel = options.summaryModel ?? "gpt-4o-mini";
  }

  private shouldSummarize(): boolean {
    return this.messages.length >= this.summarizeThreshold;
  }

  private async summarizeOldestBatch(): Promise<void> {
    const batch = this.messages.slice(0, this.summarizeBatchSize);
    const remaining = this.messages.slice(this.summarizeBatchSize);

    const batchText = batch
      .map((m) => `${m.role.toUpperCase()}: ${m.content}`)
      .join("\n");

    const summaryResponse = await client.chat.completions.create({
      model: this.summaryModel,
      max_tokens: 400,
      messages: [
        {
          role: "user",
          content:
            "Summarize the following conversation excerpt into a concise paragraph " +
            "preserving all key decisions, facts, and context needed for future turns. " +
            "Be specific — include names, values, and conclusions reached.\n\n" +
            batchText,
        },
      ],
    });

    const summaryText = summaryResponse.choices[0].message.content ?? "";
    const summaryInputCost = (summaryResponse.usage?.prompt_tokens ?? 0) * 0.15 / 1_000_000;
    const summaryOutputCost = (summaryResponse.usage?.completion_tokens ?? 0) * 0.60 / 1_000_000;
    console.log(
      `[Summary] Compressed ${this.summarizeBatchSize} turns, ` +
      `cost: $${(summaryInputCost + summaryOutputCost).toFixed(6)}`
    );

    const summaryMessage: Message = {
      role: "user",
      content: `[CONVERSATION SUMMARY — earlier turns]: ${summaryText}`,
    };
    this.messages = [summaryMessage, ...remaining];
  }

  private buildApiMessages(): OpenAI.ChatCompletionMessageParam[] {
    let messages = this.messages;
    if (messages.length > this.maxWindowTurns) {
      messages = messages.slice(-this.maxWindowTurns);
    }

    const apiMessages: OpenAI.ChatCompletionMessageParam[] = [];
    if (this.systemPrompt) {
      apiMessages.push({ role: "system", content: this.systemPrompt });
    }
    apiMessages.push(...messages.map((m) => ({ role: m.role, content: m.content })));
    return apiMessages;
  }

  async chat(userMessage: string): Promise<string> {
    this.turnNumber++;
    this.messages.push({ role: "user", content: userMessage });

    if (this.shouldSummarize()) {
      await this.summarizeOldestBatch();
    }

    const apiMessages = this.buildApiMessages();

    const response = await client.chat.completions.create({
      model: this.mainModel,
      max_tokens: 1024,
      messages: apiMessages,
    });

    const assistantText = response.choices[0].message.content ?? "";
    this.messages.push({ role: "assistant", content: assistantText });

    // Extract cost data from response
    const inputTokens = response.usage?.prompt_tokens ?? 0;
    const outputTokens = response.usage?.completion_tokens ?? 0;
    const cachedTokens = response.usage?.prompt_tokens_details?.cached_tokens ?? 0;
    const uncachedInput = inputTokens - cachedTokens;

    const inputCost = (uncachedInput * 2.50 + cachedTokens * 0.75) / 1_000_000;
    const outputCost = outputTokens * 10.00 / 1_000_000;

    costTracker.recordTurn({
      turnNumber: this.turnNumber,
      inputTokens,
      outputTokens,
      turnCostUsd: inputCost + outputCost,
    });

    return assistantText;
  }

  getStats(): {
    turnCount: number;
    activeMessages: number;
    estimatedContextTokens: number;
  } {
    return {
      turnCount: this.turnNumber,
      activeMessages: this.messages.length,
      estimatedContextTokens: Math.ceil(
        this.messages.reduce((sum, m) => sum + m.content.length, 0) / 4
      ),
    };
  }
}

// Usage example
async function runAgentConversation(): Promise<void> {
  const manager = new ConversationManager({
    systemPrompt:
      "You are a helpful technical support agent. " +
      "Resolve issues efficiently and ask clarifying questions when needed.",
    maxWindowTurns: 20,
    summarizeThreshold: 15,
    summarizeBatchSize: 10,
    mainModel: "gpt-4o",
    summaryModel: "gpt-4o-mini",
  });

  const turns = [
    "I’m having trouble connecting to the database.",
    "The error is: connection refused on port 5432.",
    "PostgreSQL 15 on Ubuntu 22.04.",
    "The service is running according to systemctl.",
  ];

  for (const turn of turns) {
    const response = await manager.chat(turn);
    console.log(`User: ${turn}`);
    console.log(`Assistant: ${response}`);
    console.log("Stats:", manager.getStats());
  }
}

The getStats() method is a lightweight observability endpoint: it gives you the active message count and an estimated context token size without an API call. Wire this to your monitoring dashboard to track how conversation state grows in production. When estimatedContextTokens climbs unexpectedly between deployments, it typically signals that the summarization threshold has been misconfigured or the summarization model is generating unusually verbose summaries. Combined with RunGuard’s per-turn cost callbacks, you have full visibility into both token volume and dollar spend per conversation turn. For additional context on managing costs in complex multi-agent pipelines, see multi-agent orchestration cost control.

Crossover analysis: when to switch strategies as conversations grow

No single strategy dominates across all conversation lengths. Choosing the wrong strategy for your workload’s typical turn count is a common source of both unnecessary cost and degraded agent performance.

Multi-turn conversation cost strategies: input growth, cost per 50-turn conversation, semantic retention, and complexity

Strategy Input token growth Cost (50-turn, 1k tok/turn, GPT-4o) Semantic retention Best at turn count Implementation complexity
No management (full history) O(n²) — unbounded $3.19 input alone 100% 1–10 turns only None
Message window (last 15 turns) O(1) after window fills ~$0.84 (15k tok/call × 50) Recency only — early context lost 10–30 turns Very low
Rolling summary (batch every 10) O(log n) — slowly growing summary ~$0.38 + ~$0.05 summary cost High — semantic history preserved 30–60 turns Medium
Retrieval-augmented history (K=5) O(1) — fixed retrieval budget ~$0.20 + embedding + retrieval cost Very high — relevant turns retrieved 60+ turns High
Semantic deduplication only O(n²) minus ~15% duplicates ~$2.71 input (15% reduction) Near 100% Best as add-on Medium
Window + caching (Anthropic) O(1) after window + 90% cache discount ~$0.12 (cache-adjusted input) Recency only 10–30 turns, <5 min between turns Low to medium
Rolling summary + RunGuard cost tracker O(log n) + budget guardrail ~$0.43 with full observability High + budget cap protection 30–60 turns (recommended for production) Medium

Related: AI agent memory consolidation cost optimization · LLM caching cost savings calculation · agent task decomposition cost efficiency

Flatten the O(n²) curve before it shows up in your billing dashboard

Multi-turn conversation cost optimization is not a premature optimization — it is the difference between a support agent that costs $2,437/month in input tokens and one that costs $412/month, with no change in quality. The strategies are well-understood: use message windowing for task-focused agents under 30 turns, add rolling summarization for longer conversations where semantic history matters, and layer in retrieval-augmented history for sessions that span tens or hundreds of turns. Combine any of these with provider-side prompt caching to apply an additional 70–90% discount on the stable portions of your context window. RunGuard adds the per-turn cost tracking that makes all of this visible in real time, and the session budget cap that ensures a runaway conversation can’t exceed your cost threshold before the circuit breaker fires.

RunGuard pricing: Solo plan at $19/month for individual developers. Team plan at $79/month adds Slack and PagerDuty webhook alerts, shared dashboards, and audit log. Both plans include a 14-day free trial — no credit card required.

Start your 14-day free trial — or explore related: AI agent memory consolidation cost optimization, LLM caching cost savings calculation, and autonomous agent cost control best practices.