AI agent cold start cost reduction: cutting the token overhead every invocation pays before real work begins

Every time an AI agent starts a fresh session, it pays for a block of tokens that have nothing to do with the actual task: the system prompt that defines the agent’s behavior, the tool definitions that describe every available function, and any conversation history loaded from a database. For a typical production agent with a 1,500-token system prompt and 20 tools at 100 tokens each, the cold start overhead is 3,500 tokens. At $3.00/MTok for Claude Sonnet input tokens and 10,000 agent runs per day, that’s $105 per day in cold start overhead alone — before a single task-relevant token is processed. Prompt caching drops that to roughly $10.50/day. This page explains exactly where cold start cost comes from, how to eliminate most of it with prompt caching, what pitfalls break the cache, and how to use RunGuard to detect when system prompt bloat has crept back in.

Where cold start tokens come from and what they cost at scale

Cold start overhead has three distinct components that accumulate multiplicatively across invocations.

Prompt caching mechanics: Anthropic vs OpenAI

Both providers support prompt caching, but with different implementations, discount levels, and requirements for cache stability.

Connection pooling and TLS handshake elimination

Cold start cost is not purely about tokens. Each fresh HTTP connection to the LLM provider API requires a TLS handshake that adds 50–200 milliseconds of latency. At 10,000 calls/day, this is 500–2,000 seconds of cumulative latency. More importantly, if your agent infrastructure spins up a new process for each invocation (serverless functions, container-per-request patterns), you pay the TLS cost on every single call.

Python implementation: Anthropic client with prompt caching and RunGuard cold start tracking

This implementation demonstrates the correct Anthropic SDK pattern for cache_control, module-level client reuse for connection pooling, and RunGuard’s cold_start_tracker that alerts when system prompt bloat has crept above a threshold.

import anthropic
import runguard
from functools import lru_cache

# Module-level client: initialized once, connection pool reused across all calls
_anthropic_client = anthropic.Anthropic()

# RunGuard cold start tracker: fires an alert if cold start tokens exceed threshold
cold_start_guard = runguard.ColdStartTracker(
    max_cold_start_tokens=4000,
    on_exceeded=lambda ctx: print(
        f"[RunGuard] Cold start bloat detected: {ctx.cold_start_tokens} tokens "
        f"(limit: {ctx.limit}). Check system prompt and tool definitions."
    ),
)

# Stable system prompt — NO dynamic content here
# All timestamps, user IDs, session data go in the user message instead
SYSTEM_PROMPT = """You are a helpful research assistant specializing in technical documentation.

Your responsibilities:
- Answer questions accurately using the provided tools
- Cite sources when available
- Acknowledge uncertainty rather than speculating
- Format responses in Markdown unless the user requests plain text

When using tools, prefer targeted queries over broad ones to minimize token usage.
If a tool call returns an error, explain the error to the user and suggest alternatives.
Do not retry the same tool call with identical parameters more than once."""

# Tool definitions: static, no dynamic injection
TOOLS = [
    {
        "name": "search_documentation",
        "description": "Search the product documentation index for relevant pages.",
        "input_schema": {
            "type": "object",
            "properties": {
                "query": {"type": "string", "description": "Search query"},
                "max_results": {"type": "integer", "description": "Maximum results (1-10)", "default": 5},
            },
            "required": ["query"],
        },
    },
    {
        "name": "fetch_page",
        "description": "Retrieve the full content of a documentation page by URL slug.",
        "input_schema": {
            "type": "object",
            "properties": {
                "slug": {"type": "string", "description": "URL slug of the documentation page"},
            },
            "required": ["slug"],
        },
    },
    # ... additional tools defined similarly, no dynamic content in descriptions
]

@lru_cache(maxsize=1)
def _count_cold_start_tokens() -> int:
    """Estimate cold start token overhead from system prompt + tool defs."""
    import tiktoken
    enc = tiktoken.get_encoding("cl100k_base")
    system_tokens = len(enc.encode(SYSTEM_PROMPT))
    tool_tokens = sum(
        len(enc.encode(str(t))) for t in TOOLS
    )
    return system_tokens + tool_tokens


def call_agent(user_message: str, conversation_history: list[dict] | None = None) -> str:
    """
    Call the agent with prompt caching on system prompt and tools.
    conversation_history: list of prior message dicts [{"role": ..., "content": ...}]
    """
    history = conversation_history or []

    # Report cold start overhead to RunGuard on the first call
    cold_start_tokens = _count_cold_start_tokens()
    cold_start_guard.record(cold_start_tokens=cold_start_tokens)

    messages = history + [{"role": "user", "content": user_message}]

    response = _anthropic_client.messages.create(
        model="claude-sonnet-4-5",
        max_tokens=1024,
        system=[
            {
                "type": "text",
                "text": SYSTEM_PROMPT,
                # cache_control marks the end of the cacheable prefix.
                # Everything before this marker is cached after the first call.
                "cache_control": {"type": "ephemeral"},
            }
        ],
        tools=TOOLS,
        messages=messages,
    )

    # Log cache performance for monitoring
    usage = response.usage
    cache_write = getattr(usage, "cache_creation_input_tokens", 0)
    cache_read = getattr(usage, "cache_read_input_tokens", 0)
    standard_input = usage.input_tokens - cache_write - cache_read

    if cache_read > 0:
        saved_cost = cache_read * (3.00 - 0.30) / 1_000_000
        print(f"[Cache] Read {cache_read} cached tokens, saved ${saved_cost:.6f}")
    elif cache_write > 0:
        print(f"[Cache] Wrote {cache_write} tokens to cache (first call or cache miss)")
    else:
        print(f"[Cache] No caching detected — check prefix stability")

    # Extract text content from response
    text_blocks = [b.text for b in response.content if hasattr(b, "text")]
    return "\n".join(text_blocks)


# Example: warm agent startup pattern for serverless environments
class AgentPool:
    """
    Pre-warm agent connections to avoid cold start latency on first real request.
    In Lambda: call warm() in the module-level initialization code (outside the handler).
    """

    def __init__(self):
        self._warmed = False

    def warm(self) -> None:
        """Send a no-op call to establish TCP connection and warm the cache."""
        if self._warmed:
            return
        _anthropic_client.messages.create(
            model="claude-haiku-3-5",  # cheapest model for the warm ping
            max_tokens=1,
            system=[
                {
                    "type": "text",
                    "text": SYSTEM_PROMPT,
                    "cache_control": {"type": "ephemeral"},
                }
            ],
            messages=[{"role": "user", "content": "ping"}],
        )
        self._warmed = True
        print("[AgentPool] Connection warm, cache primed")

# Module-level pool — warm() runs once per process lifecycle
_pool = AgentPool()
_pool.warm()

The @lru_cache(maxsize=1) on _count_cold_start_tokens ensures the token count is computed exactly once per process, not on every call. RunGuard’s cold_start_guard.record() accepts the pre-computed token count and fires the alert callback if the threshold is exceeded. This means if a developer accidentally adds a 2,000-token example to the system prompt, the alert fires immediately on the next deployment. The AgentPool.warm() pattern primes both the TCP connection and the Anthropic cache in a single cheap Haiku call, so the first real user request sees a cache hit instead of a write.

TypeScript implementation: OpenAI with connection pooling and cold start monitoring

The TypeScript equivalent uses a globalThis singleton for the OpenAI client (important in serverless runtimes where module scope may not persist across invocations) and integrates RunGuard’s cold start tracker via the TypeScript SDK.

import OpenAI from "openai";
import RunGuard from "@runguard/sdk";

// globalThis singleton: survives hot-reload in Next.js and some serverless runtimes
declare global {
  // eslint-disable-next-line no-var
  var __openaiClient: OpenAI | undefined;
  var __runguardColdStartGuard: RunGuard.ColdStartTracker | undefined;
}

function getOpenAIClient(): OpenAI {
  if (!globalThis.__openaiClient) {
    globalThis.__openaiClient = new OpenAI({
      apiKey: process.env.OPENAI_API_KEY,
      // maxRetries: 0 — we handle retries ourselves for cost control
      maxRetries: 0,
    });
  }
  return globalThis.__openaiClient;
}

function getColdStartGuard(): RunGuard.ColdStartTracker {
  if (!globalThis.__runguardColdStartGuard) {
    globalThis.__runguardColdStartGuard = new RunGuard.ColdStartTracker({
      maxColdStartTokens: 4000,
      onExceeded: (ctx) => {
        console.warn(
          `[RunGuard] Cold start bloat: ${ctx.coldStartTokens} tokens exceeds limit ${ctx.limit}. ` +
          `Check system prompt and tool definitions for dynamic content injection.`
        );
      },
    });
  }
  return globalThis.__runguardColdStartGuard;
}

// Stable system prompt — no dynamic content allowed here
const SYSTEM_PROMPT =
  "You are a helpful research assistant specializing in technical documentation. " +
  "Answer questions accurately using the provided tools. Cite sources when available. " +
  "Acknowledge uncertainty rather than speculating. Format responses in Markdown " +
  "unless the user requests plain text. When using tools, prefer targeted queries " +
  "over broad ones to minimize token usage. Do not retry the same tool call with " +
  "identical parameters more than once.";

// Tool definitions: completely static, no dynamic injection
const TOOLS: OpenAI.ChatCompletionTool[] = [
  {
    type: "function",
    function: {
      name: "search_documentation",
      description: "Search the product documentation index for relevant pages.",
      parameters: {
        type: "object",
        properties: {
          query: { type: "string", description: "Search query" },
          max_results: { type: "integer", description: "Maximum results (1-10)", default: 5 },
        },
        required: ["query"],
      },
    },
  },
  {
    type: "function",
    function: {
      name: "fetch_page",
      description: "Retrieve the full content of a documentation page by URL slug.",
      parameters: {
        type: "object",
        properties: {
          slug: { type: "string", description: "URL slug of the documentation page" },
        },
        required: ["slug"],
      },
    },
  },
];

// Approximate cold start token count (system prompt + tools serialized)
let _coldStartTokensCached: number | null = null;

function estimateColdStartTokens(): number {
  if (_coldStartTokensCached !== null) return _coldStartTokensCached;
  // Rough approximation: 1 token ≈ 4 chars for English text
  const systemChars = SYSTEM_PROMPT.length;
  const toolChars = JSON.stringify(TOOLS).length;
  _coldStartTokensCached = Math.ceil((systemChars + toolChars) / 4);
  return _coldStartTokensCached;
}

interface CallAgentOptions {
  userMessage: string;
  conversationHistory?: OpenAI.ChatCompletionMessageParam[];
}

async function callAgent({ userMessage, conversationHistory = [] }: CallAgentOptions): Promise<string> {
  const client = getOpenAIClient();
  const coldStartGuard = getColdStartGuard();

  // Report cold start estimate to RunGuard
  const coldStartTokens = estimateColdStartTokens();
  coldStartGuard.record({ coldStartTokens });

  const messages: OpenAI.ChatCompletionMessageParam[] = [
    { role: "system", content: SYSTEM_PROMPT },
    ...conversationHistory,
    { role: "user", content: userMessage },
  ];

  const response = await client.chat.completions.create({
    model: "gpt-4o",
    max_tokens: 1024,
    messages,
    tools: TOOLS,
  });

  // Check for OpenAI automatic cache hits
  const usage = response.usage;
  const cachedTokens = usage?.prompt_tokens_details?.cached_tokens ?? 0;
  const totalInputTokens = usage?.prompt_tokens ?? 0;

  if (cachedTokens > 0) {
    // OpenAI cached: $0.75/MTok vs $2.50/MTok standard
    const savedCost = cachedTokens * (2.50 - 0.75) / 1_000_000;
    console.log(`[Cache] ${cachedTokens}/${totalInputTokens} input tokens cached, saved $${savedCost.toFixed(6)}`);
  } else if (totalInputTokens >= 1024) {
    // Eligible for caching but missed — prefix may have changed
    console.warn(`[Cache] ${totalInputTokens} tokens but 0 cached — check prefix stability`);
  }

  const choice = response.choices[0];
  return choice.message.content ?? "";
}

export { callAgent, estimateColdStartTokens };

The globalThis singleton pattern is critical for serverless environments like Vercel Edge Functions and Next.js API routes, where module-level variables may be re-initialized between invocations. Using globalThis ensures the OpenAI client and its underlying connection pool survive across the warm lifecycle of the worker process. The cache performance logging — checking cached_tokens in prompt_tokens_details — should be wired to your observability platform. A cache hit rate below 80% on a stable agent is a signal that the system prompt prefix is changing between calls. For session-level cost analysis across multiple turns, see AI agent cost per user session.

RunGuard cold start budget: catching system prompt bloat in production

Prompt caching is powerful, but it only helps if you know your cold start token count in the first place. Without explicit tracking, system prompt bloat is invisible — a developer adds a two-paragraph example to the system prompt, the token count climbs from 1,500 to 2,800, and monthly costs increase by 87% without any change in agent behavior or output quality.

Cold start reduction approaches: token overhead, cost per 10k calls, and complexity

Approach Token overhead per call Cost per 10k calls (3,500 token cold start) Cache hit rate Implementation complexity
No caching, new client per call 3,500 (full cold start every time) $105.00 0% Low — no setup required
No caching, client singleton 3,500 (tokens) + TLS saved $105.00 (saves latency, not cost) 0% Low — one-line singleton
OpenAI automatic caching ~350 effective (70% discount on 3,500) ~$31.50 70–85% on stable prefixes Low — automatic, no code changes
Anthropic explicit cache_control ~350 effective (90% discount on cache read) ~$10.50 85–97% on stable prefixes Medium — requires cache_control blocks
Anthropic caching + keep-alive warm ping ~350 effective ~$10.50 95–99% Medium — warm ping on module load
Anthropic caching + RunGuard cold start tracker ~350 effective + bloat alerts ~$10.50 with bloat guardrails 95–99% Medium — adds observability layer
System prompt minimization + caching ~100–175 effective (700-token prompt + caching) ~$3.00 95–99% High — requires prompt engineering discipline

Related: Anthropic Claude API cost optimization · LLM caching cost savings calculation · AI agent cost per user session

Stop paying $105/day for agent overhead before the task even starts

AI agent cold start cost reduction is one of the highest-leverage optimizations available to teams running agents at scale. The math is straightforward: a 3,500-token cold start at $3.00/MTok costs $105/day at 10,000 invocations. Anthropic prompt caching at $0.30/MTok for cache reads reduces that to $10.50/day. The technical requirements are minimal — a stable system prompt with no dynamic content, explicit cache_control blocks, and a module-level client singleton for connection reuse. RunGuard’s cold start tracker adds the observability layer that catches system prompt bloat before it silently doubles your monthly bill. Together these changes typically cut cold start overhead by 85–90% within one deployment cycle.

RunGuard pricing: Solo plan at $19/month for individual developers. Team plan at $79/month adds Slack and PagerDuty webhook alerts, shared dashboards, and audit log. Both plans include a 14-day free trial — no credit card required.

Start your 14-day free trial — or explore related: LLM caching cost savings calculation, autonomous agent cost control best practices, and agent task decomposition cost efficiency.