AI agent cold start cost reduction: cutting the token overhead every invocation pays before real work begins

Every time an AI agent starts a fresh session, it pays for a block of tokens that have nothing to do with the actual task: the system prompt that defines the agent’s behavior, the tool definitions that describe every available function, and any conversation history loaded from a database. For a typical production agent with a 1,500-token system prompt and 20 tools at 100 tokens each, the cold start overhead is 3,500 tokens. At $3.00/MTok for Claude Sonnet input tokens and 10,000 agent runs per day, that’s $105 per day in cold start overhead alone — before a single task-relevant token is processed. Prompt caching drops that to roughly $10.50/day. This page explains exactly where cold start cost comes from, how to eliminate most of it with prompt caching, what pitfalls break the cache, and how to use RunGuard to detect when system prompt bloat has crept back in.

Where cold start tokens come from and what they cost at scale

Cold start overhead has three distinct components that accumulate multiplicatively across invocations.

System prompt tokens. Most non-trivial agents have system prompts ranging from 200 tokens (minimal personality plus role definition) to 2,000+ tokens (detailed instructions, examples, output format specifications, guardrails). A 1,500-token system prompt at $3.00/MTok costs $0.0045 per call. At 10,000 calls/day that is $45/day. At 100,000 calls/day it is $450/day. System prompts are the single largest lever for cold start cost reduction because they are entirely stable across calls — exactly what prompt caching is designed for.
Tool definition tokens. Every tool you define — function name, parameter names, types, descriptions — is serialized and included in the input token count. A well-documented tool with three to five parameters costs roughly 80–150 tokens. An agent with 20 tools has 1,600–3,000 tokens of tool definitions per call. At $3.00/MTok with 10,000 runs/day, 2,000 tool tokens cost $60/day. Like system prompts, tool definitions are static and fully cache-eligible.
Conversation history prefill. Agents that resume prior sessions by injecting stored conversation turns pay for every token in that history on every call. A 10-turn conversation at ~200 tokens per turn adds 2,000 prefill tokens. Without windowing or summarization, this grows quadratically as conversations continue. For a fresh-start invocation with no prior history, this cost is zero; for session-resuming agents it can dwarf system prompt cost. See multi-agent orchestration cost control for session-level strategies.
The compounded cold start number. System prompt: 1,500 tokens. Tool definitions (20 × 100 tokens): 2,000 tokens. Total cold start overhead: 3,500 tokens per call. At $3.00/MTok: $0.0105/call. At 10,000 calls/day: $105/day. At 30 days: $3,150/month. With Anthropic cache reads at $0.30/MTok: $0.00105/call, $10.50/day, $315/month. That is a $2,835/month difference from one configuration change — with no change to agent behavior whatsoever.

Prompt caching mechanics: Anthropic vs OpenAI

Both providers support prompt caching, but with different implementations, discount levels, and requirements for cache stability.

Anthropic cache_control. Anthropic’s caching uses explicit cache_control blocks attached to system prompt content objects. You mark the end of the cacheable prefix with "cache_control": {"type": "ephemeral"}. Cache writes cost $3.75/MTok (25% premium over standard input). Cache reads cost $0.30/MTok — a 90% discount from $3.00/MTok standard input. The cache TTL is 5 minutes: if the same cache-controlled prefix is not reused within 5 minutes, the cache entry expires and the next call pays the write rate again. For agents processing calls every few seconds, the TTL is easily maintained. The breakeven point is roughly 1.25 cache reads per cache write; above that ratio, caching is always net positive on cost.
OpenAI automatic caching. OpenAI caches automatically: any input prefix that has been seen recently and is at least 1,024 tokens long is eligible. Cache reads for GPT-4o cost $0.75/MTok vs $2.50/MTok uncached — a 70% discount. You do not mark cache boundaries explicitly; OpenAI caches the longest matching prefix. This is simpler to implement but harder to debug: always inspect usage.prompt_tokens_details.cached_tokens in the response to verify cache hits are occurring. See OpenAI Assistants API budget control for full pricing context.
The cache-breaking pitfall: dynamic content in static positions. The most common mistake that destroys cache hit rates is injecting dynamic content — timestamps, request IDs, user identifiers, session tokens — into the system prompt or tool descriptions. If your system prompt includes "Current UTC time: 2026-06-02T14:23:01Z", the cache prefix changes every second and no cache hits occur. Move all dynamic content to the first user message or to a separate message after the system prompt. The system prompt and tool definitions should be 100% identical across all calls from the same agent version. Any content that varies between calls must live in the message body, never the static prefix.
Measuring cache effectiveness. Anthropic returns usage.cache_creation_input_tokens and usage.cache_read_input_tokens in every response. Log these fields in your telemetry. A healthy caching setup should show cache reads representing 80-95% of cold start token counts after the first few calls of a session. If cache reads are near zero, your prefix is changing between calls. If cache creation tokens are consistently high, you may have a TTL problem (calls spaced more than 5 minutes apart) or a dynamic prefix problem.

Connection pooling and TLS handshake elimination

Cold start cost is not purely about tokens. Each fresh HTTP connection to the LLM provider API requires a TLS handshake that adds 50–200 milliseconds of latency. At 10,000 calls/day, this is 500–2,000 seconds of cumulative latency. More importantly, if your agent infrastructure spins up a new process for each invocation (serverless functions, container-per-request patterns), you pay the TLS cost on every single call.

HTTP keep-alive and connection pooling. The Anthropic Python SDK uses httpx internally. By default, httpx maintains a connection pool and reuses connections for subsequent requests. If you instantiate a new Anthropic() client per request, you destroy the connection pool on every call. Instantiate the client once at module load time or in an application-level singleton, and reuse it across all calls within the process lifetime.
HTTP/2 multiplexing. HTTP/2 allows multiple concurrent requests over a single connection. For agents that make several sequential LLM calls per task (plan, execute step 1, execute step 2, synthesize result), a single multiplexed connection eliminates three TLS handshakes. The Anthropic SDK supports HTTP/2 via httpx’s http2=True flag when passed a custom httpx.Client.
Serverless cold starts. If you deploy agents as AWS Lambda functions or Google Cloud Run instances, the Python/Node runtime itself has a cold start cost (100–500ms) before the first LLM call is even dispatched. Provisioned concurrency (Lambda) or minimum instances (Cloud Run) eliminate this by keeping at least one process warm. For agents with consistent traffic, the cost of provisioned concurrency is usually lower than the combined latency and cache-miss cost of fully cold invocations.

Python implementation: Anthropic client with prompt caching and RunGuard cold start tracking

This implementation demonstrates the correct Anthropic SDK pattern for cache_control, module-level client reuse for connection pooling, and RunGuard’s cold_start_tracker that alerts when system prompt bloat has crept above a threshold.

import anthropic
import runguard
from functools import lru_cache

# Module-level client: initialized once, connection pool reused across all calls
_anthropic_client = anthropic.Anthropic()

# RunGuard cold start tracker: fires an alert if cold start tokens exceed threshold
cold_start_guard = runguard.ColdStartTracker(
    max_cold_start_tokens=4000,
    on_exceeded=lambda ctx: print(
        f"[RunGuard] Cold start bloat detected: {ctx.cold_start_tokens} tokens "
        f"(limit: {ctx.limit}). Check system prompt and tool definitions."
    ),
)

# Stable system prompt — NO dynamic content here
# All timestamps, user IDs, session data go in the user message instead
SYSTEM_PROMPT = """You are a helpful research assistant specializing in technical documentation.

Your responsibilities:
- Answer questions accurately using the provided tools
- Cite sources when available
- Acknowledge uncertainty rather than speculating
- Format responses in Markdown unless the user requests plain text

When using tools, prefer targeted queries over broad ones to minimize token usage.
If a tool call returns an error, explain the error to the user and suggest alternatives.
Do not retry the same tool call with identical parameters more than once."""

# Tool definitions: static, no dynamic injection
TOOLS = [
    {
        "name": "search_documentation",
        "description": "Search the product documentation index for relevant pages.",
        "input_schema": {
            "type": "object",
            "properties": {
                "query": {"type": "string", "description": "Search query"},
                "max_results": {"type": "integer", "description": "Maximum results (1-10)", "default": 5},
            },
            "required": ["query"],
        },
    },
    {
        "name": "fetch_page",
        "description": "Retrieve the full content of a documentation page by URL slug.",
        "input_schema": {
            "type": "object",
            "properties": {
                "slug": {"type": "string", "description": "URL slug of the documentation page"},
            },
            "required": ["slug"],
        },
    },
    # ... additional tools defined similarly, no dynamic content in descriptions
]

@lru_cache(maxsize=1)
def _count_cold_start_tokens() -> int:
    """Estimate cold start token overhead from system prompt + tool defs."""
    import tiktoken
    enc = tiktoken.get_encoding("cl100k_base")
    system_tokens = len(enc.encode(SYSTEM_PROMPT))
    tool_tokens = sum(
        len(enc.encode(str(t))) for t in TOOLS
    )
    return system_tokens + tool_tokens


def call_agent(user_message: str, conversation_history: list[dict] | None = None) -> str:
    """
    Call the agent with prompt caching on system prompt and tools.
    conversation_history: list of prior message dicts [{"role": ..., "content": ...}]
    """
    history = conversation_history or []

    # Report cold start overhead to RunGuard on the first call
    cold_start_tokens = _count_cold_start_tokens()
    cold_start_guard.record(cold_start_tokens=cold_start_tokens)

    messages = history + [{"role": "user", "content": user_message}]

    response = _anthropic_client.messages.create(
        model="claude-sonnet-4-5",
        max_tokens=1024,
        system=[
            {
                "type": "text",
                "text": SYSTEM_PROMPT,
                # cache_control marks the end of the cacheable prefix.
                # Everything before this marker is cached after the first call.
                "cache_control": {"type": "ephemeral"},
            }
        ],
        tools=TOOLS,
        messages=messages,
    )

    # Log cache performance for monitoring
    usage = response.usage
    cache_write = getattr(usage, "cache_creation_input_tokens", 0)
    cache_read = getattr(usage, "cache_read_input_tokens", 0)
    standard_input = usage.input_tokens - cache_write - cache_read

    if cache_read > 0:
        saved_cost = cache_read * (3.00 - 0.30) / 1_000_000
        print(f"[Cache] Read {cache_read} cached tokens, saved ${saved_cost:.6f}")
    elif cache_write > 0:
        print(f"[Cache] Wrote {cache_write} tokens to cache (first call or cache miss)")
    else:
        print(f"[Cache] No caching detected — check prefix stability")

    # Extract text content from response
    text_blocks = [b.text for b in response.content if hasattr(b, "text")]
    return "\n".join(text_blocks)


# Example: warm agent startup pattern for serverless environments
class AgentPool:
    """
    Pre-warm agent connections to avoid cold start latency on first real request.
    In Lambda: call warm() in the module-level initialization code (outside the handler).
    """

    def __init__(self):
        self._warmed = False

    def warm(self) -> None:
        """Send a no-op call to establish TCP connection and warm the cache."""
        if self._warmed:
            return
        _anthropic_client.messages.create(
            model="claude-haiku-3-5",  # cheapest model for the warm ping
            max_tokens=1,
            system=[
                {
                    "type": "text",
                    "text": SYSTEM_PROMPT,
                    "cache_control": {"type": "ephemeral"},
                }
            ],
            messages=[{"role": "user", "content": "ping"}],
        )
        self._warmed = True
        print("[AgentPool] Connection warm, cache primed")

# Module-level pool — warm() runs once per process lifecycle
_pool = AgentPool()
_pool.warm()

The @lru_cache(maxsize=1) on _count_cold_start_tokens ensures the token count is computed exactly once per process, not on every call. RunGuard’s cold_start_guard.record() accepts the pre-computed token count and fires the alert callback if the threshold is exceeded. This means if a developer accidentally adds a 2,000-token example to the system prompt, the alert fires immediately on the next deployment. The AgentPool.warm() pattern primes both the TCP connection and the Anthropic cache in a single cheap Haiku call, so the first real user request sees a cache hit instead of a write.

TypeScript implementation: OpenAI with connection pooling and cold start monitoring

The TypeScript equivalent uses a globalThis singleton for the OpenAI client (important in serverless runtimes where module scope may not persist across invocations) and integrates RunGuard’s cold start tracker via the TypeScript SDK.

import OpenAI from "openai";
import RunGuard from "@runguard/sdk";

// globalThis singleton: survives hot-reload in Next.js and some serverless runtimes
declare global {
  // eslint-disable-next-line no-var
  var __openaiClient: OpenAI | undefined;
  var __runguardColdStartGuard: RunGuard.ColdStartTracker | undefined;
}

function getOpenAIClient(): OpenAI {
  if (!globalThis.__openaiClient) {
    globalThis.__openaiClient = new OpenAI({
      apiKey: process.env.OPENAI_API_KEY,
      // maxRetries: 0 — we handle retries ourselves for cost control
      maxRetries: 0,
    });
  }
  return globalThis.__openaiClient;
}

function getColdStartGuard(): RunGuard.ColdStartTracker {
  if (!globalThis.__runguardColdStartGuard) {
    globalThis.__runguardColdStartGuard = new RunGuard.ColdStartTracker({
      maxColdStartTokens: 4000,
      onExceeded: (ctx) => {
        console.warn(
          `[RunGuard] Cold start bloat: ${ctx.coldStartTokens} tokens exceeds limit ${ctx.limit}. ` +
          `Check system prompt and tool definitions for dynamic content injection.`
        );
      },
    });
  }
  return globalThis.__runguardColdStartGuard;
}

// Stable system prompt — no dynamic content allowed here
const SYSTEM_PROMPT =
  "You are a helpful research assistant specializing in technical documentation. " +
  "Answer questions accurately using the provided tools. Cite sources when available. " +
  "Acknowledge uncertainty rather than speculating. Format responses in Markdown " +
  "unless the user requests plain text. When using tools, prefer targeted queries " +
  "over broad ones to minimize token usage. Do not retry the same tool call with " +
  "identical parameters more than once.";

// Tool definitions: completely static, no dynamic injection
const TOOLS: OpenAI.ChatCompletionTool[] = [
  {
    type: "function",
    function: {
      name: "search_documentation",
      description: "Search the product documentation index for relevant pages.",
      parameters: {
        type: "object",
        properties: {
          query: { type: "string", description: "Search query" },
          max_results: { type: "integer", description: "Maximum results (1-10)", default: 5 },
        },
        required: ["query"],
      },
    },
  },
  {
    type: "function",
    function: {
      name: "fetch_page",
      description: "Retrieve the full content of a documentation page by URL slug.",
      parameters: {
        type: "object",
        properties: {
          slug: { type: "string", description: "URL slug of the documentation page" },
        },
        required: ["slug"],
      },
    },
  },
];

// Approximate cold start token count (system prompt + tools serialized)
let _coldStartTokensCached: number | null = null;

function estimateColdStartTokens(): number {
  if (_coldStartTokensCached !== null) return _coldStartTokensCached;
  // Rough approximation: 1 token ≈ 4 chars for English text
  const systemChars = SYSTEM_PROMPT.length;
  const toolChars = JSON.stringify(TOOLS).length;
  _coldStartTokensCached = Math.ceil((systemChars + toolChars) / 4);
  return _coldStartTokensCached;
}

interface CallAgentOptions {
  userMessage: string;
  conversationHistory?: OpenAI.ChatCompletionMessageParam[];
}

async function callAgent({ userMessage, conversationHistory = [] }: CallAgentOptions): Promise<string> {
  const client = getOpenAIClient();
  const coldStartGuard = getColdStartGuard();

  // Report cold start estimate to RunGuard
  const coldStartTokens = estimateColdStartTokens();
  coldStartGuard.record({ coldStartTokens });

  const messages: OpenAI.ChatCompletionMessageParam[] = [
    { role: "system", content: SYSTEM_PROMPT },
    ...conversationHistory,
    { role: "user", content: userMessage },
  ];

  const response = await client.chat.completions.create({
    model: "gpt-4o",
    max_tokens: 1024,
    messages,
    tools: TOOLS,
  });

  // Check for OpenAI automatic cache hits
  const usage = response.usage;
  const cachedTokens = usage?.prompt_tokens_details?.cached_tokens ?? 0;
  const totalInputTokens = usage?.prompt_tokens ?? 0;

  if (cachedTokens > 0) {
    // OpenAI cached: $0.75/MTok vs $2.50/MTok standard
    const savedCost = cachedTokens * (2.50 - 0.75) / 1_000_000;
    console.log(`[Cache] ${cachedTokens}/${totalInputTokens} input tokens cached, saved $${savedCost.toFixed(6)}`);
  } else if (totalInputTokens >= 1024) {
    // Eligible for caching but missed — prefix may have changed
    console.warn(`[Cache] ${totalInputTokens} tokens but 0 cached — check prefix stability`);
  }

  const choice = response.choices[0];
  return choice.message.content ?? "";
}

export { callAgent, estimateColdStartTokens };

The globalThis singleton pattern is critical for serverless environments like Vercel Edge Functions and Next.js API routes, where module-level variables may be re-initialized between invocations. Using globalThis ensures the OpenAI client and its underlying connection pool survive across the warm lifecycle of the worker process. The cache performance logging — checking cached_tokens in prompt_tokens_details — should be wired to your observability platform. A cache hit rate below 80% on a stable agent is a signal that the system prompt prefix is changing between calls. For session-level cost analysis across multiple turns, see AI agent cost per user session.

RunGuard cold start budget: catching system prompt bloat in production

Prompt caching is powerful, but it only helps if you know your cold start token count in the first place. Without explicit tracking, system prompt bloat is invisible — a developer adds a two-paragraph example to the system prompt, the token count climbs from 1,500 to 2,800, and monthly costs increase by 87% without any change in agent behavior or output quality.

What RunGuard’s cold start tracker measures. The ColdStartTracker component accepts the token count of your system prompt plus tool definitions on each agent initialization. It maintains a rolling baseline and fires the on_exceeded callback when a call’s cold start overhead exceeds the configured threshold. In CI/CD, this becomes a lightweight guardrail: run a test invocation in your deployment pipeline, check the cold start token count, and fail the build if it exceeds your budget.
Distinguishing cold start tokens from task tokens. Without explicit tracking, all input tokens look identical in your billing dashboard. RunGuard separates the cold start overhead (system prompt + tools) from task-specific input (user messages, tool responses) in its per-session cost breakdown. This lets you answer the question: “How much of our LLM spend is pure overhead vs productive task work?” For well-optimized agents with caching, overhead should be under 5% of total token spend. Overhead above 20% is a signal that caching is broken or the system prompt has grown disproportionately.
Cache miss cost is higher than cache hit cost by design. Anthropic charges $3.75/MTok for cache writes vs $3.00/MTok for standard input — a 25% premium. This means a cold start that misses cache is more expensive than a fully non-cached call. If your agent has intermittent traffic (calls spaced >5 minutes apart), you pay the write premium without benefit. In that case, consider a keep-alive ping (a minimal call that re-establishes the cache entry) rather than paying full write rate on every sporadic call.

Cold start reduction approaches: token overhead, cost per 10k calls, and complexity

Approach	Token overhead per call	Cost per 10k calls (3,500 token cold start)	Cache hit rate	Implementation complexity
No caching, new client per call	3,500 (full cold start every time)	$105.00	0%	Low — no setup required
No caching, client singleton	3,500 (tokens) + TLS saved	$105.00 (saves latency, not cost)	0%	Low — one-line singleton
OpenAI automatic caching	~350 effective (70% discount on 3,500)	~$31.50	70–85% on stable prefixes	Low — automatic, no code changes
Anthropic explicit cache_control	~350 effective (90% discount on cache read)	~$10.50	85–97% on stable prefixes	Medium — requires cache_control blocks
Anthropic caching + keep-alive warm ping	~350 effective	~$10.50	95–99%	Medium — warm ping on module load
Anthropic caching + RunGuard cold start tracker	~350 effective + bloat alerts	~$10.50 with bloat guardrails	95–99%	Medium — adds observability layer
System prompt minimization + caching	~100–175 effective (700-token prompt + caching)	~$3.00	95–99%	High — requires prompt engineering discipline

Stop paying $105/day for agent overhead before the task even starts

AI agent cold start cost reduction is one of the highest-leverage optimizations available to teams running agents at scale. The math is straightforward: a 3,500-token cold start at $3.00/MTok costs $105/day at 10,000 invocations. Anthropic prompt caching at $0.30/MTok for cache reads reduces that to $10.50/day. The technical requirements are minimal — a stable system prompt with no dynamic content, explicit cache_control blocks, and a module-level client singleton for connection reuse. RunGuard’s cold start tracker adds the observability layer that catches system prompt bloat before it silently doubles your monthly bill. Together these changes typically cut cold start overhead by 85–90% within one deployment cycle.

RunGuard pricing: Solo plan at $19/month for individual developers. Team plan at $79/month adds Slack and PagerDuty webhook alerts, shared dashboards, and audit log. Both plans include a 14-day free trial — no credit card required.

Start your 14-day free trial — or explore related: LLM caching cost savings calculation, autonomous agent cost control best practices, and agent task decomposition cost efficiency.