LLM Cost Spike Root Cause Analysis: 8 Common Causes and How to Diagnose Them
You opened your LLM provider dashboard and saw a cost spike. Now what? This guide walks through the 8 most common root causes of sudden LLM cost increases — from agent loops to deployment accidents — with diagnostic queries, Python investigation scripts, and the circuit breaker patterns that prevent each one from recurring.
First: The Triage Flow
Before diving into root causes, run this 5-question triage to narrow down the category:
- When did the spike start? Check your provider dashboard for exact UTC time. Cross-reference with deployment history, cron job schedules, and incident logs.
- Which app or API key is responsible? Most providers break down cost by API key. If you use a single key, check your own per-app tagging in logs.
- Is it more calls or more tokens per call? Call volume increase → traffic spike or loop. Token increase per call → context accumulation, model change, or prompt change.
- Did anything deploy in the prior 24 hours? Model version changes, new tools, prompt edits, and batch job additions all correlate with spikes.
- Is the spike still ongoing or historical? If ongoing, stop it first (rate limit, circuit breaker, feature flag), then diagnose. Don't diagnose a running fire.
Use this Python script to pull the diagnostic data from your RunGuard or custom logs:
import requests
import os
from datetime import datetime, timedelta, timezone
RUNGUARD_API_KEY = os.environ["RUNGUARD_API_KEY"]
BASE_URL = "https://api.runguard.dev/v1"
def get_cost_summary(app_id: str, hours_back: int = 48):
since = (datetime.now(timezone.utc) - timedelta(hours=hours_back)).isoformat()
r = requests.get(
f"{BASE_URL}/apps/{app_id}/cost-summary",
headers={"Authorization": f"Bearer {RUNGUARD_API_KEY}"},
params={"since": since, "granularity": "hour"}
)
return r.json()
def find_spike_hour(summary: dict) -> dict:
buckets = summary["hourly_buckets"]
# Compare each hour to 7-day rolling average
avg = summary["seven_day_hourly_avg_usd"]
for bucket in buckets:
if bucket["cost_usd"] > avg * 3:
print(f"SPIKE at {bucket['hour']}: "
f"${bucket['cost_usd']:.2f} vs avg ${avg:.2f} "
f"({bucket['cost_usd']/avg:.1f}×)")
return buckets
def analyze_spike_hour(app_id: str, hour: str):
r = requests.get(
f"{BASE_URL}/apps/{app_id}/sessions",
headers={"Authorization": f"Bearer {RUNGUARD_API_KEY}"},
params={"hour": hour, "sort": "cost_usd_desc", "limit": 20}
)
sessions = r.json()["sessions"]
for s in sessions[:5]:
print(f"Session {s['id']}: ${s['cost_usd']:.4f}, "
f"{s['tool_calls']} tool calls, "
f"{s['input_tokens']} in / {s['output_tokens']} out tokens")
return sessions
Cause 1: Agent Tool-Call Loop
Signature: Sharp spike in call count, cost-per-session 10–100× normal, sessions with identical or near-identical tool calls repeating.
Diagnostic:
def detect_loop_sessions(app_id: str, hour: str):
sessions = analyze_spike_hour(app_id, hour)
for session in sessions:
if session["tool_calls"] > 20: # threshold for loop suspicion
# Fetch the tool call sequence for this session
r = requests.get(
f"{BASE_URL}/sessions/{session['id']}/tool-calls",
headers={"Authorization": f"Bearer {RUNGUARD_API_KEY}"}
)
calls = r.json()["tool_calls"]
# Check for repeated tool names (loop signal)
from collections import Counter
name_counts = Counter(c["name"] for c in calls)
repeated = {k: v for k, v in name_counts.items() if v > 3}
if repeated:
print(f"Loop detected in {session['id']}: {repeated}")
Root causes: Missing loop-break condition in agent prompt, tool that always returns a result requiring further action, state not persisted between tool calls causing the agent to "forget" it already did the action.
Fix: Add a max-iterations guard and a loop-detection circuit breaker. RunGuard detects repeated tool call patterns automatically and trips the breaker before the 10th repetition.
Cause 2: Context Window Accumulation
Signature: Gradually increasing tokens-per-call over a session, cost rising monotonically within a session, final calls in a long session costing 5–10× the first call.
Diagnostic:
def detect_context_accumulation(session_id: str):
r = requests.get(
f"{BASE_URL}/sessions/{session_id}/messages",
headers={"Authorization": f"Bearer {RUNGUARD_API_KEY}"}
)
messages = r.json()["messages"]
prev_tokens = 0
for i, msg in enumerate(messages):
total_tokens = msg["input_tokens"]
growth = total_tokens - prev_tokens
print(f"Turn {i}: {total_tokens} input tokens (+{growth})")
prev_tokens = total_tokens
# If growth is monotonically increasing, context is not being pruned
growths = [messages[i+1]["input_tokens"] - messages[i]["input_tokens"]
for i in range(len(messages)-1)]
if all(g > 0 for g in growths):
print("WARNING: Context growing on every turn — no pruning in effect")
Root causes: Full conversation history appended to each call without summarization or sliding window, large tool call results stored verbatim in context, no max-context enforcement.
Fix: Implement context pruning — summarize old turns, drop tool call results after they've been acted on, enforce a max context size. See AI agent context pruning strategies.
Cause 3: Model Version Upgrade
Signature: Spike aligned exactly with a deployment timestamp, call count unchanged, tokens per call unchanged, but cost per token 2–10× higher.
Diagnostic: Check your deployment history against the spike timestamp. Query your logs for the model parameter value before and after the spike hour. A change from gpt-4o-mini to gpt-4o or from claude-haiku-4-5-20251001 to claude-sonnet-4-6 increases per-token cost by 10–50×.
Fix: Pin model versions explicitly. Never use aliases like gpt-4-latest in production — provider alias updates silently change your cost profile. Review model routing logic to ensure upgrades are intentional.
Cause 4: Retry Storm
Signature: Sharp spike in call count, majority of sessions have the same input hash (same content, different session IDs), spike often follows an upstream service incident.
Diagnostic:
def detect_retry_storm(app_id: str, hour: str):
sessions = analyze_spike_hour(app_id, hour)
from collections import Counter
# Hash the first user message of each session
input_hashes = Counter()
for s in sessions:
r = requests.get(
f"{BASE_URL}/sessions/{s['id']}/messages",
headers={"Authorization": f"Bearer {RUNGUARD_API_KEY}"}
)
first_msg = r.json()["messages"][0]["content"][:200]
import hashlib
h = hashlib.md5(first_msg.encode()).hexdigest()[:8]
input_hashes[h] += 1
# If any hash appears many times, it's a retry storm
for h, count in input_hashes.most_common(5):
if count > 5:
print(f"Retry storm: input hash {h} appeared {count} times in 1 hour")
Fix: Add idempotency keys to all agent invocations, implement deduplication at the receiver layer, add exponential backoff to retry logic so storms don't sustain. See webhook cost control for the full pattern.
Cause 5: New High-Volume User or Integration
Signature: Gradual cost increase (not sudden spike), call volume increase tracks with a specific user ID or API key, cost per session normal but total sessions much higher.
Diagnostic: Group sessions by user ID or tenant. Look for a Pareto distribution where one or two users account for 80%+ of new volume. Check whether the increase started when a specific customer onboarded or a new integration went live.
Fix: Implement per-user rate limits and monthly budget caps. Power users are often a sign of product-market fit — the fix isn't to block them, but to price them correctly (usage-based tier) or apply per-user budgets that require upgrade to increase.
Cause 6: Prompt Injection / Malicious Input
Signature: Very high token counts in sessions from specific users, output containing unexpected tool calls (file reads, web fetches, code execution), sessions with unusually long output tokens.
Diagnostic: Fetch the raw user input from high-cost sessions. Look for patterns like "ignore previous instructions", "repeat the above 100 times", "search the web for X and then Y and then Z" chains designed to maximize agent work.
Fix: Input length limits, output token caps, tool call allowlists, and per-session budget caps that automatically terminate expensive sessions regardless of cause. RunGuard's budget enforcement is cause-agnostic — it trips the breaker whether the cost came from a legitimate multi-step task or an injection attack.
Cause 7: Batch Job Misconfiguration
Signature: Spike at a regular time (hourly, daily), much higher-than-expected volume, spike repeats on the same schedule.
Diagnostic: Check cron job schedules against the spike times. Look for batch jobs that process more records than expected — a job intended to process today's 100 new records but accidentally querying all 100,000 historical records. Check the LIMIT clause in your data fetch queries.
Fix: Add an expected-record-count assertion to the batch job. If the record count exceeds 2× the historical max, abort before calling the LLM layer. Add a separate per-run budget cap that causes the job to halt gracefully rather than silently over-processing.
Cause 8: Parallel Agent Fan-Out
Signature: Cost spike proportional to a configuration change (number of workers, number of parallel agents), individual session cost normal but total session count 10–100× higher than expected.
Diagnostic: Check concurrency configuration against the spike timestamp. Fan-out is often set as an environment variable (MAX_WORKERS, AGENT_CONCURRENCY) that was accidentally increased or defaulted to a high value in a new environment.
Fix: Set explicit concurrency limits with circuit breaker semantics — if total active sessions exceeds a threshold, queue new requests rather than spawning new agents. Monitor total concurrent cost (active sessions × estimated cost/session) as a real-time metric.
Prevention: Circuit Breakers for Each Cause
After root cause analysis, the question is: how do you prevent recurrence? Each cause maps to a specific circuit breaker pattern:
| Root cause | Prevention control |
|---|---|
| Tool-call loop | Max iterations + loop pattern detector |
| Context accumulation | Max context tokens + sliding window pruning |
| Model version upgrade | Pinned model IDs + cost-per-token alerting |
| Retry storm | Idempotency keys + deduplication cache |
| High-volume user | Per-user monthly budget cap |
| Prompt injection | Input length limit + per-session budget |
| Batch misconfiguration | Expected-record assertion + per-run budget |
| Parallel fan-out | Concurrency limit + total-active-cost monitor |
RunGuard implements all eight controls. The wrap() decorator enforces per-session budgets, detects tool-call loops, and surfaces alerts when any threshold is breached — regardless of which root cause triggered the breach.
For incident response, see the AI agent on-call cost incident runbook. For proactive detection before a spike becomes an incident, see AI agent cost anomaly detection. To build a full prevention stack, start with RunGuard's circuit breaker SDK.