AI agent on-call cost incident runbook: how to respond when your LLM cost alert fires at 2am
Your phone buzzes. PagerDuty. “LLM spend rate 4.7× above P95 baseline for the last 15 minutes.” It’s 2:14am. Unlike a service outage, where users are actively complaining and the symptoms are obvious, an LLM cost spike is invisible to your users — they’re getting responses, the system appears healthy, and you have no idea whether you’re burning $500 or $50,000 before the sun comes up. Without a documented runbook, the first five minutes of a cost incident are spent arguing about who should look at what. With a runbook, those five minutes are a structured triage that either clears the alert as a false positive or initiates containment before significant damage accumulates. This page is that runbook: a step-by-step guide through triage, containment, root cause investigation, mitigation, and the post-mortem process for LLM cost incidents in production AI agent systems.
Why cost incidents need a dedicated runbook
- Cost incidents are asymmetric and time-sensitive. A service latency incident causes user pain proportional to its duration; when you fix it, the pain stops. A cost incident causes financial damage proportional to its duration, but the damage is already spent — you cannot un-spend the $3,000 in API calls that accrued while your on-call engineer was figuring out which Slack channel to post in. Every minute of uncontained cost incident has a dollar value that directly compresses your margin. A typical uncontrolled LLM cost spike running for 45 minutes at 5× normal spend rate represents 3.75 hours of normal spend compressed into 45 minutes. At $200/hour normal run rate, that’s $750 in 45 minutes versus $150 in the same period under normal operation. A runbook that gets your MTTC (mean time to contain) from 30 minutes to 8 minutes saves $733 per incident on average at this scale.
- Alert fatigue is real but so are false negatives. Poorly calibrated cost alerts produce false positives that train on-call engineers to dismiss or snooze them — until the alert that mattered gets snoozed too. A runbook that starts with a structured false-positive check (step one of triage) gives engineers a defensible reason to clear an alert they believe is a misconfiguration, without just ignoring it. It also prevents the opposite failure: the engineer who stares at an alert for 20 minutes wondering if it’s real before taking any action. The runbook removes ambiguity from the first decision point.
- Cost incident root causes are non-obvious. Service incidents usually have obvious symptoms (error rate spike, latency spike, health check failure). Cost incidents can be caused by a dozen different mechanisms: a prompt regression that increased average output length by 40%, a tool API change that made results 8× larger, a retry bug that sent every request 6 times, a routing failure that sent cheap-model traffic to an expensive model, a malicious user prompt injection designed to maximize token consumption, or a legitimate traffic surge that simply wasn’t anticipated. Without a structured investigation sequence, engineers chase the wrong hypothesis and waste 20 minutes before finding the actual cause. A runbook encodes your team’s accumulated knowledge about which causes are most common and which signals to check first.
- Post-mortems without runbooks are less useful. If every engineer handles cost incidents differently, your post-mortem data is inconsistent: different teams capture different signals, apply different containment steps, and reach different conclusions about root cause. A runbook creates a consistent paper trail for every incident. Within six months of running structured runbooks, you will have enough pattern data to identify which root causes recur most often, which containment steps are most effective, and where your alerting thresholds need adjustment — insights that are simply not available from ad-hoc incident handling.
Incident severity classification: P0, P1, P2
- P0 — existential spend rate. A P0 cost incident is one where the current spend rate, if sustained for 60 minutes, would represent more than 20% of your monthly LLM budget. At this severity, the correct first action is immediate containment (circuit breaker or emergency rate limit), not investigation. You investigate after you’ve stopped the bleeding. P0 triggers include: spend rate exceeding 10× baseline for more than 5 minutes, a single session exceeding $50 in total cost, aggregate hourly spend projected to exceed $1,000 (calibrate this threshold to your business). P0 incidents require the on-call engineer to page the engineering lead within 10 minutes of confirmation, regardless of time of day. Recovery from a P0 without a circuit breaker in place is not acceptable — the post-mortem action item should always include “add circuit breaker for this failure mode.”
- P1 — material overspend requiring same-day resolution. A P1 cost incident is one where the spend rate is elevated 3–10× above baseline, sustained for more than 10 minutes, and projected to cause meaningful overspend if allowed to continue until business hours. Unlike P0, investigation can proceed in parallel with containment rather than being deferred. P1 incidents should be resolved within 2 hours of detection; if the engineer on call cannot identify the root cause within 45 minutes, they escalate to the engineering lead. Common P1 patterns include: a prompt regression affecting 15–30% of sessions, a caching layer that stopped working and is sending all previously-cached requests to the API, or a power user whose session has entered a loop but is below the automatic circuit breaker threshold.
- P2 — elevated spend warranting investigation during business hours. A P2 cost incident is one where spend is elevated 1.5–3× above baseline but not trending toward material overspend. The correct action is to acknowledge the alert, document the symptoms in the incident tracker, and schedule investigation for next business day. P2 incidents should not interrupt sleep. They are important — they often represent the early warning of a pattern that will become P1 if left unaddressed — but they don’t warrant immediate containment. Examples: a new user cohort that turns out to be heavier users than your existing cohort, a feature that costs 40% more than estimated in production due to longer-than-expected user inputs, or a third-party tool API that started returning more verbose responses.
- Classification must be automated, not manual. Engineers woken at 2am should not have to calculate whether an alert is P0, P1, or P2. Your alerting system should classify severity automatically based on the projected 60-minute spend, the duration of the anomaly, and the trend (accelerating vs. stable vs. decelerating). The PagerDuty/OpsGenie notification should include the severity classification, the current spend rate, the projected hourly cost, and a link to the pre-populated investigation dashboard — not just “cost alert fired.”
The triage and containment playbook
- Step 1: confirm the alert is real (2 minutes). Before doing anything else, verify the alert is not a misconfiguration or monitoring artifact. Check: (a) is the alert based on cost data from the last 5 minutes or from a delayed batch job that just processed 2 hours of data at once? (b) did you just deploy a new feature that legitimately costs more? (c) is the alert threshold miscalibrated for a seasonal traffic pattern (e.g., Monday morning surge)? If any of these explain the alert, acknowledge it with a note, adjust the threshold if appropriate, and go back to sleep. If none of these explain it, proceed to step 2. This step should take no more than 2 minutes — if you can’t clear the alert in 2 minutes, treat it as real.
- Step 2: apply emergency rate limiting (3 minutes for P0, optional for P1). For P0 incidents, enable your emergency rate limit before investigating. This is counterintuitive — most engineers want to understand before acting — but the asymmetry of cost incidents justifies containment-first. Your emergency rate limit should be a global request rate cap set to 150% of your normal P95 request rate: enough to allow legitimate traffic to continue at near-normal levels while capping the tail. For P1 incidents, apply a circuit breaker specifically to the session, agent, or user identified by the initial alert if one is clearly implicated; otherwise apply the global rate limit as a precaution. Document the containment action and timestamp in the incident tracker before proceeding to investigation.
- Step 3: identify scope (5 minutes). Is this incident affecting one session, one user, one agent type, one geographic region, or all traffic? Pull the spend breakdown by user, session, agent type, and model from your cost tracking system for the incident window. A single session consuming 60% of the elevated spend is a very different incident from spend elevated uniformly across all sessions. Scope determines investigation path: single-session anomalies point to input-specific causes (prompt injection, unusual user input, edge case in agent logic); cross-session anomalies point to systemic causes (prompt regression, tool API change, routing failure). Document your scope finding in the incident tracker.
- Step 4: implement targeted containment based on scope. If the anomaly is session-scoped: terminate the specific session(s) contributing disproportionate spend, block the user or input pattern if malicious, and monitor for recurrence. If the anomaly is systemic: consider rolling back the most recent deployment if it correlates with the incident start time. A deployment rollback is aggressive but appropriate if: the incident started within 15 minutes of a deployment, the anomaly affects all sessions rather than specific ones, and the rollback can be completed in under 10 minutes. If rollback is not feasible, implement graceful degradation: route to a cheaper model tier, disable the most expensive agent tools, or increase the circuit breaker aggressiveness until you can deploy a fix.
Root cause investigation steps
- Correlate incident start time with deployment and traffic events. The most productive first hypothesis is almost always “something changed just before the incident started.” Pull your deployment log, feature flag change log, and traffic pattern for the 30 minutes before the alert fired. A deployment that went out at 1:47am and an alert that fired at 1:52am is almost certainly causal. A traffic surge at 1:50am on a product that has significant international users (who are in their daytime hours) may explain a P1 that looks anomalous in your local timezone but is simply a legitimate traffic pattern you haven’t seen before. Always check the deployment log before assuming a bug — and always check the traffic pattern before assuming a deployment is the cause.
- Analyze the token distribution of the incident window. Pull the per-call token breakdown (input tokens, output tokens, tool result tokens) for the incident window and compare to your baseline distribution. A spike in input tokens points to either larger user inputs or larger tool results being injected into context. A spike in output tokens points to a prompt regression that changed model behavior (e.g., a system prompt change that caused the model to produce much longer responses), a change in temperature or sampling parameters, or a prompt injection attack designed to maximize output generation. A spike in tool call count without a proportional token spike points to a loop or retry storm. These three patterns have different fixes and different investigation paths.
- Query cost records by session to find the outliers. Sort sessions by total cost descending and examine the top 10. For each high-cost session: look at the message history, the tool calls made, the token counts per call, and the user input that initiated the session. In the vast majority of cost incidents, 80% of the excess spend is attributable to fewer than 5% of sessions. Finding those sessions and understanding what they have in common (a specific user, a specific input pattern, a specific agent workflow triggered) tells you the root cause faster than any aggregate analysis. Pay particular attention to the first call in each high-cost session: if the first call was already anomalously expensive, the cause is in the initial input or system prompt; if costs escalated over multiple calls, the cause is in the agent’s iteration logic.
- Post-mortem and prevention. Within 48 hours of incident resolution, conduct a blameless post-mortem that answers: what was the root cause? what was the total excess spend? how long was MTTD (mean time to detect) and MTTC (mean time to contain)? what specific runbook step was most useful? what was missing from the runbook? what prevention measures would eliminate this class of incident? The prevention measures should be concrete engineering tasks: a new circuit breaker rule, an updated alert threshold, a code fix, a prompt change, or a new anomaly detection baseline. Track these tasks to completion. A post-mortem that produces no engineering tasks is a wasted post-mortem.
RunGuard for cost incident response
- Programmatic cost record queries during investigation. RunGuard’s API lets you pull structured cost records during an incident investigation without logging into a dashboard. The following Python script performs the core triage query: it pulls all sessions from the incident window, sorts by total cost, and prints the top anomalies with their token breakdown for immediate root cause analysis.
import runguard from datetime import datetime, timedelta, timezone client = runguard.Client() # uses RUNGUARD_API_KEY env var def investigate_incident(lookback_minutes: int = 30, top_n: int = 10): """Pull top-cost sessions from the last N minutes for incident triage.""" now = datetime.now(timezone.utc) window_start = now - timedelta(minutes=lookback_minutes) # Fetch all sessions in the incident window sessions = client.sessions.list( started_after=window_start.isoformat(), limit=1000, sort="cost_desc" ) baseline = client.baselines.get( metric="session_cost_usd", percentile=95 ) p95_cost = baseline.value print(f"Incident window: {window_start.strftime('%H:%M')} - {now.strftime('%H:%M')} UTC") print(f"Sessions analysed: {len(sessions.items)}") print(f"P95 baseline cost: ${p95_cost:.4f}") print(f"{'Session ID':<36} {'Cost':>8} {'Multiple':>8} {'Input':>8} {'Output':>8} {'Tools':>6}") print("-" * 90) total_excess = 0.0 for session in sessions.items[:top_n]: multiple = session.cost_usd / p95_cost if p95_cost > 0 else 0 excess = max(0, session.cost_usd - p95_cost) total_excess += excess flag = " <-- ANOMALY" if multiple > 3 else "" print( f"{session.session_id:<36} " f"${session.cost_usd:>7.4f} " f"{multiple:>7.1f}x " f"{session.input_tokens:>8,} " f"{session.output_tokens:>8,} " f"{session.tool_calls:>6}{flag}" ) print(f"\nEstimated excess spend in window: ${total_excess:.2f}") print(f"Projected hourly excess: ${total_excess * (60 / lookback_minutes):.2f}") # Identify the dominant anomaly type top = sessions.items[0] if sessions.items else None if top: if top.output_tokens > 5 * baseline.output_tokens_p95: print("\nDominant pattern: OUTPUT TOKEN SPIKE โ check for prompt regression or injection") elif top.input_tokens > 5 * baseline.input_tokens_p95: print("\nDominant pattern: INPUT TOKEN SPIKE โ check for large tool results or user inputs") elif top.tool_calls > 5 * baseline.tool_calls_p95: print("\nDominant pattern: TOOL CALL SPIKE โ check for agent loop or retry storm") if __name__ == "__main__": investigate_incident(lookback_minutes=30, top_n=10)Run this script as the first step of your investigation query. The output gives you the top-cost sessions, their token breakdown, the excess spend in the window, and a heuristic classification of the dominant anomaly pattern — all in under 10 seconds, without leaving the terminal. - Emergency circuit breaker via API. RunGuard’s circuit breaker can be triggered programmatically during a P0 incident when you need containment faster than UI navigation allows. A single API call sets a global emergency rate cap:
client.circuit_breakers.create(scope="global", max_requests_per_minute=150, reason="P0 cost incident - auto-containment", expires_in_minutes=60). The circuit breaker applies within 200ms of the API call and returns 429 responses to any request that would exceed the cap, with aRetry-Afterheader so well-behaved clients back off gracefully. When the incident is resolved, delete the circuit breaker explicitly rather than waiting for expiry to ensure normal operation resumes immediately. - Automated severity classification and escalation. Configure RunGuard’s alert rules to pre-classify incidents before they wake your engineer. The rule configuration maps spend projections to PagerDuty urgency levels: P0 rules use
urgency: criticalwith a 5-minute sustained threshold; P1 rules useurgency: highwith a 10-minute sustained threshold; P2 rules useurgency: lowwith email-only notification. Each alert payload includes the projected hourly cost, the top contributing session IDs, and a direct link to the incident dashboard pre-filtered to the alert window — so the engineer sees the investigation starting point in the notification itself, not after logging in.
Stop investigating cost incidents in the dark
RunGuard gives your on-call team the tools to triage, contain, and resolve LLM cost incidents in minutes instead of hours: real-time spend tracking, programmable circuit breakers, pre-classified severity alerts, and structured cost record queries. Start your free trial today and have your first runbook-ready alert configured before your next incident.
Start free trial →