AI agent on-call cost incident runbook: how to respond when your LLM cost alert fires at 2am

Your phone buzzes. PagerDuty. “LLM spend rate 4.7× above P95 baseline for the last 15 minutes.” It’s 2:14am. Unlike a service outage, where users are actively complaining and the symptoms are obvious, an LLM cost spike is invisible to your users — they’re getting responses, the system appears healthy, and you have no idea whether you’re burning $500 or $50,000 before the sun comes up. Without a documented runbook, the first five minutes of a cost incident are spent arguing about who should look at what. With a runbook, those five minutes are a structured triage that either clears the alert as a false positive or initiates containment before significant damage accumulates. This page is that runbook: a step-by-step guide through triage, containment, root cause investigation, mitigation, and the post-mortem process for LLM cost incidents in production AI agent systems.

Why cost incidents need a dedicated runbook

Incident severity classification: P0, P1, P2

The triage and containment playbook

Root cause investigation steps

RunGuard for cost incident response

Stop investigating cost incidents in the dark

RunGuard gives your on-call team the tools to triage, contain, and resolve LLM cost incidents in minutes instead of hours: real-time spend tracking, programmable circuit breakers, pre-classified severity alerts, and structured cost record queries. Start your free trial today and have your first runbook-ready alert configured before your next incident.

Start free trial →