AI Agent Production Readiness Cost Checklist: 40 Controls Before You Ship
Most AI agent cost incidents are preventable. They happen because agents shipped to production without the cost controls that make runaway behavior physically impossible, not just unlikely. This checklist covers the 40 controls your agent needs before it handles real users and real money.
Work through it section by section. Mark each item when the control is in place and tested — not just planned. An "in progress" item isn't a checklist item; it's an open risk.
Section 1: Budget Caps (8 items)
Budget caps are the foundation. Every other control is defense-in-depth; budget caps are the absolute ceiling that makes worst-case predictable.
-
☐ 1.1 Per-session hard budget set
Every agent session has a maximum dollar or token amount that, when reached, terminates the session and returns a graceful error. Not a soft warning — a hard stop. Verified by test: trigger budget exhaustion in staging and confirm the session terminates cleanly. -
☐ 1.2 Per-user monthly budget cap
Each user or tenant has a monthly spending ceiling that blocks new sessions when reached. Free/trial users have lower caps than paid users. The UI shows users their current usage vs. cap before they hit it. -
☐ 1.3 Daily total budget cap
Your system has a daily spending ceiling across all users. When hit, new sessions return a clear maintenance message rather than silently queuing. You have an on-call playbook for when this fires. -
☐ 1.4 Per-tool-call budget estimate
Before making a tool call that fetches large amounts of data (web search, file read, database query), the agent estimates how many tokens the result will consume. If the estimate exceeds a threshold, the agent uses a more selective query or truncates. -
☐ 1.5 Budget state persisted across restarts
If the agent process crashes and restarts mid-session, the budget consumed in the first half of the session is remembered. The restarted session starts from the remaining budget, not the full budget. -
☐ 1.6 Concurrent session budget accounting
If a user can run multiple sessions simultaneously (multi-tab, API), the per-user budget is shared across all concurrent sessions — not per-session independently. Race conditions in budget checks are handled atomically (e.g., Redis atomic increment). -
☐ 1.7 Budget exceeded response tested
The error message shown to users when budget is exceeded is clear, non-technical, and tells them what to do next (upgrade plan, wait until reset, contact support). Not a stack trace. Not a silent failure. -
☐ 1.8 Budget metrics visible in ops dashboard
You can see, at a glance: current-day spend vs. budget, per-user spend rank, and recent budget-exceeded events. You don't need to query logs manually to answer "how much have we spent today?"
Section 2: Loop and Runaway Detection (7 items)
Loops are the single biggest cause of surprise LLM bills. An agent stuck in a loop costs the same per iteration whether it's making progress or not.
-
☐ 2.1 Maximum iterations enforced
Every agent has a hard maximum on total iterations (full ReAct loops, plan-and-execute cycles, or tool call rounds). Exceeding the limit causes a graceful abort, not an error. The limit is set to 3–5× the expected maximum for legitimate tasks. -
☐ 2.2 Repeated tool call pattern detection
The agent detects when the same tool is called with the same or similar arguments more than N times in a single session. On detection, the agent either breaks the loop by returning a partial result or escalates with a "stuck" signal rather than continuing indefinitely. -
☐ 2.3 No-progress detection
Beyond repeated calls, the agent detects sessions where tool calls are returning results but the agent's plan isn't advancing (e.g., the "current step" hasn't changed in the last 5 iterations). This catches slow-motion loops that don't repeat exact calls. -
☐ 2.4 Retry backoff with jitter
Any retry logic in your agent or its tools uses exponential backoff with jitter. Fixed-interval retries at high frequency can create retry storms that multiply costs. Maximum retry count is bounded. -
☐ 2.5 Tool call depth limit
If your agent can invoke sub-agents or tools that themselves invoke LLMs, the total call depth (agent → sub-agent → sub-sub-agent) is bounded. Recursive agent invocations without depth limits are a common source of exponential cost growth. -
☐ 2.6 Loop detection tested with adversarial input
You have at least one test case that triggers a genuine loop scenario (a task that causes the agent to repeat the same tool call 10+ times). The circuit breaker trips correctly, and the test passes in CI. -
☐ 2.7 Loop trip event emits notification
When a loop is detected and the circuit breaker trips, a notification is emitted to Slack or PagerDuty with the session ID, tool call pattern, and cost accumulated before the trip. This enables post-incident root cause analysis.
Section 3: Context Window Management (6 items)
Context window costs scale quadratically in some architectures — more context means more expensive calls, and more expensive calls are more likely to exceed budget, triggering retries that add more context.
-
☐ 3.1 Context size measured before each call
Before every LLM call, the total token count of the messages array is computed and logged. You have dashboards showing p50/p95/p99 context sizes per app and per session-step. Outliers are visible without querying raw logs. -
☐ 3.2 Sliding window or summarization in place
For agents that have multi-turn conversations longer than ~10 turns, there is an explicit context management strategy: either a sliding window (drop oldest N messages), a summarization step (compress old turns into a summary), or a hybrid. It is not "keep everything." -
☐ 3.3 Tool call results truncated before storing
Tool results (web search snippets, code execution output, database query results) are truncated to a maximum token count before being added to the context. The truncation limit is set based on what the agent actually needs, not on what the tool returns. -
☐ 3.4 System prompt audited for verbosity
The system prompt has been reviewed for redundant instructions, repetitive examples, and outdated content. It is the minimum necessary to produce correct behavior, not a maximalist specification that covers every edge case. -
☐ 3.5 Max output tokens set per call
Every LLM call specifies an explicitmax_tokensappropriate for the expected output. You are not relying on the model's default maximum, which may be 4,096–8,192 tokens for tasks that only need 200. -
☐ 3.6 Context spike test passing
You have a test that feeds the agent a pathologically verbose input (e.g., a 50,000-character document) and verifies that the context stays within budget due to truncation. It does not silently overflow into a $5 single call.
Section 4: Input Controls (6 items)
User-supplied input is an external attack surface for cost. Prompt injection, oversized inputs, and adversarial task chaining can all inflate costs without the agent "misbehaving" from its own perspective.
-
☐ 4.1 Input length validation
User messages have a maximum character or token count enforced before the agent processes them. The error message for oversized input is user-friendly. The limit is set 2–3× higher than the longest legitimate input in your data, not at some arbitrary round number. -
☐ 4.2 File upload size and type limits
If the agent accepts file uploads (PDFs, images, code files), each file has a maximum size. File types are allowlisted. The agent does not accept ZIP files or recursively expanding archives. -
☐ 4.3 URL fetch limits
If the agent fetches external URLs, fetched content is truncated to a maximum size before being added to context. You are not fetching and embedding arbitrary external documents in full. -
☐ 4.4 Tool allowlist in system prompt
The agent's system prompt specifies explicitly which tools the agent may use. Prompt injection attempts to use unlisted tools (send email, execute code, access filesystem) fail because the unlisted tools are not available in the agent's tool definitions. -
☐ 4.5 Rate limiting per user at the API layer
Independent of LLM budget caps, there is a request rate limit per user at the HTTP layer (requests per minute). This prevents a single user from exhausting your system's concurrency capacity even before LLM calls are made. -
☐ 4.6 Adversarial prompt test suite passing
Your test suite includes at least 5 adversarial prompt patterns: cost amplification (ask agent to repeat a task 100 times), tool chaining abuse (chain 10 tool calls in one message), context stuffing (fill context with irrelevant text), role override attempts, and budget bypass attempts. All 5 are handled gracefully.
Section 5: Alerting and Observability (7 items)
Controls that aren't observable don't get maintained. Each circuit breaker needs a corresponding observable event to verify it's working and to diagnose failures.
-
☐ 5.1 Real-time spend alert at 50% of daily budget
An alert fires when daily spend reaches 50% of budget — early enough to investigate and respond before hitting the cap. It goes to the on-call channel, not just email. -
☐ 5.2 Real-time spend alert at 80% of daily budget
A second alert fires at 80%. This is the "act now" signal. At this point, on-call has time to throttle or disable non-critical workloads before the hard cap hits. -
☐ 5.3 Anomaly alert on 3× normal hourly spend
An alert fires when any 1-hour period costs 3× the 7-day rolling average for that same hour of day. This catches spikes that don't hit absolute thresholds (e.g., a spike from $0.50/hr to $1.50/hr that's still below the daily cap). -
☐ 5.4 Per-session cost logged with correlation ID
Every completed session emits a cost log line with: session ID, user ID, total cost, token breakdown (input/output), model used, and wall-clock duration. These logs are queryable. You can answer "what did session X cost?" in under 60 seconds. -
☐ 5.5 Circuit breaker trips logged
When any circuit breaker trips (budget, loop, context, rate limit), the event is logged with enough context to reproduce the scenario in staging: the triggering session ID, the breach value, and a stack trace or tool call trace. -
☐ 5.6 Cost per feature tracked
You track LLM cost broken down by feature or agent type (e.g., "code review agent", "email draft agent", "research agent"). This data drives pricing decisions and identifies which features are economically unviable at current model pricing. -
☐ 5.7 Monthly cost projection updated weekly
Based on the current week's run rate, you have an automated projection of end-of-month LLM spend. This is reviewed in your weekly operations meeting — not just at end-of-month when the bill arrives.
Section 6: Incident Response (6 items)
Even with all controls in place, incidents happen. The difference between a $50 incident and a $5,000 incident is usually whether you detected it in 5 minutes or 5 hours.
-
☐ 6.1 Kill switch available without deployment
There is a way to immediately stop all agent processing that does not require a code deployment. Acceptable options: a feature flag in your feature flag system, an environment variable in your secrets manager that the agent reads on each call, or a maintenance mode toggle in your admin dashboard. -
☐ 6.2 Kill switch tested in staging
The kill switch has been tested. You've verified that it stops agent calls within one request cycle and that the user-facing error message is appropriate. -
☐ 6.3 On-call runbook documented
There is a written runbook for "LLM cost spike" that covers: how to identify the cause (triage questions), who to notify, what commands to run to stop the bleeding, and how to communicate with users. It lives somewhere the on-call engineer can find it in a panic at 2am. -
☐ 6.4 Provider cost dashboard bookmark shared
Every engineer on the team has access to and has bookmarked the LLM provider's cost dashboard. This seems obvious; in practice, half the team doesn't know the URL when the first incident happens. -
☐ 6.5 Post-mortem template ready
After any incident that causes >$10 unexpected spend, you run a 5-Whys post-mortem using a standard template. The output is a specific action item that closes the root cause, not a vague "we should be more careful." -
☐ 6.6 Cost incident simulation run
Before going to production, you've run a cost incident simulation in staging: deliberately triggered a loop, observed the circuit breaker trip, confirmed the alert fired, and verified the kill switch worked. If you've never exercised the system, you don't know if it works.
Scoring Your Agent
Count your checked items:
| Score | Assessment | Action |
|---|---|---|
| 35–40 | Production ready for cost control | Ship it. Review monthly. |
| 28–34 | Acceptable for low-traffic launch | Ship with 50% of daily budget as cap. Fix gaps in first 30 days. |
| 20–27 | High risk for cost incidents | Block on Section 1 (budget caps) and Section 2 (loop detection) before launch. |
| <20 | Not production ready | Do not ship with real user traffic. Complete Sections 1–3 first. |
If your score is <28, the fastest path to production readiness is implementing RunGuard. The wrap() decorator covers 15 of the 40 items (all of Section 1, most of Section 2, and the circuit breaker tests in Sections 2 and 6) in a single SDK integration:
from runguard import RunGuard
import os
rg = RunGuard(api_key=os.environ["RUNGUARD_API_KEY"])
async def my_agent(user_input: str):
async with rg.wrap(
app_id="my-agent",
env={
"RUNGUARD_BUDGET_USD": "0.50", # 1.1 per-session budget
"RUNGUARD_MAX_ITERATIONS": "25", # 2.1 max iterations
"RUNGUARD_LOOP_DETECT": "true", # 2.2 loop detection
"RUNGUARD_CONTEXT_LIMIT_TOKENS": "16000", # 3.1 context limit
"RUNGUARD_ALERT_SLACK": os.environ["SLACK_WEBHOOK_URL"], # 5.5 alerts
}
) as guard:
# Your agent code here
result = await run_agent(user_input)
return result
For the remaining items — per-user monthly budgets, input validation, adversarial test suites, and incident runbooks — see the linked guides throughout this checklist.
Related: production LLM agent reliability checklist (covering latency, error handling, and availability alongside cost), enterprise AI agent cost governance, and the AI agent on-call cost incident runbook.