Production LLM agent reliability checklist: 25 checks before you ship an autonomous agent
Most teams discover their LLM agent’s reliability gaps in production, not in testing. Integration tests pass because they run happy-path inputs against a mock API. The agent loops forever when the tool response format changes at the API provider. The budget runs to $400 because a weekend cron job ran the agent against a thousand records with no per-call cap. The context window fills after 40 turns, the model starts hallucinating, and neither the agent nor the user notices for three more turns. This checklist is the pre-deployment gate that forces you to answer the reliability questions before your users experience them. It is organized into five categories: budget controls, loop detection, context management, error handling, and observability. Each item has a pass/fail criterion, a minimal code example where relevant, and a note on what breaks if you skip it. The checklist assumes you are using RunGuard for the guard layer, but the questions apply regardless of how you implement the underlying controls.
Category 1: budget controls (7 checks)
-
1. Session budget cap is set and tested. Every agent run must have a maximum spend ceiling. Pass criterion:
guard(fn, budget={"max_usd": X})is configured and you have a test that verifies the agent raisesBudgetExceededErrorbefore exceedingX. Skip cost: unbounded spend on any session that loops or processes an unexpectedly large input.from runguard import guard, BudgetExceededError # Pass: guard is configured with explicit cap guarded = guard(call_llm, budget={"max_usd": 5.0}) # Test: verify the cap fires import pytest def test_budget_cap_fires(): with pytest.raises(BudgetExceededError): run_agent_until_budget_exceeded(guarded, expensive_input) -
2. Per-call ceiling is set. A single LLM call with an enormous context can exceed your session cap in one call before the session-level guard fires. Pass criterion:
per_call_budget={"max_usd": Y}is set whereYis 10–20% of your session cap. Skip cost: a single call with a 200K-token context burns a significant fraction of your session budget before any rate limiting fires. -
3. Cost per model is calculated for your specific input distribution. P50 and P99 input token counts from your staging data, multiplied by your model’s input cost rate, must fit within the session cap with margin. Pass criterion:
P99_input_tokens × input_cost_rate + P99_output_tokens × output_cost_rate < 0.5 × session_cap. Skip cost: your “conservative” cap fires on median inputs and degrades normal user sessions. -
4. Budget cap error is handled explicitly in the caller.
BudgetExceededErrormust be caught and handled in the agent loop. Pass criterion: the agent returns a usable response (partial result, error message with next steps) rather than an unhandled exception traceback. See graceful degradation patterns for implementation options. Skip cost: users see a 500 error instead of a helpful “request was too large for my budget” message. - 5. Budget cap is set differently for batch vs. interactive agents. Batch agents can have tighter caps because they run in bulk and partial failures are expected. Interactive agents need headroom for tail-case inputs. Pass criterion: you have documented separate cap values for each agent type with rationale. Skip cost: interactive agents hit caps on legitimate complex queries; batch agents have excessive caps that allow runaway costs on malformed inputs.
-
6. Monthly spend projection is calculated.
avg_cost_per_session × sessions_per_day × 30 daysmust be within budget. Pass criterion: the projection is calculated and your LLM API plan covers the expected spend with 3× headroom. Skip cost: you discover in month 2 that your agent costs 4× what you priced into your SaaS plan. - 7. Spend alert is configured. Your LLM provider’s spend alert (OpenAI, Anthropic, AWS) is set to fire at 50% and 80% of your monthly budget. Pass criterion: you have received at least one test alert email. Skip cost: you discover overspend at invoice time, not when you can still intervene.
Category 2: loop detection (5 checks)
-
8. Loop detection is enabled with non-default parameters for your tool set. The default
repeats=3, max_cycle_len=5is a conservative starting point. Your agent’s specific tool call patterns may need tighter or looser configuration. Pass criterion: you have profiled 20+ successful agent runs to determine the maximum legitimate repetition count for any tool call pattern in your workflow, and yourrepeatsvalue is 1 higher than that maximum. Skip cost: either false positives (normal behavior flagged as loops) or missed loops (loop runs longer than it should before the guard fires). -
9.
LoopDetectedErroris caught and handled separately fromBudgetExceededError. Loops indicate a structural problem (tool returning unexpected output, prompt design flaw, injection attempt) while budget exceeded is a scaling limit. Pass criterion: your error handler logs loop trips to a separate log stream or metric and alerts on-call if the loop rate exceeds 1% of sessions. Skip cost: loop events are silently counted as generic errors and you miss the signal that your agent has a systematic reliability problem. -
10. Custom
sig_fnis implemented for your specific tool set. The default signature function groups by tool name. If your agent calls the same tool with meaningfully different argument patterns (e.g.,read_filewith different paths is not a loop), you need a customsig_fn. Pass criterion: you have audited the default signature function’s behavior against your tool call logs and confirmed it correctly identifies loops vs. legitimate repetition. See AI agent sandbox escape prevention for custom sig_fn examples. -
11. Loop detection is tested with a synthetic loop injection. Pass criterion: you have a test that manually crafts a message sequence that should trigger loop detection, and the guard raises
LoopDetectedErrorat the expected turn count. Skip cost: you discover in production that your loop guard configuration is too loose when a real runaway occurs. - 12. Loop event rate is tracked in your observability dashboard. Pass criterion: you can answer “what percentage of sessions triggered loop detection this week?” in under 30 seconds using your monitoring tools. See agent observability cost dashboard for the SQL queries. Skip cost: a spike in loop rate caused by a prompt regression or a new injection attack goes undetected for days.
Category 3: context management (5 checks)
- 13. Context window limit is tested at P99 conversation length. Pass criterion: you have run your agent on P99 conversation length inputs (from staging or synthetic data) and confirmed it does not hit the context window limit under normal use. Skip cost: P99 users hit a cryptic context window error that your error handling does not catch gracefully.
- 14. Tool result size is bounded. Tool results that return unbounded text (web page content, file contents, API responses) can fill the context window in one call. Pass criterion: every tool that returns text has a size limit enforced at the tool wrapper layer. Skip cost: one large document retrieval fills the remaining context budget and all subsequent turns fail with context window errors.
- 15. Context compaction strategy is implemented for long-running agents. If your agent may run for more than 20 turns, you need a compaction strategy. Pass criterion: either your agent is architecturally bounded to fewer than 15 turns, or you have implemented sliding window summarization as shown in AI agent graceful degradation patterns. Skip cost: agents that run long degrade from hallucination as context fills, producing silently wrong outputs rather than errors.
-
16. Retrieval chunk size is calibrated for your model’s context budget. Pass criterion:
max_retrieved_chunks × avg_chunk_size_tokens < 20% of model context window. Skip cost: retrieval alone consumes most of the context window, leaving insufficient space for the agent’s own reasoning and tool call history. - 17. Context window error is caught and handled. Context window exceeded errors from the LLM API should be caught and trigger context compaction, not a generic 500. Pass criterion: you have tested the agent with an input designed to exceed the context window and confirmed it handles the error with a useful response.
Category 4: error handling and tool robustness (4 checks)
-
18. All tool calls have explicit error handling with structured error returns. Tools that throw exceptions propagate to the agent’s LLM context as unhandled errors, which confuses the model and wastes tokens on error recovery. Pass criterion: every tool function catches exceptions and returns a structured error object that the agent can process (e.g.,
{"error": "rate_limit", "retry_after": 5}) rather than raising. Skip cost: tool exceptions cause the agent to loop on error recovery, consuming tokens and often tripping loop detection. - 19. Rate limit handling is implemented in all tool wrappers. External APIs rate-limit your agent based on request frequency, not LLM call frequency. Pass criterion: tools that call external APIs use exponential backoff with jitter and a maximum retry count. See LLM agent rate limit backoff strategy for the recommended implementation. Skip cost: a rate limited API turns into an agent that retries in a tight loop, which both fails and accumulates LLM cost on the retry reasoning.
- 20. Maximum turn count is enforced independently of loop detection. Loop detection catches repeated patterns; max turn count catches agents that make progress but too slowly. Pass criterion: your agent loop has an absolute maximum turn count (e.g., 25 turns) that fires before the LLM context window fills. Skip cost: agents that make slow forward progress eventually hit context window limits, not a clean max-turn exit, producing confusing errors.
- 21. Partial tool results are handled gracefully. Tools that return partial data (truncated search results, rate-limited API pages) should be handled as valid inputs, not errors. Pass criterion: you have tested your agent with truncated tool results from each external dependency and confirmed it continues rather than looping on “result appears incomplete” reasoning.
Category 5: observability and incident response (4 checks)
- 22. Cost events are logged with session ID, turn number, input/output tokens, and USD. Pass criterion: you can query average cost per session, P99 cost per session, and total daily spend in under 60 seconds from your logs. See agent observability cost dashboard for the SQLite implementation.
- 23. Loop and budget trip events are logged separately with enough context to diagnose the cause. Pass criterion: each LoopDetectedError log entry includes the tool call sequence that triggered the loop, the session ID, and the user query (or a hash of it for privacy). Skip cost: you cannot determine whether loop events are caused by a prompt regression, a tool API change, or an injection attack.
-
24. On-call alert fires when loop rate exceeds threshold. Pass criterion: you have a monitoring rule that alerts when
loop_trip_count / total_sessions > 0.02in any rolling 1-hour window. Skip cost: a regression that causes 30% of sessions to loop is discovered by a user complaint, not by a monitoring alert. - 25. Runbook exists for the two most common incidents: runaway cost and agent loop. Pass criterion: you can describe in 3 sentences what you would do right now if your monitoring showed a 10× cost spike or a 20% loop rate. Skip cost: incident response is ad hoc, which extends the incident duration and compounds the blast radius.
Pre-production vs. production discovery of reliability gaps
| Gap type | Discovered pre-production (via checklist) | Discovered in production |
|---|---|---|
| No session budget cap | 15-minute fix: add guard() wrapper | Weekend bill, manual LLM provider intervention, user data potentially affected |
| Loop detection not configured | 30-minute fix: add loop config and test with synthetic loop input | Agent loops on edge-case tool response, burns session budget, user gets nothing |
| Context window not tested at P99 | 1-hour fix: identify heavy inputs, add truncation at tool layer | P99 users get cryptic API errors, support ticket volume spikes |
| No cost observability | 2-hour fix: add SQLite logger around guard | Cost inflation discovered at month-end invoice; no data to diagnose cause |
| No on-call alert for loop rate | 30-minute fix: add monitoring rule | Systematic reliability regression runs for 24+ hours before user reports surface it |
For the full cost control implementation that this checklist assumes is in place, see autonomous agent cost control best practices. For memory leak patterns that can cause context window fills without obvious tool call loops, see AI agent memory leak detection.
RunGuard handles the guard layer of this checklist
RunGuard installs in one command: pip install runguard for Python, npm install @runguard/sdk for TypeScript. Items 1–12 on this checklist (budget controls and loop detection) are handled by the guard() wrapper with budget, per_call_budget, and loop configuration. Items 13–17 (context management) require application-level changes. Items 18–25 (error handling and observability) are application patterns that RunGuard’s typed exceptions make straightforward to implement. The full checklist takes a typical team 4–8 hours to work through on a first agent; subsequent agents take under 2 hours because the patterns are reusable.
RunGuard pricing: Solo plan at $19/month for individual developers. Team plan at $79/month adds Slack and PagerDuty webhook alerts, shared dashboards, and audit log. Both plans include a 14-day free trial — no credit card required.
Start your 14-day free trial — or explore related: autonomous agent cost control best practices, agent observability cost dashboard, AI agent graceful degradation patterns, AI agent memory leak detection, and prevent AI agent runaway cost in real time.