LLM API cost optimization audit: a 47-point checklist for reducing your AI spend by 40–70%
Most teams that overpay for LLM API access aren’t doing anything obviously wrong — they’re just leaving optimization opportunities untouched because those opportunities aren’t visible without a structured audit. The 40–70% cost reduction figure is not aspirational: it represents the median result for engineering teams that systematically work through model tier selection, context window efficiency, semantic caching, and batching opportunities for the first time. Teams with no prior optimization effort typically find that 60–80% of their total token spend falls into one of three categories: sending unnecessary tokens in every request (fixable with prompt compression and context pruning), using a model tier that is more capable than the task requires (fixable with intelligent routing), or re-computing results that could be served from a cache (fixable with response memoization). This checklist is structured as a working document you can run through with your engineering team in a half-day session, with each item designed to produce either a confirmed optimization or a confirmed non-issue — no vague “consider whether” items.
Why you need a cost audit before optimizing
- Optimization without measurement produces local minima. The most common cost optimization mistake is picking the most obvious-looking opportunity and optimizing it in isolation, only to find that your bill moved by 3% when you expected 30%. This happens because LLM cost is multiplicative: a 20% reduction in prompt length is only a 20% reduction in input costs, which may be only 40% of total costs, producing an 8% bill reduction — but if the same session also used a model tier that could be downgraded for this task, the downgrade alone would have produced a 70% reduction. Auditing first lets you rank opportunities by impact before spending engineering effort. Every item in this checklist has an associated “potential savings” category (High/Medium/Low) so you can prioritize appropriately.
- Baselines expire if you don’t establish them. The best time to establish a cost baseline was the day you launched your first LLM feature. The second best time is today, before you start optimizing, so you can prove the impact of each change. Run the audit against your last 30 days of cost data. Record your current cost per session by feature, by model, by user segment, and by time of day. When you finish the checklist and implement the top 5 items, you want to be able to attribute each percentage point of savings to a specific change — both to validate the change and to build institutional knowledge about what actually moves the needle in your specific product. Without a pre-audit baseline, you’re flying blind.
- Some “optimizations” harm quality and need measurement to detect. Prompt compression, context pruning, and model downgrading can all reduce cost significantly while also reducing the quality of agent outputs. Without a baseline and a quality metric tracked alongside cost, you may achieve 40% cost savings while also achieving a 25% reduction in task success rate — a net negative business outcome despite the cost win. The audit process should therefore establish not just cost baselines but quality baselines: task completion rate, user satisfaction score, or error rate, depending on what your product measures. Any optimization that saves more than 20% of cost should be validated against these quality metrics before full rollout.
- Audit results compound when shared across teams. If you’re a platform team running this audit, share the results with every product team that uses your LLM infrastructure. A prompt efficiency pattern found by the search team may apply directly to the summarization team’s prompts. A caching strategy that works for the classification feature may generalize to the extraction feature. Cost audits produce institutional knowledge that has a multiplier effect when distributed — a half-day audit workshop where five product teams review results together routinely produces 3–5 additional optimization ideas that no single team would have identified alone.
Model tier and routing audit (10 checks)
- Checks 1–3: task-to-tier alignment. For each LLM-powered feature in your product, answer three questions: (1) What is the actual capability required? Simple classification, extraction, and single-turn Q&A rarely require frontier-model reasoning; they can typically be handled by a model that costs 10–20× less. (2) What is the quality floor? A customer-facing response requires higher quality than an internal routing decision. (3) Have you benchmarked a smaller model on this task? Assumption-based tier selection is the most expensive mistake in LLM cost optimization. Run 500 representative examples from production through a smaller model and measure quality against your existing outputs. For 60–70% of tasks, you will find that the cheaper model produces indistinguishable results, because the task did not require the capabilities you were paying for.
- Checks 4–6: routing logic and fallback behavior. Intelligent routing sends each request to the cheapest model tier that can handle it, with automatic fallback to a higher tier if quality fails. Check: (4) Do you have a routing layer at all, or does all traffic go to one model? (5) If you have routing, is it based on task-type classification (good) or just request length (insufficient)? Request length is a poor routing signal; a short but complex reasoning task needs a frontier model, while a long but simple extraction task does not. (6) What is your fallback behavior when the lower-tier model fails to meet quality requirements? Blind escalation to the top tier on every failure is expensive; a smarter pattern is to retry with a more capable model only when the output fails a lightweight quality check (e.g., a regex validation or a second-pass classification).
- Checks 7–8: prompt caching for large system prompts. If you have a system prompt longer than 1,024 tokens that is identical across sessions (or changes rarely), you should be using prompt caching. Many providers charge 50–90% less for cached input tokens. Check: (7) What is your average system prompt length? (8) What fraction of your input tokens are system prompt tokens vs. user-turn tokens? If your system prompt represents more than 30% of your input tokens and you’re not caching it, prompt caching alone may represent a 15–25% total cost reduction. This is one of the highest-ROI, lowest-effort optimizations available and is almost always the first item to implement.
- Checks 9–10: model version and deprecation hygiene. Check: (9) Are any of your model calls specifying an exact model version that has been superseded by a cheaper or more efficient version of the same capability tier? Providers routinely release updated model versions that are faster and cheaper than their predecessors; pinning to an old version out of stability concerns is costing you money. Establish a quarterly model version review process. (10) Are you paying for model capabilities that are bundled in the tier you use but never exercised? Some model tiers include extended context windows, image understanding, or specialized capabilities that are priced into the per-token cost. If your use case doesn’t use these capabilities, switching to a specialized text-only tier or a different provider may reduce costs by 20–40%.
Context window and prompt efficiency audit (12 checks)
- Checks 11–15: system prompt efficiency. Your system prompt is sent with every single API call, so inefficiency here compounds across your entire request volume. Work through these five checks: (11) Remove all instructional prose that can be expressed as a shorter directive. “You are a helpful assistant that should always try to answer user questions as helpfully as possible while being polite and professional” is 22 tokens that communicate almost nothing the model doesn’t already do by default. Delete it. (12) Remove negative instructions that mirror default behavior (“do not make up information” — the model already doesn’t by default). (13) Convert numbered lists of instructions into brief directives. (14) Move rarely-triggered edge-case handling out of the system prompt and into conditional injection: only include the “if the user asks about X, handle it by Y” section when the user has actually asked about X. (15) Measure the token count of your system prompt before and after each optimization; your target is the minimum number of tokens that maintains your quality floor.
- Checks 16–19: conversation history management. Multi-turn agents that pass the full conversation history on every call are the most common source of runaway context costs. Check: (16) Do you pass the complete conversation history on every call, or do you prune it? (17) If you prune, is the pruning strategy recency-only (keep last N turns) or semantic (keep turns relevant to the current task)? Recency-only pruning is better than nothing but semantic pruning, which summarizes distant history and retains only the most relevant exchanges, typically achieves 40–60% better token efficiency for long-running sessions. (18) Do you summarize conversation history at checkpoint intervals? Generating a 150-token summary of the last 20 exchanges and replacing those exchanges with the summary reduces context size by 85% for that segment. (19) Are tool results being passed in full, or are they summarized before insertion into context? Raw tool results from APIs and databases are often extremely verbose; a transformation layer that extracts only the relevant fields can reduce tool-result tokens by 70–90%.
- Checks 20–22: output length control. Most LLM applications generate more output tokens than strictly necessary because they don’t explicitly constrain output format and length. Check: (20) Do your prompts specify output format and target length? A prompt that says “respond in JSON with these three fields” generates far fewer tokens than an open-ended prompt that implicitly allows the model to add context, caveats, and explanations. (21) For structured extraction tasks, do you use JSON mode or a structured output format to eliminate prose preamble? The tokens “Sure, here is the information you requested in JSON format:” are pure waste. (22) Have you measured the distribution of output token counts and identified any task types that consistently over-generate? Long-tail output generators — the 5% of requests that produce 40% of output tokens — usually have a specific prompt pattern that can be tightened.
Caching and batching audit (8 checks)
- Checks 23–27: semantic and exact-match caching opportunities. Caching is the highest-leverage cost optimization available to most teams because its savings are proportional to your re-request rate rather than to a fixed percentage of token count. Work through these five checks: (23) What fraction of your LLM calls are semantically identical to a previous call in the last 24 hours? Measure this by taking a random sample of 1,000 requests and computing cosine similarity between each request’s prompt embedding and all requests in a rolling 24-hour window. A similarity threshold of 0.95 typically identifies exact-match candidates; 0.85–0.95 identifies semantic-match candidates. (24) For any request type where more than 15% of calls are semantically similar to a previous call, implement caching and measure cache hit rate in production. (25) Are FAQ-style queries (where users ask the same questions in slightly different phrasings) currently hitting the LLM API on every call? These are ideal candidates for semantic cache lookup. (26) Is your cache key including context that makes logically identical requests appear different (e.g., a request ID or timestamp in the message)? Strip non-semantic fields before cache lookup. (27) What is your current cache TTL strategy? A 24-hour TTL on fact-based queries and a 7-day TTL on classification tasks are reasonable defaults; responses that are stateless with respect to time can use indefinite TTLs.
- Checks 28–30: batch processing opportunities. Not all LLM work needs to happen in real time. Check: (28) Which of your LLM tasks are triggered by user actions (must be synchronous) versus triggered by data events, scheduled jobs, or background workflows (can be deferred)? Deferred tasks are candidates for batch API endpoints that typically cost 50% less than synchronous endpoints at most providers. (29) For synchronous tasks, are you making LLM calls sequentially when they could be parallelized? Sequential calls where each one is independent of the previous one are costing you latency but not enabling any savings opportunity; parallelizing them reduces latency but doesn’t change cost. However, if parallelizing allows you to merge multiple short requests into fewer calls with shared context, you may see cost savings. (30) Do you have any analytics, reporting, or enrichment pipelines that use LLM calls? These are almost universally batchable, and the 50% batch discount makes them ideal first candidates for batch migration.
RunGuard for LLM cost optimization
- Continuous audit data in production. The audit checklist above is a one-time exercise, but cost efficiency degrades over time as features are added, prompts are modified, and traffic patterns shift. RunGuard provides continuous per-call cost attribution so you can see optimization regressions as they happen rather than in a quarterly audit. The following Python snippet uses RunGuard’s analytics API to generate an ongoing efficiency report that you can run daily or wire into a Slack digest:
import runguard from datetime import datetime, timedelta, timezone from collections import defaultdict client = runguard.Client() def generate_efficiency_report(days: int = 7): """Weekly cost efficiency report for audit tracking.""" since = datetime.now(timezone.utc) - timedelta(days=days) calls = client.calls.list( started_after=since.isoformat(), limit=10000, include_metadata=True ) # Group by feature tag by_feature = defaultdict(lambda: { "calls": 0, "input_tokens": 0, "output_tokens": 0, "cost_usd": 0.0, "cache_hits": 0 }) for call in calls.items: feature = call.metadata.get("feature", "untagged") stats = by_feature[feature] stats["calls"] += 1 stats["input_tokens"] += call.input_tokens stats["output_tokens"] += call.output_tokens stats["cost_usd"] += call.cost_usd if call.cache_hit: stats["cache_hits"] += 1 print(f"\nLLM Cost Efficiency Report ({days}d)") print(f"{'Feature':<30} {'Calls':>7} {'Avg Input':>10} {'Avg Output':>11} {'Cache%':>7} {'Cost':>8}") print("-" * 85) total_cost = 0.0 for feature, stats in sorted(by_feature.items(), key=lambda x: -x[1]["cost_usd"]): n = stats["calls"] if n == 0: continue avg_in = stats["input_tokens"] / n avg_out = stats["output_tokens"] / n cache_pct = 100.0 * stats["cache_hits"] / n total_cost += stats["cost_usd"] print( f"{feature:<30} {n:>7,} {avg_in:>10,.0f} {avg_out:>11,.0f} " f"{cache_pct:>6.1f}% ${stats['cost_usd']:>7.2f}" ) # Flag high-average-input features as context efficiency candidates if avg_in > 4000: print(f" ** AUDIT FLAG: avg input {avg_in:,.0f} tokens — review context pruning") # Flag low-cache-rate features with repeated patterns if cache_pct < 5 and n > 500: print(f" ** AUDIT FLAG: cache hit rate {cache_pct:.1f}% on {n} calls — review caching") print(f"\nTotal cost ({days}d): ${total_cost:.2f}") if __name__ == "__main__": generate_efficiency_report(days=7)Run this script weekly and track the “AUDIT FLAG” items over time. Each flag represents a checklist item from the audit that has regressed or was never addressed. The script serves as both a recurring efficiency report and a living checklist that surfaces new optimization opportunities as your product evolves. - Model routing intelligence built in. RunGuard’s routing layer classifies each incoming request by task complexity and routes it to the cheapest model tier that meets your configured quality threshold for that task type. Rather than manually implementing and maintaining your own routing logic, you configure quality thresholds per feature and RunGuard handles the tier selection, with detailed per-call routing decisions available in the audit log for verification. Teams using RunGuard routing report 30–55% cost reduction on features that were previously pinned to a single model tier.
- Audit trail for cost attribution across the checklist. Every optimization you implement changes your cost profile in ways that can be hard to attribute without a clean audit trail. RunGuard records model version, prompt hash, cache hit/miss status, and context token count for every call, tagged with your feature and session metadata. When you complete a checklist item — say, implementing context pruning for the research agent — you can pull a before/after comparison from RunGuard that shows the exact change in average input tokens, cache hit rate, and cost per session, with statistical significance testing built into the comparison API.
Turn your cost audit into continuous optimization
A one-time audit is a starting point, not a solution. RunGuard provides the continuous cost attribution, routing intelligence, and anomaly detection that turns a quarterly audit exercise into a live optimization loop — catching regressions, surfacing new opportunities, and attributing every dollar of spend to the feature and session that generated it. Start your free trial and run your first efficiency report against 30 days of production data today.
Start free trial →