AI agent cost engineering: 6 patterns that cut our LLM spend by 71%
We shipped our first production AI agent in January 2026. By March, our monthly LLM bill was $4,200 — for an agent serving fewer than 400 daily active users. We were losing money on every user at growth. This is that post-mortem.
The diagnosis took two weeks. The fixes took three days. We went from $4,200/month to $1,218/month for the same user load — a 71% reduction — without changing models, without degrading quality, and without reducing features. Every fix was an engineering change, not a product concession. This post documents the six patterns that got us there, with real numbers and code you can use today.
The anatomy of our $4,200/month problem
Before the diagnosis, we had no idea where the money was going. We knew our Anthropic bill was high. We did not know why. The first thing we did was instrument every LLM call with its full token breakdown: system prompt tokens, conversation history tokens, tool definition tokens, tool result tokens, and output tokens. Here is what we found for a representative session:
- System prompt: 2,800 tokens (stable, potentially cacheable)
- Tool definitions: 3,100 tokens (stable, potentially cacheable)
- Conversation history: 11,200 tokens (growing with each turn)
- Tool results: 18,400 tokens (the surprise)
- Output: 2,100 tokens
- Total per session: 37,600 tokens
At Anthropic Sonnet pricing, that session cost $0.15. Our estimated session cost in development had been $0.04. The delta was in two places: conversation history growth (we had not modeled the O(n²) accumulation correctly) and tool results (we had no idea our search tool was returning 4,000-token results on every call).
Tool results alone were 49% of our input token cost. We had never looked at this number before. It was the most expensive line item in our bill, and it had been invisible to us.
Pattern 1: cap tool result sizes at the source
Our search tool was implemented as a thin wrapper around a web search API that returned full HTML content. The API response included navigation, ads, footers, comment sections, and the actual article content — typically in a 1:9 useful-to-noise ratio. We were paying $0.07/MTok to feed the model navigation menus.
The fix: strip HTML to text, extract the main content block using a lightweight parser (Readability.js for Node, newspaper3k for Python), and truncate to 1,500 tokens. The model got the same useful information; the noise was gone.
Result: search tool average result size went from 4,200 tokens to 800 tokens. For our research agent that called search an average of 5 times per session, this reduced per-session input tokens by 17,000 — a 45% reduction in total session input tokens.
// Before: raw HTML passthrough
async function search(query: string) {
const response = await searchApi.fetch(query);
return response.html; // 4,000-8,000 tokens of noise
}
// After: extract + truncate
async function search(query: string) {
const response = await searchApi.fetch(query);
const article = extractMainContent(response.html); // Readability
return truncateToTokens(article.textContent, 1500);
}
This is the most common missed optimization in AI agent systems. Almost every team we talk to has at least one tool that returns more data than the model needs. Audit every tool result and ask: what is the maximum size this result should ever be? Then enforce it in the tool, not the prompt. Prompt-based size instructions are probabilistic; code-enforced truncation is deterministic. For more on tool result optimization, see AI agent tool selection cost optimization.
Pattern 2: flatten the conversation history growth curve
Our agent was a multi-turn conversational assistant. Turn 1 sent 6,000 tokens of context. Turn 2 sent 6,000 + the first exchange (800 tokens) = 6,800. Turn 3 sent 7,800. By turn 10, we were sending 14,000 tokens per call, and the model was paying to read a full conversation history that was mostly irrelevant to the current step.
The fix had two parts. First, we implemented a sliding window: keep only the last 5 turns of conversation history plus a 200-token summary of earlier turns. Second, we moved to explicit working memory: instead of asking the model to re-infer the task state from conversation history, we maintained a structured JSON object with the current task state, key facts extracted so far, and completed steps. This state object was included in the system prompt (cacheable) rather than the conversation history (unchacheable).
interface AgentState {
taskGoal: string;
completedSteps: string[];
keyFacts: Record<string, string>;
currentStep: string;
}
// System prompt includes stable state (cached after first call)
// Conversation history kept to last 5 turns only
function buildContext(state: AgentState, history: Message[]): Context {
return {
system: renderStateAsSystemPrompt(state), // cacheable, stable
messages: history.slice(-5) // last 5 turns only
};
}
The combined effect: per-call input tokens dropped from averaging 12,000 to averaging 5,200 — a 57% reduction in conversation history token cost. Our sessions are now nearly flat in input cost per turn instead of growing linearly. See multi-turn conversation cost optimization for a deeper treatment of history management strategies.
Pattern 3: route cheap tasks to cheap models
Our agent used Anthropic Sonnet ($3/$15 per MTok input/output) for every single operation, including tasks that did not need a frontier model. Tool call parsing, format conversion, simple classification, and structured data extraction were all running on Sonnet. These tasks are solved by Claude Haiku ($0.25/$1.25) with equivalent quality in our testing.
We classified our agent’s operations into three tiers:
- Haiku tier: structured output extraction, format conversion, simple yes/no classification, tool call result parsing (45% of total calls)
- Sonnet tier: multi-step reasoning, code generation, research synthesis, ambiguous instruction interpretation (50% of total calls)
- Sonnet Extended Thinking tier: complex debugging, novel problem decomposition (5% of total calls)
With this routing in place, our effective blended rate dropped from $3/$15 (all Sonnet) to approximately $1.58/$8.30 per MTok. On our actual token mix, total model cost dropped 47%. The routing logic was 40 lines of TypeScript: a task classifier that runs on Haiku (cheap!) to determine which model should handle the actual task.
Pattern 4: fix the retry loops
Our agent had retry logic for structured output failures. If the model returned JSON that failed schema validation, we corrected the prompt and retried. In development, this happened rarely. In production, our correction-loop rate was 12% of sessions — meaning 1 in 8 sessions spent 2–3 extra LLM calls on output correction.
Worse, some sessions entered cascade correction loops: the correction prompt caused a different schema violation, which triggered another correction, and so on. We had 6 sessions in our first month that accumulated 15+ LLM calls in correction loops, each costing $1.50–$3.00 instead of the expected $0.05.
The fix was two-part. First, we switched to native structured outputs (OpenAI json_schema, Anthropic forced tool use) which eliminated validation failures for well-formed schemas entirely. Second, we added a loop detection circuit breaker that halted any session making more than 3 consecutive calls of the same type without advancing task state. RunGuard’s LoopDetector caught the cascade cases automatically; we just had to wire it in.
Result: correction loop sessions dropped from 12% to 1.3%. The 6 catastrophic sessions per month dropped to zero in the next 6 weeks after deployment. Per-session output cost dropped 14% from eliminating the correction overhead.
Pattern 5: cache your stable prompt components
Our system prompt was 2,800 tokens. Our tool definitions were 3,100 tokens. These 5,900 tokens were being sent and charged on every single LLM call at full input rates. With Anthropic prompt caching enabled, any API call that reuses the same prefix gets cache-read tokens at $0.30/MTok instead of $3/MTok — a 90% discount.
The catch: caching requires the cached prefix to be bit-for-bit identical between calls. We were dynamically inserting the current timestamp into our system prompt (“Today’s date is {date}”). This invalidated the cache on every call. Removing the dynamic date insertion and moving any session-specific data to the user message (outside the cached prefix) was a 10-minute fix.
With proper cache-aware prompt structure (system prompt + tool definitions as the stable prefix, dynamic context in the user message), our cache hit rate went from 0% to 78% on the stable prefix. On the 5,900-token prefix, this saved 4,602 tokens at the $2.70/MTok differential per call, on 95% of our calls. At 8,000 calls per day, this alone saved $0.03 per call × 8,000 = $240/day = $7,200/month. We had been burning $7,200/month on uncacheable prompts because of a timestamp in the system prompt.
The lesson: audit every dynamic element in your system prompt. Each dynamic element that changes between calls breaks the cache prefix for everything that follows it. Move dynamic data as late in the prompt as possible, ideally into the user message. Static system prompt + static tool definitions + dynamic user message is the pattern that maximizes cache hit rate.
Pattern 6: enforce budget ceilings on tail-risk sessions
Even after implementing patterns 1–5, we had a tail of expensive sessions. These are the ones that drive a disproportionate fraction of the bill: the user who asks the agent to “research everything about X” and the agent complies by calling search 25 times, or the session where a subtle prompt injection causes the model to generate a 10,000-token response, or the edge case input that triggers an unexpected agent behavior.
Our pre-pattern-5 P95 session cost was $0.42. Our post-pattern-5 P95 was $0.18. But our P99 was still $0.85 — 17x the mean of $0.05. Those top-1% sessions were 14% of our total cost.
The fix was a per-session cost ceiling enforced by RunGuard’s BudgetTracker. We set a ceiling of $0.30 per session (6x the mean). Any session that hit $0.30 in cumulative LLM spend received a graceful halt: the agent returned a summary of what it had completed so far and asked the user to refine their request. The user got a useful response; we avoided the tail of the distribution.
const guard = new RunGuard({
budget: { maxCostUsd: 0.30 },
loop: { maxToolCallsPerType: 8, windowTurns: 5 },
context: { maxContextTokens: 160000, headroom: 4000 }
});
const result = await guard.run(async (ctx) => {
return await agent.run(userInput, { runguardContext: ctx });
});
After deploying the ceiling, P99 session cost dropped from $0.85 to $0.31 (just above the ceiling, accounting for the final call that pushed us over before the breaker fired). Top-1% sessions went from 14% of total cost to 6% of total cost. Combined with the mean cost reduction from patterns 1–5, total monthly spend dropped from $4,200 to $1,218.
The full picture
Here is the cost breakdown before and after all six patterns:
| Component | Before | After | Reduction |
|---|---|---|---|
| Tool result tokens/session | 18,400 | 4,000 | −78% |
| History tokens/session | 11,200 | 4,800 | −57% |
| Effective blended rate | $3.00/$15 | $1.58/$8.30 | −47% |
| Correction loop overhead | +12% | +1.3% | −89% |
| Cache hit rate (stable prefix) | 0% | 78% | — |
| P99 session cost | $0.85 | $0.31 | −64% |
| Monthly bill | $4,200 | $1,218 | −71% |
None of these changes required switching models, reducing agent capability, or changing the user-facing product. They were purely engineering changes: better tool implementations, smarter context management, model routing, loop detection, prompt caching, and cost ceilings.
What to do first
If you are facing a similar situation, start with instrumentation. You cannot optimize what you cannot measure. Add token-level logging to every LLM call: break it down by component (system prompt, history, tool defs, tool results, output). Then look at the numbers. In our experience, most agent teams discover one of two patterns:
- Tool results dominate: your search, fetch, or database tools are returning far more data than the model needs. Fix these first; the ROI is highest.
- History growth dominates: your conversation-per-session is long and you are re-sending the entire history on every call. Add a sliding window and structured state object.
After fixing the dominant cost driver, add prompt caching by auditing your system prompt for dynamic elements. This is a quick win with no quality impact. Then add model routing for your cheap tasks. Then add circuit breakers for loops and tail-risk sessions.
The order matters because the first two (tool result sizing, history management) have the highest impact and are completely free to implement. Prompt caching is also free. Model routing requires some engineering but pays for itself quickly. Circuit breakers (via RunGuard or a custom implementation) prevent the tail-risk sessions that make your P99 cost scary.
For more on any of these patterns, see: AI agent cost anomaly detection, LLM agent production cost estimation, and autonomous agent cost control best practices.
Stop paying for sessions you can’t explain.
RunGuard wraps your agent with real-time cost tracking, loop detection, and budget enforcement. 5-minute integration. Free 14-day trial.
Start free trial →