LLM model routing cost optimization: send the right task to the right model
Most AI agent stacks pick a single model and use it for everything. That model is typically the best available — GPT-4o, Claude Sonnet, Gemini 1.5 Pro — because teams default to capability when reliability matters. The cost consequence is severe: you pay frontier-model prices for tasks that a model costing 10–20x less would handle equally well. Intent classification, text extraction, simple summarization, format conversion, and tool-call routing do not require a 200-billion-parameter model. Model routing — the practice of matching each task to the cheapest model that can handle it reliably — routinely reduces LLM spend by 60–80% in production agent stacks without degrading user-facing quality. This page covers the routing architectures, classification approaches, fallback patterns, and how RunGuard’s BudgetTracker enforces per-task cost discipline so routing decisions are auditable and correctable.
The cost gap between model tiers (2026 pricing)
- The 10–50x price spread. At 2026 pricing, the difference between a frontier reasoning model and a capable small model spans one to two orders of magnitude. Claude Sonnet 4.6 at $3/MTok input vs Claude Haiku 4.5 at $0.25/MTok input is a 12x spread. GPT-4o at $2.50/MTok vs GPT-4o mini at $0.15/MTok is a 17x spread. Gemini 1.5 Pro at $3.50/MTok (over 128k) vs Gemini 1.5 Flash at $0.35/MTok is a 10x spread. If 60% of your agent’s tasks could be handled by the small model, routing those tasks reduces total LLM spend by approximately 50–60% with no change in the quality of the remaining 40% that require the frontier model.
- Which tasks actually require frontier models. The tasks that genuinely require frontier-model capability are a smaller set than most teams assume: multi-step reasoning with ambiguous constraints, code generation in complex or novel domains, tasks requiring broad world knowledge and synthesis across disparate sources, nuanced judgment with significant consequences. Everything else — classification, extraction, formatting, slot-filling, simple Q&A from a provided document, function call parsing with a known schema — is well within the capability of models that cost 10–20x less.
- The quality validation gap. The reason teams default to frontier models is not that they’ve tested small models and found them inadequate; it’s that they haven’t tested small models at all. Before implementing routing, a one-week A/B test on a representative sample of your task distribution typically reveals that 50–70% of tasks pass quality checks at the same rate on small vs large models. This empirical baseline is the foundation for a routing policy.
Routing architecture 1: rule-based task type routing
- How it works. You define a taxonomy of task types (classification, extraction, generation, reasoning, code) and map each type to a model tier. Your agent framework tags each task with its type before the LLM call, and the router selects the model based on the tag. This is the simplest routing architecture, requires no additional LLM calls, and is deterministic and auditable.
- Task type taxonomy example.
- Tier 1 (cheap model): intent classification, entity extraction, slot filling, format conversion, simple yes/no, structured data extraction from a provided document
- Tier 2 (mid-range model): multi-document summarization, code explanation, function call routing with complex schemas, multi-step instruction following
- Tier 3 (frontier model): complex code generation, multi-step reasoning with ambiguous constraints, synthesis across multiple disparate sources, novel problem-solving
- Implementation. In LangChain, this is a model selector in a
RunnableBranch. In a custom agent loop, it’s agetModel(taskType)function called before each LLM invocation. The task type is either inferred from the tool name being called (allextract_tools use the cheap model) or set explicitly in the tool definition. - Limitation: task type is not always predictable. Some tasks that appear to be “extraction” actually require reasoning to resolve ambiguity. Rule-based routing must include an explicit fallback: if the cheap model returns a low-confidence result or an explicit “I cannot determine this” response, the task escalates to the next tier. The escalation adds one extra LLM call but is still cheaper than routing everything to the frontier model by default.
Routing architecture 2: cascading router with confidence-based escalation
- How it works. You attempt every task with the cheapest model first and escalate to the next tier only when confidence is below a threshold. Confidence can be measured by: (a) the model’s explicit confidence score if the output format includes one, (b) a self-consistency check (sample 3 times at high temperature; if the results diverge significantly, escalate), or (c) a lightweight classifier trained on your historical escalation decisions.
- The cascade pattern in code.
async function routedLlmCall(prompt, options = {}) { const cheap = await callModel('haiku', prompt, options); if (cheap.confidence >= CONFIDENCE_THRESHOLD) return cheap; const mid = await callModel('sonnet', prompt, options); if (mid.confidence >= HIGH_CONFIDENCE_THRESHOLD) return mid; return await callModel('opus', prompt, options); } - Cost modeling for cascading. If 70% of tasks resolve at tier 1, 20% at tier 2, and 10% at tier 3, and tier prices are $0.25/$1.00/$3.00 per MTok, the average cost is:
0.70 × $0.25 + 0.20 × ($0.25 + $1.00) + 0.10 × ($0.25 + $1.00 + $3.00) = $0.175 + $0.250 + $0.425 = $0.85/MTok— versus $3.00/MTok if you route everything to tier 3. That is a 72% cost reduction. The cascade adds latency for escalated tasks, but those tasks (30% of the total) were the ones that warranted the extra computation anyway. - Latency tradeoff. Cascading adds one extra round-trip per escalation. If P90 latency tolerance is tight, configure cascade only for background/async agent tasks and use direct routing for interactive tasks. See AI agent A/B testing cost tradeoffs for methodology to measure the latency-cost tradeoff at your specific task mix.
Routing architecture 3: classifier-based routing
- Why a separate classifier. Rule-based routing requires explicit task-type tagging in your code, which creates maintenance burden. Cascading routing adds latency. Classifier-based routing uses a lightweight model (or a fine-tuned embedding classifier) to predict the appropriate model tier from the raw prompt, before the task is sent to any LLM. This allows routing decisions to be made in milliseconds without an additional LLM call and without requiring explicit task-type metadata.
- Building the training set. The training set comes from your production logs. For each historical LLM call, label it with the model tier that produced a satisfactory result. The cheapest model that passed your quality gate is the label. After a few thousand examples, a logistic regression or lightweight BERT-style classifier can predict the routing decision with 85–90% accuracy on your task distribution.
- Using an LLM as the classifier (meta-routing). An alternative: use a tiny, cheap model specifically for routing decisions. The “routing call” costs 100–200 input tokens and produces a single token output (tier 1/2/3). At Haiku pricing, 1,000 routing decisions cost approximately $0.025 — negligible compared to the savings from correct routing. The disadvantage is added latency (one extra round-trip) vs the embedding classifier approach.
- OpenRouter and LiteLLM for multi-provider routing. For teams running a multi-provider stack (Anthropic + OpenAI + Google), frameworks like OpenRouter and LiteLLM implement model routing at the API proxy layer. You define routing rules in configuration; the framework selects the provider and model per call. This decouples routing logic from application code. RunGuard integrates with LiteLLM’s callback hooks to apply BudgetTracker limits across the routed call graph, see below.
Avoiding common routing mistakes
- Routing on capability without measuring quality. The most common mistake is assuming the cheap model will be inadequate without testing it. Test first: run 100 examples of your intended cheap-tier tasks through both the cheap and frontier model; compare output quality on your criteria. In the majority of cases, you will find the cheap model is adequate for more task types than you expected.
- Not logging routing decisions for auditability. When a routing decision causes a quality issue (the cheap model produced an incorrect result that propagated downstream), you need to debug it. Every LLM call should log which model was selected, why (rule match, cascade confidence, classifier score), and the task type. Without this, routing bugs are opaque.
- Using routing to compensate for poor prompts. If a task requires the frontier model because the prompt is poorly specified, routing is hiding a prompt engineering problem. Before implementing routing, ensure your prompts are well-structured for each task type. A well-engineered prompt for a classification task will work on the cheap model; a vague prompt may require the frontier model to interpret. Fix the prompt; then implement routing.
- Missing the cost of escalation at scale. If your cascade escalation rate is higher than expected (40% of tasks escalate rather than the anticipated 20%), your routing architecture has miscalibrated thresholds. Audit a sample of escalated tasks: are they genuinely hard tasks, or is the cheap model failing on tasks it should handle? Miscalibrated escalation can eliminate the cost savings of routing entirely.
RunGuard BudgetTracker for multi-model cost enforcement
- Why enforcement is necessary even with routing. Routing reduces expected costs; it does not eliminate cost spikes. An edge-case task that escalates through all tiers multiple times due to repeated tool failures can still produce a $5–$20 session when the normal average is $0.10. BudgetTracker provides the hard cap: when cumulative session spend crosses the per-session limit, the next LLM call throws a
BudgetExceededErrorand the session is halted gracefully. - Configuring BudgetTracker with model-specific cost rates.
const tracker = new BudgetTracker({ capUsd: 0.50 }); const cost = (inputTokens * modelInputPrice + outputTokens * modelOutputPrice) / 1_000_000; tracker.record(cost);Calltracker.check()before each model call to trip early if you can project that the call will bust the budget. For cascading routing, check the budget before attempting tier 2 or tier 3 escalation — if the budget is almost exhausted, skip the escalation and return the tier 1 result even if confidence is low. - Separate budgets for routing tiers. Advanced configuration: maintain separate BudgetTracker instances for tier 1 and tier 2+. If tier 2+ spend in a session exceeds a threshold (suggesting the routing classifier is miscalibrated for this session), trigger an alert or a circuit break that prevents further escalations. This gives you observability on the routing decision quality in addition to pure cost control.
Enforce model routing budgets automatically
RunGuard’s BudgetTracker gives you a hard cap across your entire multi-model call graph — including escalated cascade calls. No more surprise bills from routing miscalibration.
Start free trial →