AI agent token efficiency optimization: reduce LLM spend without touching the model

When teams calculate LLM costs, they typically focus on input tokens — the prompt, history, and tool results they send to the model. Output tokens receive less attention, yet for many agent workloads they are the primary cost driver. A model generating a verbose 800-token response when a 150-token JSON object would serve equally well is wasting 650 tokens of output-side budget per call. At Claude Sonnet pricing ($15/MTok output), that waste costs $0.0098 per call — negligible in isolation but meaningful at scale: 10,000 agent sessions per day with 10 LLM calls each means 100,000 wasted 650-token responses, adding $97/day to the bill for no added value. Token efficiency optimization addresses both input and output sides through disciplined prompt engineering, output format constraints, schema-driven generation, and max_tokens discipline. Combined with RunGuard’s per-call and per-session budget enforcement, these techniques produce 30–50% cost reductions with no change to model selection or infrastructure.

Output token waste: the underappreciated cost driver

Output tokens cost more than input tokens. At every major LLM provider, output tokens cost 3–5x more per token than input tokens. Anthropic Sonnet: $3/MTok input, $15/MTok output (5x). GPT-4o: $2.50/MTok input, $10/MTok output (4x). Gemini 1.5 Pro: $3.50/MTok input, $10.50/MTok output (3x). Output optimization has greater marginal impact per token saved than input optimization. Teams that focus exclusively on compressing inputs are optimizing the cheaper dimension.
Verbose by default. Without explicit output constraints, LLMs trend toward verbose, discursive responses. This is a training artifact: human feedback raters during RLHF training often prefer longer, more thorough answers. The resulting tendency toward verbosity is appropriate for user-facing chat applications, where thoroughness is valued, but counterproductive for programmatic agent use cases where you need a specific structured value, not an explanation.
Reasoning tokens in extended thinking models. Models with extended thinking (Claude Sonnet’s thinking tokens, OpenAI o-series reasoning tokens) generate “thinking” that is billed at the output rate. These tokens can be 5–20x the length of the final answer for simple tasks. Only enable extended thinking for tasks that genuinely require deep reasoning; disabling it for classification, extraction, and simple generation tasks eliminates the thinking overhead entirely.
The preamble tax. LLMs often begin responses with preambles: “Sure, I can help with that. Here’s the requested information:” — 15–20 tokens that convey no information. Instructions like “Begin directly with the result. No preamble.” in the system prompt eliminate these tokens. Over 100,000 LLM calls, eliminating a 15-token preamble saves 1.5M tokens — $22.50 at Sonnet output pricing, from a single instruction line.

Output format discipline: the highest-ROI optimization

JSON instead of prose for structured outputs. When your agent needs a structured result — a classification label, an extracted entity, a confidence score, a list of items — request JSON explicitly: Respond with only a JSON object: {"label": <string>, "confidence": <float 0-1>}. No other text. A prose response to a classification task might read “Based on my analysis, I believe this text falls into the category of [X] with moderate-to-high confidence, approximately 0.82.” (35 tokens). The JSON equivalent is {"label": "X", "confidence": 0.82} (11 tokens). A 3x output reduction from format discipline alone.
Structured outputs (constrained decoding). OpenAI’s response_format: {type: "json_schema"} and Anthropic’s tool-use forced-response mode use constrained decoding to guarantee valid JSON conforming to a schema. Beyond correctness, constrained decoding reduces output tokens because the model does not need to generate explanatory prose around the JSON — it generates only the schema-specified fields. Field names count toward output tokens, so use short field names in schemas where possible: c instead of confidence_score saves 3 tokens per field per call.
Explicit response length instructions. For summarization, analysis, and reasoning tasks: specify the desired output length explicitly. “Summarize in 2–3 sentences.” produces 80–120-token responses. Without the constraint, the same prompt might generate 300–500 tokens. The model respects length instructions reliably when they are specific (sentence count or word count) rather than vague (“briefly” is inconsistent; “in 3 sentences” is reliable).
Setting max_tokens appropriately. Every LLM API call accepts a max_tokens parameter. Most applications leave it at the default (often 4,096 or unlimited), which means the model can generate as many tokens as it wants. Set max_tokens to 1.5x your expected output length for the task type. A forced truncation at max_tokens is a bug signal: if the model is regularly hitting the limit, your prompt or task decomposition needs review. But setting a realistic limit prevents the model from generating 10x the expected output when it misinterprets the task.

Input token efficiency: eliminating redundancy from prompts

System prompt audit. Run tiktoken (for OpenAI models) or Anthropic’s token counter on your system prompt. A system prompt exceeding 2,000 tokens is almost always carrying redundant instructions. Common bloat sources: repeated instructions (“always respond in JSON” stated three times in three different sections), over-specified personas (500-word backstory for a classification task that needs 20 words of context), example outputs that are longer than necessary (5 examples where 2 demonstrate the pattern).
Few-shot example calibration. Few-shot examples are powerful for guiding model behavior but expensive: a 3-example few-shot block adds 600–2,000 tokens to every call. Test whether your task requires 3 examples or whether 1 example achieves the same calibration. For many tasks, especially with well-aligned models, zero-shot with a precise output format specification achieves equivalent quality to 3-shot at 600+ tokens less per call. Run the comparison empirically before assuming few-shot is necessary.
Instruction compression. Write instructions at the minimum verbosity required for the model to follow them correctly. “You are a helpful assistant specialized in customer support for RunGuard. Your job is to help customers understand how to use RunGuard to protect their AI agents from infinite loops, budget overruns, and context overflow errors. Always be concise and technical in your responses.” (55 tokens) can often be reduced to “You are a RunGuard support agent. Be concise and technical.” (14 tokens) without loss of performance for most support queries. Empirically test the compressed version before deploying.
Dynamic context injection vs static bloat. Many system prompts include context that is only relevant for some tasks: documentation sections, tool descriptions, background information. Instead of including all possible context in every call, inject context dynamically based on the current task: if the task is a billing question, inject the billing FAQ; if it’s a technical question, inject the technical documentation. RAG (retrieval-augmented generation) applied to your own system prompt reduces average prompt length by 30–60% for diverse task distributions.

Tool call token overhead

Tool definition token cost. Every tool definition in the tools array adds tokens to the input. A comprehensive tool definition with a detailed description, parameter descriptions, and examples can be 200–500 tokens. If you have 20 tools defined and only 2 are relevant to the current task, you are paying for 18 unused tool definitions on every call. Dynamic tool injection — only including the tools that are relevant to the current step — is a direct input-token optimization.
Tool description verbosity. Tool descriptions exist to help the model select the correct tool. They do not need to be exhaustive documentation. A one-sentence description plus the parameter names and types is typically sufficient. “Search the web for recent information” (6 tokens) is equivalent in routing quality to “This tool allows you to search the internet for current, up-to-date information about any topic. Use it when the user asks about recent events or when your training data may be out of date.” (35 tokens) for most routing decisions.
Tool result token management. Tool results are injected into the conversation as assistant-role messages (for tool calls) and user-role messages (for tool results). Every token in the tool result accumulates as history on subsequent calls. See AI agent prompt compression cost savings for the truncation strategies that keep tool results from compounding into context bloat.

Measuring token efficiency in production

Token efficiency ratio. Define token efficiency as: information_delivered / tokens_consumed. The numerator requires a quality metric for your specific task (correct classification rate, answer accuracy, user satisfaction score). Track this ratio over time; a falling ratio means you are paying more tokens for the same quality, indicating drift toward verbosity. A rising ratio means your prompt engineering improvements are increasing quality per token.
Output-to-input ratio. A simple proxy: track the ratio of output tokens to input tokens per call. For extraction and classification tasks, this ratio should be very low (<0.05: you send 1,000 tokens, you receive 50). For generation and reasoning tasks, the ratio can legitimately be higher (0.2–0.5). If your extraction tasks have a high output-to-input ratio, you are not constraining the model’s output adequately.
Preamble and suffix token waste. Measure how many tokens in each response are “preamble” (pre-JSON prose) and “suffix” (post-JSON explanations). A properly constrained prompt should produce zero preamble and zero suffix for structured outputs. Log the token positions of the first and last JSON brace characters in responses; everything outside those positions is waste.

RunGuard BudgetTracker and ContextGuard for token efficiency enforcement

Per-call max_tokens enforcement via ContextGuard. RunGuard’s ContextGuard monitors the projected context size before each LLM call. Configure it with a maxContextTokens limit appropriate for your task type, and your agent loop catches ContextOverflowError to trigger compression before the call is made. This prevents the context accumulation that defeats even the best output format discipline over long multi-turn sessions.
Per-session dollar cap via BudgetTracker. After implementing token efficiency improvements, set BudgetTracker’s capUsd to your target per-session cost. BudgetTracker trips when cumulative spend crosses the cap, halting the session before it exceeds budget. This is the enforcement layer that catches the sessions where everything goes right individually but the aggregate still runs over — typically because a task takes more steps than anticipated rather than any single step being inefficient.

Integration example:

const g = guard(
  async (input) => {
    const resp = await anthropic.messages.create({
      model: 'claude-sonnet-4-6',
      max_tokens: 256,  // always set per-call
      messages: input.messages,
    });
    tracker.record(estimateCost(resp.usage));
    return resp;
  },
  {
    context: { maxContextTokens: 16000, headroom: 2000 },
    tokens: (input) => estimateInputTokens(input),
    budget: { capUsd: 0.25 }
  }
);

Enforce token budgets automatically

RunGuard wraps your LLM calls with hard context and dollar caps. Token efficiency improvements reduce expected costs; RunGuard catches the outliers that escape them.

Start free trial →