AI agent tool selection cost optimization: choose tools that don’t blow your LLM budget
When AI agent teams audit their LLM costs, they typically focus on model selection, prompt length, and output verbosity. They rarely audit the tools themselves. Yet tool design decisions — which tools are available to the agent, how verbose their result schemas are, how large their outputs are allowed to be, and when they execute — often account for 30–60% of total input token cost in tool-heavy agents. A research agent that calls a web_search tool returning 5,000-token unstructured HTML results spends 3× more on input tokens per search than one calling a structured-result search tool that returns 800-token formatted summaries. The tool’s result schema, not the model or the prompt, is the dominant cost driver. Optimizing tool selection and tool result design is one of the highest-leverage, lowest-effort cost reductions available — and it requires no model changes, no prompt refactoring, and no infrastructure modifications. Just schema and size discipline on the tool side.
How tool selection drives LLM costs
- Tool definitions as fixed input cost per call. Every LLM call that includes tool definitions pays input tokens for those definitions. If your agent has 8 tools with verbose JSON schema definitions (name, description, parameters, examples) averaging 250 tokens each, you spend 2,000 tokens on tool definitions before a single character of conversation. At Sonnet input pricing ($3/MTok), that is $0.006 per call just for tool definitions — $60/day at 10,000 calls/day. An agent that has 15 tools and only regularly uses 4 of them is paying for 11 unnecessary tool definitions on every call.
- Tool result size as the dominant variable cost. Tool definitions are a fixed overhead. Tool results are variable and agent-controlled. An agent that calls
read_fileon a 10,000-line file and receives the full content pays 25,000–40,000 input tokens for that single result. An agent that callsread_file_lines(path, start=100, end=150)pays 200–400 tokens. The difference is 100x. Tool result size budgeting — enforcing maximum result sizes in tool implementation, not in the prompt — is the single highest-impact tool cost optimization. - Tool call frequency multiplier. In a loop, the cost of an expensive tool call is multiplied by the loop iteration count. An agent that calls a 5,000-token tool in a 10-iteration reasoning loop pays 50,000 input tokens for tool results alone. RunGuard’s LoopDetector catches repetitive tool call patterns that indicate reasoning loops; catching the loop at iteration 3 instead of iteration 10 saves 35,000 input tokens per session.
- Parallel tool calls and fan-out. Modern LLM APIs support parallel tool calls — the model returns multiple tool calls in a single response, all of which must be executed and their results returned as a batch in the next call. If the agent fans out to 5 parallel tool calls each returning 2,000 tokens, the next input call starts with 10,000 tokens of tool results before conversation context. See AI agent parallel tool call budget control for fan-out cost management patterns.
Tool definition optimization
- Trim description verbosity. Tool descriptions are often written for clarity, with examples, edge cases, and usage notes embedded in the description field. Each word in a tool description costs input tokens on every call. Audit each tool description for tokens vs value: remove examples (they can live in the system prompt or in a separate tool examples context), shorten verbose parameter descriptions to concise types and constraints, and remove redundant phrases (“This tool allows you to” โ “Reads”). Most teams can reduce tool definition token counts by 40–60% with no impact on model tool-use quality.
- Dynamic tool sets. Instead of providing all available tools on every call, provide only tools relevant to the current agent state. A multi-phase agent (research phase, analysis phase, reporting phase) should expose only the tools appropriate for the current phase. Use a tool selector that returns a filtered tool list based on current state: research phase gets search/fetch/extract tools; analysis phase gets data/compute tools; reporting phase gets write/format tools. This is the tool-level equivalent of model routing and can reduce tool definition overhead by 50–70%.
- Schema minimalism. Avoid optional parameters with long descriptions for paths that are rarely taken. If a tool has 12 parameters but 95% of calls use only 3 of them, restructure the tool into a primary call signature (3 required params, no optionals) and a secondary extended version for the rare cases. The primary version costs 3× less in tool definition tokens and is less likely to confuse the model into over-using optional parameters.
- Caching tool definitions. Tool definitions are stable across calls in a session. Use provider-level prefix caching to cache them: place tool definitions at the end of the system prompt (before the conversation history) so the system prompt + tools block is the stable cache prefix. At Anthropic’s 10% cache read rate, this converts tool definition tokens from $3/MTok to $0.30/MTok on every cached call. For agents with large tool sets, this alone can reduce call costs by 20–30%.
Tool result size budgeting
- Enforce result size limits in the tool, not the prompt. Telling the model “limit your search results to 500 words” in the system prompt does not limit search result sizes — it may cause the model to summarize results in its response, but the full search result is still in the context. Size limits must be enforced in the tool implementation: truncate HTML content before returning, paginate large file reads, limit search result count. Tool-enforced limits are deterministic; prompt-enforced limits are probabilistic.
- Structured result schemas vs raw text. A web scraping tool that returns raw HTML (10,000–50,000 tokens) is 10–50x more expensive than one that returns a structured schema of {title, author, date, main_content_paragraphs: []}. The structured schema strips navigation, ads, footers, and markup — noise tokens that contribute nothing to the task. Design tool result schemas to return only the fields the agent actually uses downstream. If the agent never uses the HTML <head> content, don’t return it.
- Pagination as cost control. For tools that return variable-length results (database queries, file reads, search results), implement pagination with a default page size of 20–50 items. The agent can explicitly request more pages if needed, but defaults to the minimum viable context. This prevents the common failure mode where an agent asks for “all records” and receives 10,000 rows of data it will summarize to 5 insights. The agent should work iteratively on pages rather than loading all data at once.
- Result summarization in the tool layer. For tools that retrieve inherently large content (document readers, code repositories, conversation histories), build summarization into the tool implementation itself. The tool runs a lightweight summarization pass (using a small, cheap model) before returning results to the main agent. A 15,000-token document that gets summarized to 800 tokens by a $0.25/MTok Haiku call costs $0.004 instead of adding $0.045 in input tokens on every subsequent call in the session.
Lazy tool execution patterns
- Defer expensive tool calls as long as possible. The model often requests tool calls as a reflex — “let me search for that” — before actually needing the information for a decision. Implement a tool call pre-check in your agent loop: before executing an expensive tool (search, fetch, large file read), check whether the information is already in the conversation history. Cache the results of expensive tool calls within a session and return cached results on repeated calls. This is cheap at implementation time and can eliminate 20–40% of redundant tool executions.
- Tool call batching. If the agent makes multiple related tool calls in sequence (read file A, then read file B, then search for C), consider providing a
batch_readormulti_searchtool that handles multiple requests in a single LLM tool call. The agent pays one output token cost (for the batch tool call request) instead of three separate ones. Tool call batching reduces total output tokens and simplifies the conversation history by replacing three tool-use turns with one. - Speculative tool prefetching. For predictable tool call sequences (agents that always search before reading before summarizing), pre-fetch the likely next tool result while the current tool is executing. Prefetching is outside the LLM’s awareness — you pre-execute what the model will almost certainly ask for next. When the model requests the tool, return the cached prefetch result instantly. This does not reduce token cost but reduces latency, which reduces time in the agentic loop, which reduces the number of turns needed to complete the task (some tasks require fewer turns when executed quickly because tool results arrive before the model’s planning context decays).
RunGuard integration for tool cost enforcement
- Per-tool cost tracking. Instrument RunGuard’s BudgetTracker with per-tool cost annotations. Before executing each tool call, record the tool name and the agent’s current context size. After execution, record the tool result size. This gives you a per-tool cost contribution breakdown: “search tool: 45% of input tokens; file_read tool: 30%; code_execute tool: 15%; other: 10%.” This breakdown reveals which tools are driving costs and should be prioritized for result size optimization.
- Tool result size circuit breaker. Configure a per-tool result size limit in RunGuard. If a tool returns a result larger than the configured limit (e.g.,
file_readreturning more than 5,000 tokens), RunGuard intercepts the result before it is added to context, triggers automatic truncation, and logs the event. This prevents individual large tool results from spiking a session’s input token cost regardless of whether the tool implementation enforces its own limits. It is a defense-in-depth measure for tools you don’t control (third-party APIs, plugin-sourced tools).const guard = new RunGuard({ tools: { resultSizeLimits: { web_search: 1500, // tokens read_file: 4000, fetch_url: 3000, run_query: 2000 }, onResultTruncated: (toolName, originalSize, limit) => { console.warn(`${toolName} result truncated: ${originalSize} โ ${limit} tokens`); } } }); - Loop detection on repetitive tool calls. RunGuard’s LoopDetector tracks the sequence of tool calls in a session. If the same tool is called with the same or near-identical parameters more than twice in a row, LoopDetector raises a
LoopDetectedError, halting the agent before it accumulates additional tool result tokens for a query it already has the answer to. See how to set max cost per LLM request for configuring RunGuard enforcement thresholds.
The cheapest tool call is the one you don’t make twice.
Tool verbosity and result size are the hidden cost drivers in AI agent systems. Optimize your tool schemas and enforce size limits with RunGuard to cut input token costs 30–50% without touching your model or your prompts.
Start free trial →