LLM agent SDK cost comparison: LangChain, AutoGen, CrewAI, LlamaIndex, PydanticAI

Picking an LLM agent framework is partly a cost decision. Each SDK makes different default choices about how many LLM calls to make per task, how aggressively to retry on errors, how much context it carries across turns, and whether it ever stops a runaway loop. These defaults can vary your per-task cost by 2× to 20× on identical workloads. This page breaks down the cost characteristics of each major SDK and explains why a runtime circuit breaker like RunGuard is necessary regardless of which framework you choose.

Why SDK choice affects cost dramatically

When you pick an agent framework, you’re choosing an implicit execution model. That execution model determines:

How many LLM calls are made per step. Some frameworks call the model once to plan and once to execute. Others make a planning call, a reflection call, a verification call, and a synthesis call — four calls for every step the naive single-call approach does in one.
What context is carried between calls. Frameworks that carry the full conversation history on every call will see costs that grow quadratically with run length. Frameworks that summarize or trim the context linearly cap the growth.
How retries are handled. A framework that retries failed tool calls 3 times with exponential backoff and a fresh context each time multiplies costs at exactly the moments when something is already going wrong — which is when you can least afford it.
Whether loops are detected. Most frameworks have no built-in loop detection. If your agent gets into a state where it keeps calling the same tool with the same arguments, the framework will dutifully call the LLM to decide what to do next, receive the same answer, and call the tool again. Indefinitely.

These are not hypothetical concerns. They are the most common cause of four-figure LLM bills in production AI applications.

LangChain Agents: flexible, verbose by default

LangChain’s agent abstraction supports multiple execution strategies: ReAct, OpenAI Functions, Tool Calling, and Structured Chat. The ReAct loop — still widely used — sends the full chat history on every iteration, which means that by turn 10, each call includes turns 1 through 9 as context. On GPT-4o, a 30-turn ReAct run can cost 15× more in output tokens than a 5-turn run even if the useful work done is the same, simply because the model is processing an ever-growing transcript on every call.

LangChain’s max_iterations parameter (default: 15) and max_execution_time are the primary safety valves, but they are opt-in at construction time. They don’t detect loops; they just cap the iteration count. An agent stuck in a loop will still execute 15 iterations before stopping. At $0.005 per call, 15 iterations costs about $0.075 in LLM calls alone — but if each iteration involves multiple tool calls that themselves trigger LLM calls (e.g., an agent using another LLM as a tool), the multiplier compounds quickly.

LangChain also has trim_messages utilities and callbacks you can use to control context growth, but these require explicit configuration. The defaults favor completeness over cost efficiency.

AutoGen: multi-agent orchestration with high call volume

AutoGen’s core abstraction is a conversation between agents. A two-agent setup (AssistantAgent + UserProxyAgent) that would take a single-agent framework one LLM call might take AutoGen three to five: the user proxy sends a message, the assistant responds, the user proxy evaluates the response and possibly asks for clarification, the assistant refines, and the user proxy executes. Each agent-to-agent message is a separate LLM call.

AutoGen’s GroupChat compounds this. In a GroupChat with four specialized agents, selecting which agent should speak next is itself an LLM call against a “manager” model. For every step in the task, you pay: one call to select the speaker, one call from the selected agent to respond, and potentially calls from other agents who decide to chime in. A 10-step task in a 4-agent GroupChat can easily require 30–50 LLM calls.

AutoGen has a max_consecutive_auto_reply setting, but it applies per agent pair, not globally across the group, and it does not detect repeated call patterns. Loop detection is not built in.

CrewAI: structured task graphs, predictable but not cheap

CrewAI organizes agents into crews with defined roles and tasks. The execution model is more predictable than AutoGen’s dynamic conversations: each agent gets a specific task, processes it, and passes output to the next agent. This structural predictability reduces the risk of unbounded loops in the task graph itself, but it introduces different cost risks.

CrewAI’s Process.sequential mode calls each agent once in order, which is economical. Process.hierarchical adds a manager agent that reviews each step’s output and decides whether to continue — doubling the LLM calls for complex tasks. The “memory” feature (long-term, short-term, entity memory) adds retrieval calls to each agent turn, and verbose output from tools is often included in full in the next agent’s context window.

Tool retry behavior in CrewAI defaults to retrying failed tool calls up to 2 times. If your tool is flaky (network errors, rate limits), CrewAI will attempt recovery — three tool calls instead of one, with LLM overhead on each retry to re-plan.

LlamaIndex Agents: strong RAG integration, context-heavy

LlamaIndex’s agent framework is deeply integrated with its query engine and retrieval pipeline. This means an agent that needs to answer a question will often chain a retrieval call (which may itself invoke an LLM for query rewriting) with an answer synthesis call and a citation check call. Each step in a research task can involve 3–5 LLM calls where a retrieval-agnostic framework would use one.

The ReActAgent in LlamaIndex carries the full chat history by default. The FunctionCallingAgent is somewhat more efficient because function calling responses are more token-efficient than free-text ReAct traces. LlamaIndex does provide a TokenCounter callback that tracks cumulative token usage, which is useful for analysis but not for real-time enforcement — you can see that you used 50,000 tokens after the fact, but nothing in the framework halts the run when you hit your limit.

PydanticAI: type-safe, lean call model

PydanticAI is the newest entry in this comparison and takes a notably different approach. It uses structured outputs enforced by Pydantic models, which means the model is guided toward valid structured responses rather than free-text reasoning. This significantly reduces the number of calls needed for validation and retry: instead of three attempts to parse a JSON response the model generated as free text, a single structured-output call returns a valid Pydantic object or raises a ValidationError that you handle explicitly.

PydanticAI’s default agent is single-threaded and synchronous, with explicit tool registration. The call model is: one LLM call to select tools and arguments, tools execute, results are fed back, one LLM call to synthesize. This two-call pattern per step is the most economical of the major frameworks. The tradeoff is that complex multi-step reasoning that benefits from ReAct-style reflection is harder to express.

PydanticAI has no built-in loop detection, but its explicit, typed control flow makes loops easier to detect at the application level than in more dynamic frameworks.

Semantic Kernel: enterprise-grade, plugin overhead

Microsoft’s Semantic Kernel targets enterprise integration scenarios. Its plugin architecture means that calling a native function (C# or Python) alongside LLM prompts is a first-class operation, but the planner components — particularly the HandleBars and Stepwise planners — have significant overhead: planning itself requires an LLM call that generates an execution plan before any work starts. For short tasks, the planning call can cost as much as the execution.

Semantic Kernel’s FunctionChoiceBehavior.Auto mode uses the model’s own function-calling capability, which is more efficient than the older planner-based approach. But the framework still carries rich metadata about every registered plugin in the system prompt, which grows with your plugin count. An agent with 20 registered plugins sends the description of all 20 plugins in every call, even if only one is relevant.

Cost comparison: key behavioral differences

Framework	LLM calls per step (typical)	Context growth model	Built-in loop detection	Default retry on tool error
LangChain ReAct	1–2	Full history (quadratic)	No (max_iterations cap only)	No (manual)
AutoGen GroupChat	3–5+ (speaker selection overhead)	Per-agent conversation history	No	max_consecutive_auto_reply per pair
CrewAI sequential	1 per agent task	Task output passed forward	No	2 retries on tool error
CrewAI hierarchical	2 (agent + manager)	Task output + manager context	No	2 retries on tool error
LlamaIndex ReAct	2–3 (retrieval + synthesis)	Full chat history	No	Configurable
PydanticAI	1–2 (select + synthesize)	Explicit message history	No	ValidationError → explicit
Semantic Kernel	2 (plan + execute)	Plugin metadata always in prompt	No	Configurable

The critical observation: none of these frameworks include built-in loop detection. They all rely on the developer to add safeguards. This is the gap RunGuard fills.

Adding RunGuard to any agent SDK

RunGuard’s guard() wrapper is framework-agnostic. It wraps individual tool functions regardless of which framework is orchestrating the calls. The pattern is the same whether you’re using LangChain tools, AutoGen function calls, CrewAI tools, or LlamaIndex query tools:

from runguard import guard, BudgetTracker, LoopDetectedError, BudgetExceededError

tracker = BudgetTracker(max_usd=5.0)  # $5 per-run ceiling

# Wrap any tool function — works with any SDK
@guard(budget=tracker, loop_window=20, loop_threshold=3)
async def web_search(query: str) -> str:
    return await search_client.search(query)

# Then pass the wrapped function as a tool to your framework of choice:

# LangChain:
from langchain.tools import StructuredTool
tool = StructuredTool.from_function(web_search)

# AutoGen:
agent.register_function(web_search, name="web_search")

# CrewAI:
from crewai.tools import BaseTool
class SearchTool(BaseTool):
    async def _run(self, query: str):
        return await web_search(query)

# LlamaIndex:
from llama_index.tools import FunctionTool
tool = FunctionTool.from_defaults(fn=web_search)

The guard wraps the function, not the framework. It doesn’t care whether LangChain, AutoGen, or CrewAI is calling the function — it sees the call signature, checks the loop window, checks the budget, and either lets the call through or raises before the expensive operation starts. When a loop is detected at iteration 3 instead of iteration 15, you save 80% of the call volume that a framework’s max_iterations cap alone would allow.

SDK selection vs. runtime protection: two different decisions

Choosing the right SDK for your use case is primarily about developer ergonomics, ecosystem fit, and the execution model that matches your task structure. LangChain’s flexibility is valuable when your workflows are diverse. AutoGen’s multi-agent model is the right choice for tasks that genuinely benefit from specialized agent roles collaborating. PydanticAI’s type safety reduces bugs and simplifies debugging.

What SDK selection does not address is the fundamental execution risk: any agent, in any framework, can get into a state where it makes the same call repeatedly, exceeds your cost budget, or continues running after it should have stopped. These risks exist in every framework because they are properties of the agent’s runtime behavior, not of the framework’s design.

Runtime protection and SDK choice are independent dimensions. You pick the SDK that fits your architecture, then add RunGuard to the tools that touch external APIs, LLM endpoints, or any other resource with a cost-per-call. The two layers compose cleanly, and neither constrains the other.

The teams that skip runtime protection are betting that their agents will never loop in production. Given that loop bugs are routinely discovered in production for the first time — because the exact combination of inputs that triggers them never appeared in testing — that bet is rarely worth taking.

Add circuit breaking to your agent SDK today

RunGuard is a five-minute install on top of whatever framework you’re already using. Set a per-run budget, set a loop threshold, wrap your tool functions, and deploy. The SDK handles the rest — including Slack alerts when a circuit trips.

Get started with RunGuard — or read more about token budget enforcement in Python, LLM cost per feature tracking, and implementing a circuit breaker for AI agents.