AI agent cost per user session: the unit economics metric every AI product team needs to track

When you ship an AI agent to real users, cost-per-session becomes as important as cost-per-acquisition. A free-tier user who runs 10 agent sessions per day at $0.30 per session costs your company $3/day — more than most SaaS products charge for free-tier users in a month. At scale, untracked per-user LLM spend can make a growing AI product less profitable at higher user counts, not more. This page explains how to measure cost per user session accurately, how to set per-user caps that protect margins without degrading experience, and how to identify the power users and edge cases that drive 80% of your LLM bill.

Why per-user cost tracking is hard

Shared API keys aggregate everything. Most teams start with a single OpenAI API key. The billing dashboard shows total spend but provides no per-user, per-session, or per-feature attribution. By the time you notice a spike, you don’t know which users, features, or input patterns caused it. Retroactively attributing spend requires log correlation that most teams never set up.
The heavy-user skew. LLM cost distributions are heavily right-skewed. In a typical AI product, 10% of users generate 60–70% of token spend. These power users either run many sessions, use features that trigger multi-step agent loops, or enter inputs that produce long chains of tool calls and model reasoning. Without per-user tracking, you can’t identify these users, model the cost at different price points, or apply differentiated limits between tiers.
System prompt inflation is invisible. Your system prompt is included in every call. A 4k-token system prompt costs $0.01 per call at GPT-4o input pricing. For a user who triggers 50 calls per session (common for tool-heavy agents), the system prompt alone costs $0.50 per session. As your system prompt grows with new instructions and few-shot examples, per-session cost increases without any change in user behavior — and without per-user tracking, this inflation is invisible.
Sessions are not requests. A single user session may consist of 1–50 LLM requests depending on what the user asks. Cost per API request is a poor proxy for cost per session because it doesn’t account for the session’s total interaction depth. A user who asks one complex task that triggers a 20-step agent loop costs the same as a user who has 20 simple one-step requests — but their experience, value delivered, and billing implications are completely different.

How to track cost per user session

The session-scoped guard pattern. The cleanest approach is to create a new RunGuard instance per user session, tagged with the user and session ID. The guard accumulates cost for all LLM calls within the session and exposes the total on session end:

from runguard import guard, BudgetExceededError
import uuid

class AgentSession:
    def __init__(self, user_id: str, plan: str = "free"):
        self.user_id = user_id
        self.session_id = str(uuid.uuid4())
        self.plan = plan

        # Plan-based budget caps
        per_session_limits = {
            "free": 0.15,    # 15 cents — about 5–10 agent steps
            "starter": 0.50,  # 50 cents — ~20–30 steps
            "pro": 2.00,     # $2 — ~100 steps
        }
        self._budget = per_session_limits.get(plan, 0.15)
        self._spent = 0.0
        self._calls = 0

    def create_guard(self, llm_fn):
        """Wrap an LLM call function with this session's budget guard."""
        async def tracked_call(*args, **kwargs):
            response = await llm_fn(*args, **kwargs)
            usage = response.usage
            usd = (usage.prompt_tokens * 2.50 + usage.completion_tokens * 10.0) / 1_000_000
            sig = extract_tool_sig(response)
            return {"response": response, "usd": usd, "sig": sig}

        return guard(
            tracked_call,
            budget={"max_usd": self._budget},
            loop={"repeats": 3, "window": 6},
            on_budget_exceeded="raise",
        )

    def log_session_end(self, spent_usd: float, error: str = None):
        import json, datetime
        record = {
            "user_id": self.user_id,
            "session_id": self.session_id,
            "plan": self.plan,
            "spent_usd": round(spent_usd, 6),
            "budget_usd": self._budget,
            "budget_used_pct": round(spent_usd / self._budget * 100, 1),
            "error": error,
            "ts": datetime.datetime.utcnow().isoformat() + "Z",
        }
        # Write to your analytics store
        print(json.dumps(record))  # replace with your logging backend

Pulling cost from usage metadata. The OpenAI API returns token counts in response.usage. To convert to USD cost: (prompt_tokens * input_price_per_token) + (completion_tokens * output_price_per_token). For GPT-4o (as of early 2026): input $2.50/M, output $10.00/M, cached input $1.25/M. Update these constants when model pricing changes — they’re not returned by the API and must be maintained by your code.
Attributing tool call costs. Tool calls themselves don’t have LLM cost, but the model call that processes tool results does — and it carries the full tool result in the context, which adds to input tokens. The correct attribution model: the cost of processing a tool result belongs to the step that called the tool, not the step that received the result. RunGuard attributes all cost at the model-call level, which is the most accurate granularity available from the API.

Setting per-user budget caps by plan tier

The 3x rule for cap sizing. Your per-session cap should be set at approximately 3x the 90th-percentile session cost for users on that plan. This gives heavy-but-legitimate users enough headroom while capping the edge cases (runaway loops, adversarial inputs, unusually complex tasks) at a point where they don’t break your economics. Tighter caps frustrate legitimate power users; looser caps expose you to cost attacks.
Free tier: tight cap + graceful degradation. Free-tier users should hit a hard cap quickly and receive a clear, helpful error message that explains the limit and shows how to upgrade. A well-designed cap experience converts free-tier heavy users to paid — they hit the limit doing something valuable, which is exactly when upsell messaging works. A badly designed cap (generic error, no upgrade path) just creates churn.
Pro tier: soft cap + alert, hard cap later. For paid users, a soft cap at 75% of the limit fires an alert (log the event, optionally notify the user) and a hard cap at 150% of the nominal limit. Paid users who consistently hit the 75% soft cap are candidates for a higher-tier plan — flag them for sales or in-app upsell. Users who occasionally spike to 150% are doing something unusual but legitimately expensive — let it through but log it for anomaly review.
Enterprise: per-contract limits, not fixed caps. Enterprise customers negotiate custom contracts with usage commitments. Apply RunGuard’s per-session guard at 5–10x the contracted per-session rate as a safety ceiling against runaway loops, but don’t enforce the contract rate in real-time — that creates reliability issues for legitimate heavy workloads. Bill against the contract monthly, not per-session.

Reducing cost per session without degrading quality

Use a cascade model strategy. Route simple queries (short input, no tool use, conversational) to a cheap model (gpt-4o-mini at $0.15/M). Route complex queries (tool use, long context, multi-step reasoning) to the full model. A cascade strategy reduces median per-session cost by 40–60% for most agents without affecting the output quality that users actually notice — because most user queries are simpler than your worst-case design scenarios.
Trim system prompts aggressively. Audit your system prompt for instructions that are never followed, few-shot examples that don’t improve output quality, and formatting instructions that the model ignores. Every 1k tokens removed from your system prompt saves $0.0025 per call at GPT-4o pricing — for an agent that processes 100k sessions/month with 10 calls per session, that’s $2,500/month per 1k system prompt tokens eliminated.
Cache identical tool results. Many agent tool calls are idempotent for the same input — fetching the same URL, running the same database query. A simple LRU cache on tool results with a 5-minute TTL can eliminate 20–40% of tool result processing calls for typical workloads. Combine with OpenAI’s prompt caching feature (automatic for 1024+ token identical prefixes) to cache the portion of context that doesn’t change between calls.
Detect and short-circuit loops before they compound. The biggest reduction in cost per session comes from detecting runaway loops early. A loop that runs for 3 extra iterations at $0.10/iteration costs $0.30 — the same as an entire legitimate session for many users. RunGuard’s loop detection at 3 repeats prevents the loop tax on your per-session cost, keeping the 95th-percentile cost close to the median.

Per-user cost management approaches

Approach	Granularity	Real-time cap?	Implementation effort
OpenAI usage dashboard	Account-level only	No (weekly alerts)	None — but useless for attribution
Log + attribute post-hoc	Per request	No	Medium — needs log pipeline
Session-scoped RunGuard	Per session + per user	Yes — fires before limit exceeded	Low — one guard per session
Database usage counters	Per user + per day	Soft (check before each session)	Medium — DB read/write per call
Middleware rate limiter	Per request (not per session)	Yes	Low — but doesn’t account for per-call cost variability