AI agent observability cost dashboard: the four cost dimensions standard APM tools miss
Datadog, New Relic, and every standard APM tool measure latency, error rate, and throughput — the metrics that matter for traditional web services. For LLM agents, these three dimensions are necessary but not sufficient. A request that returns HTTP 200 in 4 seconds may have cost $2.80 in tokens and looped 14 times before producing a usable output. A request that returns in 400ms may have hit a cached prompt and cost $0.002. None of this is visible in a latency histogram. Cost per session, token consumption per turn, budget utilization rate, and loop detection event frequency are the four observability dimensions that determine whether your agent is economically viable at scale — and they are entirely absent from standard APM dashboards. Building a cost dashboard for an LLM agent does not require a separate observability platform, a new data pipeline, or an LLM-specific SaaS layer. It requires instrumenting the one function that touches every LLM call: the call function you wrap with RunGuard. Every call that passes through RunGuard emits structured event data that you can log, aggregate, and visualize with whatever infrastructure you already have.
The four cost dimensions your dashboard must track
- Token consumption per turn and per session. Input tokens and output tokens have different unit costs and different cost drivers. Input tokens grow with context accumulation and retrieval; output tokens grow with verbosity and reasoning depth. Tracking both separately lets you diagnose whether cost increases are driven by context bloat (input growth) or by the model reasoning more (output growth). Per-turn tracking reveals which tool call step in your agent pipeline is most expensive; per-session tracking tells you whether your average session cost is within business model constraints ($0.30 average session cost on a $19/month plan that includes 1M invocations means your margin per invocation is negative above certain session lengths).
- Model spend per session in USD. Token counts are a proxy metric. USD spend is the business metric. Different models have dramatically different cost-to-capability ratios: GPT-4o at $2.50/$10 per M tokens vs. Claude Haiku at $0.25/$1.25 vs. Claude Sonnet 4.6 at $3/$15. If your agent auto-selects models based on task complexity, USD spend per session tracks the economic outcome of those routing decisions in a single comparable number. Tracking spend-per-session over time reveals cost inflation trends (growing context → growing cost even with no feature changes) before they compound into a budget crisis.
- Budget utilization rate. If you set a $5 session budget cap, the average session spending $0.40 means 8% budget utilization — which is fine. If the 99th percentile session is spending $4.90, you are one tail event away from a cap trip becoming your primary user-facing error. Budget utilization rate (mean and P99 of spend divided by cap) tells you how much headroom you have before your circuit breaker becomes a user experience problem, not just a cost safety net.
- Loop detection event frequency. Every LoopDetectedError that RunGuard raises is a data point about your agent’s reliability. If loop detection fires on 0.1% of sessions, that is background noise from edge-case inputs. If it fires on 5% of sessions, your agent has a structural reliability problem — possibly a prompt design issue, a tool that returns inconsistent outputs, or an injection attack that is successfully reaching your production agents. Tracking loop event rate by agent version, by tool called at trip time, and by user cohort turns an interrupt into a diagnostic signal.
Python: emitting cost events from RunGuard callbacks
RunGuard’s guard function wraps your LLM call function. The wrapped function receives structured data on every call including token counts, USD spend, and loop events. You can add a callback wrapper around the guard to log this data to any sink: SQLite, InfluxDB, CloudWatch, stdout in JSON for a log aggregator, or a simple append-only file.
-
Python: SQLite cost event logger
import sqlite3 import time import uuid import json from runguard import guard, LoopDetectedError, BudgetExceededError import anthropic DB_PATH = "./agent-costs.db" def init_db(): con = sqlite3.connect(DB_PATH) con.execute(""" CREATE TABLE IF NOT EXISTS agent_runs ( run_id TEXT PRIMARY KEY, started_at REAL, ended_at REAL, agent_name TEXT, turns INTEGER, input_tokens INTEGER, output_tokens INTEGER, cost_usd REAL, loop_trips INTEGER, budget_trips INTEGER, outcome TEXT ) """) con.execute(""" CREATE TABLE IF NOT EXISTS call_events ( id INTEGER PRIMARY KEY AUTOINCREMENT, run_id TEXT, turn INTEGER, ts REAL, input_tok INTEGER, output_tok INTEGER, cost_usd REAL, sig TEXT, event_type TEXT -- 'call', 'loop_trip', 'budget_trip' ) """) con.commit() return con client = anthropic.Anthropic() def call_claude(messages: list) -> dict: response = client.messages.create( model="claude-sonnet-4-6", max_tokens=2048, messages=messages, ) usd = (response.usage.input_tokens * 3.0 + response.usage.output_tokens * 15.0) / 1_000_000 return { "response": response, "input_tokens": response.usage.input_tokens, "output_tokens": response.usage.output_tokens, "usd": usd, } base_guard = guard( call_claude, budget={"max_usd": 5.0}, loop={"repeats": 3, "max_cycle_len": 5}, per_call_budget={"max_usd": 0.50}, ) def instrumented_agent(user_query: str, agent_name: str = "default") -> str: con = init_db() run_id = str(uuid.uuid4()) started = time.time() messages = [{"role": "user", "content": user_query}] total_input_tok = 0 total_output_tok = 0 total_usd = 0.0 loop_trips = 0 budget_trips = 0 outcome = "completed" turn = 0 try: for turn in range(20): result = base_guard(messages) response = result["response"] # Log this call call_usd = result.get("usd", 0.0) in_tok = result.get("input_tokens", 0) out_tok = result.get("output_tokens", 0) total_input_tok += in_tok total_output_tok += out_tok total_usd += call_usd con.execute( "INSERT INTO call_events VALUES (NULL,?,?,?,?,?,?,?,?)", (run_id, turn, time.time(), in_tok, out_tok, call_usd, None, "call"), ) con.commit() # Check for tool use tool_calls = [b for b in response.content if b.type == "tool_use"] if not tool_calls: text = "\n".join(b.text for b in response.content if hasattr(b, "text")) return text messages.append({"role": "assistant", "content": response.content}) tool_results = [{"type": "tool_result", "tool_use_id": tc.id, "content": "done"} for tc in tool_calls] messages.append({"role": "user", "content": tool_results}) except LoopDetectedError as e: loop_trips += 1 con.execute( "INSERT INTO call_events VALUES (NULL,?,?,?,?,?,?,?,?)", (run_id, turn, time.time(), 0, 0, 0.0, str(e), "loop_trip"), ) outcome = "loop_halted" return f"[HALTED: loop detected] {e}" except BudgetExceededError as e: budget_trips += 1 con.execute( "INSERT INTO call_events VALUES (NULL,?,?,?,?,?,?,?,?)", (run_id, turn, time.time(), 0, 0, 0.0, str(e), "budget_trip"), ) outcome = "budget_halted" return f"[HALTED: budget exceeded] {e}" finally: ended = time.time() con.execute( "INSERT OR REPLACE INTO agent_runs VALUES (?,?,?,?,?,?,?,?,?,?,?)", (run_id, started, ended, agent_name, turn + 1, total_input_tok, total_output_tok, total_usd, loop_trips, budget_trips, outcome), ) con.commit() con.close() return "[max turns reached]" -
Query: daily cost metrics from the SQLite dashboard
-- Average cost per session, P95 cost, loop trip rate — last 7 days SELECT agent_name, COUNT(*) AS sessions, ROUND(AVG(cost_usd), 4) AS avg_cost_usd, ROUND(AVG(input_tokens), 0) AS avg_input_tok, ROUND(AVG(output_tokens), 0) AS avg_output_tok, ROUND(100.0 * SUM(loop_trips) / COUNT(*), 2) AS loop_trip_pct, ROUND(100.0 * SUM(budget_trips) / COUNT(*), 2) AS budget_trip_pct, ROUND(SUM(cost_usd), 4) AS total_spend_usd FROM agent_runs WHERE started_at > unixepoch('now', '-7 days') AND outcome != 'error' GROUP BY agent_name ORDER BY total_spend_usd DESC; -- Per-turn cost profile: which turn is most expensive? SELECT turn, COUNT(*) AS calls, ROUND(AVG(cost_usd), 5) AS avg_cost_usd, ROUND(AVG(input_tok), 0) AS avg_input_tok, ROUND(AVG(output_tok), 0) AS avg_output_tok, SUM(CASE WHEN event_type = 'loop_trip' THEN 1 ELSE 0 END) AS loop_trips, SUM(CASE WHEN event_type = 'budget_trip' THEN 1 ELSE 0 END) AS budget_trips FROM call_events WHERE ts > unixepoch('now', '-7 days') GROUP BY turn ORDER BY turn;
The second query is the most operationally useful: it tells you which turn number in your agent’s loop has the highest average cost. If turn 8 costs 4× turn 1, your agent is accumulating context faster than expected and will hit the context window limit sooner than your integration tests showed. This is the kind of cost inflation that is invisible at the API boundary but immediately visible in a per-turn cost profile.
TypeScript: structured cost logging with JSON output
-
TypeScript: JSON cost event emitter for log aggregators
import { guard, LoopDetectedError, BudgetExceededError } from "@runguard/sdk"; import Anthropic from "@anthropic-ai/sdk"; import * as crypto from "crypto"; const client = new Anthropic(); async function callClaude(messages: Anthropic.MessageParam[]) { const response = await client.messages.create({ model: "claude-sonnet-4-6", max_tokens: 2048, messages, }); const usd = (response.usage.input_tokens * 3.0 + response.usage.output_tokens * 15.0) / 1_000_000; return { response, inputTokens: response.usage.input_tokens, outputTokens: response.usage.output_tokens, usd }; } const baseGuard = guard(callClaude, { budget: { maxUsd: 5.0 }, loop: { repeats: 3, maxCycleLen: 5 }, perCallBudget: { maxUsd: 0.50 }, }); interface CostEvent { run_id: string; ts: number; event: "call" | "loop_trip" | "budget_trip" | "session_end"; turn?: number; input_tokens?: number; output_tokens?: number; cost_usd?: number; total_cost_usd?: number; outcome?: string; detail?: string; } function emit(event: CostEvent): void { // Replace with: write to CloudWatch, InfluxDB, Datadog custom metric, etc. process.stdout.write(JSON.stringify(event) + "\n"); } export async function instrumentedAgent(query: string, agentName = "default"): Promise{ const runId = crypto.randomUUID(); const messages: Anthropic.MessageParam[] = [{ role: "user", content: query }]; let totalCost = 0; let turn = 0; try { for (turn = 0; turn < 20; turn++) { const result = await baseGuard(messages); totalCost += result.usd ?? 0; emit({ run_id: runId, ts: Date.now(), event: "call", turn, input_tokens: result.inputTokens, output_tokens: result.outputTokens, cost_usd: result.usd, }); const { response } = result; const toolCalls = response.content.filter(b => b.type === "tool_use"); if (!toolCalls.length) { const text = response.content.filter(b => b.type === "text").map(b => (b as Anthropic.TextBlock).text).join("\n"); emit({ run_id: runId, ts: Date.now(), event: "session_end", total_cost_usd: totalCost, outcome: "completed" }); return text; } messages.push({ role: "assistant", content: response.content }); messages.push({ role: "user", content: toolCalls.map(tc => ({ type: "tool_result" as const, tool_use_id: (tc as Anthropic.ToolUseBlock).id, content: "done" })) }); } } catch (e) { if (e instanceof LoopDetectedError) { emit({ run_id: runId, ts: Date.now(), event: "loop_trip", turn, detail: e.message, total_cost_usd: totalCost, outcome: "loop_halted" }); return `[HALTED: loop] ${e.message}`; } if (e instanceof BudgetExceededError) { emit({ run_id: runId, ts: Date.now(), event: "budget_trip", turn, detail: e.message, total_cost_usd: totalCost, outcome: "budget_halted" }); return `[HALTED: budget] ${e.message}`; } throw e; } emit({ run_id: runId, ts: Date.now(), event: "session_end", total_cost_usd: totalCost, outcome: "max_turns" }); return "[max turns reached]"; }
Agent cost observability: approach comparison
| Approach | Token-level cost tracking | Per-session USD spend | Loop event detection | Budget utilization rate |
|---|---|---|---|---|
| Standard APM (Datadog, New Relic) | Not natively — requires custom metrics | Not natively — requires manual tagging | No — no concept of LLM loop events | No |
| LLM observability SaaS (Langfuse, LangSmith) | Yes — trace token counts per span | Yes — aggregated per trace | Partial — post-hoc trace analysis, not real-time halt | No — no budget cap concept, no utilization metric |
| RunGuard + SQLite callback | Yes — per-call input/output token counts | Yes — aggregated across all calls in session | Yes — real-time; LoopDetectedError fired and logged at trip time | Yes — spend / cap ratio queryable per session |
For real-time prevention of costs that would require dashboard intervention after the fact, see prevent AI agent runaway cost in real time. For the specific cost patterns that appear in multi-agent systems and require per-agent tracking, see multi-agent orchestration cost control.
Add cost observability to your AI agent today
RunGuard installs in one command: pip install runguard for Python, npm install @runguard/sdk for TypeScript. Wrap your LLM call function with guard(), add a callback layer that logs the structured result to SQLite or your log aggregator, and run the dashboard queries above to see per-session cost, per-turn cost profile, and loop event frequency within one session of instrumentation. No separate data pipeline required, no new infrastructure, no data leaving your environment.
RunGuard pricing: Solo plan at $19/month for individual developers. Team plan at $79/month adds shared dashboards, Slack and PagerDuty webhook alerts, and multi-user audit log. Both plans include a 14-day free trial — no credit card required.
Start your 14-day free trial — or explore related patterns: prevent AI agent runaway cost in real time, autonomous agent cost control best practices, set max cost per LLM request, multi-agent orchestration cost control, and AI agent cost per user session.