AI agent observability cost dashboard: the four cost dimensions standard APM tools miss

Datadog, New Relic, and every standard APM tool measure latency, error rate, and throughput — the metrics that matter for traditional web services. For LLM agents, these three dimensions are necessary but not sufficient. A request that returns HTTP 200 in 4 seconds may have cost $2.80 in tokens and looped 14 times before producing a usable output. A request that returns in 400ms may have hit a cached prompt and cost $0.002. None of this is visible in a latency histogram. Cost per session, token consumption per turn, budget utilization rate, and loop detection event frequency are the four observability dimensions that determine whether your agent is economically viable at scale — and they are entirely absent from standard APM dashboards. Building a cost dashboard for an LLM agent does not require a separate observability platform, a new data pipeline, or an LLM-specific SaaS layer. It requires instrumenting the one function that touches every LLM call: the call function you wrap with RunGuard. Every call that passes through RunGuard emits structured event data that you can log, aggregate, and visualize with whatever infrastructure you already have.

The four cost dimensions your dashboard must track

Token consumption per turn and per session. Input tokens and output tokens have different unit costs and different cost drivers. Input tokens grow with context accumulation and retrieval; output tokens grow with verbosity and reasoning depth. Tracking both separately lets you diagnose whether cost increases are driven by context bloat (input growth) or by the model reasoning more (output growth). Per-turn tracking reveals which tool call step in your agent pipeline is most expensive; per-session tracking tells you whether your average session cost is within business model constraints ($0.30 average session cost on a $19/month plan that includes 1M invocations means your margin per invocation is negative above certain session lengths).
Model spend per session in USD. Token counts are a proxy metric. USD spend is the business metric. Different models have dramatically different cost-to-capability ratios: GPT-4o at $2.50/$10 per M tokens vs. Claude Haiku at $0.25/$1.25 vs. Claude Sonnet 4.6 at $3/$15. If your agent auto-selects models based on task complexity, USD spend per session tracks the economic outcome of those routing decisions in a single comparable number. Tracking spend-per-session over time reveals cost inflation trends (growing context → growing cost even with no feature changes) before they compound into a budget crisis.
Budget utilization rate. If you set a $5 session budget cap, the average session spending $0.40 means 8% budget utilization — which is fine. If the 99th percentile session is spending $4.90, you are one tail event away from a cap trip becoming your primary user-facing error. Budget utilization rate (mean and P99 of spend divided by cap) tells you how much headroom you have before your circuit breaker becomes a user experience problem, not just a cost safety net.
Loop detection event frequency. Every LoopDetectedError that RunGuard raises is a data point about your agent’s reliability. If loop detection fires on 0.1% of sessions, that is background noise from edge-case inputs. If it fires on 5% of sessions, your agent has a structural reliability problem — possibly a prompt design issue, a tool that returns inconsistent outputs, or an injection attack that is successfully reaching your production agents. Tracking loop event rate by agent version, by tool called at trip time, and by user cohort turns an interrupt into a diagnostic signal.

Python: emitting cost events from RunGuard callbacks

RunGuard’s guard function wraps your LLM call function. The wrapped function receives structured data on every call including token counts, USD spend, and loop events. You can add a callback wrapper around the guard to log this data to any sink: SQLite, InfluxDB, CloudWatch, stdout in JSON for a log aggregator, or a simple append-only file.

Python: SQLite cost event logger

import sqlite3
import time
import uuid
import json
from runguard import guard, LoopDetectedError, BudgetExceededError
import anthropic

DB_PATH = "./agent-costs.db"

def init_db():
    con = sqlite3.connect(DB_PATH)
    con.execute("""
        CREATE TABLE IF NOT EXISTS agent_runs (
            run_id      TEXT PRIMARY KEY,
            started_at  REAL,
            ended_at    REAL,
            agent_name  TEXT,
            turns       INTEGER,
            input_tokens  INTEGER,
            output_tokens INTEGER,
            cost_usd    REAL,
            loop_trips  INTEGER,
            budget_trips INTEGER,
            outcome     TEXT
        )
    """)
    con.execute("""
        CREATE TABLE IF NOT EXISTS call_events (
            id          INTEGER PRIMARY KEY AUTOINCREMENT,
            run_id      TEXT,
            turn        INTEGER,
            ts          REAL,
            input_tok   INTEGER,
            output_tok  INTEGER,
            cost_usd    REAL,
            sig         TEXT,
            event_type  TEXT   -- 'call', 'loop_trip', 'budget_trip'
        )
    """)
    con.commit()
    return con

client = anthropic.Anthropic()

def call_claude(messages: list) -> dict:
    response = client.messages.create(
        model="claude-sonnet-4-6",
        max_tokens=2048,
        messages=messages,
    )
    usd = (response.usage.input_tokens * 3.0 + response.usage.output_tokens * 15.0) / 1_000_000
    return {
        "response":      response,
        "input_tokens":  response.usage.input_tokens,
        "output_tokens": response.usage.output_tokens,
        "usd":           usd,
    }

base_guard = guard(
    call_claude,
    budget={"max_usd": 5.0},
    loop={"repeats": 3, "max_cycle_len": 5},
    per_call_budget={"max_usd": 0.50},
)

def instrumented_agent(user_query: str, agent_name: str = "default") -> str:
    con = init_db()
    run_id = str(uuid.uuid4())
    started = time.time()
    messages = [{"role": "user", "content": user_query}]

    total_input_tok = 0
    total_output_tok = 0
    total_usd = 0.0
    loop_trips = 0
    budget_trips = 0
    outcome = "completed"
    turn = 0

    try:
        for turn in range(20):
            result = base_guard(messages)
            response = result["response"]

            # Log this call
            call_usd = result.get("usd", 0.0)
            in_tok   = result.get("input_tokens", 0)
            out_tok  = result.get("output_tokens", 0)
            total_input_tok  += in_tok
            total_output_tok += out_tok
            total_usd        += call_usd

            con.execute(
                "INSERT INTO call_events VALUES (NULL,?,?,?,?,?,?,?,?)",
                (run_id, turn, time.time(), in_tok, out_tok, call_usd, None, "call"),
            )
            con.commit()

            # Check for tool use
            tool_calls = [b for b in response.content if b.type == "tool_use"]
            if not tool_calls:
                text = "\n".join(b.text for b in response.content if hasattr(b, "text"))
                return text

            messages.append({"role": "assistant", "content": response.content})
            tool_results = [{"type": "tool_result", "tool_use_id": tc.id, "content": "done"} for tc in tool_calls]
            messages.append({"role": "user", "content": tool_results})

    except LoopDetectedError as e:
        loop_trips += 1
        con.execute(
            "INSERT INTO call_events VALUES (NULL,?,?,?,?,?,?,?,?)",
            (run_id, turn, time.time(), 0, 0, 0.0, str(e), "loop_trip"),
        )
        outcome = "loop_halted"
        return f"[HALTED: loop detected] {e}"

    except BudgetExceededError as e:
        budget_trips += 1
        con.execute(
            "INSERT INTO call_events VALUES (NULL,?,?,?,?,?,?,?,?)",
            (run_id, turn, time.time(), 0, 0, 0.0, str(e), "budget_trip"),
        )
        outcome = "budget_halted"
        return f"[HALTED: budget exceeded] {e}"

    finally:
        ended = time.time()
        con.execute(
            "INSERT OR REPLACE INTO agent_runs VALUES (?,?,?,?,?,?,?,?,?,?,?)",
            (run_id, started, ended, agent_name, turn + 1,
             total_input_tok, total_output_tok, total_usd,
             loop_trips, budget_trips, outcome),
        )
        con.commit()
        con.close()

    return "[max turns reached]"

Query: daily cost metrics from the SQLite dashboard

-- Average cost per session, P95 cost, loop trip rate — last 7 days
SELECT
    agent_name,
    COUNT(*)                                         AS sessions,
    ROUND(AVG(cost_usd), 4)                         AS avg_cost_usd,
    ROUND(AVG(input_tokens), 0)                     AS avg_input_tok,
    ROUND(AVG(output_tokens), 0)                    AS avg_output_tok,
    ROUND(100.0 * SUM(loop_trips) / COUNT(*), 2)    AS loop_trip_pct,
    ROUND(100.0 * SUM(budget_trips) / COUNT(*), 2)  AS budget_trip_pct,
    ROUND(SUM(cost_usd), 4)                         AS total_spend_usd
FROM agent_runs
WHERE started_at > unixepoch('now', '-7 days')
  AND outcome != 'error'
GROUP BY agent_name
ORDER BY total_spend_usd DESC;

-- Per-turn cost profile: which turn is most expensive?
SELECT
    turn,
    COUNT(*)                        AS calls,
    ROUND(AVG(cost_usd), 5)        AS avg_cost_usd,
    ROUND(AVG(input_tok), 0)       AS avg_input_tok,
    ROUND(AVG(output_tok), 0)      AS avg_output_tok,
    SUM(CASE WHEN event_type = 'loop_trip'   THEN 1 ELSE 0 END) AS loop_trips,
    SUM(CASE WHEN event_type = 'budget_trip' THEN 1 ELSE 0 END) AS budget_trips
FROM call_events
WHERE ts > unixepoch('now', '-7 days')
GROUP BY turn
ORDER BY turn;

The second query is the most operationally useful: it tells you which turn number in your agent’s loop has the highest average cost. If turn 8 costs 4× turn 1, your agent is accumulating context faster than expected and will hit the context window limit sooner than your integration tests showed. This is the kind of cost inflation that is invisible at the API boundary but immediately visible in a per-turn cost profile.

TypeScript: structured cost logging with JSON output

TypeScript: JSON cost event emitter for log aggregators

import { guard, LoopDetectedError, BudgetExceededError } from "@runguard/sdk";
import Anthropic from "@anthropic-ai/sdk";
import * as crypto from "crypto";

const client = new Anthropic();

async function callClaude(messages: Anthropic.MessageParam[]) {
  const response = await client.messages.create({
    model: "claude-sonnet-4-6",
    max_tokens: 2048,
    messages,
  });
  const usd = (response.usage.input_tokens * 3.0 + response.usage.output_tokens * 15.0) / 1_000_000;
  return { response, inputTokens: response.usage.input_tokens, outputTokens: response.usage.output_tokens, usd };
}

const baseGuard = guard(callClaude, {
  budget: { maxUsd: 5.0 },
  loop: { repeats: 3, maxCycleLen: 5 },
  perCallBudget: { maxUsd: 0.50 },
});

interface CostEvent {
  run_id: string;
  ts: number;
  event: "call" | "loop_trip" | "budget_trip" | "session_end";
  turn?: number;
  input_tokens?: number;
  output_tokens?: number;
  cost_usd?: number;
  total_cost_usd?: number;
  outcome?: string;
  detail?: string;
}

function emit(event: CostEvent): void {
  // Replace with: write to CloudWatch, InfluxDB, Datadog custom metric, etc.
  process.stdout.write(JSON.stringify(event) + "\n");
}

export async function instrumentedAgent(query: string, agentName = "default"): Promise {
  const runId = crypto.randomUUID();
  const messages: Anthropic.MessageParam[] = [{ role: "user", content: query }];
  let totalCost = 0;
  let turn = 0;

  try {
    for (turn = 0; turn < 20; turn++) {
      const result = await baseGuard(messages);
      totalCost += result.usd ?? 0;

      emit({
        run_id: runId, ts: Date.now(), event: "call", turn,
        input_tokens: result.inputTokens, output_tokens: result.outputTokens,
        cost_usd: result.usd,
      });

      const { response } = result;
      const toolCalls = response.content.filter(b => b.type === "tool_use");
      if (!toolCalls.length) {
        const text = response.content.filter(b => b.type === "text").map(b => (b as Anthropic.TextBlock).text).join("\n");
        emit({ run_id: runId, ts: Date.now(), event: "session_end", total_cost_usd: totalCost, outcome: "completed" });
        return text;
      }
      messages.push({ role: "assistant", content: response.content });
      messages.push({ role: "user", content: toolCalls.map(tc => ({ type: "tool_result" as const, tool_use_id: (tc as Anthropic.ToolUseBlock).id, content: "done" })) });
    }
  } catch (e) {
    if (e instanceof LoopDetectedError) {
      emit({ run_id: runId, ts: Date.now(), event: "loop_trip", turn, detail: e.message, total_cost_usd: totalCost, outcome: "loop_halted" });
      return `[HALTED: loop] ${e.message}`;
    }
    if (e instanceof BudgetExceededError) {
      emit({ run_id: runId, ts: Date.now(), event: "budget_trip", turn, detail: e.message, total_cost_usd: totalCost, outcome: "budget_halted" });
      return `[HALTED: budget] ${e.message}`;
    }
    throw e;
  }
  emit({ run_id: runId, ts: Date.now(), event: "session_end", total_cost_usd: totalCost, outcome: "max_turns" });
  return "[max turns reached]";
}

Agent cost observability: approach comparison

Approach	Token-level cost tracking	Per-session USD spend	Loop event detection	Budget utilization rate
Standard APM (Datadog, New Relic)	Not natively — requires custom metrics	Not natively — requires manual tagging	No — no concept of LLM loop events	No
LLM observability SaaS (Langfuse, LangSmith)	Yes — trace token counts per span	Yes — aggregated per trace	Partial — post-hoc trace analysis, not real-time halt	No — no budget cap concept, no utilization metric
RunGuard + SQLite callback	Yes — per-call input/output token counts	Yes — aggregated across all calls in session	Yes — real-time; LoopDetectedError fired and logged at trip time	Yes — spend / cap ratio queryable per session

For real-time prevention of costs that would require dashboard intervention after the fact, see prevent AI agent runaway cost in real time. For the specific cost patterns that appear in multi-agent systems and require per-agent tracking, see multi-agent orchestration cost control.

Add cost observability to your AI agent today

RunGuard installs in one command: pip install runguard for Python, npm install @runguard/sdk for TypeScript. Wrap your LLM call function with guard(), add a callback layer that logs the structured result to SQLite or your log aggregator, and run the dashboard queries above to see per-session cost, per-turn cost profile, and loop event frequency within one session of instrumentation. No separate data pipeline required, no new infrastructure, no data leaving your environment.

RunGuard pricing: Solo plan at $19/month for individual developers. Team plan at $79/month adds shared dashboards, Slack and PagerDuty webhook alerts, and multi-user audit log. Both plans include a 14-day free trial — no credit card required.

Start your 14-day free trial — or explore related patterns: prevent AI agent runaway cost in real time, autonomous agent cost control best practices, set max cost per LLM request, multi-agent orchestration cost control, and AI agent cost per user session.