LLM cost per tenant in multi-tenant SaaS: architecture patterns for isolation, attribution, and billing
LLM API costs are the first major variable cost category in SaaS history that is simultaneously: proportional to usage (like bandwidth), driven by user behavior within a session (like compute cycles), and extremely wide in its distribution across tenants (unlike almost any other infrastructure cost). A traditional multi-tenant SaaS might find that its top 10% of tenants generate 30% of infrastructure costs — a manageable skew. In LLM-powered multi-tenant SaaS, the top 5% of tenants routinely generate 60–80% of LLM API costs, because their users ask longer questions, run more complex agent workflows, and trigger more tool calls per session. If you don’t have tenant-level cost attribution, you can’t identify which tenants are unprofitable on their current plan, you can’t enforce fair-use policies, you can’t generate the per-tenant cost reports your enterprise customers expect, and you can’t protect your other tenants from the noisy-neighbor effect. This page covers the complete architecture: context propagation, budget enforcement, billing reconciliation, and fair-use policy design for LLM costs in multi-tenant SaaS.
Why tenant-aware cost tracking is essential
- Plan profitability cannot be assessed without per-tenant cost data. If you offer three pricing tiers — Starter at $99/month, Professional at $299/month, and Enterprise at $999/month — you might assume that higher plan tiers are proportionally more profitable. This assumption fails if higher-plan tenants are also proportionally heavier LLM users. An Enterprise tenant paying $999/month who generates $1,200/month in LLM API costs is unprofitable. Without per-tenant cost attribution, you don’t know this customer exists until you notice aggregate margins shrinking at your quarterly business review. With per-tenant cost attribution, you can identify this customer within 30 days of them joining, model their profitability on the current plan vs. a renegotiated plan with usage-based LLM overage pricing, and have a business conversation with their account manager before the unprofitable pattern accumulates over multiple billing periods.
- Enterprise customers increasingly require usage transparency. Your enterprise customers’ procurement teams are asking about AI costs in ways they never asked about SaaS costs before. They want to know: how much of our monthly bill is attributable to AI API usage? Which of our teams or projects are the heaviest AI consumers? What is our cost trend, and how will it change as we roll out AI features to more users? These questions require per-tenant, per-feature, and per-user cost attribution that you need to provide both in your product UI (a “cost and usage” dashboard) and in API-accessible reports for their FinOps tooling. Tenants that cannot get this visibility are increasingly choosing not to renew — AI cost transparency has become a procurement requirement, not just a nice-to-have.
- Noisy-neighbor effects are uniquely severe for LLM workloads. In a traditional multi-tenant SaaS, noisy neighbors slow down other tenants’ API calls or database queries. In an LLM-powered SaaS where you have shared API key infrastructure (many tenants sharing the same provider API key and rate limit), a single tenant’s burst of heavy usage can consume your shared rate limit headroom and cause 429 errors for other tenants. One enterprise tenant running a batch enrichment job that generates 50,000 API calls in an hour can make your other tenants’ interactive sessions start failing. Per-tenant rate limiting, with each tenant’s burst allowance calculated to protect the shared rate limit pool, is the architectural solution. This requires per-tenant tracking infrastructure to implement.
- Regulatory and audit requirements are emerging. As AI governance regulations evolve, the ability to provide per-tenant AI usage records on demand is becoming a compliance requirement in regulated industries (financial services, healthcare, legal). Your enterprise customers in these industries need to be able to demonstrate to their auditors that AI usage is tracked, attributable, and subject to budget controls. Building per-tenant cost tracking as a product feature now positions you for these compliance conversations; building it reactively when a customer demands it during a contract renewal is expensive and slow.
Tenant context propagation architecture
- Propagate tenant ID automatically via middleware, not application code. The most common failure in multi-tenant cost attribution is inconsistent tagging: some LLM calls have the tenant ID attached, others don’t, because the tagging responsibility was left to individual application developers who sometimes forgot. The correct architecture propagates the tenant ID automatically at the infrastructure layer using a middleware pattern. In a web framework, set the tenant ID in a request-scoped context at the authentication middleware layer, before any application code executes. In any code path that makes an LLM API call, the LLM client wrapper reads the tenant ID from the request context automatically. No application developer needs to remember to pass the tenant ID; it’s always present because the wrapper enforces it. If the tenant ID is absent from the context, the LLM client raises a hard error rather than making an untagged call:
import contextvars import functools from typing import Optional, Any from openai import OpenAI # Request-scoped context variable for tenant isolation _tenant_ctx: contextvars.ContextVar[Optional[str]] = contextvars.ContextVar( "tenant_id", default=None ) class TenantContextError(RuntimeError): pass def set_tenant_context(tenant_id: str) -> contextvars.Token: """Call this in your auth middleware before handling any request.""" return _tenant_ctx.set(tenant_id) def get_tenant_id() -> str: """Retrieve current tenant ID, raising if not set.""" tid = _tenant_ctx.get() if not tid: raise TenantContextError( "No tenant context set. Ensure set_tenant_context() is called " "in auth middleware before any LLM call." ) return tid class TenantAwareLLMClient: """ OpenAI client wrapper that automatically attaches tenant context to every call and records cost attribution. """ def __init__(self, openai_client: OpenAI, cost_recorder=None): self._client = openai_client self._cost_recorder = cost_recorder # e.g. RunGuard client def chat_complete(self, messages: list[dict], model: str, **kwargs) -> Any: tenant_id = get_tenant_id() # raises if not set response = self._client.chat.completions.create( model=model, messages=messages, **kwargs ) # Record cost attribution with tenant context if self._cost_recorder and response.usage: self._cost_recorder.record( tenant_id=tenant_id, model=model, input_tokens=response.usage.prompt_tokens, output_tokens=response.usage.completion_tokens, ) return responseThis pattern ensures 100% tagging coverage because the tenant ID is injected by the middleware before any application code can make an LLM call, and the client wrapper enforces its presence. An integration test that callschat_completewithout setting tenant context will fail loudly, catching untagged call paths before they reach production. - Propagate tenant context through async boundaries and background workers. Async Python code and background task queues (Celery, RQ, Dramatiq) break the automatic propagation of context variables unless explicitly handled. In async Python, use
contextvars.copy_context()when spawning tasks to carry the tenant context into async child tasks. For background workers, serialize the tenant ID as part of the task payload and restore it usingset_tenant_context()at the beginning of the task handler. Never rely on the context variable surviving across process boundaries; always treat background task context as explicitly passed data. - Handle multi-tenant administrative operations explicitly. Some LLM operations are genuinely cross-tenant — platform-level analytics, tenant health reports generated by your own infrastructure, model fine-tuning that uses data from multiple tenants. These operations should be tagged with a reserved
tenant_idvalue like__platform__rather than being left untagged or incorrectly attributed to a real tenant. This ensures that platform-level LLM costs appear in your aggregate cost reports and are separated from tenant-attributable costs when calculating per-tenant profitability.
Per-tenant budget enforcement patterns
- Three-tier enforcement: daily cap, burst allowance, and monthly ceiling. Per-tenant budgets should operate at three time granularities. The monthly ceiling is the ultimate limit beyond which the tenant’s plan does not allow further LLM usage without entering an overage arrangement — this is the contractually agreed usage cap for their plan tier. The daily cap prevents a single catastrophic day of usage from consuming the entire monthly ceiling in 24 hours (set this to 20–25% of the monthly ceiling to allow sustained high usage while preventing single-day spikes). The burst allowance is a short-window (5–15 minute) rate limit that prevents the tenant from spiking at a rate that would cause noisy-neighbor effects for other tenants — even if they have monthly budget remaining, they cannot consume it all in a burst that saturates your shared API rate limit. All three must be enforced simultaneously, with the burst allowance checked most frequently (on every request), the daily cap checked on session start, and the monthly ceiling checked on session start and at a low-frequency background job.
- Different enforcement responses for different tenant tiers. A Starter plan tenant hitting their monthly ceiling should receive a hard stop and an upgrade prompt. An Enterprise tenant hitting their monthly ceiling should trigger an automated notification to their account manager and a 48-hour grace period before enforcement tightens, because enterprise contracts typically include an explicit overage negotiation process. A tenant hitting their burst allowance should receive a 429 with a
Retry-Afterheader — this is rate limiting, not budget exhaustion, and the request should succeed if retried 60 seconds later. Encoding these different responses into the enforcement layer requires knowing the tenant’s plan tier at enforcement time, not just their budget status. - Proactive alerting to tenants as they approach ceilings. Budget enforcement that activates only at 100% of ceiling is a bad product experience. Configure proactive outreach at 75% and 90% of monthly ceiling, surfaced both as in-product notifications and as webhook events that trigger automated emails or CRM updates. Enterprise tenants should have a dedicated budget dashboard in your admin portal that shows real-time spend against ceiling, cost trend for the current month, projected end-of-month spend, and the ability to set their own internal cost alerts below your platform’s ceiling. This self-service visibility reduces support tickets and increases the tenant’s sense of control over their AI costs.
- Tenant-level circuit breakers for runaway sessions. Even within their budget ceiling, a tenant can have a runaway agent session that consumes 50% of their monthly budget in a single session due to a bug or an edge-case input. Per-tenant circuit breakers that cap individual session costs (e.g., no single session can consume more than 10% of the tenant’s monthly budget) protect both the tenant (who doesn’t want a bug to exhaust their budget) and your infrastructure (preventing a single session from causing noisy-neighbor effects). Expose these circuit breaker thresholds as configurable tenant settings, allowing enterprise tenants to set their own per-session caps for their specific use cases.
Billing reconciliation and fair-use policy design
- Generate per-tenant cost reports that reconcile with your provider bill. Your LLM provider bills you for total API usage across all tenants. You need to be able to allocate that bill accurately to each tenant for internal profitability accounting and for customer-facing usage reports. The reconciliation process requires: (1) per-call cost records tagged with tenant ID, with input and output token counts that match the provider’s billing granularity; (2) a monthly reconciliation job that aggregates per-tenant costs and compares the sum against the provider invoice total; (3) a variance tolerance of less than 1% between your tracked aggregate and the provider invoice, with automated alerting if variance exceeds this threshold. A variance above 1% indicates either a tagging gap (calls going out untagged) or a cost calculation error (your per-token pricing doesn’t match the provider’s current pricing). Both need to be caught and corrected before they compound.
import runguard from datetime import date from calendar import monthrange client = runguard.Client() def monthly_tenant_cost_report(year: int, month: int) -> dict: """ Generate a per-tenant cost breakdown for billing reconciliation. Returns a dict suitable for export to billing or BI systems. """ _, last_day = monthrange(year, month) period_start = date(year, month, 1).isoformat() period_end = date(year, month, last_day).isoformat() report = client.reports.tenant_cost_breakdown( period_start=period_start, period_end=period_end, include_model_breakdown=True, include_feature_breakdown=True, include_daily_trend=True, ) output = { "period": f"{year}-{month:02d}", "total_cost_usd": report.total_cost_usd, "total_input_tokens": report.total_input_tokens, "total_output_tokens": report.total_output_tokens, "unattributed_pct": report.unattributed_cost_usd / report.total_cost_usd * 100, "tenants": [], } for tenant in sorted(report.tenants, key=lambda t: -t.cost_usd): tenant_record = { "tenant_id": tenant.tenant_id, "cost_usd": tenant.cost_usd, "input_tokens": tenant.input_tokens, "output_tokens": tenant.output_tokens, "session_count": tenant.session_count, "cost_per_session": tenant.cost_usd / tenant.session_count if tenant.session_count else 0, "plan_ceiling_usd": tenant.plan_ceiling_usd, "ceiling_utilization_pct": tenant.cost_usd / tenant.plan_ceiling_usd * 100, "model_breakdown": {m.model: m.cost_usd for m in tenant.model_costs}, } output["tenants"].append(tenant_record) # Flag tenants above 90% of plan ceiling for account manager review if tenant_record["ceiling_utilization_pct"] > 90: tenant_record["flag"] = "APPROACHING_CEILING" # Flag tenants whose cost exceeds plan revenue (unprofitable) if tenant.cost_usd > tenant.plan_revenue_usd: tenant_record["flag"] = "UNPROFITABLE" return outputRun this report on the first business day of each month and feed results into your CRM, billing system, and BI tooling. Theunattributed_pctfield is your reconciliation quality metric: keep it below 1% by investigating any period where it rises. - Fair-use policy design for unlimited plans. “Unlimited” AI features are a marketing statement, not an engineering reality. Every “unlimited” plan has an implicit ceiling based on what the plan tier can sustain profitably, and exceeding that ceiling requires either a fair-use policy or explicit overage pricing. A well-designed fair-use policy specifies: the threshold above which usage is considered excessive (typically 3× the P90 of usage for that plan tier), the consequence for exceeding it (soft notification, account review, overage billing, or service throttling), and the review process for tenants who routinely approach the threshold (proactive account management conversation about plan upgrades or custom pricing). Fair-use policies should be surfaced transparently in your terms of service and summarized in plain language in your pricing page, not buried in legalese that only appears when a tenant has already exceeded the threshold.
- Usage-based overage billing as an alternative to hard limits. For enterprise tenants, a hard ceiling that stops AI functionality is often contractually unacceptable and operationally damaging to their business. An alternative is usage-based overage billing: the tenant’s plan includes a base allocation of LLM cost (e.g., $200/month of AI usage included), and usage above the base allocation is billed at a per-token or per-session rate. This model eliminates the service disruption of hard limits while ensuring that heavy usage is profitable. Implement it by tracking cumulative monthly cost per tenant, detecting when a tenant crosses the base allocation threshold, automatically switching their billing mode to overage, and generating a real-time overage estimate that the tenant’s admin users can view in the product. A tenant who can see their overage accumulating in real time makes informed decisions about usage and is much less likely to be surprised by an overage invoice.
RunGuard for multi-tenant SaaS
- First-class tenant isolation in the RunGuard data model. RunGuard’s data model treats tenant ID as a first-class attribute on every cost record. Every API call wrapped by the RunGuard client is automatically tagged with the tenant context, stored in a tenant-partitioned data store, and queryable via the RunGuard API with tenant-scoped filters. The tenant isolation is enforced at the query layer: a tenant-scoped API key can only access cost records for its own tenant, enabling you to expose RunGuard’s API directly to your enterprise customers for their own FinOps tooling without risking cross-tenant data exposure. This eliminates the need to build a custom cost reporting API in front of your own cost records database.
import runguard # Initialize with tenant context propagation enabled client = runguard.Client( auto_propagate_tenant=True, # reads tenant_id from contextvars automatically tenant_context_var="tenant_id" # name of your contextvars.ContextVar ) # Configure per-tenant budgets inline or via the RunGuard dashboard client.tenants.configure_budget( tenant_id="tenant_abc123", monthly_ceiling_usd=500.00, daily_cap_usd=100.00, per_session_max_usd=10.00, on_ceiling_exceeded="soft_block", # notify + 48h grace for enterprise on_burst_exceeded="throttle", # 429 with Retry-After alert_at_pct=[75, 90], alert_webhook="https://your-app.com/webhooks/runguard/budget-alert" )Theauto_propagate_tenantconfiguration eliminates the need to pass tenant IDs explicitly in application code: RunGuard reads the context variable set by your auth middleware automatically, ensuring 100% tagging coverage without requiring developer discipline. - Automated billing reconciliation reports. RunGuard generates a monthly reconciliation report that breaks down your total provider invoice by tenant, flags the unattributed percentage, identifies unprofitable tenants, and provides a CSV export in the format expected by Stripe, Chargebee, and other billing platforms for automated invoice generation. The reconciliation report is available via the RunGuard API by the second business day of each month, after provider invoices are received and token prices are confirmed. Teams using RunGuard reconciliation reports eliminate the 3–5 day monthly manual allocation process that was previously required to generate per-tenant cost data from raw API logs.
- Noisy-neighbor protection via shared rate limit pooling. RunGuard’s rate limit management layer maintains a real-time view of your remaining provider rate limit headroom and distributes it across tenants using a configurable weighting algorithm (equal share, plan-tier weighted, or historical-usage weighted). When a tenant’s burst activity threatens to exhaust the shared pool, RunGuard applies progressive throttling to that tenant specifically while allowing other tenants to continue at normal rates. This noisy-neighbor protection operates transparently — the affected tenant sees slightly increased latency or occasional retries, not errors — and is logged for the post-event cost attribution report so you can document the protection event to customers who notice the throttling.
Bring tenant-level visibility to your LLM costs today
RunGuard makes per-tenant LLM cost tracking, budget enforcement, and billing reconciliation a product feature rather than a custom engineering project. Start your free trial and connect your first tenant in under 20 minutes — your first monthly reconciliation report will be ready at the start of next month.
Start free trial →