vLLM Cost Control for Agent Workloads: KV Cache Thrashing, Queue Saturation, and Runaway Request Loops
vLLM has become the backbone of self-hosted LLM inference for teams running open-weight models at scale. With 65k+ GitHub stars and an OpenAI-compatible API that drops into any agent stack with a single endpoint swap, it's the default choice when you want to run Llama, Mistral, Qwen, or DeepSeek on your own hardware. The promise is compelling: eliminate per-token API fees, run models at marginal GPU electricity cost, and get sub-second latencies that managed APIs can't match for batch workloads.
The cost model, however, is fundamentally different from managed APIs — and that difference creates failure modes that teams often don't encounter until an agent runs overnight and the GPU rental invoice arrives. With OpenAI or Anthropic, a runaway agent loop burns money in direct proportion to the tokens generated; the cost is visible in real-time dashboard spend tracking and rises linearly. With self-hosted vLLM, you're paying for GPU hours regardless of whether those hours produce useful work. A looping agent that sends one request per second for eight hours doesn't just waste its own compute — it saturates the inference server, starves concurrent agents, and keeps all your GPUs hot at full cost while producing output that gets discarded on the next loop iteration.
Four structural failure modes amplify GPU cost for agent workloads on vLLM:
- KV cache thrashing — looping agents send slightly-different prompts each iteration (tool outputs change, timestamps update, context accumulates), defeating vLLM's prefix caching entirely; every request pays the full prefill cost instead of a cache hit.
- Continuous batching queue saturation — a single runaway agent flooding the server with requests fills the pending-request queue and starves legitimate concurrent agents; vLLM's scheduler cannot prioritize newer requests over an older agent that is monopolizing slots.
- Per-session context growth memory pressure — an agent that appends tool outputs to its context window on each step grows its KV cache footprint geometrically; past a threshold, vLLM begins preempting other requests to make room, reducing effective batch size server-wide.
- Speculative decoding waste on agentic token patterns — vLLM's speculative decoding accelerates generation by predicting multiple tokens ahead with a small draft model; agents generating structured JSON tool calls produce token sequences the draft model cannot predict, leading to high rejection rates and wasted draft computation.
vLLM's cost model for agent workloads
Understanding why these failure modes are expensive requires understanding how vLLM charges work in practice. Whether you're renting GPU instances from AWS, Lambda Labs, Vast.ai, or CoreWeave, or running on-premises, the billing unit is compute time — typically GPU-hours for an A100 or H100 instance. A single A100 80GB costs roughly $2–4/hour on spot instances and $4–8/hour on on-demand in mid-2026. vLLM maximizes GPU utilization through two mechanisms that make it efficient for batch workloads but create specific cost pathologies for agent workloads.
PagedAttention with prefix caching is vLLM's foundational optimization. Rather than allocating a fixed KV cache block for each sequence, vLLM allocates physical memory blocks dynamically and maps them to logical positions in the sequence. When multiple requests share a common prefix — the same system prompt, for example — vLLM detects the shared prefix via hash comparison and reuses the same physical KV blocks for all requests that share it. The result: an agent sending ten requests with the same 2,000-token system prompt pays prefill computation only once; subsequent requests skip prefill for the shared prefix entirely. Cache hit rates above 80% are common in production for single-agent workloads with stable system prompts. The failure mode is when the agent loop mutates the part of the context that is included in the prefix: injecting tool outputs, timestamps, or step counts before the shared suffix causes a prefix cache miss on every request.
Continuous batching allows vLLM to interleave token generation across multiple concurrent requests rather than waiting for one request to complete before starting another. This maximizes GPU utilization when requests have different lengths — short requests don't block long ones from making progress. The failure mode for agent workloads is queue saturation: when a single agent sends requests faster than the server can complete them, the pending queue grows, and the continuous batching scheduler cannot easily preempt the in-flight requests that are already consuming KV cache memory. The result is head-of-line blocking for every other agent trying to use the server.
The practical implication of this cost model: a looping agent on vLLM is not just expensive in absolute GPU cost — it is also a negative externality on every other agent sharing the same server. For teams running multi-tenant agent infrastructure on shared vLLM instances, a single runaway agent degrades the quality of service for every other agent until the loop is stopped.
Failure mode 1: KV cache thrashing from looping agents
The prefix cache hit rate is the single most important cost metric for agent workloads on vLLM. A cache hit eliminates prefill computation for the shared prefix — which for agentic workloads with long system prompts can be 60–80% of total input tokens. A cache miss pays the full prefill cost, which on large models scales quadratically with sequence length due to the attention mechanism. For agents running Llama 3 70B or Qwen 2.5 72B on vLLM, a 4,000-token system prompt that hits the cache costs roughly the same in wall-clock prefill time as a 400-token prompt that misses it.
The looping agent cache thrash pattern works as follows. On step 1, the agent sends a request with prefix [system prompt] + [user query]. The prefix is cached. On step 2, the agent calls a tool, gets a result, and sends [system prompt] + [user query] + [tool call] + [tool result]. The prefix now extends to include the tool call and result — and if the tool result contains any dynamic content (a timestamp, a counter, a value from an external API), the extended prefix is unique to this step. On step 3, the agent loops back for the same task, sending [system prompt] + [user query] + [tool call 1] + [result 1] + [tool call 2] + [result 2]. The prefix is now even longer and even more unique. After five loop iterations, every request is a near-full cache miss, paying full quadratic prefill cost for a context that is growing at the rate of tool-output accumulation.
The guard monitors per-session cache hit rate using vLLM's Prometheus metrics endpoint and trips a circuit when the session's hit rate drops below a threshold — the signal that the agent has entered a loop that is defeating the cache.
import time
import threading
import httpx
from dataclasses import dataclass, field
from collections import defaultdict
from runguard import BudgetTracker, BudgetExceededError
@dataclass
class SessionCacheStats:
total_requests: int = 0
cache_hit_tokens: int = 0
total_prefix_tokens: int = 0
first_request_ts: float = field(default_factory=time.time)
last_request_ts: float = field(default_factory=time.time)
@property
def cache_hit_rate(self) -> float:
if self.total_prefix_tokens == 0:
return 1.0
return self.cache_hit_tokens / self.total_prefix_tokens
@property
def session_duration_seconds(self) -> float:
return self.last_request_ts - self.first_request_ts
class VLLMCacheGuard:
"""
Monitors per-session KV cache hit rates via vLLM's Prometheus /metrics endpoint.
Trips a circuit breaker when a session's cache hit rate drops below the threshold,
indicating the agent has entered a loop that is defeating prefix caching.
"""
def __init__(
self,
vllm_base_url: str = "http://localhost:8000",
min_cache_hit_rate: float = 0.5, # trip below 50% cache hit rate
min_requests_before_trip: int = 5, # don't trip on first few requests
circuit_reset_seconds: int = 120,
poll_interval_seconds: float = 10.0,
gpu_cost_per_hour: float = 3.0, # USD/hour for your GPU instance
):
self.vllm_url = vllm_base_url.rstrip("/")
self.min_hit_rate = min_cache_hit_rate
self.min_requests = min_requests_before_trip
self.circuit_reset_seconds = circuit_reset_seconds
self.gpu_cost_per_hour = gpu_cost_per_hour
self._stats: dict[str, SessionCacheStats] = defaultdict(SessionCacheStats)
self._open_circuits: dict[str, float] = {} # session_id -> opened_at
self._lock = threading.Lock()
self._poll_interval = poll_interval_seconds
self._server_cache_hit_tokens = 0.0
self._server_total_tokens = 0.0
self._last_metrics_ts = 0.0
def _fetch_server_metrics(self) -> dict[str, float]:
"""Fetches vLLM Prometheus metrics and returns parsed key-value pairs."""
try:
resp = httpx.get(f"{self.vllm_url}/metrics", timeout=5.0)
resp.raise_for_status()
metrics = {}
for line in resp.text.splitlines():
if line.startswith("#") or not line.strip():
continue
parts = line.rsplit(" ", 1)
if len(parts) == 2:
try:
metrics[parts[0].split("{")[0]] = float(parts[1])
except ValueError:
continue
return metrics
except Exception:
return {}
def _get_server_cache_hit_rate(self) -> float | None:
"""
Returns the server-wide prefix cache hit rate from vLLM metrics.
vLLM exposes vllm:cache_config_info and vllm:gpu_cache_usage_perc.
The prefix cache hit rate is available via vllm:cpu_cache_usage_perc
and related counters in vLLM >= 0.4.
"""
now = time.time()
if now - self._last_metrics_ts < self._poll_interval:
if self._server_total_tokens > 0:
return self._server_cache_hit_tokens / self._server_total_tokens
return None
metrics = self._fetch_server_metrics()
self._last_metrics_ts = now
# vLLM >= 0.4 exposes these counters
hit_tokens = metrics.get("vllm:prompt_tokens_total", 0)
cache_hit_tokens = metrics.get("vllm:cache_hit_tokens_total", 0)
if hit_tokens > 0:
self._server_cache_hit_tokens = cache_hit_tokens
self._server_total_tokens = hit_tokens
return cache_hit_tokens / hit_tokens
return None
def record_request(
self,
session_id: str,
prompt_token_count: int,
cache_hit_token_count: int,
) -> None:
"""
Records a completed request's cache performance for a session.
Call this after each vLLM response using the usage data from the response.
"""
with self._lock:
stats = self._stats[session_id]
stats.total_requests += 1
stats.total_prefix_tokens += prompt_token_count
stats.cache_hit_tokens += cache_hit_token_count
stats.last_request_ts = time.time()
def check_session(self, session_id: str) -> None:
"""
Raises RuntimeError if the session's circuit is open.
Call this before issuing each vLLM request for the session.
"""
with self._lock:
# Check circuit reset
if session_id in self._open_circuits:
opened_at = self._open_circuits[session_id]
if time.time() - opened_at > self.circuit_reset_seconds:
del self._open_circuits[session_id]
self._stats[session_id] = SessionCacheStats()
else:
elapsed = int(time.time() - opened_at)
raise RuntimeError(
f"[RunGuard] vLLM session '{session_id}' circuit open: KV cache hit rate "
f"dropped below {self.min_hit_rate:.0%} threshold. "
f"Agent appears to be looping — context is mutating every step, "
f"defeating prefix caching. Circuit resets in "
f"{self.circuit_reset_seconds - elapsed}s. "
"Check agent for tool-output accumulation or timestamp injection in system prompt."
)
# Trip circuit if hit rate too low after enough samples
stats = self._stats[session_id]
if (
stats.total_requests >= self.min_requests
and stats.cache_hit_rate < self.min_hit_rate
):
self._open_circuits[session_id] = time.time()
gpu_cost_estimate = (
stats.session_duration_seconds / 3600 * self.gpu_cost_per_hour
)
raise RuntimeError(
f"[RunGuard] vLLM session '{session_id}' circuit opened: "
f"cache hit rate {stats.cache_hit_rate:.1%} "
f"(threshold: {self.min_hit_rate:.0%}) "
f"after {stats.total_requests} requests. "
f"GPU cost burned in loop: ~${gpu_cost_estimate:.3f}. "
"Context is changing every step — prefix cache cannot be reused."
)
def session_stats(self, session_id: str) -> dict:
with self._lock:
s = self._stats[session_id]
return {
"session_id": session_id,
"total_requests": s.total_requests,
"cache_hit_rate": f"{s.cache_hit_rate:.1%}",
"circuit_open": session_id in self._open_circuits,
"session_duration_seconds": s.session_duration_seconds,
}
The guard operates at two levels. At the server level, it polls vLLM's Prometheus /metrics endpoint to get aggregate cache hit rate data — this tells you whether the server as a whole is experiencing cache thrashing, which happens when many sessions are looping simultaneously. At the session level, it tracks the cache performance of individual agent sessions by recording the prompt_tokens and prompt_tokens_details.cached_tokens fields from each vLLM response (these are available in the OpenAI-compatible response format as usage fields). Session-level tracking lets you identify which specific agent session is thrashing the cache, rather than reacting only when the aggregate server-wide rate degrades.
The cache hit data per response is available in the usage field of the completion response. In OpenAI-compatible mode, vLLM returns usage.prompt_tokens_details.cached_tokens from the API response — the same field that OpenAI uses for their prompt caching feature. Record this alongside usage.prompt_tokens to get per-call cache performance without polling the metrics endpoint.
Failure mode 2: continuous batching queue saturation
vLLM's continuous batching scheduler treats all pending requests equally by default. When an agent enters a loop and sends requests at a high rate — say, one request per second for a 5-second average-latency model — it fills the pending request queue faster than the server can drain it. Other agents making infrequent, high-value requests see their latency grow from sub-second to tens of seconds as they wait behind the looping agent's queued requests. At 50 queued requests × 5 seconds per request, a legitimate agent starts after four minutes of waiting.
The saturation compounds over time. vLLM's scheduler uses token budget limits to control the maximum number of tokens inflight across all concurrent sequences — this is how it prevents OOM crashes. When the queue contains many large-context requests from a looping agent, the scheduler may refuse new small requests until the in-flight token budget clears, even if the GPU has capacity for those small requests. The looping agent's large context window is effectively blocking smaller, faster requests from even starting.
The guard monitors queue depth and per-session request rate via the metrics endpoint and enforces rate limits per session to prevent queue monopolization.
import time
import asyncio
import threading
from collections import defaultdict, deque
from dataclasses import dataclass, field
import httpx
from runguard import LoopDetector
@dataclass
class SessionRateWindow:
"""Tracks request timestamps for a sliding window rate limit."""
timestamps: deque = field(default_factory=lambda: deque(maxlen=200))
def record(self) -> None:
self.timestamps.append(time.monotonic())
def rate_in_window(self, window_seconds: float) -> float:
now = time.monotonic()
cutoff = now - window_seconds
recent = [ts for ts in self.timestamps if ts >= cutoff]
return len(recent) / window_seconds if window_seconds > 0 else 0.0
def requests_in_window(self, window_seconds: float) -> int:
now = time.monotonic()
cutoff = now - window_seconds
return sum(1 for ts in self.timestamps if ts >= cutoff)
class VLLMQueueGuard:
"""
Enforces per-session request rate limits to prevent a single agent session
from saturating vLLM's continuous batching queue.
Also monitors server-wide queue depth and pauses high-rate sessions
when the server is under heavy load.
"""
def __init__(
self,
vllm_base_url: str = "http://localhost:8000",
max_requests_per_minute_per_session: int = 12,
max_requests_per_10s_per_session: int = 4,
max_queue_depth_before_throttle: int = 20,
max_queue_depth_before_block: int = 50,
loop_detector_window: int = 30, # trip if same tool call seen 30× in a session
):
self.vllm_url = vllm_base_url.rstrip("/")
self.max_rpm = max_requests_per_minute_per_session
self.max_r10s = max_requests_per_10s_per_session
self.throttle_queue_depth = max_queue_depth_before_throttle
self.block_queue_depth = max_queue_depth_before_block
self._sessions: dict[str, SessionRateWindow] = defaultdict(SessionRateWindow)
self._loop_detectors: dict[str, LoopDetector] = {}
self._lock = threading.Lock()
self._server_queue_depth = 0
self._last_queue_poll = 0.0
def _get_queue_depth(self) -> int:
"""Returns vLLM's current pending request queue depth from Prometheus metrics."""
now = time.time()
if now - self._last_queue_poll < 5.0:
return self._server_queue_depth
try:
resp = httpx.get(f"{self.vllm_url}/metrics", timeout=3.0)
for line in resp.text.splitlines():
if line.startswith("vllm:num_requests_waiting "):
self._server_queue_depth = int(float(line.split()[-1]))
self._last_queue_poll = now
return self._server_queue_depth
except Exception:
pass
return self._server_queue_depth
def _get_or_create_loop_detector(self, session_id: str) -> LoopDetector:
if session_id not in self._loop_detectors:
self._loop_detectors[session_id] = LoopDetector(
window=self.loop_detector_window,
min_repeats=3,
)
return self._loop_detectors[session_id]
def before_request(
self,
session_id: str,
tool_call_signature: str | None = None,
) -> None:
"""
Call before issuing each request to vLLM for this session.
Raises RuntimeError if the session is rate-limited or the server queue is full.
Optionally accepts a tool_call_signature to detect repeated tool calls (loop detection).
"""
with self._lock:
session = self._sessions[session_id]
# Per-session rate check: short window (10s burst)
r10s = session.requests_in_window(10.0)
if r10s >= self.max_r10s:
raise RuntimeError(
f"[RunGuard] Session '{session_id}' rate limited: "
f"{r10s} requests in 10s (max: {self.max_r10s}). "
"Agent may be in a tight loop. Back off before retrying."
)
# Per-session rate check: 1-minute window
rpm = session.requests_in_window(60.0)
if rpm >= self.max_rpm:
raise RuntimeError(
f"[RunGuard] Session '{session_id}' rate limited: "
f"{rpm} requests/min (max: {self.max_rpm}). "
f"At this rate the session will monopolize the inference server. "
"Implement exponential backoff or reduce agent step frequency."
)
# Server queue depth check
queue_depth = self._get_queue_depth()
if queue_depth >= self.block_queue_depth:
raise RuntimeError(
f"[RunGuard] vLLM queue saturated: {queue_depth} requests pending "
f"(blocking threshold: {self.block_queue_depth}). "
"Refusing new request to protect other sessions. "
"Wait for queue to drain before retrying."
)
if queue_depth >= self.throttle_queue_depth:
# Throttle: add artificial delay for high-rate sessions
session_rpm = session.requests_in_window(60.0)
if session_rpm > self.max_rpm * 0.5:
# This session is a significant queue contributor — throttle harder
raise RuntimeError(
f"[RunGuard] vLLM queue at {queue_depth} pending requests "
f"and session '{session_id}' is sending at {session_rpm:.0f} req/min. "
"Throttling this session to protect other agents. Slow down request rate."
)
# Optional loop detection via tool call signature repetition
if tool_call_signature is not None:
loop_detector = self._get_or_create_loop_detector(session_id)
if loop_detector.step(tool_call_signature):
raise RuntimeError(
f"[RunGuard] Loop detected in session '{session_id}': "
f"tool call signature repeated {self.loop_detector_window}+ times. "
f"Signature: '{tool_call_signature[:100]}'. "
"Agent is calling the same tool with the same arguments repeatedly."
)
session.record()
def session_report(self, session_id: str) -> dict:
with self._lock:
session = self._sessions[session_id]
return {
"session_id": session_id,
"requests_last_60s": session.requests_in_window(60.0),
"requests_last_10s": session.requests_in_window(10.0),
"server_queue_depth": self._server_queue_depth,
}
The guard's before_request() check fires synchronously before each API call. The vllm:num_requests_waiting Prometheus counter is the most reliable signal for queue saturation — it counts requests that have been accepted by the server but not yet started generation (i.e., they are waiting in the scheduler queue for a free token-budget slot). When this metric exceeds the blocking threshold, the guard refuses new requests rather than adding them to an already-saturated queue. The key insight: adding one more request to a queue of 50 does not help the situation; it makes the wait for every other queued request longer.
The tool_call_signature parameter bridges the queue guard with loop detection. When an agent calls a tool, pass the tool name and a hash of its arguments as the signature. A loop detector sees the same signature 30 times in a session and trips — the agent is calling the same tool with the same inputs repeatedly, which is the hallmark of a tool-output that is not progressing the task. This is the complementary detection to the cache hit rate approach: the cache guard catches loops early (cache miss rate rises), while the tool-call signature guard catches loops that happen to send different prompts (context accumulated differently) but are making the same tool calls.
Failure mode 3: per-session context growth and memory pressure
vLLM manages KV cache memory as a shared resource across all concurrent sessions. Each token position in each active sequence occupies physical GPU memory proportional to num_layers × num_heads × head_dim × 2 × dtype_bytes. For Llama 3 70B at fp16, one token position requires roughly 2.5 MB of KV cache memory across all layers and heads. A single agent session with a 32,000-token context occupies 80 GB of KV cache — the entire memory of one A100 — for the duration of that sequence's active generation.
When an agent grows its context across steps by accumulating tool outputs, the KV cache for that session grows correspondingly. At some point — typically around 60–80% of total KV cache capacity — vLLM begins preempting lower-priority sequences to make room for the high-priority growing sequence. Preemption means vLLM swaps the preempted sequence's KV cache to CPU memory (if swap space is configured) or simply aborts the request and requeues it. Either way, the preempted request has to redo its prefill computation from scratch when it eventually restarts — the prefill cost that prefix caching was supposed to eliminate is paid again.
The guard tracks per-session context length over time and trips when a session is growing its context faster than expected for legitimate progressive task completion.
import time
import threading
import httpx
from dataclasses import dataclass, field
from collections import defaultdict
from runguard import BudgetTracker, BudgetExceededError
@dataclass
class ContextGrowthRecord:
steps: list[tuple[float, int]] = field(default_factory=list) # (timestamp, context_tokens)
def add_step(self, token_count: int) -> None:
self.steps.append((time.time(), token_count))
@property
def latest_tokens(self) -> int:
return self.steps[-1][1] if self.steps else 0
@property
def step_count(self) -> int:
return len(self.steps)
def tokens_per_step(self) -> float:
if len(self.steps) < 2:
return 0.0
total_tokens = self.steps[-1][1] - self.steps[0][1]
return total_tokens / (len(self.steps) - 1)
def growth_rate_last_n(self, n: int) -> float:
"""Average tokens added per step over the last N steps."""
if len(self.steps) < 2:
return 0.0
window = self.steps[-n:]
if len(window) < 2:
return 0.0
return (window[-1][1] - window[0][1]) / (len(window) - 1)
class VLLMContextGrowthGuard:
"""
Monitors per-session context growth and vLLM's GPU KV cache utilization.
Trips a circuit when a session's context is growing unsustainably fast,
risking OOM preemption and server-wide performance degradation.
"""
def __init__(
self,
vllm_base_url: str = "http://localhost:8000",
max_context_tokens: int = 24_000,
max_tokens_per_step: int = 2_000, # if avg growth > this, likely looping
high_gpu_cache_threshold: float = 0.75, # trip earlier when server memory is tight
tokens_per_gb_kv_cache: int = 400, # approximate for 70B fp16 model
):
self.vllm_url = vllm_base_url.rstrip("/")
self.max_context = max_context_tokens
self.max_growth_per_step = max_tokens_per_step
self.high_cache_threshold = high_gpu_cache_threshold
self.tokens_per_gb = tokens_per_gb_kv_cache
self._sessions: dict[str, ContextGrowthRecord] = defaultdict(ContextGrowthRecord)
self._open_circuits: set[str] = set()
self._lock = threading.Lock()
self._gpu_cache_usage: float = 0.0
self._last_cache_poll: float = 0.0
def _get_gpu_cache_usage(self) -> float:
"""Returns vLLM's GPU KV cache utilization (0.0–1.0)."""
now = time.time()
if now - self._last_cache_poll < 10.0:
return self._gpu_cache_usage
try:
resp = httpx.get(f"{self.vllm_url}/metrics", timeout=3.0)
for line in resp.text.splitlines():
if line.startswith("vllm:gpu_cache_usage_perc "):
self._gpu_cache_usage = float(line.split()[-1])
self._last_cache_poll = now
return self._gpu_cache_usage
except Exception:
pass
return self._gpu_cache_usage
def record_step(self, session_id: str, context_token_count: int) -> None:
"""
Records the context size after each agent step.
Call this after receiving a response and counting the total context tokens
(prompt_tokens + completion_tokens from the response usage field).
"""
with self._lock:
self._sessions[session_id].add_step(context_token_count)
def check_before_step(self, session_id: str) -> None:
"""
Checks whether the session's context growth is within acceptable bounds.
Raises RuntimeError if the session is growing too fast or exceeds absolute limits.
"""
with self._lock:
if session_id in self._open_circuits:
raise RuntimeError(
f"[RunGuard] Session '{session_id}' circuit open: context growth "
"exceeded safe bounds. Reset context window or start a new session."
)
record = self._sessions[session_id]
if not record.steps:
return
# Absolute context size limit
if record.latest_tokens > self.max_context:
self._open_circuits.add(session_id)
raise RuntimeError(
f"[RunGuard] Session '{session_id}' context exceeded {self.max_context:,} tokens "
f"(current: {record.latest_tokens:,}). "
"This session is occupying excessive KV cache memory and risks preempting "
"other sessions. Summarize context or start a fresh session."
)
# Growth rate check (requires at least 4 steps)
if record.step_count >= 4:
recent_growth = record.growth_rate_last_n(4)
if recent_growth > self.max_growth_per_step:
gpu_usage = self._get_gpu_cache_usage()
# Apply tighter threshold when server memory is already strained
effective_max_growth = (
self.max_growth_per_step * 0.5
if gpu_usage > self.high_cache_threshold
else self.max_growth_per_step
)
if recent_growth > effective_max_growth:
self._open_circuits.add(session_id)
raise RuntimeError(
f"[RunGuard] Session '{session_id}' context growing too fast: "
f"{recent_growth:.0f} tokens/step over last 4 steps "
f"(max: {effective_max_growth:.0f} tokens/step). "
f"GPU KV cache at {gpu_usage:.1%}. "
"Agent is accumulating tool output without summarizing — "
"consider compressing context between steps."
)
def growth_report(self, session_id: str) -> dict:
with self._lock:
record = self._sessions[session_id]
return {
"session_id": session_id,
"step_count": record.step_count,
"current_tokens": record.latest_tokens,
"avg_tokens_per_step": record.tokens_per_step(),
"circuit_open": session_id in self._open_circuits,
"gpu_cache_utilization": f"{self._gpu_cache_usage:.1%}",
}
The vllm:gpu_cache_usage_perc metric is the server's real-time report of what fraction of total KV cache capacity is currently in use across all active sequences. When this metric is above 75%, the server is memory-constrained — any new sequence that grows its context significantly will trigger preemptions. The guard applies a tighter per-step growth limit when the server is memory-constrained, because the cost of preemption scales with how many other sessions are active. A 2,000-token context growth step at 40% GPU cache usage is tolerable; the same step at 80% usage forces preemption of other sessions that then pay re-prefill cost.
Failure mode 4: speculative decoding waste on agentic token patterns
Speculative decoding is one of vLLM's most effective optimizations for text-heavy workloads. A small, fast draft model generates a batch of candidate tokens speculatively; the large target model verifies them all in a single forward pass, accepting any prefix of the speculation that matches what the target model would have generated and discarding the rest. When speculation acceptance rates are high — typical for natural language continuations — speculative decoding reduces wall-clock time by 2–3× while generating identical output to non-speculative decoding.
Agentic workloads break this assumption. Agents generate structured JSON tool calls, function argument objects, and schema-constrained outputs. These token sequences are highly unpredictable at the character level: field names, string values, and nested structure depend entirely on the tool schema and the agent's decision, not on any statistical continuation pattern the draft model can learn. A draft model trained on natural language will speculate continuations that look like prose — not {"tool_name": "search_web", "arguments": {"query": "current price of. The result is high speculation rejection rates: the target model accepts 0–2 speculative tokens per draft step instead of the 5–8 typical for prose generation. Each rejected speculation still costs a full draft model forward pass plus a target model verification pass. At high rejection rates, speculative decoding adds overhead rather than reducing it.
The guard monitors speculation acceptance rates for agent sessions and disables speculative decoding (via per-request configuration) when the session's acceptance rate indicates structured output generation.
import time
import threading
import statistics
from collections import defaultdict, deque
import httpx
from openai import OpenAI
class SpeculativeDecodingGuard:
"""
Monitors per-session speculative decoding acceptance rates.
Disables speculation for sessions that consistently reject speculative tokens
(the pattern for agents generating structured JSON tool calls).
Re-enables when a session switches to prose generation (higher acceptance expected).
"""
def __init__(
self,
vllm_base_url: str = "http://localhost:8000",
api_key: str = "not-needed",
min_acceptance_rate_to_enable: float = 0.55, # enable spec if above this
max_acceptance_rate_to_disable: float = 0.25, # disable if below this
window_size: int = 10, # samples to average over
spec_tokens: int = 5, # speculative tokens per step
):
self.vllm_url = vllm_base_url.rstrip("/")
self.client = OpenAI(base_url=f"{vllm_base_url}/v1", api_key=api_key)
self.min_enable_rate = min_acceptance_rate_to_enable
self.max_disable_rate = max_acceptance_rate_to_disable
self.window = window_size
self.spec_tokens = spec_tokens
self._session_acceptance_history: dict[str, deque] = defaultdict(
lambda: deque(maxlen=window_size)
)
self._speculation_enabled: dict[str, bool] = defaultdict(lambda: True)
self._lock = threading.Lock()
def _estimate_acceptance_rate(
self,
prompt_tokens: int,
completion_tokens: int,
wall_clock_seconds: float,
speculation_enabled: bool,
) -> float | None:
"""
Estimates speculative acceptance rate from generation throughput.
With spec decoding ON: higher acceptance = more tokens/second
With spec decoding OFF: baseline tokens/second
This is an approximation; vLLM does not expose per-request acceptance rates
in the OpenAI-compatible response format (only in Prometheus metrics).
"""
if wall_clock_seconds <= 0 or completion_tokens <= 0:
return None
tokens_per_second = completion_tokens / wall_clock_seconds
# Rough baseline: 70B model at fp16 does ~25-35 tok/s without speculation
# With speculation at 50% acceptance: ~50-60 tok/s
# We estimate acceptance from deviation from baseline
baseline_tps = 30.0 # calibrate for your specific GPU + model
if speculation_enabled:
# Acceptance rate estimate from throughput uplift
uplift = tokens_per_second / baseline_tps
# Each accepted spec token saves one target model forward pass on that token
# uplift ≈ 1 + acceptance_rate * spec_tokens / (1 + spec_tokens)
# Solving for acceptance_rate:
estimated_rate = max(0.0, min(1.0, (uplift - 1) * (1 + self.spec_tokens) / self.spec_tokens))
return estimated_rate
return None
def complete(
self,
session_id: str,
messages: list[dict],
model: str,
max_tokens: int = 1024,
**kwargs,
):
"""
Issues a chat completion request to vLLM with automatic speculative
decoding management based on the session's observed acceptance rate.
"""
with self._lock:
spec_enabled = self._speculation_enabled[session_id]
# Build extra_body for vLLM-specific parameters
extra_body: dict = {}
if spec_enabled:
extra_body["speculative_config"] = {
"num_speculative_tokens": self.spec_tokens,
}
else:
# Explicitly disable speculation for this request
extra_body["speculative_config"] = None
start_ts = time.monotonic()
response = self.client.chat.completions.create(
model=model,
messages=messages,
max_tokens=max_tokens,
extra_body=extra_body if extra_body else None,
**kwargs,
)
wall_clock = time.monotonic() - start_ts
# Estimate acceptance rate from throughput
completion_tokens = response.usage.completion_tokens if response.usage else 0
prompt_tokens = response.usage.prompt_tokens if response.usage else 0
acceptance_rate = self._estimate_acceptance_rate(
prompt_tokens, completion_tokens, wall_clock, spec_enabled
)
if acceptance_rate is not None:
with self._lock:
history = self._session_acceptance_history[session_id]
history.append(acceptance_rate)
# Switch modes based on rolling average
if len(history) >= 3:
avg_rate = statistics.mean(history)
if spec_enabled and avg_rate < self.max_disable_rate:
self._speculation_enabled[session_id] = False
elif not spec_enabled and avg_rate > self.min_enable_rate:
self._speculation_enabled[session_id] = True
return response
def speculation_report(self, session_id: str) -> dict:
with self._lock:
history = list(self._session_acceptance_history[session_id])
return {
"session_id": session_id,
"speculation_enabled": self._speculation_enabled[session_id],
"samples": len(history),
"avg_acceptance_rate": f"{statistics.mean(history):.1%}" if history else "n/a",
"recent_acceptance_rate": f"{history[-1]:.1%}" if history else "n/a",
}
The per-request acceptance rate is not directly exposed in vLLM's OpenAI-compatible response format — it's available only in Prometheus metrics as an aggregate counter (vllm:spec_decode_draft_acceptance_rate). The guard approximates per-session acceptance rate from throughput measurements: with speculation enabled and high acceptance, tokens-per-second significantly exceeds the non-speculative baseline for the model; with low acceptance, throughput may actually be slightly below the non-speculative baseline due to draft model overhead. While this estimation is noisy at the individual request level, it converges over a 10-request window to reliably identify sessions in structured-output generation mode. The guard then passes speculative_config: null in the request's extra_body to disable speculation for that session, reducing per-request latency and freeing draft model computation for sessions that benefit from it.
Combining guards in a production vLLM deployment
For teams running vLLM as shared agent infrastructure, these four guards operate at different granularities and should be layered accordingly:
- The VLLMCacheGuard operates at the session level with server-level aggregation. Deploy it as a sidecar that polls
/metricsand maintains per-session hit rate state. Trip the circuit for individual sessions without affecting others. - The VLLMQueueGuard operates at the request level with global queue depth awareness. Run it as a proxy or middleware in front of vLLM — every request passes through the rate limiter regardless of which session sent it. This is the guard that protects the server from multi-tenant degradation.
- The VLLMContextGrowthGuard operates at the session level with per-step monitoring. Integrate it into the agent loop itself — after each step, record the context size and check before issuing the next step. This guard catches slow-growing loops that the cache and rate guards may miss.
- The SpeculativeDecodingGuard operates at the request level per session. Wrap your vLLM client with this guard when your agent pipeline mixes prose generation (high spec benefit) with structured tool-call generation (low spec benefit). If your pipeline generates only structured outputs, simply disable speculative decoding globally in vLLM configuration.
A key difference from managed API cost control: on vLLM, the guards protect not just your own cost but shared server resources. The rate limiter and queue guard have an important multi-tenancy dimension — one agent's loop is another agent's latency regression. This makes enforcement more urgent than on managed APIs where loops only affect the caller's own bill.
vLLM metrics endpoint note: vLLM exposes Prometheus-format metrics at GET /metrics on the same port as the inference API. Key metrics for agent cost control: vllm:num_requests_waiting (queue depth), vllm:gpu_cache_usage_perc (KV cache pressure), vllm:prompt_tokens_total and vllm:cache_hit_tokens_total (server-wide cache hit rate), and vllm:spec_decode_draft_acceptance_rate (speculative decoding efficiency). These metrics update every 5 seconds and are the authoritative source for all four guards. Scrape them into Prometheus and alert when num_requests_waiting exceeds 10 or gpu_cache_usage_perc exceeds 85% — those are the two leading indicators of a degraded serving environment for agent workloads.
Summary: vLLM cost amplification patterns for agent workloads
| Pattern | Cost impact | Guard |
|---|---|---|
KV cache thrashinglooping agent with mutating context prefix |
Full prefill cost every request instead of cache hit; 3–8× per-request compute cost | Per-session hit rate monitor; circuit trip below 50% hit rate after 5 requests |
Continuous batching queue saturationrunaway agent floods request queue |
Multi-minute latency for all other sessions; head-of-line blocking degrades server-wide throughput | Per-session rate limiter (max 12 req/min); block all sessions when queue > 50 |
Context growth memory pressureagent accumulating tool output without summarizing |
KV cache preemption forces re-prefill for other sessions; OOM risk above 80% cache usage | Per-step context size tracking; circuit trip at 24k tokens or 2k tokens/step growth rate |
Speculative decoding wastestructured JSON tool-call generation |
Draft model overhead adds 10–20% latency when acceptance rate < 25% | Per-session throughput-estimated acceptance rate; auto-disable spec below 25% rate |
Frequently asked questions
Does vLLM have built-in request rate limiting or per-session circuit breakers?
vLLM does not include per-session rate limiting or circuit breakers. It accepts requests up to a configurable maximum concurrent sequences limit (--max-num-seqs) and a maximum number of batched tokens (--max-num-batched-tokens), but these are server-wide capacity limits rather than per-session safety bounds. Requests beyond the concurrent sequence limit queue indefinitely. There is no built-in mechanism to identify a runaway session, apply back-pressure to it specifically, or trip a circuit for sessions with degraded cache performance. All of that must be implemented at the client or proxy layer — which is what the guards above provide.
How does vLLM's prefix caching interact with agent frameworks like LangChain or LangGraph?
Most agent frameworks build a conversation history by appending messages to a list and sending the full history to the model on each step. vLLM's prefix cache works on the tokenized representation of the request — it hashes the token IDs for the prefix and checks for a matching cached KV block. For prefix caching to work, the prefix must be token-for-token identical across requests. LangChain and LangGraph pipelines typically send the full message history on each step with the same system prompt at the top, which means the system prompt prefix is reused across steps. However, if the framework adds per-request metadata (timestamps, request IDs) to the system prompt or user message before the history, the cache key changes on every request and hits nothing. Audit your framework's prompt assembly to ensure the shareable prefix (system prompt + task description) appears before the dynamic content (tool outputs, conversation history) — this ordering is the only one that prefix caching can exploit.
What is the right model size for agent workloads on self-hosted vLLM?
The cost-optimal model size depends on your agent's task complexity and context length requirements. For agents making short-context tool calls (under 4k tokens), a 7B–14B model at fp16 or a quantized 30B model typically matches the quality of larger models on structured tasks (function calling, JSON extraction, classification) at 5–10× lower GPU memory cost. For agents that need deep reasoning over long documents (32k+ tokens), a 70B model is often necessary — and at that size, prefix caching becomes the dominant cost lever because the prefill cost per token is high enough that a cache hit saves significant time. A common architecture for cost efficiency: run a 7B model for tool dispatch decisions and short structured outputs, and reserve the 70B model for the reasoning steps that require broad knowledge or multi-step planning.
How should I configure vLLM's chunked prefill for agent workloads?
vLLM's chunked prefill (--enable-chunked-prefill) splits long prefill computations into smaller chunks, interleaving them with decode steps for other requests. This is generally beneficial for agent workloads because agents often send long-context requests (large tool outputs appended to history) that would otherwise block all other decode steps for the duration of the prefill. With chunked prefill enabled, a 20k-token prefill is processed in chunks of 512 tokens each, allowing decode steps for other sessions to proceed between chunks. The cost trade-off is that the prefill for the long-context request takes longer in wall-clock time. For interactive agent workloads where response latency matters, set --max-num-batched-tokens high enough that most requests complete prefill in one or two chunks. For batch agent workloads where throughput matters more than latency, smaller chunk sizes increase fairness across sessions.
How does RunGuard integrate with vLLM deployments?
RunGuard's LoopDetector, BudgetTracker, and circuit breaker primitives are backend-agnostic and work with any OpenAI-compatible API including vLLM. The vLLM-specific guards above use RunGuard's primitives for the detection logic while adding vLLM-specific integrations: polling the /metrics Prometheus endpoint, reading usage.prompt_tokens_details.cached_tokens from vLLM responses, and passing speculative_config in the request extra_body. For teams running vLLM as a shared inference cluster, the recommended deployment is to run the queue guard and cache guard as a lightweight proxy (FastAPI or similar) that intercepts all requests before they reach vLLM, enforces rate limits, and injects the speculative decoding configuration based on per-session history. The context growth guard integrates directly into the agent framework loop.