Ollama and Llama.cpp Agent Cost Control: Loop Detection and Resource Enforcement in Production
The appeal of running agents on local models is obvious: no API key, no per-token bill, no rate limits. Pull a Llama 3.1 70B quantized weight, start a server, and your agent can call it ten thousand times without a cent landing on a credit card. The invoice never arrives.
The cost still shows up. It shows up as a VRAM OOM kill at 2 AM that silently restarts the subprocess, triggering the retry loop that should have been the circuit breaker. It shows up as a 16-minute CPU inference job that was supposed to take 30 seconds — and then spawns 10 concurrent copies of itself before the first one finishes. It shows up as the llama.cpp context window filling up, truncating the oldest messages without so much as a warning, causing the model to re-issue the same tool calls it already made — because it no longer remembers making them. It shows up as the Ollama model unloading after five minutes of inactivity and taking 45 seconds to cold-start every time the orchestrator retries a failed task.
Every cloud-hosted AI agent framework has been written about through the lens of billing: how many tokens, how many dollars, which circuit breaker trips the HTTP 429. Local model agents have a different cost model entirely — one measured in GPU hours, CPU saturation, wall-clock time, and hardware opportunity cost — and almost no one has written the guards for it.
This post covers the four failure modes specific to agents running on Ollama (the model management server most commonly used for local development and self-hosted production) and llama.cpp (the underlying inference engine, also accessible directly via llama-server or the Python llama-cpp-python bindings). For each failure mode you get the mechanism, a detection heuristic, and a Python guard implementation.
Scope. This post covers agents using Ollama's HTTP API (/api/chat, /api/generate) or llama-cpp-python bindings directly. If you're using Ollama as a drop-in OpenAI-compatible backend via /v1/chat/completions with the OpenAI SDK, the cloud-side billing guards from OpenAI Agents SDK Cost Control apply at the API layer, but the local resource failure modes here still apply at the infrastructure layer.
Why local model cost control is different
Cloud API agents and local model agents share one structural problem: a tool-use loop with no termination condition can run forever. Everything else about how that "forever" manifests is different.
| Dimension | Cloud API (Anthropic / OpenAI / Bedrock) | Local (Ollama / llama.cpp) |
|---|---|---|
| Cost unit | Per-token billing — detectable from API response | Wall-clock time × hardware cost (GPU/CPU/RAM) |
| Context overflow signal | HTTP 400 / context_length_exceeded error |
Silent truncation (llama.cpp) or HTTP 400 (Ollama, model-dependent) |
| Budget tracking | Dollars — cumulative from usage.total_tokens × price |
Time in seconds + token count + memory watermark |
| Runaway failure mode | Exponential invoice; rate-limit storm | OOM kill loop; CPU saturation; cold-start cascade |
| Failure observability | Billing dashboard, HTTP 429, CloudWatch spend alerts | Process exit code 137 (OOM kill), wall-clock hang, nvidia-smi |
| Concurrency risk | Rate limit exceeded; TPM/RPM quotas | All CPU cores pinned; VRAM double-allocated; OOM cascade |
The practical implication: circuit breakers for local model agents can't rely on HTTP status codes or dollar thresholds. They have to measure time, token counts at the client, and process health — all of which the agent orchestrator must track itself since no external billing system is doing it.
A minimal Ollama agent loop
A tool-use agent built on Ollama's /api/chat endpoint looks like this:
import httpx
import json
OLLAMA_BASE = "http://localhost:11434"
def ollama_chat(model: str, messages: list[dict], tools: list[dict] | None = None) -> dict:
payload = {"model": model, "messages": messages, "stream": False}
if tools:
payload["tools"] = tools
resp = httpx.post(f"{OLLAMA_BASE}/api/chat", json=payload, timeout=120.0)
resp.raise_for_status()
return resp.json()
messages = [
{"role": "user", "content": "Summarize today's top three AI papers from arXiv."}
]
tools = [
{
"type": "function",
"function": {
"name": "fetch_arxiv",
"description": "Fetch arXiv paper metadata by search query.",
"parameters": {
"type": "object",
"properties": {"query": {"type": "string"}},
"required": ["query"]
}
}
}
]
while True:
response = ollama_chat("llama3.1:70b", messages, tools)
msg = response["message"]
messages.append(msg)
if not msg.get("tool_calls"):
print(msg["content"])
break
# Execute each tool call and append results
for tc in msg["tool_calls"]:
fn = tc["function"]
result = execute_tool(fn["name"], fn["arguments"])
messages.append({
"role": "tool",
"content": json.dumps(result)
})
Three structural facts determine where failures occur:
- You own the
messageslist. Every round trip appends the assistant response and your tool results. The full list is sent on every subsequent call. Context grows unboundedly. - There is no step limit. The
while Trueloop runs until the model returns a response with no tool calls. A model in a broken reasoning state — or one working from a truncated context that has forgotten what it already tried — can return tool calls indefinitely. - Inference is synchronous and local. Every call blocks for wall-clock time proportional to the model size and token count. A slow inference is not a network timeout — it's real CPU/GPU time being consumed on your hardware.
Failure Mode 1: VRAM exhaustion and the OOM crash-and-retry loop
When you launch a llama.cpp server or load a model through Ollama, the KV (key-value) cache for the attention mechanism is pre-allocated at startup based on the n_ctx (context length) parameter. For a Llama 3.1 70B Q4_K_M model with the default 2048-token context, the KV cache is modest. Bump n_ctx to 32768 for long-document tasks and the math changes dramatically:
KV cache size ≈ n_ctx × n_layers × head_dim × 2 (K and V) × bytes_per_element. For Llama 3.1 70B: 80 layers, 128 head dimension, fp16 KV cache → 32768 × 80 × 128 × 2 × 2 = 10.7 GB for the cache alone, on top of 38–40 GB for the quantized weights. A single 40 GB A100 can't hold both.
The failure scenario: an agent orchestrator is designed to retry on any subprocess error. The llama-server process is launched with --ctx-size 32768, the VRAM allocation fails during model load, the process exits with code 137 (SIGKILL from the OOM killer) or 1 (allocation failure), and the orchestrator's retry logic restarts it. Same VRAM constraint. Same failure. The orchestrator retries again. You have a crash-and-retry loop that consumes zero useful work while hammering the GPU driver and potentially corrupting VRAM state in adjacent processes.
The guard is a subprocess-level circuit breaker that tracks exit codes and trips after consecutive OOM-related failures:
import subprocess
import time
import shlex
from dataclasses import dataclass, field
@dataclass
class SubprocessCircuitBreaker:
"""Circuit breaker for llama-server / Ollama subprocess restarts."""
command: str
max_failures_in_window: int = 3
window_seconds: float = 120.0
oom_exit_codes: frozenset = frozenset({137, 134, 1}) # SIGKILL, SIGABRT, generic failure
_failures: list[float] = field(default_factory=list, init=False)
_proc: subprocess.Popen | None = field(default=None, init=False)
def start(self) -> subprocess.Popen:
self._check_breaker()
self._proc = subprocess.Popen(
shlex.split(self.command),
stdout=subprocess.PIPE,
stderr=subprocess.PIPE
)
return self._proc
def wait_for_exit(self, timeout: float = 30.0) -> int:
if self._proc is None:
raise RuntimeError("No subprocess running")
try:
self._proc.wait(timeout=timeout)
except subprocess.TimeoutExpired:
self._proc.kill()
code = self._proc.returncode
if code in self.oom_exit_codes:
self._record_failure()
return code
def _record_failure(self) -> None:
now = time.monotonic()
self._failures = [t for t in self._failures if now - t < self.window_seconds]
self._failures.append(now)
if len(self._failures) >= self.max_failures_in_window:
raise OOMCrashLoopError(
f"llama-server exited with OOM signal {self.max_failures_in_window}x "
f"in {self.window_seconds}s. Reduce --ctx-size or --n-gpu-layers "
f"to fit model + KV cache within available VRAM."
)
def _check_breaker(self) -> None:
now = time.monotonic()
recent = [t for t in self._failures if now - t < self.window_seconds]
if len(recent) >= self.max_failures_in_window:
raise OOMCrashLoopError("Circuit breaker open — too many recent OOM exits.")
class OOMCrashLoopError(RuntimeError):
pass
# Usage: wrap your server launch in SubprocessCircuitBreaker
# breaker = SubprocessCircuitBreaker(
# command="llama-server --model llama-3.1-70b-q4_k_m.gguf --ctx-size 32768 --n-gpu-layers 80",
# max_failures_in_window=3,
# window_seconds=120.0
# )
For Ollama specifically, the OOM scenario is different: Ollama handles VRAM capacity internally and will fall back to partial CPU offloading (--gpu-layers equivalent) rather than crashing. But this fallback silently drops inference throughput from 40+ tokens/sec to 2–5 tokens/sec. The agent doesn't know inference is now 10x slower, issues the same call with the same timeout, times out, retries — and the retry hits the same offloaded-to-CPU scenario at the same throughput. Set an explicit throughput guard to detect the fallback condition:
class OllamaThroughputGuard:
"""Detects when Ollama has fallen back to CPU offloading."""
def __init__(self, min_tokens_per_sec: float = 10.0, sample_window: int = 3):
self.min_tps = min_tokens_per_sec
self.window = sample_window
self._samples: list[float] = []
def record(self, eval_count: int, eval_duration_ns: int) -> None:
"""Pass Ollama response's eval_count and eval_duration fields."""
if eval_duration_ns <= 0:
return
tps = eval_count / (eval_duration_ns / 1e9)
self._samples.append(tps)
if len(self._samples) >= self.window:
avg = sum(self._samples[-self.window:]) / self.window
if avg < self.min_tps:
raise CPUFallbackDetected(
f"Ollama inference averaging {avg:.1f} tok/s over last {self.window} calls "
f"(threshold: {self.min_tps} tok/s). Model likely fell back to CPU offloading. "
"Check VRAM usage with nvidia-smi. Consider reducing num_ctx in the request."
)
class CPUFallbackDetected(RuntimeError):
pass
The Ollama API response includes eval_count (output tokens generated) and eval_duration (nanoseconds spent generating) in every non-streaming response. These are the raw ingredients for throughput measurement with zero additional instrumentation.
Failure Mode 2: Silent context truncation and the tool-call repetition loop
Cloud APIs are explicit about context overflow: OpenAI returns {"error": {"code": "context_length_exceeded"}}; Anthropic returns HTTP 400 with a message that includes the current and maximum token counts. You can catch the exception, trim the history, and retry with a reduced context.
Llama.cpp is silent. When the accumulated messages JSON exceeds the server's n_ctx, llama.cpp does not return an error. It truncates the oldest tokens from the input and proceeds with what fits. The model receives a context that starts mid-conversation — missing the initial task description, the tool calls already made, the intermediate reasoning — and it fills in those gaps based on the remaining content and its prior training distribution.
The result is a specific failure pattern: the model begins issuing tool calls that it already made earlier in the conversation, because the record of those calls has been truncated out of its context window. The tool executor runs them again. The results come back. The model, still not seeing the full history, issues them again. This loop is indistinguishable from a normal multi-turn conversation unless you're tracking tool call history on the client side.
Detecting truncation-induced repetition requires two independent checks: a client-side token estimate to warn before truncation occurs, and a tool call fingerprint deduplicator to catch the repetition pattern if truncation happens anyway:
import hashlib
import json
import time
class TruncationAwareToolLoop:
"""
Guards a llama.cpp / Ollama agent loop against silent context truncation.
Two detection layers:
1. Pre-call: estimate token count and warn before the n_ctx threshold
2. Post-call: fingerprint tool calls and trip on repetition
"""
def __init__(
self,
model_n_ctx: int,
warn_fraction: float = 0.80,
max_tool_call_repeats: int = 2,
repeat_detection_window: int = 8,
):
self.n_ctx = model_n_ctx
self.warn_threshold = int(model_n_ctx * warn_fraction)
self.max_repeats = max_tool_call_repeats
self.window = repeat_detection_window
self._tool_fingerprints: list[str] = []
def estimate_tokens(self, messages: list[dict]) -> int:
"""
Conservative estimate: 3.5 chars per token for English, plus JSON overhead.
Runs high (safe direction) for code-heavy tool outputs.
"""
raw_chars = sum(len(json.dumps(m, ensure_ascii=False)) for m in messages)
return int(raw_chars / 3.0) # 3.0 not 3.5 — conservative for JSON whitespace
def check_before_inference(self, messages: list[dict]) -> None:
estimated = self.estimate_tokens(messages)
if estimated > self.warn_threshold:
pct = estimated / self.n_ctx * 100
raise ContextNearLimitWarning(
f"Estimated context: ~{estimated:,} tokens "
f"({pct:.0f}% of n_ctx={self.n_ctx:,}). "
"llama.cpp will truncate silently above n_ctx — trim oldest turns or "
"increase n_ctx before the next call."
)
def record_tool_call(self, tool_name: str, tool_arguments: dict) -> None:
"""Call after extracting each tool call from the model response."""
fp = self._fingerprint(tool_name, tool_arguments)
# Check recent window for this fingerprint
recent = self._tool_fingerprints[-self.window:]
repeat_count = recent.count(fp)
if repeat_count >= self.max_repeats:
raise TruncationRepetitionError(
f"Tool '{tool_name}' with identical arguments called {repeat_count + 1}x "
f"in the last {self.window} tool calls. "
"This pattern indicates context truncation: the model has forgotten "
"it already executed this call. Check n_ctx vs estimated context size."
)
self._tool_fingerprints.append(fp)
def _fingerprint(self, tool_name: str, arguments: dict) -> str:
canonical = f"{tool_name}:{json.dumps(arguments, sort_keys=True)}"
return hashlib.sha256(canonical.encode()).hexdigest()[:16]
def trim_oldest_turns(self, messages: list[dict], target_fraction: float = 0.70) -> list[dict]:
"""
Emergency trim: remove oldest assistant+tool turn pairs until context
estimate drops below target_fraction of n_ctx.
Preserves the first user message (the original task) and the system prompt.
"""
target_tokens = int(self.n_ctx * target_fraction)
if self.estimate_tokens(messages) <= target_tokens:
return messages
# Find boundary: keep system prompt + first user message
keep_head = []
rest = list(messages)
for i, m in enumerate(messages):
if m.get("role") in ("system",):
keep_head.append(m)
rest = messages[i+1:]
else:
break
# Keep first user message separately
if rest and rest[0].get("role") == "user":
keep_head.append(rest[0])
rest = rest[1:]
# Drop pairs from the front of rest until we're under budget
while self.estimate_tokens(keep_head + rest) > target_tokens and len(rest) >= 2:
rest = rest[2:] # drop one assistant + tool turn pair
return keep_head + rest
class ContextNearLimitWarning(RuntimeError):
pass
class TruncationRepetitionError(RuntimeError):
pass
One note on n_ctx discovery: llama.cpp's server exposes a /health endpoint that includes the configured context length in recent versions, but the most reliable way to populate TruncationAwareToolLoop(model_n_ctx=...) is to read the value you passed at server launch (or from Ollama's Modelfile). Ollama's /api/show endpoint returns the model's default num_ctx in the model_info field if you need it programmatically:
def get_ollama_num_ctx(model: str) -> int:
resp = httpx.get(f"{OLLAMA_BASE}/api/show", params={"name": model})
resp.raise_for_status()
data = resp.json()
# model_info contains llm.context_length for most GGUF models
return data.get("model_info", {}).get("llm.context_length", 2048)
Failure Mode 3: Ollama cold-start cascade
Ollama manages a model cache in VRAM. When a model hasn't been called for keep_alive seconds (default: 5 minutes), Ollama unloads it to free VRAM for the next model loaded. This is sensible behavior for a local development environment where you're switching between models interactively. It becomes a production failure mode in two scenarios:
- Multi-model orchestrators that switch between a planning model (e.g.,
llama3.1:70b) and tool-calling models (e.g.,mistral:7b) evict each other from the cache on every switch. Each round trip pays a cold-start penalty. - Retry-on-timeout orchestrators where the agent has a short inference timeout, the 70B model takes longer than expected on a busy system, the orchestrator kills the request, the model has now been partially loaded, Ollama marks it as unloaded, and the next retry pays another full cold-start cost.
Cold-start times for common models (measured on an RTX 3090 24 GB):
| Model | Quantization | VRAM | Cold start (GPU) | Cold start (CPU fallback) |
|---|---|---|---|---|
| Llama 3.1 8B | Q4_K_M |
~5.0 GB | 4–8 s | 20–40 s |
| Llama 3.1 70B | Q4_K_M |
~38 GB (split GPU+CPU) | 30–60 s | 3–8 min |
| Mistral 7B | Q4_K_M |
~4.1 GB | 3–7 s | 15–35 s |
| Qwen2.5 72B | Q4_K_M |
~40 GB (split) | 35–70 s | 4–10 min |
| Phi-3.5 Mini | Q4_K_M |
~2.4 GB | 2–4 s | 8–18 s |
An orchestrator making 10 calls to a 70B model with 5-minute idle timeouts between calls, retrying once on failure, can spend 10–20 minutes on cold starts alone — before a single useful token is generated.
The guard measures per-call latency against a warm-state baseline and counts cold-start events within a task window. It also provides a keep_alive enforcement helper that pins the model in VRAM for the duration of a task:
import time
import httpx
class OllamaSessionManager:
"""
Manages model warm state across a multi-call agent task.
Two responsibilities:
1. Detect cold-start events from anomalous response latency
2. Prevent model eviction by pinging keep-alive before the 5-minute timeout
"""
def __init__(
self,
model: str,
base_url: str = "http://localhost:11434",
warm_tps_floor: float = 15.0, # tok/s — below this = cold start suspected
cold_start_alarm: int = 3, # trip after this many cold starts in a task
keep_alive_ping_interval: float = 240.0, # ping every 4 min (before 5-min eviction)
):
self.model = model
self.base_url = base_url
self.warm_tps_floor = warm_tps_floor
self.cold_start_alarm = cold_start_alarm
self.ping_interval = keep_alive_ping_interval
self._cold_start_count = 0
self._last_ping_time: float = 0.0
self._client = httpx.Client(base_url=base_url, timeout=300.0)
def check_response(self, response_json: dict) -> None:
"""Call after every Ollama API response to detect cold starts."""
eval_count = response_json.get("eval_count", 0)
eval_duration_ns = response_json.get("eval_duration", 1)
load_duration_ns = response_json.get("load_duration", 0)
# load_duration > 2 seconds is a cold-start signal
load_duration_s = load_duration_ns / 1e9
if load_duration_s > 2.0:
self._cold_start_count += 1
if self._cold_start_count >= self.cold_start_alarm:
raise ColdStartCascadeError(
f"Model '{self.model}' cold-started {self._cold_start_count}x "
f"in this task (last load: {load_duration_s:.1f}s). "
"Set keep_alive=-1 on your Ollama requests or use "
"keep_alive_ping() to prevent eviction between calls."
)
# Also check throughput for partial-load degradation
if eval_count > 10 and eval_duration_ns > 0:
tps = eval_count / (eval_duration_ns / 1e9)
if tps < self.warm_tps_floor:
raise ColdStartCascadeError(
f"Inference throughput {tps:.1f} tok/s is below warm floor "
f"({self.warm_tps_floor} tok/s). Model may be partially CPU-offloaded. "
"Check VRAM with nvidia-smi and consider a smaller quantization."
)
def keep_alive_ping(self) -> None:
"""Ping Ollama to prevent model eviction. Call periodically between agent steps."""
now = time.monotonic()
if now - self._last_ping_time < self.ping_interval:
return
# Generate 1 token to reset the keep_alive timer
self._client.post("/api/generate", json={
"model": self.model,
"prompt": " ",
"max_tokens": 1,
"stream": False,
"keep_alive": "10m", # extend keep_alive to 10 minutes per call
})
self._last_ping_time = now
def pin_model(self, keep_alive: str = "-1") -> None:
"""
Load the model and set keep_alive=-1 to prevent eviction until explicit unload.
Call once at task start. Call unpin_model() when the task completes.
"""
self._client.post("/api/generate", json={
"model": self.model,
"prompt": "",
"stream": False,
"keep_alive": keep_alive,
})
self._last_ping_time = time.monotonic()
def unpin_model(self) -> None:
"""Explicitly unload the model from VRAM to free memory."""
self._client.post("/api/generate", json={
"model": self.model,
"prompt": "",
"stream": False,
"keep_alive": "0",
})
class ColdStartCascadeError(RuntimeError):
pass
The keep_alive field in Ollama's request body controls how long the model stays loaded after the last call. Setting it to "-1" pins it indefinitely; "0" unloads immediately after the call. The pattern above pins at task start and explicitly unloads at task end, avoiding both the eviction-cascade failure and the wasted-VRAM-after-idle problem.
Failure Mode 4: CPU inference runaway
On systems without a compatible GPU — or when VRAM is exhausted and Ollama falls back to CPU offloading — inference throughput drops from 40–120 tokens/sec to roughly 1–8 tokens/sec, depending on the model size and CPU core count. An agent task designed and tested on a MacBook Pro with M2 GPU (60+ tok/s) will behave very differently deployed on a 4-core cloud VM with no GPU (2–5 tok/s).
The math matters for agent loops. A loop with 10 steps, each generating 300 output tokens:
- GPU (60 tok/s): 10 × 300 / 60 = 50 seconds ✓
- CPU (5 tok/s): 10 × 300 / 5 = 600 seconds = 10 minutes ⚠️
- CPU (2 tok/s): 10 × 300 / 2 = 1500 seconds = 25 minutes 🚨
CPU inference runaway compounds into a concurrency problem. If the orchestrator has a short per-call timeout (say, 30 seconds), it times out on a 5-tok/s call, spawns a retry, the original call is still running in the background consuming a CPU core, and the retry also starts consuming a core. With 5 retries on a 4-core machine, you have 5 pinned cores generating nothing useful — all eventually completing, returning stale results, and potentially conflicting with each other's tool outputs.
The guard is a token-budget + time-budget tracker that measures actual throughput from the first few inference calls and projects whether the remaining work fits within a time budget before committing to the next step:
import time
import statistics
from dataclasses import dataclass, field
@dataclass
class LocalInferenceBudget:
"""
Tracks token generation rate and enforces time + token budgets
for CPU and GPU inference. Detects when hardware throughput makes
the task infeasible within the allowed time window.
"""
max_wall_seconds: float = 300.0 # 5 minutes total for the task
max_total_tokens: int = 15000 # output token cap across all steps
min_acceptable_tps: float = 2.0 # below this = declare runaway
warmup_samples: int = 2 # collect this many samples before projecting
_start_time: float = field(default_factory=time.monotonic, init=False)
_total_tokens: int = field(default=0, init=False)
_tps_samples: list[float] = field(default_factory=list, init=False)
def record_step(self, tokens_generated: int, step_duration_seconds: float) -> None:
"""
Call after each inference step completes.
tokens_generated: output token count from response
step_duration_seconds: wall time for this step
"""
self._total_tokens += tokens_generated
if step_duration_seconds > 0 and tokens_generated > 5:
tps = tokens_generated / step_duration_seconds
self._tps_samples.append(tps)
elapsed = time.monotonic() - self._start_time
# Hard token budget
if self._total_tokens >= self.max_total_tokens:
raise TokenBudgetExceededError(
f"Local inference consumed {self._total_tokens:,} output tokens "
f"(limit: {self.max_total_tokens:,}) in {elapsed:.1f}s. "
"Reduce max_tokens per step or tighten your agent loop termination."
)
# Hard time budget
if elapsed >= self.max_wall_seconds:
raise TimeBudgetExceededError(
f"Task exceeded {self.max_wall_seconds:.0f}s wall time "
f"after {self._total_tokens:,} tokens at "
f"{self._avg_tps():.1f} tok/s average."
)
# After warmup: project whether remaining budget is feasible
if len(self._tps_samples) >= self.warmup_samples:
avg_tps = self._avg_tps()
if avg_tps < self.min_acceptable_tps:
raise CPUInferenceRunaway(
f"Inference throughput {avg_tps:.2f} tok/s is below minimum "
f"acceptable ({self.min_acceptable_tps} tok/s). "
"This task will not complete within the time budget on this hardware. "
"Use GPU acceleration or a smaller quantized model (e.g., Q2_K instead of Q8_0)."
)
def check_before_step(self, estimated_output_tokens: int = 300) -> None:
"""
Call before each inference step to pre-check feasibility.
estimated_output_tokens: expected output for this step.
"""
elapsed = time.monotonic() - self._start_time
remaining_seconds = self.max_wall_seconds - elapsed
if len(self._tps_samples) >= self.warmup_samples:
avg_tps = self._avg_tps()
projected_step_seconds = estimated_output_tokens / avg_tps if avg_tps > 0 else float("inf")
if projected_step_seconds > remaining_seconds:
raise TimeBudgetExceededError(
f"Projected next step: {projected_step_seconds:.0f}s "
f"({estimated_output_tokens} tokens at {avg_tps:.1f} tok/s). "
f"Only {remaining_seconds:.0f}s remaining in budget. "
"Aborting before incurring unrecoverable latency."
)
def _avg_tps(self) -> float:
if not self._tps_samples:
return float("inf")
return statistics.mean(self._tps_samples[-5:]) # rolling 5-sample window
@property
def summary(self) -> str:
elapsed = time.monotonic() - self._start_time
return (
f"tokens={self._total_tokens}, "
f"elapsed={elapsed:.1f}s, "
f"avg_tps={self._avg_tps():.1f}, "
f"budget_remaining={self.max_wall_seconds - elapsed:.1f}s"
)
class TokenBudgetExceededError(RuntimeError):
pass
class TimeBudgetExceededError(RuntimeError):
pass
class CPUInferenceRunaway(RuntimeError):
pass
Composing all four guards: LocalAgentBreaker
Each guard addresses a distinct failure mode, but a production agent needs all four working together. LocalAgentBreaker is a composing wrapper that wires all four checks into the Ollama chat loop from the introduction:
import httpx
import json
import time
class LocalAgentBreaker:
"""
Drop-in wrapper for an Ollama chat loop that wires all four local-model guards:
1. OllamaThroughputGuard — detects CPU fallback from VRAM oversubscription
2. TruncationAwareToolLoop — prevents and detects context truncation loops
3. OllamaSessionManager — prevents cold-start cascade; manages keep_alive
4. LocalInferenceBudget — enforces time and token budgets for CPU inference
"""
def __init__(
self,
model: str,
model_n_ctx: int = 4096,
base_url: str = "http://localhost:11434",
max_wall_seconds: float = 300.0,
max_total_tokens: int = 15000,
pin_model: bool = True,
):
self.model = model
self._client = httpx.Client(base_url=base_url, timeout=max_wall_seconds)
self._throughput = OllamaThroughputGuard(min_tokens_per_sec=10.0)
self._ctx = TruncationAwareToolLoop(model_n_ctx=model_n_ctx)
self._session = OllamaSessionManager(model=model, base_url=base_url)
self._budget = LocalInferenceBudget(
max_wall_seconds=max_wall_seconds,
max_total_tokens=max_total_tokens,
)
if pin_model:
self._session.pin_model(keep_alive="-1")
def chat(self, messages: list[dict], tools: list[dict] | None = None) -> dict:
"""
Wraps a single Ollama /api/chat call with all pre- and post-call checks.
Returns the full Ollama response dict.
Raises one of the guard exceptions on detected failure.
"""
# Pre-call checks
self._ctx.check_before_inference(messages)
self._budget.check_before_step()
self._session.keep_alive_ping()
payload = {"model": self.model, "messages": messages, "stream": False}
if tools:
payload["tools"] = tools
step_start = time.monotonic()
resp = self._client.post("/api/chat", json=payload)
resp.raise_for_status()
data = resp.json()
step_duration = time.monotonic() - step_start
# Post-call checks
self._session.check_response(data)
self._throughput.record(
eval_count=data.get("eval_count", 0),
eval_duration_ns=data.get("eval_duration", 1),
)
self._budget.record_step(
tokens_generated=data.get("eval_count", 0),
step_duration_seconds=step_duration,
)
# Record any tool calls for repetition detection
msg = data.get("message", {})
for tc in msg.get("tool_calls", []):
fn = tc.get("function", {})
self._ctx.record_tool_call(fn.get("name", ""), fn.get("arguments", {}))
return data
def trim_messages(self, messages: list[dict]) -> list[dict]:
"""Emergency context trim. Call when ContextNearLimitWarning is raised."""
return self._ctx.trim_oldest_turns(messages)
def close(self) -> None:
"""Unload model and release HTTP connection."""
self._session.unpin_model()
self._client.close()
def __enter__(self):
return self
def __exit__(self, *args):
self.close()
# Full usage example
def run_agent(task: str, model: str = "llama3.1:70b") -> str:
n_ctx = get_ollama_num_ctx(model)
with LocalAgentBreaker(model=model, model_n_ctx=n_ctx, max_wall_seconds=180.0) as breaker:
messages = [{"role": "user", "content": task}]
tools = [/* ... your tool definitions ... */]
while True:
try:
response = breaker.chat(messages, tools)
except ContextNearLimitWarning:
messages = breaker.trim_messages(messages)
response = breaker.chat(messages, tools) # retry with trimmed context
msg = response["message"]
messages.append(msg)
if not msg.get("tool_calls"):
return msg["content"]
for tc in msg["tool_calls"]:
fn = tc["function"]
result = execute_tool(fn["name"], fn["arguments"])
messages.append({"role": "tool", "content": json.dumps(result)})
Threshold reference for local models
| Parameter | Recommended default | Rationale |
|---|---|---|
| OOM crash circuit breaker | 3 failures / 120 s | Three consecutive OOM kills indicate a structural VRAM config problem, not a transient error. More than 3 retries wastes GPU driver recovery time. |
| Context warn fraction | 0.80 of n_ctx | Leaves 20% headroom for the next assistant response. Truncation risk begins at 85–90% — the 0.80 warn gives time to trim before the critical threshold. |
| Tool repetition window | 8 calls, 2 repeat max | Legitimate iterative search can call the same tool 2x with the same query (refining vs confirming). Third identical call is a truncation-loop signal. |
| Cold-start alarm threshold | load_duration > 2 s, 3 events/task | A 2-second load_duration distinguishes cold start from cache miss. Three cold starts in one task indicates a persistent keep_alive configuration problem. |
| Minimum acceptable throughput | 10 tok/s (GPU), 2 tok/s (CPU-only) | Below 10 tok/s on a GPU system indicates VRAM offloading. Below 2 tok/s on CPU-only means a quantized 7B model is pinning all cores with little throughput gain. |
| Time budget (per task) | 300 s (GPU), 900 s (CPU) | 5 minutes is an upper bound for responsive agent tasks on GPU hardware. CPU-only tasks with larger models may need 15 minutes, but longer than that should queue for async execution. |
Local vs cloud: failure mode comparison
Teams that have built cloud-API agents first and are migrating to local models often expect the same guard patterns to apply directly. Most don't. The mechanism is similar — a loop that won't stop — but the trigger, the signal, and the fix are different for every failure mode:
| Failure mode | Cloud API equivalent | Local model equivalent |
|---|---|---|
| Context overflow | HTTP 400 context_length_exceeded — explicit, catchable |
llama.cpp silent truncation — no error, just degraded behavior |
| Cost runaway | Per-token bill → spend alert → card blocked | CPU/GPU saturation → machine unresponsive → process OOM-killed |
| Rate limit | HTTP 429 → exponential backoff | VRAM full → Ollama queues or rejects → cold-start on queue drain |
| Model unavailability | API downtime → retry with backoff | Model not pulled → Ollama downloads 40 GB mid-agent-run |
| Budget enforcement | Dollar threshold + billing webhook | Wall-clock time + token count (no external billing system) |
| Loop detection | Tool-call fingerprint similarity on API-normalized payloads | Same fingerprinting — plus context-truncation pre-check to catch root cause |
One configuration issue worth fixing now: model pre-warming
Many production local-model agent setups lose 80% of their cold-start latency to a single fixable issue: the model isn't loaded when the first request arrives. Ollama's default behavior loads on first request and evicts after 5 minutes. For any service where agent tasks arrive on a schedule or triggered by webhooks, add a startup pre-warm and a health probe that pings the model every 4 minutes:
import httpx
import threading
def prewarm_model(model: str, base_url: str = "http://localhost:11434") -> None:
"""
Load the model into VRAM at service startup and keep it warm.
Runs keep-alive pings in a background thread.
"""
client = httpx.Client(base_url=base_url, timeout=300.0)
# Initial load — keep_alive=-1 pins the model permanently
client.post("/api/generate", json={
"model": model,
"prompt": "",
"stream": False,
"keep_alive": "-1",
})
print(f"[prewarm] {model} loaded into VRAM")
def ping_loop():
while True:
time.sleep(240) # every 4 minutes
try:
client.post("/api/generate", json={
"model": model,
"prompt": " ",
"max_tokens": 1,
"stream": False,
"keep_alive": "10m",
})
except Exception:
pass # ping failure is non-fatal; model reloads on next real call
t = threading.Thread(target=ping_loop, daemon=True)
t.start()
# Call once at service startup:
# prewarm_model("llama3.1:70b")
The keep_alive: "-1" on the initial load pins the model indefinitely. The background ping thread resets the keep_alive timer every 4 minutes with a "10m" keep_alive, as a belt-and-suspenders approach for cases where "-1" doesn't behave as expected across Ollama versions.
FAQ
Does llama.cpp ever signal context truncation, or is it always silent?
Always silent in the server HTTP API (as of llama.cpp build b3941). The underlying engine logs a warning to stderr when context is exceeded (llama_decode: input is too long (NNNN tokens, max MMMM)), but this does not propagate to the HTTP response — the API call succeeds, the response body contains the model's output from the truncated context, and eval_count in the response reflects only the tokens generated, not the tokens dropped. The only reliable guard is client-side token estimation before the call. If you're running llama-cpp-python bindings directly (not via HTTP), the Llama class raises ValueError: Prompt is too long when context is exceeded — this exception is catchable.
How do I get accurate token counts for llama.cpp without calling the API?
The llama-server exposes a /tokenize endpoint that takes a text string and returns the exact token count for the loaded model's tokenizer. This is slower than a character-based estimate but exact. For agents where accuracy matters more than speed, call POST /tokenize with the JSON-serialized messages string before each inference call. The character-based estimate (3.0 chars/token, as in the guard above) is accurate to within ±20% for English/code mixed content — conservative enough that the 80% warn threshold catches approaching limits before truncation. For non-Latin scripts or heavily structured JSON outputs, calibrate the per-char estimate against a few /tokenize calls on representative data.
Will these guards work with the Ollama OpenAI-compatible endpoint (/v1/chat/completions)?
Partially. The /v1/chat/completions endpoint does not return Ollama-specific fields like eval_count, eval_duration, and load_duration — those are Ollama native API fields. The throughput and cold-start detection guards in this post rely on those fields and will not work through the OpenAI-compatible layer. The TruncationAwareToolLoop and LocalInferenceBudget guards work through any endpoint because they measure client-side metrics (message JSON size, elapsed wall time). If you need throughput monitoring through the OpenAI-compatible path, measure total wall time per call and use that as a cold-start proxy instead of load_duration.
What's the right quantization choice to avoid VRAM-related failure modes?
For a given hardware budget, pick the largest model that fits in VRAM with no CPU offloading — partial offloading triggers the throughput degradation that causes cold-start-cascade false positives and makes time budgets unreliable. On a 24 GB RTX 3090: Llama 3.1 70B Q2_K (~25 GB) does not fit cleanly; Mistral 7B Q8_0 (~7.7 GB) and Llama 3.1 8B Q6_K (~6.6 GB) fit comfortably with room for a 4096-token KV cache. On 40 GB (A100): Llama 3.1 70B Q4_K_M (~38 GB) fits with minimal headroom for KV cache — use n_ctx=2048 here unless you need long-context, since a 32768-context KV cache adds 10+ GB. Check model fit before starting production traffic: ollama run <model> with nvidia-smi in another terminal tells you the actual VRAM allocation immediately.
Can I run the LocalAgentBreaker against llama-cpp-python bindings directly instead of Ollama?
Yes, with minor adaptation. Replace the httpx call in LocalAgentBreaker.chat() with a llama.create_chat_completion() call. The TruncationAwareToolLoop, LocalInferenceBudget, and the OOM subprocess guard all work identically. Replace OllamaSessionManager with a simple in-process guard: llama-cpp-python keeps the model loaded in the Llama object as long as the object is alive, so cold-start cascade isn't a concern within a single process — but if your orchestrator creates a new Llama() instance per task, model loading cost reappears and the timing approach from OllamaThroughputGuard applies equally.