Apple MLX / Core ML Agent Cost Control: Model Reload Loops, Metal Shader Compilation Storms, Thermal Throttle Retry Spirals, and KV Cache Overflow
Apple MLX and Core ML bring LLM inference onto Apple Silicon hardware — M-series Macs, iPhone, iPad — using the device's unified memory architecture where CPU and GPU share the same physical RAM pool. Running agents on-device avoids cloud API metering entirely: there is no per-token invoice, no rate limit from a provider, and no data leaving the device. This makes on-device inference attractive for privacy-sensitive workflows, offline-capable agents, and edge deployments where latency or connectivity matter.
The absence of per-token billing creates a false sense of cost-free operation. On-device inference has its own cost surface: wasted compute cycles, latency spikes that trigger agent retries, memory pressure that degrades generation quality, and thermal conditions that throttle the very hardware you are relying on. An agent that calls into MLX or Core ML without instrumentation can burn through minutes of wall-clock time and megabytes of RAM on work that never produces a useful result — while the operator sees only a hung process and no invoice to investigate.
Four failure modes specific to MLX and Core ML agentic pipelines:
- Model reload loop on every agent invocation — When the model weights are loaded fresh from disk on each call to the agent, a 7B model on MLX takes 8–15 seconds per load on M2 hardware. An agent loop with 10 tool calls that naively instantiates the model per-turn wastes 80–150 seconds of serial load time before the first token is generated at full speed.
- Metal shader compilation blocking inference on cold cache — MLX compiles Metal compute shaders for its graph operations the first time a given operation shape runs. On a cold cache, a single forward pass through a 7B model can trigger 200–400 shader compilations, blocking inference for 20–60 seconds before the first token streams. Agents that spawn fresh Python processes per task — a common pattern in LangChain tool wrappers — hit this cold-cache penalty on every invocation.
- Thermal throttle retry spiral — Sustained inference on Apple Silicon triggers the thermal management subsystem, which reduces CPU/GPU clock speeds to stay within the device's thermal envelope. Agents that set latency-based timeouts (e.g. 30 seconds per generation) will time out during throttle events, retry the request, and drive the device deeper into throttle — producing a spiral where each retry is slower than the last.
- KV cache overflow on unified RAM — MLX keeps the KV cache in unified memory. Multi-turn agent sessions that accumulate long conversation histories grow the KV cache linearly with context length. A Llama 3.2 7B model at full 131K context occupies approximately 14 GB of KV cache alone on MLX — exhausting the RAM on a 16 GB M2 MacBook Air before the model weights themselves are counted, causing the OS to pressure-kill the process or fall back to swap with generation speeds of 0.5–2 tok/s.
Failure mode 1: Model reload loop on every agent invocation
MLX loads model weights from disk using mlx_lm.load(), which reads the model's safetensors shards, allocates unified memory, and initializes the model graph. For a Llama 3.2 7B model, the weight files total approximately 14 GB; on M2 hardware with NVMe transfer rates of 7–8 GB/s, a cold load from disk takes 8–15 seconds. A warm load (files cached in the OS page cache from a prior load) takes 2–5 seconds because the data is copied from RAM to the model's unified memory allocation rather than read from NVMe — but it is still not free.
The failure mode is an agent loop that instantiates the model as part of a tool function or as a local variable inside a function that is called repeatedly. In LangChain-based agents, this pattern arises when the MLX inference call is wrapped in a BaseTool that initializes the model in its _run() method. In direct agent loops, it arises when the model is instantiated inside the loop body rather than before it. The per-call overhead grows linearly with the number of agent turns:
| Model | Cold load (NVMe) | Warm load (page cache) | 10-turn loop overhead (warm) |
|---|---|---|---|
| Llama 3.2 3B (Q4) | ~3s | ~0.8s | ~8s wasted |
| Llama 3.2 7B (Q4) | ~8s | ~2.2s | ~22s wasted |
| Mistral 7B (Q4) | ~9s | ~2.4s | ~24s wasted |
| Llama 3.1 8B (BF16) | ~15s | ~4.1s | ~41s wasted |
| Mixtral 8×7B (Q4) | ~38s | ~11s | ~110s wasted |
The fix is to load the model once at process startup and pass the loaded model and tokenizer into the agent loop. mlx_lm.load() returns a (model, tokenizer) tuple; both are ordinary Python objects that can be stored in a module-level singleton, a dependency-injected context, or a process-global cache. The ModelCache guard below implements the singleton pattern with a path-keyed cache so different adapter configurations share no state.
from mlx_lm import load, generate
def run_agent_turn_unsafe(prompt: str, model_path: str) -> str:
"""Loads the model on every call — 2–15 seconds of wasted load time per turn."""
model, tokenizer = load(model_path) # ← BUG: should be outside the loop
return generate(model, tokenizer, prompt=prompt, max_tokens=512)
from __future__ import annotations
import threading
from dataclasses import dataclass, field
from typing import Optional
import mlx.core as mx
from mlx_lm import load, generate
@dataclass
class _CachedEntry:
model: object
tokenizer: object
load_time_s: float
class ModelCache:
"""Thread-safe singleton cache for MLX model/tokenizer pairs.
Usage:
cache = ModelCache()
model, tokenizer = cache.get("mlx-community/Llama-3.2-7B-Instruct-4bit")
# subsequent calls with the same path return the cached pair instantly
"""
_instance: Optional[ModelCache] = None
_lock: threading.Lock = threading.Lock()
def __new__(cls) -> ModelCache:
if cls._instance is None:
with cls._lock:
if cls._instance is None:
inst = super().__new__(cls)
inst._cache: dict[str, _CachedEntry] = {}
inst._cache_lock = threading.Lock()
cls._instance = inst
return cls._instance
def get(self, model_path: str) -> tuple:
with self._cache_lock:
if model_path in self._cache:
entry = self._cache[model_path]
print(f"[RunGuard] ModelCache HIT: {model_path} (loaded in {entry.load_time_s:.1f}s)")
return entry.model, entry.tokenizer
import time
t0 = time.monotonic()
print(f"[RunGuard] ModelCache MISS: loading {model_path}")
model, tokenizer = load(model_path)
mx.eval(model.parameters()) # force weight materialization before timing stops
elapsed = time.monotonic() - t0
print(f"[RunGuard] ModelCache: loaded in {elapsed:.1f}s, caching.")
with self._cache_lock:
self._cache[model_path] = _CachedEntry(model, tokenizer, elapsed)
return model, tokenizer
def evict(self, model_path: str) -> bool:
with self._cache_lock:
if model_path in self._cache:
del self._cache[model_path]
return True
return False
@property
def loaded_models(self) -> list[str]:
with self._cache_lock:
return list(self._cache.keys())
# Module-level singleton — import this and call .get() from anywhere
_model_cache = ModelCache()
def guarded_generate(
model_path: str,
prompt: str,
max_tokens: int = 512,
**kwargs,
) -> str:
"""Drop-in replacement for mlx_lm.generate() with reload-loop prevention."""
model, tokenizer = _model_cache.get(model_path)
return generate(model, tokenizer, prompt=prompt, max_tokens=max_tokens, **kwargs)
Memory note: Caching the model means its unified memory allocation stays live for the process lifetime. On a 16 GB M2, a 7B Q4 model occupies ~4 GB of unified memory. If your agent orchestrator spawns multiple models — e.g. a router model and a generation model — sum their weights to verify you stay within the device's unified memory budget before adding the KV cache allocation (failure mode 4 below).
Failure mode 2: Metal shader compilation blocking inference on cold cache
MLX compiles Metal compute shaders the first time it encounters a given operation shape (matrix dimensions, dtype, layout). These compiled shaders are cached in the MLX shader cache directory (~/.cache/mlx/) and reused on subsequent runs. On a warm cache — typical when the same model runs repeatedly in a long-lived process — shader compilation is invisible. On a cold cache — a fresh process, a new model, a changed quantization scheme, or a macOS version upgrade that invalidates cached shaders — MLX must compile every operation variant it needs before the first token can generate.
For a 7B model at 4-bit quantization, the cold-cache compilation pass spans 200–400 distinct operation shapes. Each shape takes 50–300 ms to compile on M2 hardware. Total cold-cache latency ranges from 20 seconds (fast M3 Pro) to 60+ seconds (M1 with many unique shapes). The computation is non-parallelizable — shapes must compile sequentially as they are encountered during the forward pass.
The failure mode in agentic systems is any deployment pattern that spawns a new Python process per request. LangChain agents that subprocess into an MLX inference script, FastAPI servers that use gunicorn workers with pre-fork model loading disabled, or CI pipeline agents that run each tool invocation in an isolated container — all pay the cold-cache penalty on every call. A 10-tool agent loop with subprocess isolation burns 200–600 seconds of compilation time before any useful inference occurs.
import time
import mlx.core as mx
from mlx_lm import load, generate
def warm_shader_cache(model_path: str, tokenizer_obj=None, model_obj=None) -> dict:
"""
Run a minimal warmup pass to trigger Metal shader compilation.
Call this once at process startup before accepting requests.
Returns timing diagnostics so the caller can decide whether to
reject requests during warmup or simply log the cold-cache latency.
"""
if model_obj is None or tokenizer_obj is None:
model_obj, tokenizer_obj = load(model_path)
t_compile_start = time.monotonic()
# A short prompt with a small max_tokens triggers all core operation shapes
# without wasting time on a long generation. 16 tokens is enough to exercise
# the attention, MLP, and norm layers for all sequence positions up to that length.
warmup_prompt = "Hello"
_ = generate(
model_obj,
tokenizer_obj,
prompt=warmup_prompt,
max_tokens=16,
verbose=False,
)
mx.synchronize() # flush Metal command queue before timing
t_compile_end = time.monotonic()
compile_duration = t_compile_end - t_compile_start
diagnostics = {
"warmup_duration_s": round(compile_duration, 2),
"cache_status": "cold" if compile_duration > 5.0 else "warm",
"model_path": model_path,
}
print(
f"[RunGuard] Shader warmup: {compile_duration:.1f}s "
f"({'COLD CACHE — compilation occurred' if compile_duration > 5.0 else 'warm cache — fast'})"
)
return diagnostics
class ShaderWarmGuard:
"""Blocks inference requests until the shader cache is warm.
Usage at startup:
guard = ShaderWarmGuard(model_path, model, tokenizer)
guard.warm() # blocks until warmup completes
# then proceed to accept requests
"""
def __init__(self, model_path: str, model, tokenizer, timeout_s: float = 120.0):
self._model_path = model_path
self._model = model
self._tokenizer = tokenizer
self._timeout_s = timeout_s
self._warm = False
self._warmup_stats: dict = {}
def warm(self) -> dict:
if self._warm:
return self._warmup_stats
stats = warm_shader_cache(
self._model_path,
model_obj=self._model,
tokenizer_obj=self._tokenizer,
)
if stats["warmup_duration_s"] > self._timeout_s:
raise RuntimeError(
f"[RunGuard] Shader warmup exceeded {self._timeout_s}s timeout "
f"({stats['warmup_duration_s']:.1f}s). "
f"Possible cause: MLX shader cache invalidated after OS/MLX upgrade. "
f"Delete ~/.cache/mlx/ and retry to rebuild from scratch."
)
self._warm = True
self._warmup_stats = stats
return stats
def assert_warm(self):
if not self._warm:
raise RuntimeError(
"[RunGuard] ShaderWarmGuard.warm() has not completed. "
"Call warm() at startup before accepting inference requests."
)
Cache path note: The MLX shader cache lives at ~/.cache/mlx/ on macOS. It is keyed on operation shape, dtype, and MLX version — not on model path. A single warmup pass for one model populates shapes shared by all models of the same architecture and quantization. Two Llama 3 models at 4-bit share most cached shapes; switching from Q4 to BF16 requires a separate warmup pass because the quantized matmul kernels differ from the full-precision kernels.
Failure mode 3: Thermal throttle retry spiral
Apple Silicon's power management reduces CPU and GPU clock speeds when the die temperature exceeds the device's thermal target. For sustained LLM inference — which pegs both the Neural Engine and GPU continuously — MacBook Air models (fanless) begin throttling within 2–5 minutes of continuous generation. MacBook Pro models (active cooling) sustain higher throughput but still throttle under multi-hour workloads. The throttle is proportional: a device generating 35 tok/s at nominal clocks might fall to 12–18 tok/s under sustained load.
The failure mode is a latency-based timeout in the agent loop. Agents that set a timeout=30 on their inference call will see generation stall as the device throttles — 512 tokens at 35 tok/s completes in 14 seconds, but at 14 tok/s (throttled) takes 36 seconds, exceeding the timeout. The agent catches the timeout as an error, logs it as a transient failure, and retries. The retry runs another full generation pass on already-hot hardware, driving the temperature higher. The third retry might see 8–10 tok/s. Each retry worsens the condition it is trying to escape.
| Generation attempt | Clock state | Tok/s (7B Q4, M2 Air) | 512-token generation time | 30s timeout result |
|---|---|---|---|---|
| 1 (nominal) | Full speed | ~35 | ~14.6s | Success |
| 2 (warming) | Mild throttle | ~22 | ~23.3s | Success |
| 3 (throttled) | Moderate throttle | ~14 | ~36.6s | TIMEOUT → retry |
| 4 (hot retry) | Heavy throttle | ~9 | ~56.9s | TIMEOUT → retry |
| 5 (spiral) | Maximum throttle | ~6 | ~85.3s | TIMEOUT → retry |
The fix has two components: a ThermalThrottleGuard that measures per-token throughput and detects when it falls below a configurable floor (signaling throttle onset), and a backoff policy that pauses generation — allowing the device to cool — before retrying. On MacBook Air models, a 90-second pause after detecting throttle typically recovers 60–80% of nominal throughput. Retrying immediately recovers nothing and extends the throttle event.
import time
import threading
from dataclasses import dataclass, field
from typing import Optional, Callable, Iterator
@dataclass
class ThermalThrottleGuard:
"""Detects thermal throttle via tok/s drop and enforces cooling backoff.
Wrap around any streaming generation that exposes a token iterator.
If throughput drops below throttle_floor_tok_s, the guard trips and
the next call to guarded_generate() will pause for cool_down_s before
invoking the model.
"""
nominal_tok_s: float = 30.0 # expected throughput on a cool device
throttle_floor_tok_s: float = 15.0 # below this = throttle detected
cool_down_s: float = 90.0 # pause duration when throttle detected
max_retries: int = 2 # retry ceiling during a throttle event
_throttle_detected: bool = field(default=False, init=False)
_last_throttle_at: float = field(default=0.0, init=False)
_lock: threading.Lock = field(default_factory=threading.Lock, init=False)
def measure_throughput(
self,
token_iterator: Iterator[str],
collect: bool = True,
) -> tuple[list[str], float]:
"""
Consume a streaming token iterator, measure tok/s, and record throttle state.
Returns (tokens_list, tok_per_s).
"""
tokens = []
t0 = time.monotonic()
for tok in token_iterator:
tokens.append(tok)
elapsed = max(time.monotonic() - t0, 1e-6)
tok_s = len(tokens) / elapsed
with self._lock:
if tok_s < self.throttle_floor_tok_s:
self._throttle_detected = True
self._last_throttle_at = time.monotonic()
print(
f"[RunGuard] THERMAL THROTTLE detected: {tok_s:.1f} tok/s "
f"(floor={self.throttle_floor_tok_s}). "
f"Scheduling {self.cool_down_s}s cool-down before next call."
)
else:
self._throttle_detected = False
print(f"[RunGuard] Throughput OK: {tok_s:.1f} tok/s")
return tokens, tok_s
def wait_if_throttled(self):
"""Call before each generation attempt. Blocks for cool_down_s if throttled."""
with self._lock:
if not self._throttle_detected:
return
elapsed_since = time.monotonic() - self._last_throttle_at
remaining = self.cool_down_s - elapsed_since
if remaining > 0:
print(f"[RunGuard] Cooling down — waiting {remaining:.0f}s before retry.")
time.sleep(remaining)
else:
print("[RunGuard] Cool-down period elapsed, proceeding.")
def guarded_generate_with_thermal(
model,
tokenizer,
prompt: str,
max_tokens: int = 512,
guard: Optional[ThermalThrottleGuard] = None,
**generate_kwargs,
) -> tuple[str, dict]:
"""
Generate with thermal throttle detection and cooling backoff.
Returns (generated_text, diagnostics).
"""
from mlx_lm import stream_generate
if guard is None:
guard = ThermalThrottleGuard()
for attempt in range(guard.max_retries + 1):
guard.wait_if_throttled()
token_stream = stream_generate(
model,
tokenizer,
prompt=prompt,
max_tokens=max_tokens,
**generate_kwargs,
)
tokens, tok_s = guard.measure_throughput(token_stream)
text = "".join(tokens)
diagnostics = {
"attempt": attempt + 1,
"tok_s": round(tok_s, 2),
"tokens_generated": len(tokens),
"throttle_detected": guard._throttle_detected,
}
# If we hit max_tokens without throttling, the result is valid regardless of speed
if len(tokens) >= max_tokens or not guard._throttle_detected:
return text, diagnostics
# Exhausted retries under throttle — return partial result with warning
print(
f"[RunGuard] THROTTLE RETRY CEILING reached after {guard.max_retries + 1} attempts. "
f"Returning partial result at {tok_s:.1f} tok/s. "
f"Consider reducing max_tokens or increasing cool_down_s."
)
return text, diagnostics
Failure mode 4: KV cache overflow on unified RAM
MLX allocates the KV cache in unified memory alongside the model weights. The KV cache size scales with context length, number of attention heads, head dimension, number of layers, and the precision of the cache entries. For a Llama 3.2 7B model with 32 attention heads, a head dimension of 128, 32 layers, at bfloat16 precision:
KV cache per token = 2 (K and V) × 32 (heads) × 128 (head_dim) × 32 (layers) × 2 bytes (bf16) = 524,288 bytes ≈ 0.5 MB per token
A context of 8,192 tokens occupies ~4 GB of KV cache. At 32K tokens (a typical multi-turn agent conversation with tool outputs), the KV cache alone requires 16 GB — exceeding the total unified memory on a 16 GB M2 MacBook Air before the model weights (4 GB for 7B Q4) are counted. At Llama 3.1's 131K context ceiling, the theoretical KV cache reaches 64 GB — physically impossible on any current Apple Silicon device.
| Context tokens | KV cache (7B, BF16) | Total w/ 7B Q4 weights (~4 GB) | Fits on 8 GB | Fits on 16 GB | Fits on 32 GB |
|---|---|---|---|---|---|
| 2,048 | ~1 GB | ~5 GB | Yes | Yes | Yes |
| 8,192 | ~4 GB | ~8 GB | Marginal | Yes | Yes |
| 16,384 | ~8 GB | ~12 GB | No | Yes | Yes |
| 32,768 | ~16 GB | ~20 GB | No | No | Yes |
| 65,536 | ~32 GB | ~36 GB | No | No | No (swap) |
When the KV cache exhausts unified memory, macOS invokes memory pressure response: it pages unified memory to swap on the internal SSD. NVMe swap for unified memory is dramatically slower than RAM — generation throughput falls from 30+ tok/s to 0.5–2 tok/s as each attention head lookup misses unified memory and incurs an NVMe read. The agent perceives this as severe latency, may time out and retry (compounding failure mode 3), and the retry starts a new generation with a context that is even longer because the timeout occurred mid-generation.
The fix is a KVCacheGuard that tracks the accumulated context length across agent turns and refuses to start a new generation when the projected KV cache size would exceed a configurable fraction of available unified memory. When the ceiling approaches, the guard compresses the conversation history using a summarization pass before the next generation call.
import subprocess
import re
from dataclasses import dataclass, field
from typing import Optional
def get_unified_memory_gb() -> float:
"""Query total unified memory via system_profiler on macOS."""
try:
out = subprocess.check_output(
["system_profiler", "SPHardwareDataType"],
text=True, timeout=5,
)
match = re.search(r"Memory:\s+([\d.]+)\s*GB", out)
if match:
return float(match.group(1))
except Exception:
pass
return 16.0 # conservative fallback
def estimate_kv_cache_gb(
context_tokens: int,
num_heads: int = 32,
head_dim: int = 128,
num_layers: int = 32,
bytes_per_element: int = 2, # bfloat16
) -> float:
"""Estimate KV cache size in GB for a given context length."""
bytes_per_token = 2 * num_heads * head_dim * num_layers * bytes_per_element
return (context_tokens * bytes_per_token) / (1024 ** 3)
@dataclass
class KVCacheGuard:
"""Tracks accumulated context tokens and enforces a unified RAM budget.
model_weights_gb: weight of the loaded model in unified memory
max_ram_fraction: fraction of total unified memory to allow for weights + KV cache
num_heads: model attention heads (32 for most 7B models)
head_dim: attention head dimension (128 for Llama 3.x)
num_layers: transformer layers (32 for 7B)
"""
model_weights_gb: float = 4.0 # 7B Q4
max_ram_fraction: float = 0.75 # leave 25% headroom for OS + other processes
num_heads: int = 32
head_dim: int = 128
num_layers: int = 32
_context_tokens: int = field(default=0, init=False)
_total_ram_gb: float = field(default_factory=get_unified_memory_gb, init=False)
@property
def budget_gb(self) -> float:
return self._total_ram_gb * self.max_ram_fraction - self.model_weights_gb
@property
def current_kv_gb(self) -> float:
return estimate_kv_cache_gb(
self._context_tokens,
self.num_heads,
self.head_dim,
self.num_layers,
)
@property
def max_safe_tokens(self) -> int:
bytes_per_token = 2 * self.num_heads * self.head_dim * self.num_layers * 2
return int((self.budget_gb * 1024 ** 3) / bytes_per_token)
def add_tokens(self, token_count: int):
self._context_tokens += token_count
print(
f"[RunGuard] KV context: {self._context_tokens} tokens "
f"({self.current_kv_gb:.2f} GB / {self.budget_gb:.2f} GB budget)"
)
def check(self, new_tokens: int = 0) -> tuple[bool, Optional[str]]:
"""
Returns (safe_to_proceed, warning_message_or_None).
Call before each generation with the expected new prompt token count.
"""
projected = self._context_tokens + new_tokens
projected_gb = estimate_kv_cache_gb(
projected, self.num_heads, self.head_dim, self.num_layers
)
if projected_gb > self.budget_gb:
return False, (
f"[RunGuard] KV CACHE BUDGET EXCEEDED — projected {projected_gb:.2f} GB "
f"would exceed {self.budget_gb:.2f} GB budget "
f"(device has {self._total_ram_gb:.0f} GB unified memory, "
f"{self.max_ram_fraction*100:.0f}% reserved for inference). "
f"Context must be compressed before the next generation. "
f"Max safe context: {self.max_safe_tokens} tokens, "
f"current: {self._context_tokens} tokens."
)
return True, None
def reset(self):
self._context_tokens = 0
print("[RunGuard] KV context reset to 0.")
@dataclass
class MLXAgentPolicy:
"""Composite policy applying all four MLX/CoreML guards from one config object."""
model_path: str
nominal_tok_s: float = 30.0
throttle_floor_tok_s: float = 15.0
cool_down_s: float = 90.0
max_thermal_retries: int = 2
model_weights_gb: float = 4.0
max_ram_fraction: float = 0.75
kv_num_heads: int = 32
kv_head_dim: int = 128
kv_num_layers: int = 32
def build_cache(self) -> ModelCache:
return ModelCache()
def build_shader_guard(self, model, tokenizer) -> ShaderWarmGuard:
return ShaderWarmGuard(self.model_path, model, tokenizer)
def build_thermal_guard(self) -> ThermalThrottleGuard:
return ThermalThrottleGuard(
nominal_tok_s=self.nominal_tok_s,
throttle_floor_tok_s=self.throttle_floor_tok_s,
cool_down_s=self.cool_down_s,
max_retries=self.max_thermal_retries,
)
def build_kv_guard(self) -> KVCacheGuard:
return KVCacheGuard(
model_weights_gb=self.model_weights_gb,
max_ram_fraction=self.max_ram_fraction,
num_heads=self.kv_num_heads,
head_dim=self.kv_head_dim,
num_layers=self.kv_num_layers,
)
Core ML note: Core ML's KV cache is managed differently — the compiled .mlpackage model specifies the maximum context window at compile time, and Core ML pre-allocates the full KV cache buffer at model load regardless of actual context used. A Core ML model compiled with context_length=4096 allocates the full 4096-token KV buffer immediately; there is no dynamic growth. The failure mode for Core ML agents is not overflow but over-allocation: compiling a model with a large context_length to handle rare edge cases wastes unified memory on every invocation. Compile multiple Core ML models at different context lengths and route agent turns to the smallest model that fits the current context.
Frequently asked questions
Does the model cache persist across Python process restarts?
No. The ModelCache singleton lives in process memory and is destroyed when the process exits. The MLX shader cache at ~/.cache/mlx/ does persist across restarts — so a long-lived server process benefits from both caches, while a script that exits after each agent run only benefits from the shader cache on its second and subsequent runs. If your agent is called as a subprocess, consider converting it to a persistent server (FastAPI, gRPC, or even a simple Unix socket wrapper) so the model loads once and the shader cache stays warm.
Can I use MLX on iOS or iPadOS for on-device agents?
MLX is primarily a macOS/Apple Silicon framework; for iOS and iPadOS the supported path is Core ML. The KV cache overflow concern is more acute on iOS — A-series chips in iPhones have 6–8 GB of RAM, of which iOS reserves 1–2 GB for system use, leaving 4–6 GB for the app. A 3B Q4 model with weights (~1.6 GB) and a 4K context KV cache (~0.5 GB) fits on a modern iPhone; a 7B model does not. Always specify a context_length at Core ML compile time that fits within the device's available memory budget.
What is the right approach when the thermal throttle guard detects throttle but the agent cannot wait 90 seconds?
Reduce max_tokens for the throttled retry — a throttled device generating 8 tok/s can still produce a 128-token response in 16 seconds, well within typical timeout budgets. The ThermalThrottleGuard can be extended to reduce max_tokens proportionally with the measured throughput drop: if throughput falls to 30% of nominal, cap the next generation at 30% of the normal token budget. This trades response length for latency stability — often acceptable in interactive agent loops where the user expects shorter responses under load.
Does MLX support LoRA adapters, and do they affect the reload cost?
Yes. MLX supports LoRA fine-tuning and inference via mlx_lm.lora. Loading a LoRA adapter adds a small overhead on top of the base model load — typically 0.1–0.5 seconds depending on adapter size — because the adapter weights are merged into the base model after weight loading. The ModelCache should key on both the base model path and the adapter path to ensure the merged weights are cached correctly: cache_key = f"{model_path}::{adapter_path or ''}". Different adapters of the same base model should be cached as separate entries because their merged weights differ.
How does the KV cache estimate change for grouped-query attention (GQA) models like Llama 3?
Llama 3.x uses grouped-query attention with 8 KV heads (not 32) while maintaining 32 query heads. The KV cache size uses the KV head count, not the query head count: 2 × 8 (KV heads) × 128 (head_dim) × 32 (layers) × 2 bytes = 131,072 bytes ≈ 0.125 MB per token — 4× smaller than naive 32-head calculation. A 32K context in Llama 3 7B occupies ~4 GB of KV cache, not 16 GB. Update KVCacheGuard(num_heads=8) for GQA models to avoid over-conservative budget enforcement that would truncate context unnecessarily.