LLM agent resource cleanup cost patterns: the leaks that silently multiply your bill
Every unclosed HTTP connection, every unbounded in-process cache, every file descriptor pointing at a vector index that finished five minutes ago — these are not just operational annoyances. They are direct cost amplifiers. When an embedding cache grows until the OS kills the process, the agent restarts cold, replays its context, and sends 3,500 tokens it already processed. When a thread pool fills with stalled tool-call futures, the agent retries the whole task from scratch. When a connection pool is exhausted and the agent opens a raw database connection on every call, latency spikes enough to trigger timeout-induced re-runs. This page catalogs every major resource category that accumulates without explicit cleanup, quantifies the cost impact of each, and shows the Python and TypeScript patterns that eliminate them — with RunGuard hooks that catch cleanup failures before they cascade.
Five resource categories that accumulate without cleanup
- HTTP connections to LLM APIs.
httpx.AsyncClient,aiohttp.ClientSession, and the underlyingopenai.AsyncOpenAIclient all maintain a connection pool. If you instantiate a new client per agent run — a common pattern when agent objects are created per request in a web framework — and never callawait client.aclose(), each client holds one or more open TCP sockets until the OS decides to reclaim them. A 5-worker Uvicorn deployment handling 50 concurrent agent runs will accumulate up to 250 open sockets pointing atapi.openai.comorapi.anthropic.com. On most Linux hosts the ephemeral port range spans 32,768–60,999, which is 28,231 ports. A sufficiently busy machine running stateless short-lived agents can exhaust this range within hours, causingOSError: [Errno 99] Cannot assign requested addresson new LLM calls. The agent framework typically catches the error and retries — which consumes tokens for the re-plan — before failing permanently. Each forced re-plan costs a full context replay. - Embedding caches growing without bounds. Retrieval-augmented agents cache embeddings in process memory to avoid re-computing vectors for documents they’ve seen. The cache is usually a plain Python
dictkeyed by document hash. Without an eviction policy the dict grows for the lifetime of the process. In production, a document-processing agent handling 50,000 unique chunks per day accumulates roughly 50,000 × 1,536 float32 values = ~300 MB of raw vectors per day. After two weeks that is 4 GB. When the container’s memory limit triggers an OOM kill, the agent process is terminated mid-task. The orchestration layer restarts it cold. On restart, the agent has no in-memory state: it re-reads the task from the queue, re-fetches context, and sends a cold-start prompt that includes all the history tokens it had already processed. For a mid-task agent with 3,500 accumulated input tokens, that is 3,500 × $3.00 / 1,000,000 = $0.0105 per restart. At 200 OOM-induced restarts per day the cost is $2.10/day or $766/year in pure replay overhead. - Vector search indexes opened but not released. Libraries like FAISS, Hnswlib, and Chroma open index files by memory-mapping or holding a file descriptor open for random-access reads. When agent code opens an index to run a similarity search inside a tool function without a corresponding close, each tool call increments the open file descriptor count by one. Linux defaults to 1,024 file descriptors per process (
ulimit -n). A busy agent that calls a vector search tool 50 times per task and handles 20 concurrent tasks will saturate the FD limit within seconds, causingOSError: [Errno 24] Too many open files. The agent cannot open new files — including log files, config files, or other tool dependencies — and begins failing in unpredictable ways that each cost a re-plan LLM call. - Thread pool executors for async tool calls. When an async agent runs synchronous tool functions — database queries, file I/O, external subprocess calls — it typically wraps them in
asyncio.get_event_loop().run_in_executor(executor, fn)using aThreadPoolExecutor. If the executor is created inside the tool function rather than shared across the agent lifecycle, each tool call spawns a new thread pool. More insidiously: if a previous tool call timed out and the future was cancelled, the thread is still blocked waiting for the underlying I/O. The executor is not garbage-collected because the thread holds a reference. Over time, the process accumulates dozens of blocked threads, each consuming 8 MB of stack space. Memory pressure builds, the OOM killer fires, and the agent restarts cold. Even before OOM, thread starvation causes new tool calls to queue behind stalled ones, increasing latency and triggering timeout-induced agent task duplication. - Database connection pool starvation. Agents that query a database during tool calls must share a connection pool. When the agent framework creates a new
asyncpg.PoolorSQLAlchemy.Engineper agent instance and never disposes it, each instance holdspool_min_sizeidle connections open. A pool withmin_size=2across 100 concurrent agent instances holds 200 idle database connections — approaching PostgreSQL’s defaultmax_connections=100. New connections are refused. The agent catches the connection error and retries the query, but the retry also cannot acquire a connection. The task fails after N retries; the orchestrator re-queues it as a new task; the new task incurs a full cold-start token cost. Each forced full-task re-run duplicates the entire prompt cost for that task.
Quantifying the cost impact: from resource leak to dollar figure
- Embedding cache OOM: $766/year per agent process. As calculated above: 200 OOM restarts/day × 3,500 replay tokens × $3.00/MTok = $2.10/day. This assumes a modest mid-task context size. Agents processing long documents or maintaining multi-turn history can easily accumulate 15,000–20,000 input tokens mid-task, pushing the per-restart cost to $0.045–$0.060 and the daily cost to $9–$12 for the same restart rate. For a fleet of 10 agent workers, that is $90–$120/day in pure OOM-replay overhead — roughly $33,000–$44,000/year.
- Thread starvation causing full task duplication: 2× task cost. When thread starvation causes a tool call to fail with a timeout after the agent has already completed 60% of a task, the orchestrator re-queues the task from scratch. The duplicated task incurs the full cost of the first 60% again. For a task that costs $0.05 in tokens (1,000 input + 500 output tokens at GPT-4o pricing), a 60% duplication adds $0.03. At 500 task failures per day from thread starvation, that is $15/day or $5,475/year in duplicated work. The AI agent retry storm prevention article covers the broader pattern of how tool failures cascade into LLM re-plans; resource exhaustion is the less-discussed upstream cause of those same failures.
- Connection pool exhaustion causing latency spikes: 1,000 seconds of wasted latency per day. When the DB connection pool is exhausted, the agent falls back to opening a raw connection per query. Establishing a new TCP connection to PostgreSQL adds approximately 100 ms of latency per call (TCP handshake + TLS + auth + session setup). At 10,000 DB-backed tool calls per day, that is 1,000 seconds of added latency. Beyond the direct cost of wasted wall-clock time, the latency spikes push individual tool calls past their configured timeout, triggering the timeout-retry logic described in LLM API timeout cost impact. Each timeout-triggered retry adds another full LLM planning call.
- File descriptor exhaustion causing cascading tool failures: variable but severe. When the FD limit is hit, every tool function that opens any file — including the agent’s own log writer — fails with an uncaught
OSError. The agent’s exception handler calls the LLM to re-plan. The LLM plans to retry the tool. The tool fails again. This is structurally a retry storm driven by resource exhaustion rather than API unavailability. The cost profile is identical to other retry storms: N planning calls × full context replay per call. See prevent AI agent runaway cost in real time for detection patterns that catch this category of spiral.
Python: ResourceScope context manager with RunGuard integration
- Building the ResourceScope context manager. The fundamental fix for all five resource categories is deterministic cleanup: resources are registered on acquisition and released on exit, regardless of whether the agent task succeeded or failed. Python’s context manager protocol (
__enter__/__exit__and the async variants) enforces this at the language level. The followingResourceScopeimplementation tracks every registered resource and calls its closer in reverse-registration order on scope exit, mirroring C++ RAII. RunGuard’sresource_guardwrapper is injected as a decorator that measures the cleanup duration and fires an alert if any individual closer takes more than 500 ms (a reliable signal that a resource is hung):import asyncio import time from contextlib import asynccontextmanager from typing import Any, Callable, Awaitable import runguard rg = runguard.init(api_key="rg_live_...") class ResourceScope: """ Async context manager that tracks acquired resources and releases them in LIFO order on __aexit__. """ def __init__(self, scope_name: str): self._scope_name = scope_name self._closers: list[tuple[str, Callable[[], Awaitable[None] | None]]] = [] def register(self, name: str, closer: Callable) -> None: """Register a resource closer. Called automatically by acquire().""" self._closers.append((name, closer)) async def __aenter__(self) -> "ResourceScope": return self @rg.resource_guard(cleanup_timeout_ms=500) async def __aexit__(self, *_) -> None: errors = [] for name, closer in reversed(self._closers): t0 = time.monotonic() try: result = closer() if asyncio.iscoroutine(result): await result elapsed_ms = (time.monotonic() - t0) * 1000 if elapsed_ms > 500: rg.alert( f"slow_cleanup", detail=f"{self._scope_name}/{name} took {elapsed_ms:.0f}ms to close", severity="warning" ) except Exception as exc: errors.append(f"{name}: {exc}") if errors: raise RuntimeError( f"ResourceScope({self._scope_name}) cleanup errors: {errors}" ) async def run_agent_task(task_id: str, task_input: dict) -> dict: async with ResourceScope(f"task-{task_id}") as scope: # HTTP client — always closed on scope exit import httpx client = httpx.AsyncClient(timeout=30.0) scope.register("http_client", client.aclose) # Database pool — always released on scope exit import asyncpg pool = await asyncpg.create_pool( "postgresql://user:pass@localhost/agentdb", min_size=1, max_size=3 ) scope.register("db_pool", pool.close) # FAISS index — always unloaded on scope exit import faiss index = faiss.read_index("/data/embeddings.index") scope.register("faiss_index", lambda: None) # FAISS releases on GC, # but explicit del ensures timing # ... agent logic runs here ... result = await _execute_agent_steps(client, pool, index, task_input) return result # All three resources released here, in reverse order - Using RunGuard’s
connection_guardfor connection-count monitoring. In addition to deterministic cleanup, RunGuard can monitor the current open connection count and alert when an agent holds more connections than expected. This catches cases where a resource was registered correctly but the closer itself is broken — for example, a pool whoseclose()method returns before all connections are actually released:@rg.connection_guard(max_open=5, alert_on_exceed=True) async def run_agent_task(task_id: str, task_input: dict) -> dict: """ RunGuard tracks HTTP + DB connections opened inside this function. If the count exceeds max_open=5, it fires a Slack/PagerDuty alert and (optionally) raises a BudgetExceededError to abort the task. """ async with ResourceScope(f"task-{task_id}") as scope: # ... same setup as above ... pass
TypeScript: explicit resource management with using and AsyncResourceManager
- ES2022 explicit resource management (
usingkeyword). TypeScript 5.2+ supports the TC39 explicit resource management proposal. Any object that implementsSymbol.dispose(sync) orSymbol.asyncDispose(async) is automatically cleaned up when theusingorawait usingblock exits, even on exception. This is the TypeScript equivalent of Python’sasync with:import Anthropic from "@anthropic-ai/sdk"; import { Pool } from "pg"; import RunGuard from "@runguard/sdk"; const rg = new RunGuard({ apiKey: process.env.RUNGUARD_API_KEY! }); class ManagedHttpClient { readonly client: Anthropic; constructor() { this.client = new Anthropic(); } async [Symbol.asyncDispose](): Promise<void> { // Anthropic SDK v0.28+ exposes destroy() // for explicit connection teardown await (this.client as any).destroy?.(); } } class ManagedDbPool { readonly pool: Pool; constructor(connectionString: string) { this.pool = new Pool({ connectionString, max: 3 }); } async [Symbol.asyncDispose](): Promise<void> { await this.pool.end(); } } async function runAgentTask( taskId: string, taskInput: Record<string, unknown> ): Promise<Record<string, unknown>> { // Both resources are automatically disposed when the block exits, // whether by normal return or by thrown exception. await using httpClient = new ManagedHttpClient(); await using dbPool = new ManagedDbPool(process.env.DATABASE_URL!); // RunGuard wraps the task to track resource duration + cost return await rg.withTask( { taskId, tags: ["agent", "production"] }, async (guard) => { const conn = await dbPool.pool.connect(); try { // ... agent logic using httpClient.client and conn ... const result = await executeAgentSteps( httpClient.client, conn, taskInput, guard ); return result; } finally { conn.release(); // always release individual connection } } ); // dbPool.pool.end() and httpClient destroy() called automatically here } - AsyncResourceManager for environments without
usingsupport. When targeting Node.js versions below 20 or transpilation targets that predate ES2022, theSymbol.asyncDisposeapproach requires a polyfill. The followingAsyncResourceManagerclass provides the same LIFO cleanup guarantee without requiring theusingkeyword, and integrates with RunGuard’s cleanup-timing alerts:import RunGuard from "@runguard/sdk"; type Disposer = () => Promise<void> | void; class AsyncResourceManager { private readonly disposers: Array<{ name: string; fn: Disposer }> = []; private readonly guard: ReturnType<typeof RunGuard.prototype.withTask>; constructor(private readonly scopeName: string) {} register(name: string, disposer: Disposer): void { this.disposers.push({ name, fn: disposer }); } async run<T>(fn: (mgr: AsyncResourceManager) => Promise<T>): Promise<T> { try { return await fn(this); } finally { await this.disposeAll(); } } private async disposeAll(): Promise<void> { const errors: string[] = []; for (const { name, fn } of [...this.disposers].reverse()) { const t0 = performance.now(); try { await fn(); const elapsed = performance.now() - t0; if (elapsed > 500) { // RunGuard cleanup-timing alert rg.alert("slow_cleanup", { scope: this.scopeName, resource: name, elapsedMs: Math.round(elapsed), severity: "warning", }); } } catch (err) { errors.push(`${name}: ${err}`); } } if (errors.length) { throw new Error( `AsyncResourceManager(${this.scopeName}) cleanup errors: ${errors.join("; ")}` ); } } } // Usage async function runAgentTask(taskId: string, input: Record<string, unknown>) { const mgr = new AsyncResourceManager(`task-${taskId}`); return mgr.run(async (m) => { const client = new Anthropic(); m.register("anthropic_client", async () => (client as any).destroy?.()); const pool = new Pool({ connectionString: process.env.DATABASE_URL, max: 3 }); m.register("db_pool", () => pool.end()); // ... agent logic ... return executeAgentSteps(client, pool, input); }); }
Bounded LRU embedding cache with RunGuard eviction monitoring
- Replace unbounded dicts with an LRU cache. The simplest fix for embedding cache OOM is a size-bounded LRU. Python’s
functools.lru_cacheworks for pure function memoization but cannot be shared across object instances or monitored externally. The followingBoundedEmbeddingCacheusescollections.OrderedDictfor O(1) LRU behavior, exposes hit/miss/eviction counters, and calls a RunGuard callback on eviction so the cache efficiency is tracked in the RunGuard dashboard alongside token spend:from collections import OrderedDict from typing import Optional import numpy as np import runguard rg = runguard.init(api_key="rg_live_...") class BoundedEmbeddingCache: """ LRU cache for document embeddings. Evicts the least-recently-used entry when max_size is reached. Reports hit rate to RunGuard so cache efficiency appears alongside token-cost metrics. """ def __init__(self, max_size: int = 10_000): self._cache: OrderedDict[str, np.ndarray] = OrderedDict() self._max_size = max_size self._hits = 0 self._misses = 0 self._evictions = 0 def get(self, key: str) -> Optional[np.ndarray]: if key in self._cache: self._cache.move_to_end(key) self._hits += 1 return self._cache[key] self._misses += 1 return None def put(self, key: str, embedding: np.ndarray) -> None: if key in self._cache: self._cache.move_to_end(key) else: if len(self._cache) >= self._max_size: evicted_key, _ = self._cache.popitem(last=False) self._evictions += 1 # Notify RunGuard — high eviction rate means cache is too small # and embeddings are being recomputed (costing API calls) rg.metric( "embedding_cache_eviction", value=1, tags={ "evicted_key_prefix": evicted_key[:8], "cache_size": self._max_size, "total_evictions": self._evictions, } ) self._cache[key] = embedding @property def hit_rate(self) -> float: total = self._hits + self._misses return self._hits / total if total > 0 else 0.0 def report_to_runguard(self) -> None: """Call this periodically (e.g. every 60s) to push cache stats.""" rg.metric("embedding_cache_hit_rate", value=self.hit_rate) rg.metric("embedding_cache_size", value=len(self._cache)) # Shared singleton — 10,000 entries × 1,536 float32 = ~59 MB max _embedding_cache = BoundedEmbeddingCache(max_size=10_000) async def get_embedding(text: str, client) -> np.ndarray: key = hash(text).to_bytes(8, "big").hex() cached = _embedding_cache.get(key) if cached is not None: return cached response = await client.embeddings.create( model="text-embedding-3-small", input=text ) embedding = np.array(response.data[0].embedding, dtype=np.float32) _embedding_cache.put(key, embedding) return embedding - Memory footprint math for cache sizing. One
text-embedding-3-smallembedding is 1,536 float32 values = 6,144 bytes = ~6 KB. A cache of 10,000 entries consumes ~59 MB — a safe allocation for any modern container. To cache 100,000 entries the footprint grows to ~590 MB; at that scale, consider an external Redis cache with TTL-based eviction instead of an in-process dict. The RunGuard eviction metric makes cache-size tuning data-driven: if evictions per minute exceed hits per minute, the cache is undersized and you are paying for repeated embedding API calls. See AI agent memory consolidation cost optimization for the broader framework of managing agent memory across turns.
Resource leak types: symptoms, cleanup patterns, RunGuard detection, and cost impact
| Resource type | Leak symptom | Cleanup pattern | RunGuard detection | Cost impact |
|---|---|---|---|---|
| HTTP client / connection pool | OSError: Cannot assign requested address; ephemeral port exhaustion on busy hosts |
Instantiate once per process; call aclose() / destroy() in context manager exit |
rg.connection_guard(max_open=5) alerts when a single agent task holds more than N connections |
Each forced re-plan from connection failure: ~1,000–3,500 input tokens × $3/MTok = $0.003–$0.011 per event |
| In-process embedding cache | OOM kill; container restart; cold-start token replay | Bounded LRU with max_size; shared singleton; external Redis for >100k entries |
rg.metric("embedding_cache_eviction") callback; high eviction rate triggers alert |
200 OOM restarts/day × 3,500 replay tokens × $3/MTok = $2.10/day ($766/year) |
| Vector search index (FAISS, Hnswlib) | OSError: Too many open files; FD limit exhaustion |
Open index once at startup; share across requests; call close() at process shutdown |
rg.resource_guard() cleanup-timing alert on __aexit__ if close takes >500ms |
FD exhaustion triggers tool failure → retry storm; cost equals N retry LLM calls × full context |
| ThreadPoolExecutor for sync tool calls | Thread accumulation; memory pressure; eventual OOM or tool-call timeout | One shared executor per process; executor.shutdown(wait=True) on app teardown; cancel stalled futures |
rg.alert("slow_cleanup") when executor shutdown exceeds 500ms threshold |
Stalled thread → tool timeout → task duplication: 2× full task token cost per event |
| Database connection pool | Connection refused; pool wait timeout; raw connection fallback adding 100ms latency per call | Single pool per process; pool.close() / pool.end() in scope exit; release individual connections in finally |
rg.connection_guard() tracks pool saturation; alert on pool wait >200ms |
100ms latency spike × 10k calls/day = 1,000s wasted latency; each timeout triggers retry LLM call |
Related: AI agent health monitoring cost tradeoffs — how to instrument agent processes for memory, FD count, and connection health without adding overhead that itself increases cost. See also production LLM agent reliability checklist for the full set of operational checks that prevent resource-driven failures.
Applying cleanup patterns to the full agent lifecycle
- Startup: initialize shared resources once. HTTP clients, database pools, shared executor pools, and singleton embedding caches should all be initialized at process startup and stored as module-level or application-level singletons. Framework startup hooks are the right place:
@app.on_event("startup")in FastAPI,lifespancontext managers in Starlette,process.on("beforeExit")in Node.js. This eliminates per-task instantiation overhead and the associated leak risk. - Task scope: register per-task resources in a
ResourceScope. Resources that are necessarily per-task — individual database connections drawn from the pool, per-taskasyncio.TaskGroupinstances, per-task temp files — should be registered in aResourceScope(Python) orAsyncResourceManager/await usingblock (TypeScript) that guarantees cleanup on task completion or failure. - Shutdown: flush metrics, then release shared resources. On process shutdown, flush all pending RunGuard metrics (so the final cache hit rate, connection count, and cleanup timings are recorded), then call the shared resource closers. This ordering ensures the last metrics are sent before the HTTP client that delivers them is closed. The
ResourceScope’s LIFO cleanup order handles this automatically if the metric-flush registration is done before the HTTP client registration. - Monitoring with RunGuard across the lifecycle. RunGuard’s connection and resource guards provide continuous visibility into which agent tasks are holding resources longer than expected. In the Team plan, alerts route to Slack and PagerDuty so on-call engineers see a connection-count spike or slow-cleanup alert before the OOM kill happens. This moves resource management from reactive (debug after the OOM) to proactive (fix the leak before it causes a cost-amplifying restart). Combined with the token-budget controls described in autonomous agent cost control best practices, resource cleanup guards form the second layer of a complete agent cost-containment strategy.
Stop resource leaks from becoming token-spend multipliers
Every resource leak in an LLM agent eventually manifests as extra token spend — through cold-start replays, retry storms, or task duplication. The patterns on this page — ResourceScope context managers, bounded LRU caches, shared singleton pools, and deterministic cleanup on task exit — eliminate the five most common leak categories. RunGuard’s resource_guard, connection_guard, and eviction metric callbacks provide the runtime visibility to catch cleanup regressions before they accumulate into meaningful cost.
RunGuard pricing: Solo plan at $19/month for individual developers. Team plan at $79/month adds Slack and PagerDuty webhook alerts, shared dashboards, and audit log. Both plans include a 14-day free trial — no credit card required.
Start your 14-day free trial — or explore related: AI agent context window truncation alert and AI agent graceful degradation patterns.