E2B Code Interpreter Agent Cost Control: Sandbox CPU-Second Billing, Timeout Retry Loops, Output Accumulation, and Parallel Sandbox Storms

E2B (formerly E2B.dev) provides isolated, cloud-hosted Python sandboxes for AI agents that need to execute code. Unlike LLM API pricing — where cost is a function of tokens in and tokens out — E2B bills primarily by CPU-seconds of sandbox runtime. A sandbox that stays alive for 60 seconds costs the same whether it ran 10 lines of code or sat idle while the agent's LLM decided what to do next. This billing model creates a set of cost failure modes that don't exist in pure LLM pipelines and aren't caught by token-budget monitoring alone.

The E2B Code Interpreter SDK (e2b-code-interpreter) is purpose-built for data analysis, research, and coding agents. It's the sandbox layer behind many Claude Artifacts-style "execute and show me the output" workflows, as well as agentic data science pipelines in LangChain, LlamaIndex, and custom frameworks. The SDK exposes a clean Sandbox.create() / sandbox.run_code() interface, which makes it trivially easy to wire up but also easy to wire up incorrectly from a cost-control perspective.

Four failure modes emerge specifically from E2B's billing model and the way agent frameworks interact with sandbox lifecycle:

  • Sandbox CPU-second accumulation during idle agent think-time — the sandbox stays alive and billed while the LLM is reasoning between code executions; a 30-step agent with 3-second LLM latency per step accumulates 90 seconds of idle sandbox billing on top of actual execution time.
  • Execution timeout retry loops — when code execution times out, agents retry the same computation in a new or existing sandbox; each retry incurs the full execution cost plus sandbox startup time, and retry loops that don't track attempt count compound the waste.
  • Output accumulation inflating LLM context — sandbox stdout, stderr, and rich outputs (DataFrames, plots as base64 strings) are fed back to the LLM; without output truncation, a few `df.head(1000)` calls can push the LLM context into the 50k–100k token range, multiplying LLM costs at every subsequent step.
  • Parallel sandbox storms — agent frameworks that support concurrency (LangGraph with parallel branches, CrewAI multi-agent, custom asyncio.gather patterns) may spawn one sandbox per concurrent task; twenty parallel sub-agents each with a live sandbox billone simultaneously.

Failure Mode 1 — Sandbox CPU-Second Billing During Idle Think-Time

E2B sandbox billing starts when Sandbox.create() returns and ends when sandbox.kill() is called or the sandbox times out. The sandbox does not pause during the intervals between sandbox.run_code() calls. If your agent's code execution pattern is: run code → LLM reasons → run code → LLM reasons → run code, every second of LLM reasoning is a billable second of sandbox time.

The math is straightforward but easy to underestimate. A data analysis agent that calls sandbox.run_code() fifteen times during a session, with 4 seconds of actual execution per call and 3 seconds of LLM latency per reasoning step, accumulates:

  • 15 × 4s = 60 seconds of execution time (expected)
  • 15 × 3s = 45 seconds of idle billing between executions (invisible)
  • Total: 105 seconds billed — 75% more than the code actually ran for

At scale, if the LLM is a frontier model with 5–10 second latency per call, idle billing at a 15-step session exceeds actual execution time by 2–4×. Teams that estimate E2B costs based on "time the code actually runs" routinely see bills 3× higher than projected.

The idle billing rule: For every second of LLM think-time between code executions, you pay one CPU-second of sandbox billing. In a pipeline with more LLM calls than code executions — common in reasoning-heavy agents — idle sandbox billing exceeds active execution billing. Measure wall-clock sandbox lifetime, not just execution durations.

The correct pattern is to close the sandbox as soon as the agent no longer needs it and open a fresh one only when the next code execution begins. E2B sandboxes start in under 500ms, so the overhead of open-execute-close is negligible compared to the idle billing it eliminates.

Python — naive pattern (idle billing accumulates)
from e2b_code_interpreter import Sandbox

sandbox = Sandbox()  # billing starts here

for step in agent_steps:
    llm_response = llm.call(step)          # 3–8 seconds: idle billing
    if llm_response.has_code:
        result = sandbox.run_code(         # billing continues
            llm_response.code_block
        )
        agent_context.add(result)

sandbox.kill()  # billing ends only here
# total billed: sum of all LLM latencies + all execution times
Python — correct pattern (sandbox scoped to each execution)
from e2b_code_interpreter import Sandbox
import time

def run_code_guarded(code: str, timeout_s: int = 30) -> dict:
    """Open sandbox, run code, close immediately. Never accumulate idle time."""
    start = time.monotonic()
    with Sandbox() as sandbox:          # auto-kills on context exit
        execution = sandbox.run_code(code, timeout=timeout_s)
    elapsed = time.monotonic() - start
    return {
        "stdout": execution.logs.stdout,
        "stderr": execution.logs.stderr,
        "results": execution.results,
        "sandbox_seconds": elapsed,
    }

# Agent loop: each call opens and closes its own sandbox
for step in agent_steps:
    llm_response = llm.call(step)
    if llm_response.has_code:
        result = run_code_guarded(llm_response.code_block)
        agent_context.add(truncate_output(result))
# idle billing: zero — sandbox is never alive during LLM reasoning

Using Sandbox() as a context manager (with Sandbox() as sandbox:) ensures the sandbox is killed even if run_code() raises an exception. Without the context manager, unhandled exceptions leave sandboxes running until they hit E2B's default 5-minute timeout — each exception contributes up to 5 minutes of unexpected billing.

Failure Mode 2 — Execution Timeout Retry Loops

E2B's sandbox.run_code() accepts a timeout parameter (default: 30 seconds). When execution exceeds the timeout, the SDK raises a TimeoutException. The natural agent response is to retry — but retry logic written without awareness of execution cost patterns creates systematic waste.

The first retry problem is the most obvious: retrying an expensive computation that always times out at 30 seconds means each attempt consumes 30 sandbox-seconds before failing. A three-retry default burns 90 CPU-seconds on a task that was never going to complete in the timeout window. Agents that don't track failed execution costs against a session budget can spend more on timed-out retries than on successful executions.

The second retry problem is subtler. Many agent frameworks implement retry logic at the tool-call level: if run_code fails, mark the tool call as failed and ask the LLM to retry with a different code block. The LLM often generates nearly identical code — the same DataFrame operations, the same nested loop, the same inefficient join — because the agent context doesn't communicate "this approach hit a computational limit." Without a circuit breaker that injects the failure reason into context, the LLM retries the same expensive pattern indefinitely.

Timeout retry rule: A timeout doesn't mean "try again." It means "this computation exceeds the time budget for a single sandbox execution." The correct response is to decompose the task, not retry identically. If your retry logic sends the same code, you're burning sandbox CPU-seconds to produce the same timeout.

Python — execution timeout guard with budget tracking
from e2b_code_interpreter import Sandbox
from e2b_code_interpreter.exceptions import TimeoutException
import time

class SandboxBudgetGuard:
    def __init__(self, max_cpu_seconds: float = 300.0, max_retries: int = 2):
        self.max_cpu_seconds = max_cpu_seconds
        self.consumed_seconds = 0.0
        self.timeout_count = 0
        self.max_retries = max_retries

    def run_code(self, code: str, timeout_s: int = 30) -> dict:
        if self.consumed_seconds >= self.max_cpu_seconds:
            raise RuntimeError(
                f"Sandbox CPU budget exhausted: {self.consumed_seconds:.1f}s "
                f"of {self.max_cpu_seconds}s used"
            )
        if self.timeout_count >= self.max_retries:
            raise RuntimeError(
                f"Too many sandbox timeouts ({self.timeout_count}). "
                "Decompose the task rather than retrying the same code."
            )

        start = time.monotonic()
        try:
            with Sandbox() as sandbox:
                execution = sandbox.run_code(code, timeout=timeout_s)
            elapsed = time.monotonic() - start
            self.consumed_seconds += elapsed
            return {
                "stdout": execution.logs.stdout,
                "stderr": execution.logs.stderr,
                "results": execution.results,
                "cpu_seconds": elapsed,
                "budget_remaining": self.max_cpu_seconds - self.consumed_seconds,
            }
        except TimeoutException:
            elapsed = time.monotonic() - start
            self.consumed_seconds += elapsed  # timed-out execution still consumed CPU
            self.timeout_count += 1
            raise RuntimeError(
                f"Code execution timed out after {timeout_s}s "
                f"(attempt {self.timeout_count}/{self.max_retries}). "
                "Break into smaller steps. "
                f"Total CPU-seconds consumed: {self.consumed_seconds:.1f}."
            )

    def status(self) -> dict:
        return {
            "consumed_seconds": self.consumed_seconds,
            "remaining_seconds": self.max_cpu_seconds - self.consumed_seconds,
            "timeout_count": self.timeout_count,
            "budget_pct_used": self.consumed_seconds / self.max_cpu_seconds * 100,
        }

The key detail is that timed-out executions still consume CPU-seconds and must be tracked. An agent session with a 300-second CPU budget that runs three 30-second timeouts before the first successful execution has already consumed 30% of its budget on failed work. Without tracking timed-out CPU consumption, the budget counter is optimistic — it only counts successful runs — and the agent overshoots its actual budget.

Retry policy for computation-heavy agents

When run_code times out, the right agent response is to instruct the LLM to decompose the computation, not restart it. The error message returned to the LLM should explicitly say what failed and why retrying the same approach won't work:

Python — structured timeout feedback for the LLM
def handle_timeout_for_llm(code: str, timeout_s: int, guard: SandboxBudgetGuard) -> str:
    """Return a structured error message that instructs the LLM how to decompose."""
    status = guard.status()
    return (
        f"SANDBOX TIMEOUT: The code block exceeded {timeout_s}s execution time.\n"
        f"Budget status: {status['consumed_seconds']:.1f}s consumed, "
        f"{status['remaining_seconds']:.1f}s remaining.\n\n"
        "Do NOT retry the same code. Instead:\n"
        "1. Break the computation into smaller chunks (e.g., process 1000 rows at a time instead of all rows at once).\n"
        "2. Sample the data first (df.sample(1000)) to validate the approach before running on the full dataset.\n"
        "3. If the operation is inherently expensive, request a longer timeout budget (state: need_extended_timeout).\n\n"
        f"Failed code:\n```python\n{code}\n```"
    )

Failure Mode 3 — Output Accumulation Inflating LLM Context

E2B sandbox executions return three output channels: logs.stdout, logs.stderr, and results (rich outputs including DataFrames, matplotlib figures as base64 PNG strings, Plotly JSON, and plain text). In a data analysis pipeline, these outputs are typically fed directly into the LLM's context as tool results so the LLM can "see" what the code produced and decide what to do next.

The problem compounds across multiple executions within a single agent session. A ten-step data analysis pipeline that calls df.describe(), df.head(100), prints a full correlation matrix, generates a matplotlib figure, and outputs intermediate results at each step can accumulate 40,000–80,000 tokens of output in the LLM's context window. Each subsequent LLM call processes that entire accumulated context — meaning the LLM cost per step increases monotonically as the session progresses.

Concrete token estimates for common data analysis outputs:

Output type Typical token cost Risk level
df.head(10) 200–400 tokens Low
df.head(100) 2,000–4,000 tokens Medium
df.describe() (10 cols) 600–1,200 tokens Low
print(df.to_string()) (1000 rows) 15,000–40,000 tokens Critical
Matplotlib PNG (base64) 8,000–25,000 tokens Critical
Plotly JSON (complex chart) 5,000–15,000 tokens High
Full correlation matrix (20 cols) 1,500–3,000 tokens Medium
Stack trace (deep call chain) 500–2,000 tokens Medium

A single matplotlib base64 PNG output can cost as much as 3–5 GPT-4 calls in token terms — and that cost is paid again at every subsequent step because the output sits in context. An agent that generates three charts during a 10-step analysis pays chart token cost 7+ times: once when each chart is generated, then again at every step after generation.

Output accumulation amplification rule: An output that costs T tokens injected at step S of an N-step session costs T × (N − S) additional tokens across remaining steps. A 20,000-token chart injected at step 3 of a 10-step session adds 140,000 tokens of LLM context cost across steps 4–10. Track cumulative output tokens, not just per-execution output size.

Python — output truncation and token estimation
import base64
from typing import Any

# Rough token estimate: ~4 characters per token for text, base64 is ~1.33x original size
def estimate_tokens(text: str) -> int:
    return max(1, len(text) // 4)

def truncate_sandbox_output(execution_result: dict, max_tokens: int = 2000) -> dict:
    """Truncate sandbox output to stay within per-execution token budget."""
    stdout = execution_result.get("stdout", "")
    stderr = execution_result.get("stderr", "")
    results = execution_result.get("results", [])

    truncated = {}
    budget = max_tokens

    # Truncate stdout
    stdout_tokens = estimate_tokens(stdout)
    if stdout_tokens > budget // 2:
        keep_chars = (budget // 2) * 4
        truncated["stdout"] = stdout[:keep_chars] + f"\n[... truncated {len(stdout) - keep_chars} chars]"
        budget -= budget // 2
    else:
        truncated["stdout"] = stdout
        budget -= stdout_tokens

    # Truncate stderr (keep more of stderr — it has error messages)
    stderr_tokens = estimate_tokens(stderr)
    if stderr_tokens > budget // 2:
        keep_chars = (budget // 2) * 4
        truncated["stderr"] = stderr[:keep_chars] + f"\n[... truncated {len(stderr) - keep_chars} chars]"
        budget -= budget // 2
    else:
        truncated["stderr"] = stderr
        budget -= stderr_tokens

    # Handle rich results — strip base64 images, summarize DataFrames
    clean_results = []
    for item in results:
        if hasattr(item, "png") and item.png:
            # Replace base64 PNG with a placeholder — saves 8k–25k tokens
            clean_results.append("[chart generated — use display() to request inline viewing]")
        elif hasattr(item, "text") and item.text:
            text_tokens = estimate_tokens(item.text)
            if text_tokens > budget:
                keep_chars = budget * 4
                clean_results.append(item.text[:keep_chars] + f"\n[... truncated]")
                budget = 0
            else:
                clean_results.append(item.text)
                budget -= text_tokens
        if budget <= 0:
            clean_results.append("[remaining results omitted — output budget exhausted]")
            break

    truncated["results"] = clean_results
    truncated["estimated_tokens"] = max_tokens - budget
    return truncated

class OutputBudgetGuard:
    def __init__(self, max_context_tokens: int = 20000, max_per_execution: int = 2000):
        self.max_context_tokens = max_context_tokens
        self.max_per_execution = max_per_execution
        self.accumulated_tokens = 0

    def process_output(self, execution_result: dict) -> dict:
        remaining = self.max_context_tokens - self.accumulated_tokens
        if remaining <= 0:
            raise RuntimeError(
                f"Output context budget exhausted ({self.accumulated_tokens} tokens). "
                "Summarize earlier results before running more code."
            )
        per_exec_limit = min(self.max_per_execution, remaining)
        truncated = truncate_sandbox_output(execution_result, max_tokens=per_exec_limit)
        self.accumulated_tokens += truncated["estimated_tokens"]
        return truncated

Handling base64 image outputs

Matplotlib and Seaborn figures returned by E2B are base64-encoded PNG strings. Passing the full base64 string to a non-vision LLM model accomplishes nothing — the model cannot interpret the image — while consuming thousands of tokens. Even with a vision model, passing raw base64 for every chart generated during a multi-step analysis is expensive. The right approach is to store the base64 data server-side, return a reference (e.g., "chart saved as analysis_step_3.png"), and only pass the actual image data when the LLM explicitly needs to analyze the visual output.

Python — save chart to file, pass reference to LLM
import base64
import os
import uuid

def save_chart_from_execution(execution, output_dir: str = "/tmp/charts") -> list[str]:
    """Save base64 PNG outputs to disk, return file paths instead of base64 strings."""
    os.makedirs(output_dir, exist_ok=True)
    saved_paths = []
    for item in execution.results:
        if hasattr(item, "png") and item.png:
            filename = f"chart_{uuid.uuid4().hex[:8]}.png"
            filepath = os.path.join(output_dir, filename)
            with open(filepath, "wb") as f:
                f.write(base64.b64decode(item.png))
            saved_paths.append(filepath)
    return saved_paths

# In your agent tool:
def run_analysis_code(code: str) -> str:
    with Sandbox() as sandbox:
        execution = sandbox.run_code(code)

    chart_paths = save_chart_from_execution(execution)
    stdout = execution.logs.stdout[:2000]  # hard truncate

    output_parts = [stdout] if stdout else []
    if chart_paths:
        output_parts.append(f"Charts saved: {', '.join(chart_paths)}")
        output_parts.append("To analyze a chart visually, request: analyze_chart('path')")

    return "\n".join(output_parts) or "(no output)"

Failure Mode 4 — Parallel Sandbox Storms

Agent frameworks with parallel execution support — LangGraph with Send nodes, CrewAI with parallel agent crews, custom asyncio.gather() patterns — can create scenarios where many agent sub-tasks run simultaneously, each with its own sandbox. If sandbox lifecycle is not coordinated at the session level, twenty parallel sub-agents each calling Sandbox.create() means twenty simultaneously billable sandboxes.

The cost multiplication is direct: N parallel sandboxes × sandbox-seconds = N× the billing of a sequential execution of the same tasks. A research agent that dispatches 10 parallel sub-agents to analyze 10 different datasets, each taking 30 seconds, incurs 300 CPU-seconds of E2B billing. The same workload executed sequentially incurs the same 300 CPU-seconds but runs in 300 real-world seconds instead of 30. If the parallelism is genuinely needed for latency, the 10× billing is a trade-off. If the parallel dispatch was a default concurrency setting that nobody questioned, the billing is pure waste.

The subtler version of this failure is sandbox leakage from exception paths. If a parallel sub-agent raises an exception before reaching sandbox.kill() — or if the exception handler for the sub-agent swallows the error without killing the sandbox — the sandbox remains alive until E2B's default timeout. In a 20-way parallel dispatch, 3 exception-path leaks leave 3 sandboxes running for up to 5 minutes each, contributing 900 CPU-seconds of unexpected billing that never appears in the application's error logs.

Parallel sandbox billing rule: Peak concurrent sandbox count × peak sandbox lifetime = sandbox-seconds billed per session. If you're not tracking peak concurrency, you don't know your E2B billing ceiling. Set a concurrency limit before the first production run, not after the first large bill.

Python — semaphore-gated sandbox concurrency
import asyncio
from e2b_code_interpreter import AsyncSandbox
import time

class ConcurrentSandboxPool:
    def __init__(self, max_concurrent: int = 3, cpu_budget_seconds: float = 600.0):
        self.semaphore = asyncio.Semaphore(max_concurrent)
        self.cpu_budget = cpu_budget_seconds
        self.consumed_seconds = 0.0
        self.active_count = 0
        self.peak_concurrent = 0
        self._lock = asyncio.Lock()

    async def run_code(self, code: str, timeout_s: int = 60) -> dict:
        async with self._lock:
            if self.consumed_seconds >= self.cpu_budget:
                raise RuntimeError(
                    f"Session CPU budget exhausted: {self.consumed_seconds:.1f}s used "
                    f"of {self.cpu_budget}s. No more sandboxes will be opened."
                )

        async with self.semaphore:
            async with self._lock:
                self.active_count += 1
                self.peak_concurrent = max(self.peak_concurrent, self.active_count)

            start = time.monotonic()
            try:
                async with AsyncSandbox() as sandbox:
                    execution = await sandbox.run_code(code, timeout=timeout_s)
                elapsed = time.monotonic() - start
                async with self._lock:
                    self.consumed_seconds += elapsed
                    self.active_count -= 1
                return {
                    "stdout": execution.logs.stdout[:4000],
                    "stderr": execution.logs.stderr[:1000],
                    "cpu_seconds": elapsed,
                }
            except Exception:
                elapsed = time.monotonic() - start
                async with self._lock:
                    self.consumed_seconds += elapsed
                    self.active_count -= 1
                raise

    def status(self) -> dict:
        return {
            "consumed_seconds": self.consumed_seconds,
            "remaining_seconds": self.cpu_budget - self.consumed_seconds,
            "active_sandboxes": self.active_count,
            "peak_concurrent": self.peak_concurrent,
            "budget_pct_used": self.consumed_seconds / self.cpu_budget * 100,
        }

# Usage in a parallel research agent:
async def parallel_analysis(datasets: list[str], pool: ConcurrentSandboxPool):
    tasks = [
        pool.run_code(f"import pandas as pd; df = pd.read_csv('{ds}'); print(df.describe())")
        for ds in datasets
    ]
    # asyncio.gather respects the semaphore — max 3 sandboxes at a time regardless of list length
    results = await asyncio.gather(*tasks, return_exceptions=True)
    print(f"Pool status: {pool.status()}")
    return results

The semaphore enforces a hard ceiling on concurrent sandboxes. Without it, asyncio.gather() on a 20-item list opens 20 sandboxes simultaneously. With max_concurrent=3, the gather proceeds in batches of 3, billing only 3 concurrent CPU-second streams at any moment.

Detecting sandbox leaks in production

E2B provides a management API endpoint to list running sandboxes. Running this at session end and cross-referencing against expected active sandboxes is the most reliable leak detection method:

Python — session-end sandbox leak detection
from e2b import Sandbox as E2BSandbox

def detect_leaked_sandboxes(expected_alive: set[str] = None) -> list[dict]:
    """List all running sandboxes. Any not in expected_alive are leaks."""
    expected_alive = expected_alive or set()
    running = E2BSandbox.list()  # returns list of running sandbox metadata
    leaked = [s for s in running if s.sandbox_id not in expected_alive]
    if leaked:
        print(f"WARNING: {len(leaked)} leaked sandbox(es) detected:")
        for s in leaked:
            print(f"  sandbox_id={s.sandbox_id}, started={s.started_at}, template={s.template_id}")
        # Kill leaked sandboxes to stop billing
        for s in leaked:
            try:
                E2BSandbox.kill(s.sandbox_id)
                print(f"  Killed leaked sandbox {s.sandbox_id}")
            except Exception as e:
                print(f"  Failed to kill {s.sandbox_id}: {e}")
    return leaked

Putting It Together — E2B Cost Budget for a Multi-Step Data Agent

A production E2B data analysis agent needs four independent budget counters: sandbox CPU-seconds, cumulative output tokens, timeout count, and peak concurrent sandbox count. Each counter has a hard ceiling, and the agent aborts the session before any ceiling is breached. Here is a minimal integration that wires all four:

Python — integrated E2B cost guard
from e2b_code_interpreter import Sandbox
from e2b_code_interpreter.exceptions import TimeoutException
import time

class E2BCostGuard:
    """Four-axis E2B cost control: CPU budget, output budget, timeout limit, concurrency limit."""

    def __init__(
        self,
        max_cpu_seconds: float = 300.0,
        max_output_tokens: int = 20000,
        max_timeouts: int = 2,
        timeout_per_exec: int = 30,
    ):
        self.max_cpu_seconds = max_cpu_seconds
        self.max_output_tokens = max_output_tokens
        self.max_timeouts = max_timeouts
        self.timeout_per_exec = timeout_per_exec

        self.consumed_cpu = 0.0
        self.consumed_output_tokens = 0
        self.timeout_count = 0
        self.execution_count = 0

    def _check_budgets(self):
        if self.consumed_cpu >= self.max_cpu_seconds:
            raise RuntimeError(
                f"E2B CPU budget exhausted: {self.consumed_cpu:.1f}s of {self.max_cpu_seconds}s used. "
                "Session terminated to prevent further charges."
            )
        if self.consumed_output_tokens >= self.max_output_tokens:
            raise RuntimeError(
                f"Output context budget exhausted: {self.consumed_output_tokens} tokens accumulated. "
                "Summarize earlier outputs before running more code."
            )
        if self.timeout_count >= self.max_timeouts:
            raise RuntimeError(
                f"Too many execution timeouts ({self.timeout_count}). "
                "Decompose computations into smaller steps."
            )

    def run_code(self, code: str) -> dict:
        self._check_budgets()
        self.execution_count += 1
        start = time.monotonic()
        try:
            with Sandbox() as sandbox:
                execution = sandbox.run_code(code, timeout=self.timeout_per_exec)
            elapsed = time.monotonic() - start
            self.consumed_cpu += elapsed

            # Truncate and count output tokens
            raw_stdout = execution.logs.stdout[:8000]
            raw_stderr = execution.logs.stderr[:2000]
            output_tokens = len(raw_stdout) // 4 + len(raw_stderr) // 4
            self.consumed_output_tokens += output_tokens

            return {
                "stdout": raw_stdout,
                "stderr": raw_stderr,
                "cpu_seconds": elapsed,
                "output_tokens_this_call": output_tokens,
                "budget_status": self.status(),
            }

        except TimeoutException:
            elapsed = time.monotonic() - start
            self.consumed_cpu += elapsed
            self.timeout_count += 1
            raise RuntimeError(
                f"Execution timed out after {self.timeout_per_exec}s "
                f"(timeout #{self.timeout_count}/{self.max_timeouts}). "
                f"CPU consumed for this failed execution: {elapsed:.1f}s. "
                "Break the computation into smaller pieces."
            )

    def status(self) -> dict:
        return {
            "cpu_used_s": round(self.consumed_cpu, 2),
            "cpu_remaining_s": round(self.max_cpu_seconds - self.consumed_cpu, 2),
            "output_tokens_used": self.consumed_output_tokens,
            "output_tokens_remaining": self.max_output_tokens - self.consumed_output_tokens,
            "timeouts": self.timeout_count,
            "executions": self.execution_count,
            "cpu_pct": round(self.consumed_cpu / self.max_cpu_seconds * 100, 1),
        }

Feed budget_status back to the LLM as part of each tool call response. When the LLM can see that it has consumed 80% of the CPU budget after 5 executions, it adapts its plan — sampling the data more aggressively, combining operations to reduce execution count, or reporting partial results instead of continuing. Agents that can see their resource envelope are significantly more cost-efficient than agents that can't.

Comparison: E2B Billing vs. LLM Token Billing

Dimension LLM token billing (GPT-4, Claude) E2B sandbox billing
Unit Tokens in / tokens out CPU-seconds (wall clock, not compute)
Idle cost Zero — charged only when the API is called Full rate — charged while sandbox is alive regardless of activity
Retry cost Proportional to context size at retry time Full execution cost per retry plus idle time waiting for retry
Output cost Direct — output tokens are billed Indirect — large outputs inflate LLM context, multiplying future LLM costs
Parallelism Parallel calls billed independently but close immediately Parallel sandboxes bill simultaneously for their full lifetime
Predictability High — tokens are countable before the call Low — wall-clock time depends on code behavior, LLM latency, and exception paths

Summary — E2B Cost Control Checklist

  • Scope sandboxes to executions, not sessions. Use with Sandbox() as s: for every run_code call. Never hold a sandbox open across LLM reasoning steps.
  • Track timed-out CPU-seconds. A timeout still consumed compute. Count it against the budget or you'll overshoot.
  • Inject timeout reason into LLM context. Don't let the LLM retry the same code. Tell it the approach hit a resource ceiling and ask it to decompose.
  • Truncate all sandbox outputs before feeding to the LLM. Cap stdout at 2,000–4,000 tokens. Strip base64 images. Return file paths, not binary data.
  • Count accumulated output tokens across the session. A 5,000-token output at step 3 of a 10-step session costs 35,000 additional LLM context tokens across the remaining steps.
  • Set a hard concurrency ceiling before using asyncio.gather. Default Python async patterns will open N sandboxes for N tasks. Use a semaphore.
  • Detect and kill leaked sandboxes at session end. Exception paths that don't reach sandbox.kill() leave sandboxes billing at full rate until E2B's default timeout.
  • Feed budget status back to the LLM at every step. Agents that can see their CPU and output budget adapt their behavior. Agents that can't will exhaust the budget and then ask for more.

The patterns above apply regardless of which agent framework wraps E2B. LangChain's E2BDataAnalysisTool, LlamaIndex's code interpreter tool, and custom implementations all share the same sandbox lifecycle model. The circuit breaker logic sits at the sandbox client layer — below the framework — so it works uniformly across integration points. For agents that use both E2B for code execution and a frontier LLM for reasoning, budget both independently: E2B CPU-seconds for the sandbox layer, token-per-step limits for the LLM layer, and a hard session wall-time ceiling that terminates both when the overall session runs long.

Frequently Asked Questions

Does E2B charge for sandbox idle time between code executions?

Yes. E2B bills by wall-clock CPU-seconds from sandbox creation to sandbox kill, regardless of whether code is actively executing. If your agent's LLM takes 5 seconds to respond between code executions and your session has 15 such gaps, you're billed for 75 extra CPU-seconds of idle time. The fix is to scope sandboxes to each individual code execution using the context manager (with Sandbox() as s:), so the sandbox is killed immediately after the code returns rather than waiting for the next execution.

How should an agent respond when E2B execution times out?

The agent should not retry the same code. A timeout means the computation exceeded the time budget for a single sandbox execution, not that it failed transiently. The correct response is to decompose the operation — process data in chunks, sample before running on the full dataset, or split a complex computation into sequential steps. The error message returned to the LLM should explicitly say "this approach hit a resource limit; decompose it" rather than "an error occurred; try again." Track timed-out executions against the CPU-second budget because the timeout still consumed compute before failing.

Why does E2B cost more than expected when the agent generates matplotlib charts?

Matplotlib figures returned by E2B are base64-encoded PNG strings. A single chart is typically 8,000–25,000 tokens when base64-encoded. If that string is included in the LLM's context as a tool result, every subsequent LLM call in the session processes those tokens again — a 20,000-token chart injected at step 3 of a 10-step session adds roughly 140,000 tokens of LLM context cost across the remaining steps. The fix is to save charts to disk and return a file path reference to the LLM instead of the raw base64 string. Only pass the actual image data when the LLM explicitly needs to analyze the visual content.

What happens when a parallel agent dispatches many E2B tasks simultaneously?

Each concurrent task opens its own sandbox, and all sandboxes bill simultaneously. Twenty parallel tasks running for 30 seconds each incur 600 CPU-seconds of E2B billing in 30 real-world seconds — the same cost as running them sequentially, just compressed into a shorter time window. Parallelism is only worth the billing if the latency reduction has business value. Use an asyncio.Semaphore to cap concurrent sandboxes at a number you've budgeted for. Any exceptions that kill sub-agents without reaching sandbox.kill() will leave sandboxes running until E2B's default 5-minute timeout — use context managers to prevent this.

How do I integrate E2B cost tracking with RunGuard?

RunGuard's circuit breaker watches tool-call patterns, budget metrics, and execution counts at the agent session level. For E2B specifically: instrument run_code calls to emit CPU-second and output-token metrics, then configure RunGuard to trip the circuit when CPU-seconds consumed exceeds your session budget, when cumulative output tokens exceed the LLM context budget, or when the timeout count exceeds your retry limit. The breaker trips before the next sandbox opens — not after the budget is already spent. See the RunGuard pattern reference for the full integration guide.

Stop E2B sandbox billing before it starts

RunGuard's circuit breaker monitors CPU-seconds consumed, output token accumulation, and concurrent sandbox count in real time. When any metric approaches your budget ceiling, the breaker trips — the sandbox closes, the agent session ends cleanly, and you get an alert before the bill arrives. One-line SDK install for Python and TypeScript agents.

Join the waitlist — free 14-day trial