Hugging Face's smolagents ships with a max_steps parameter on every agent. Teams set it and consider the cost problem solved. Then a CodeAgent writes a data-parsing script, the local Python interpreter raises a KeyError, the agent reads the traceback, rewrites the script, hits the same KeyError with a slightly different key name, reads that traceback, rewrites again — cycling through code variants that all fail for the same underlying reason until max_steps fires. Or a ToolCallingAgent doing web research calls a search tool with the same query phrasing four times in a row, having forgotten in each step that the previous attempt returned zero results for that exact keyword. Both patterns run to completion at full cost. Neither is caught earlier, because max_steps counts attempts — not whether those attempts are making progress or repeating the same mistake.
smolagents has a design characteristic that makes its failure modes distinct from most other agent frameworks: the CodeAgent. Where ToolCallingAgent produces JSON tool calls like every other framework, CodeAgent produces executable Python code as its action. The local interpreter runs it, captures stdout and any raised exception, and feeds the result back as the next step's observation. This changes the cost profile of a looping agent dramatically. A single CodeAgent step that calls web_search("query") ten times in a loop makes ten LLM-billed tool calls before the circuit breaker at the step boundary has any chance to intervene. The step counter doesn't know — it sees one step.
This post builds a production circuit breaker for smolagents: CodeAgent code-repair loop detection, tool repetition storm prevention, ManagedAgent delegation depth tracking, and memory history inflation monitoring — all implemented through smolagents' built-in step_callbacks hook and a lightweight MultiStepAgent subclass. At the end you'll see how RunGuard's one-call install replaces this hand-rolled implementation with a managed version that sends Slack alerts on trip and exposes a 30-day incidents dashboard.
What you'll build: A circuit breaker that catches CodeAgent code-repair loops (same exception class raised across N consecutive steps), detects tool repetition storms in ToolCallingAgent (same tool + args hash repeated within a session), tracks ManagedAgent delegation depth via contextvars to catch orchestrator-specialist-orchestrator cycles, and monitors per-step memory size to trip before token inflation makes each LLM call prohibitively expensive.
Why max_steps fails differently in smolagents than other frameworks
Most agent frameworks have one kind of agent: a reasoning loop that calls tools and decides next actions. smolagents has two meaningfully different agent architectures that each expose max_steps but have completely different failure modes beneath it.
ToolCallingAgent works like LangGraph, CrewAI, or AutoGen under the hood: the LLM generates a structured tool call, smolagents executes the corresponding tool, appends the result to the memory as an ActionStep, and the LLM decides what to do next. In this model max_steps correctly bounds the number of tool-call/response cycles. The failure modes are the familiar ones: the same tool called repeatedly, handoff cycles in multi-agent setups, memory inflation from large tool outputs. These parallel what we've documented for LangGraph, OpenAI Agents SDK, and CrewAI.
CodeAgent is genuinely different. The LLM's action is a block of Python code, not a JSON tool descriptor. smolagents executes this code inside a LocalPythonInterpreter (or a sandboxed E2B executor for production environments). The code can call any of the agent's registered tools as Python functions — potentially many times in a single step. This means one step can make arbitrarily many tool calls, accumulate arbitrarily many tokens of tool output, and produce arbitrarily many tokens of stdout, all before max_steps increments. The step-count cap is no longer a reliable cost predictor.
The failure mode unique to CodeAgent is the code-repair loop: the agent writes code that raises an exception, reads the traceback, writes a revised version that raises the same exception class for a different reason, reads that traceback, and continues revising indefinitely. From the agent's perspective it's "making progress" — each revision is a genuine attempt to fix the error. From a cost perspective it's burning tokens on an unresolvable problem: a missing API credential, a schema mismatch in external data, a broken dependency. max_steps will eventually stop it, but not before running 20 or 30 expensive LLM calls on increasingly large prompts that include every failed attempt and its traceback.
The four failure modes smolagents' built-in controls miss
1. CodeAgent code-repair infinite loop
A CodeAgent tasked with fetching and parsing JSON from an external API writes code that calls requests.get(url).json()["data"]. The API returns a response where the top-level key is "results", not "data". The interpreter raises KeyError: 'data'. The agent reads the traceback, decides to try ["results"] instead — but the structure under "results" has a different shape than expected, and the next line raises KeyError: 'items'. The agent revises again. Now the nested key is present but the data is paginated in a way the agent didn't anticipate — TypeError: 'NoneType' object is not subscriptable. The agent revises again.
Each revision is a full LLM call with the entire prior conversation as context — including every failed code block and its traceback. By step 8 the prompt contains 7 complete Python scripts, 7 tracebacks, and the agent's reasoning about each failure. The per-step cost has grown 4–6× compared to the first step. The problem is unresolvable via code revision — what's needed is a different API endpoint or updated API credentials — but the agent keeps trying.
Detection signal: the same Python exception class appearing in the ActionStep.error field across N consecutive steps. A KeyError on step 3 followed by a KeyError on step 4 could be progress (different key). A KeyError on steps 3, 4, 5, and 6 is not progress — the agent is cycling through key-name guesses against a schema it doesn't understand. Three consecutive steps with the same top-level exception class is a warning. Four is a trip threshold. This is readable from ActionStep.error without any LLM-level access — just check the type name of the raised exception after each step execution.
2. ToolCallingAgent tool repetition storm
smolagents' ToolCallingAgent uses the standard tool-call model: the LLM produces a {"name": "tool_name", "arguments": {...}} call, smolagents executes the matching tool, and the result is stored as an ActionStep. When a tool returns a result the agent considers insufficient — zero search results, an empty database response, a rate-limit error — the agent typically retries. Whether it retries with the same arguments or slightly modified ones depends on the model and context.
The tool repetition storm occurs when the retry produces the same outcome and the agent's reasoning loop doesn't update its belief that the tool can succeed with this input class. A web-search agent querying for a product that doesn't exist in the indexed corpus calls web_search("exact product name"), gets no results, calls web_search("exact product name review"), gets no results, calls web_search("buy exact product name"), gets no results. Each call is a distinct query string so the agent doesn't perceive it as repetition — but it's billing for three identical round trips to the search provider and three LLM reasoning steps on the same empty-results context.
Detection signal: the same tool name appearing in consecutive ActionStep records with argument fingerprints that cluster within a Levenshtein distance or share the same semantic root. For a simpler implementation: track (tool_name, frozenset(args.items())) as an exact key, and separately track (tool_name,) as a coarser key. Three calls to the same tool in five consecutive steps — regardless of argument variation — is a signal that the agent isn't breaking out of a failing strategy. Four exact-match (tool_name, args_fingerprint) pairs in a session is a hard trip.
3. ManagedAgent delegation depth: orchestrator-specialist cycles
smolagents' ManagedAgent wraps any MultiStepAgent to make it callable as a tool by another agent. This is the natural pattern for building multi-agent pipelines in smolagents: an orchestrator has tools that are ManagedAgent wrappers around specialist agents. Each specialist handles a domain — one for web research, one for code execution, one for data analysis. The orchestrator decides which specialist to invoke and what task to give it.
The delegation loop failure mode: the research specialist encounters a question that requires code execution to verify the answer. Rather than producing a final answer, it decides the question is "out of scope" for research and calls back to the orchestrator with a clarification request. The orchestrator, reasoning that this is a research question, calls the research specialist again with a slightly rephrased task. Neither agent's max_steps counter advances more than once per round-trip — from each agent's perspective it made one decision and handed off. The cycle can run indefinitely, compounding costs with each delegation: the orchestrator's call costs tokens, passing the full task description; the specialist's clarification costs tokens, including all prior context; the orchestrator's re-dispatch costs tokens again.
Detection signal: nested ManagedAgent.__call__() depth exceeding a threshold. Track with a contextvars.ContextVar that increments before each managed-agent invocation and decrements after. An orchestrator calling one specialist is depth 2 — expected. A specialist that calls back through the orchestrator to reach another specialist is depth 3 — suspicious. Depth 4 in a system designed with one orchestrator layer is a cycle. This state is cross-cutting — no individual agent can see it — so a shared context variable is the right instrument.
4. Memory history inflation across steps
smolagents stores execution history in agent.memory as a list of step objects: SystemPromptStep, UserMessageStep, ActionStep (containing the generated code or tool call, the execution result, and any error), and PlanningStep (for agents using the planning feature). Every subsequent LLM call includes a serialized version of this full history as context. This is the correct behavior for maintaining coherent multi-step reasoning — but it creates a predictable cost escalation pattern when steps produce large outputs.
The inflation failure mode is most severe with CodeAgent: a code step that prints a large dataset as debugging output, a data-processing script that emits megabytes of intermediate results to stdout, or a failing API call that returns a verbose HTML error page instead of JSON. smolagents captures all stdout from the code execution as the step's observation. If a step emits 5,000 tokens of stdout, every subsequent LLM call carries those 5,000 tokens in its context window. A five-step session where each step emits 2,000 tokens of output has a step-6 prompt that's 10,000 tokens larger than the step-1 prompt — before accounting for the agent's own reasoning at each step.
Detection signal: the ratio of the current step's serialized memory size to the median memory size across the first three steps. Serialize the memory list with json.dumps([step.__dict__ for step in agent.memory.steps], default=str) and use character count as a token proxy (divide by 4). A 2.5× growth ratio after three or more steps indicates abnormal inflation. A 4× ratio is a hard trip threshold: the session is producing more context noise than informational signal, and continuing will make each subsequent step 4× more expensive than the baseline for no improvement in output quality.
Building the circuit breaker
smolagents provides two clean integration points. step_callbacks is a list of callables that smolagents calls after each step, receiving the completed ActionStep as an argument. For per-step checks — tool repetition, code-repair loop detection, memory inflation — step_callbacks is the right hook: it requires zero subclassing and works identically for both CodeAgent and ToolCallingAgent. For pre-invocation checks on ManagedAgent calls, subclassing is necessary to intercept before the delegation happens.
The breaker itself is a shared dataclass that tracks state across callbacks and exposes a gate() method. Callbacks call gate(); if the breaker is OPEN, they raise AgentBudgetExceededError to abort the current run. The HALF_OPEN probe logic allows one test call after a configurable recovery timeout — useful when the trip was caused by a transient tool failure rather than a structural problem in the agent's task.
from __future__ import annotations
import contextvars
import hashlib
import json
import time
from collections import deque
from dataclasses import dataclass, field
from enum import Enum
from typing import Any, Callable, Deque, Dict, List, Optional, Tuple
from smolagents import MultiStepAgent, ManagedAgent, CodeAgent, ToolCallingAgent
from smolagents.memory import ActionStep
class BreakerState(Enum):
CLOSED = "closed"
OPEN = "open"
HALF_OPEN = "half_open"
# Module-level context variable — shared across all nested agent calls.
_delegation_depth: contextvars.ContextVar[int] = contextvars.ContextVar(
"smolagents_delegation_depth", default=0
)
@dataclass
class SmolagentsBreaker:
"""Circuit breaker for smolagents MultiStepAgent runs.
Wire into an agent via:
agent.step_callbacks.append(breaker.step_callback)
"""
# Thresholds
max_consecutive_same_exception: int = 4
max_tool_repeats_session: int = 4
max_consecutive_tool_calls: int = 3
max_memory_growth_ratio: float = 4.0
max_delegation_depth: int = 4
half_open_timeout_s: float = 30.0
# Runtime state
state: BreakerState = field(default=BreakerState.CLOSED, init=False)
_trip_reason: Optional[str] = field(default=None, init=False)
_opened_at: Optional[float] = field(default=None, init=False)
# Exception-class tracking (CodeAgent code-repair loop)
_exception_history: Deque[Optional[str]] = field(
default_factory=lambda: deque(maxlen=6), init=False
)
# Tool call tracking (ToolCallingAgent storm)
_tool_calls_session: Dict[str, int] = field(default_factory=dict, init=False)
_recent_tool_calls: Deque[Tuple[str, str]] = field(
default_factory=lambda: deque(maxlen=5), init=False
)
# Memory inflation tracking
_baseline_memory_sizes: List[int] = field(default_factory=list, init=False)
def reset(self) -> None:
self.state = BreakerState.CLOSED
self._trip_reason = None
self._opened_at = None
self._exception_history.clear()
self._tool_calls_session.clear()
self._recent_tool_calls.clear()
self._baseline_memory_sizes.clear()
def gate(self) -> None:
"""Raise if the breaker is OPEN (or HALF_OPEN past its window)."""
if self.state == BreakerState.CLOSED:
return
if self.state == BreakerState.HALF_OPEN:
return # Allow one probe through
# OPEN — check recovery timeout
if self._opened_at and time.monotonic() - self._opened_at > self.half_open_timeout_s:
self.state = BreakerState.HALF_OPEN
return
raise AgentBudgetExceededError(
f"smolagents circuit breaker OPEN: {self._trip_reason}"
)
def _trip(self, reason: str) -> None:
self.state = BreakerState.OPEN
self._trip_reason = reason
self._opened_at = time.monotonic()
raise AgentBudgetExceededError(f"smolagents circuit breaker tripped: {reason}")
# ------------------------------------------------------------------ #
# step_callback — attach to agent.step_callbacks #
# ------------------------------------------------------------------ #
def step_callback(self, step: ActionStep, agent: MultiStepAgent) -> None:
"""Called by smolagents after each step completes."""
self.gate()
self._check_code_repair_loop(step)
self._check_tool_repetition(step)
self._check_memory_inflation(agent)
def _check_code_repair_loop(self, step: ActionStep) -> None:
"""Detect CodeAgent stuck in the same exception class across steps."""
exc_name = None
if step.error is not None:
exc_name = type(step.error).__name__
self._exception_history.append(exc_name)
# Count consecutive identical non-None exceptions at the tail
if exc_name is None:
return
consecutive = 0
for entry in reversed(self._exception_history):
if entry == exc_name:
consecutive += 1
else:
break
if consecutive >= self.max_consecutive_same_exception:
self._trip(
f"CodeAgent code-repair loop: {exc_name} raised {consecutive} "
f"consecutive times — agent is not resolving the underlying error"
)
def _check_tool_repetition(self, step: ActionStep) -> None:
"""Detect ToolCallingAgent calling the same tool repeatedly."""
if step.tool_calls is None:
return
for call in step.tool_calls:
tool_name = getattr(call, "name", None) or getattr(call, "tool_name", None)
if not tool_name:
continue
args = getattr(call, "arguments", {}) or {}
args_fp = hashlib.md5(
json.dumps(args, sort_keys=True, default=str).encode()
).hexdigest()[:8]
exact_key = f"{tool_name}:{args_fp}"
self._tool_calls_session[exact_key] = (
self._tool_calls_session.get(exact_key, 0) + 1
)
self._recent_tool_calls.append((tool_name, args_fp))
# Hard trip: same (tool, args) tuple 4+ times in session
if self._tool_calls_session[exact_key] >= self.max_tool_repeats_session:
self._trip(
f"Tool repetition storm: {tool_name}({args_fp}) called "
f"{self._tool_calls_session[exact_key]} times in this session"
)
# Soft trip: same tool N consecutive times (any args)
recent_names = [n for n, _ in self._recent_tool_calls]
if len(recent_names) >= self.max_consecutive_tool_calls:
if all(n == tool_name for n in recent_names[-self.max_consecutive_tool_calls:]):
self._trip(
f"Tool storm: {tool_name} called {self.max_consecutive_tool_calls} "
f"consecutive times — agent is not breaking out of failing strategy"
)
def _check_memory_inflation(self, agent: MultiStepAgent) -> None:
"""Detect super-linear memory growth from large step outputs."""
try:
steps = agent.memory.steps
except AttributeError:
return
current_size = len(
json.dumps([getattr(s, "__dict__", {}) for s in steps], default=str)
)
if len(self._baseline_memory_sizes) < 3:
self._baseline_memory_sizes.append(current_size)
return
baseline = sorted(self._baseline_memory_sizes)[len(self._baseline_memory_sizes) // 2]
if baseline == 0:
return
ratio = current_size / baseline
if ratio >= self.max_memory_growth_ratio:
self._trip(
f"Memory inflation: step memory is {ratio:.1f}x the baseline "
f"({current_size:,} vs {baseline:,} chars) — large outputs are "
f"compounding prompt costs on every subsequent step"
)
class AgentBudgetExceededError(Exception):
"""Raised when the circuit breaker trips an agent run."""
# ------------------------------------------------------------------ #
# ManagedAgent subclass — add delegation depth tracking #
# ------------------------------------------------------------------ #
class GuardedManagedAgent(ManagedAgent):
"""ManagedAgent that enforces delegation depth via a ContextVar."""
def __init__(self, *args, breaker: Optional[SmolagentsBreaker] = None, **kwargs):
super().__init__(*args, **kwargs)
self._breaker = breaker
def __call__(self, request: str, **kwargs) -> str:
depth = _delegation_depth.get()
if self._breaker and depth >= self._breaker.max_delegation_depth:
raise AgentBudgetExceededError(
f"Delegation depth {depth} exceeded limit "
f"{self._breaker.max_delegation_depth} — delegation cycle detected"
)
token = _delegation_depth.set(depth + 1)
try:
return super().__call__(request, **kwargs)
finally:
_delegation_depth.reset(token)
Wiring the breaker into your agents
For a standalone CodeAgent or ToolCallingAgent, attach the breaker via step_callbacks before running the agent:
from smolagents import CodeAgent, HfApiModel, DuckDuckGoSearchTool
model = HfApiModel("Qwen/Qwen2.5-72B-Instruct")
search_tool = DuckDuckGoSearchTool()
agent = CodeAgent(tools=[search_tool], model=model, max_steps=20)
breaker = SmolagentsBreaker(
max_consecutive_same_exception=4,
max_tool_repeats_session=4,
max_memory_growth_ratio=4.0,
)
agent.step_callbacks.append(breaker.step_callback)
# Reset breaker state before each new run
breaker.reset()
try:
result = agent.run("Parse the product catalog from https://example.com/api/products")
except AgentBudgetExceededError as e:
print(f"Run aborted: {e}")
# Recover: return best partial result, log incident, alert on-call
result = None
For a multi-agent pipeline with an orchestrator and one or more specialists, wrap the specialists in GuardedManagedAgent and share the breaker instance across all agents:
from smolagents import ToolCallingAgent, HfApiModel, DuckDuckGoSearchTool, PythonInterpreterTool
model = HfApiModel("Qwen/Qwen2.5-72B-Instruct")
shared_breaker = SmolagentsBreaker(max_delegation_depth=3)
# Specialist agent
research_agent = ToolCallingAgent(
tools=[DuckDuckGoSearchTool()],
model=model,
max_steps=10,
name="research_agent",
description="Searches the web for factual information.",
)
research_agent.step_callbacks.append(shared_breaker.step_callback)
# Wrap specialist as a ManagedAgent tool — use GuardedManagedAgent
managed_researcher = GuardedManagedAgent(
agent=research_agent,
name="research_agent",
description="Searches the web for factual information.",
breaker=shared_breaker,
)
# Orchestrator
orchestrator = ToolCallingAgent(
tools=[managed_researcher, PythonInterpreterTool()],
model=model,
max_steps=15,
name="orchestrator",
)
orchestrator.step_callbacks.append(shared_breaker.step_callback)
shared_breaker.reset()
try:
result = orchestrator.run("Research and summarize the key differences between RAG and fine-tuning for production use cases.")
except AgentBudgetExceededError as e:
print(f"Pipeline aborted: {e}")
result = None
Failure mode traces and what the breaker catches
Here is what each breaker trip looks like at runtime, mapped to the four failure modes:
CodeAgent repair loop trip — step 4 raises KeyError: 'items'. Breaker sees _exception_history = [KeyError, KeyError, KeyError, KeyError] at the tail. Raises AgentBudgetExceededError: "CodeAgent code-repair loop: KeyError raised 4 consecutive times — agent is not resolving the underlying error". Saved: 16 additional LLM calls at escalating prompt sizes.
Tool repetition trip — web_search("competitive analysis fintech 2026") called 4 times with identical args across the session. Exact key web_search:a3f2c1b4 hits count 4. Raises AgentBudgetExceededError: "Tool repetition storm: web_search(a3f2c1b4) called 4 times in this session". Saved: continuation of a search strategy that returned no results in 4 prior attempts.
Delegation depth trip — orchestrator calls research specialist (depth 2), specialist calls orchestrator for clarification (depth 3), orchestrator re-dispatches to research specialist (depth 4). GuardedManagedAgent.__call__ checks depth 4 against limit 3, raises AgentBudgetExceededError: "Delegation depth 4 exceeded limit 3 — delegation cycle detected". Neither agent's max_steps would have caught this — each saw only one delegation in its own step history.
Memory inflation trip — steps 1–3 establish a median memory size of 2,100 chars. Step 5 runs code that prints a 40KB DataFrame to stdout. Serialized memory size reaches 43,000 chars. Ratio 43,000 / 2,100 = 20.5×, well above the 4.0× threshold. Raises AgentBudgetExceededError: "Memory inflation: step memory is 20.5x the baseline (43,000 vs 2,100 chars)". Saved: all subsequent steps running against a 43,000-char context floor.
Detection threshold guidance
| Failure mode | Conservative threshold | Aggressive threshold | When to tighten |
|---|---|---|---|
Code-repair loopmax_consecutive_same_exception |
5 consecutive | 3 consecutive | Agent tasks have well-defined data schemas (fewer legitimate retries needed) |
Tool repetition (session)max_tool_repeats_session |
5 exact repeats | 3 exact repeats | Tools are idempotent and deterministic (retry adds no value) |
Tool storm (consecutive)max_consecutive_tool_calls |
4 consecutive | 3 consecutive | Tasks that legitimately require sequential multi-step tool chains are rare |
Memory inflationmax_memory_growth_ratio |
5× | 3× | Agent tasks have predictable output sizes (data parsing pipelines) |
Delegation depthmax_delegation_depth |
4 | 3 | Single orchestrator + single specialist layer with no expected back-delegation |
Frequently asked questions
The CodeAgent sometimes legitimately raises the same exception class multiple times while solving a complex problem. Won't the code-repair trip fire too eagerly?
Yes, if the consecutive threshold is set to 3 or 4 and the task genuinely requires multiple rounds of exception-driven debugging. The key distinction is between consecutive identical exceptions and non-consecutive ones. A KeyError on step 2 followed by a TypeError on step 3 followed by another KeyError on step 4 is a sign of real progress — the agent solved one error class and encountered a new one. The breaker only trips on consecutive occurrences of the same exception class, not on total occurrences. If your agents legitimately handle 4+ rounds of the same exception type, raise the threshold to 6 or 7 and add a secondary check: if the exception message text is more than 80% similar across consecutive steps (indicating identical failure), trip at 3 regardless of threshold.
Our CodeAgent's step_callbacks receive the ActionStep after it completes. By then the expensive code has already run. How is this different from just letting max_steps fire?
The value of the step_callback breaker is not in stopping the current step — it's in stopping all subsequent steps before they start. max_steps=20 on an agent that's been in a code-repair loop since step 4 means it will run 16 more expensive steps. The breaker's gate() is called at the top of step_callback, which smolagents calls before beginning the next step's LLM call. When the breaker trips at step 4, it raises before steps 5 through 20 execute. The practical savings are significant: if each LLM call in a repair loop costs $0.04 and grows 20% per step due to memory accumulation, 16 additional steps costs approximately $1.10 — per run, per agent instance. Multiply by your daily run volume.
Does this work with smolagents' streaming output via stream_to_gradio or run_stream?
step_callbacks fire synchronously after each step regardless of whether you're using streaming output. smolagents' streaming functions yield tokens during generation but still call step callbacks after each complete step, including the ActionStep with all fields populated. The breaker works identically in streaming and non-streaming modes. The one difference: if you're consuming a run_stream() generator, the AgentBudgetExceededError will propagate out of the generator at the next next() call after the trip, not at the point where the tripped step's tokens finish streaming. Wrap your stream consumer in a try/except for AgentBudgetExceededError at the generator level, not inside the token-handling loop.
How does the tool repetition check interact with tools that are designed to be called multiple times — pagination tools, polling tools, or map-reduce over a list?
The exact-key check (tool_name + args_fingerprint) handles pagination and polling cleanly: page 1, page 2, and page 3 all have different page arguments, so they produce distinct fingerprints and don't accumulate toward the storm threshold. Where you need caution is with polling tools that accept a job ID as the only argument — check_job_status("job-123") will match identically across every poll. For these tools, add them to a breaker allowlist: breaker.tool_storm_allowlist = {"check_job_status", "wait_for_result"} and skip the exact-key check for allowlisted tool names. The consecutive-calls check (same tool N times in a row) still applies to allowlisted tools at a higher threshold, preventing infinite polling if the job never completes.
We use E2B's sandboxed executor instead of LocalPythonInterpreter for CodeAgent. Does the circuit breaker still work?
Yes — the circuit breaker operates on the ActionStep object that smolagents populates after code execution completes, regardless of which executor ran the code. ActionStep.error is populated by smolagents from whatever exception the executor raises or returns, and ActionStep.observations contains the captured stdout. The E2B executor's remote execution is transparent to the breaker. The only adjustment needed is the memory inflation check: E2B execution results may include additional metadata that inflates the raw ActionStep.__dict__ size. If you observe false trips on the inflation check with E2B, either raise max_memory_growth_ratio to 6× or serialize only step.observations and step.error for the size estimate rather than the full step dict.