June 21, 2026 Hugging Face Transformers Agents Cost Control Loop Detection

Hugging Face Transformers Agents Cost Control: Iteration Loops, Tool Error Cascades, Model Reload Amplification, and Context Overflow Spirals

Hugging Face Transformers Agents — the ReactAgent, CodeAgent, and ToolCallingAgent classes in the transformers.agents module — brought agent-based reasoning to the broader ecosystem of open-weight models hosted on the Hugging Face Hub and to any pipeline-compatible local checkpoint. Before the module was rebranded and evolved into the separate smolagents library, it saw wide adoption among teams that wanted to run capable agents without OpenAI API dependency. The framework remains actively used on Inference Endpoints, local GPU instances, and through the HF Inference API's serverless tier.

The cost model is straightforward: every Thought-Action-Observation cycle in the agent's reasoning loop requires one or more forward passes through a language model. On the HF Inference API this means per-token billing. On Inference Endpoints (dedicated compute you provision), it means per-second compute billing regardless of whether the model is actively generating. On local GPU infrastructure it means GPU-hours. All three billing surfaces share a critical property: any loop or retry pattern that extends the reasoning chain, restarts the chain from scratch, or prevents the agent from converging multiplies cost linearly or worse. Four structural failure modes in the Transformers Agents architecture are responsible for the majority of unexpected billing spikes:

Infinite iteration loops — the ReactAgent or CodeAgent runs reasoning steps continuously past the max_iterations limit if the limit isn't set or isn't enforced by calling code, because the built-in limit raises a generic exception that calling code commonly swallows with a bare except Exception.
Tool error retry cascades — when a tool raises an exception, the agent receives the traceback as an Observation, considers it in the next Thought, and typically generates a new Action that calls the same or a related tool with slight variations. On external API tools, this can run 10–20 retry variants before the agent gives up or the max_iterations limit fires.
Model reload amplification — teams that instantiate the pipeline or AutoModelForCausalLM inside the agent entrypoint function, rather than as a module-level singleton, reload the model weights from disk into GPU memory on every agent invocation. On a 7B-parameter model, a cold load takes 8–15 seconds of GPU time before the first token generates. On Inference Endpoints billed per second, this is pure waste on every call.
Context overflow spirals — Transformers Agents builds the prompt by concatenating the system prompt, the full Thought-Action-Observation history, and the current step input. As the history grows, it eventually exceeds the model's max_length. The default behavior truncates from the left, which discards early context and can cause the agent to lose track of the original task or earlier successful tool results, leading it to re-attempt steps it already completed — and accumulating more history that again overflows.

Billing model in detail

The HF Inference API serverless tier bills per token of input and output for supported models. The Transformers Agents prompt structure is token-heavy: the system prompt that describes all available tools can consume 500–2,000 tokens depending on how many tools are registered and how verbosely their docstrings document their arguments. Each Thought-Action-Observation step adds the model's reasoning (Thought), the tool call specification (Action), and the tool's return value (Observation) to the prompt for the next step. A 10-step reasoning chain on a well-documented agent with 5 tools can accumulate 8,000–15,000 input tokens by step 10 — each step's input token count includes all prior history.

Inference Endpoints billed per second make the model reload pattern especially expensive. When you provision a dedicated endpoint on a GPU instance (an A10G at ~$1.10/hour, for example), the endpoint bills continuously whether the model is processing tokens or idle. But model loading also blocks the endpoint: a 7B checkpoint loading from Hub storage takes 20–45 seconds on first instantiation. If your calling code instantiates the model inside a request handler rather than at startup, every request that hits a cold instance pays the full load time before generating a single token — and on Inference Endpoints, that load time is billed at the full GPU rate.

Local GPU infrastructure has the same model reload cost in GPU-hours rather than cloud dollars, but on a shared research cluster with per-GPU-hour billing or quota allocation, burning 15 seconds of A100 time per agent invocation for model loading adds up across a team running dozens of experiments concurrently.

Failure mode 1: Infinite iteration loop

The ReactAgent class has a max_iterations parameter (default: 6 in earlier versions of the module, absent in some builds). When the agent hits the limit, it raises AgentMaxIterationsError. The loop failure occurs in two common patterns:

First, many production wrappers around Transformers Agents use a broad exception handler that catches AgentMaxIterationsError alongside other transient errors and retries the entire agent run. A retry loop that catches AgentMaxIterationsError and re-invokes the agent will run 6 more iterations, hit the limit again, and retry again — indefinitely. The agent never converges; the calling code never exits the retry loop.

Second, custom agent subclasses that override run() to add logging or metrics instrumentation sometimes accidentally bypass the iteration check. When super().run() is called inside a while True wrapper in a monitoring decorator, the limit in the parent class fires but the outer loop catches the exception and re-enters. Neither pattern is obvious from a code review: the outer retry looks like standard resilience, but it transforms a finite iteration bound into an infinite one.

The signature: agent_step:<task_hash> appearing more than 8 times for the same task hash within a single run. Any agent that takes more than 8 steps to answer a question has structurally diverged from convergence on current open-weight models — more steps produce more confusion, not better answers.

Python

import hashlib
from dataclasses import dataclass, field
from runguard import LoopDetector, LoopDetectedError

@dataclass
class IterationGuard:
    hard_step_limit: int = 8
    _step_counts: dict = field(default_factory=dict)
    _detector: LoopDetector = field(
        default_factory=lambda: LoopDetector(repeats=8, max_cycle_len=1)
    )

    def _task_hash(self, task: str) -> str:
        return hashlib.md5(task.encode()).hexdigest()[:8]

    def on_step(self, task: str, step_index: int) -> None:
        """Call before each agent step. Raises LoopDetectedError if limit exceeded."""
        key = self._task_hash(task)
        count = self._step_counts.get(key, 0) + 1
        self._step_counts[key] = count

        sig = f"agent_step:{key}"
        match = self._detector.push(sig)
        if match or count > self.hard_step_limit:
            raise LoopDetectedError(
                f"Agent iteration limit reached: {count} steps for task "
                f"'{task[:60]}...' — stopping to prevent further inference cost. "
                f"Model failed to converge. Consider simplifying the task or "
                f"increasing tool reliability."
            )

    def reset(self, task: str) -> None:
        key = self._task_hash(task)
        self._step_counts.pop(key, None)
        self._detector.reset()

# Wrap the agent's step execution:
iteration_guard = IterationGuard(hard_step_limit=8)

class GuardedReactAgent:
    def __init__(self, agent, guard: IterationGuard):
        self._agent = agent
        self._guard = guard

    def run(self, task: str, **kwargs):
        self._guard.reset(task)
        original_step = self._agent.step

        def guarded_step(log_entry):
            step_index = len(self._agent.logs)
            self._guard.on_step(task, step_index)
            return original_step(log_entry)

        self._agent.step = guarded_step
        try:
            return self._agent.run(task, **kwargs)
        except LoopDetectedError as e:
            # Surface the guard error explicitly — do NOT retry
            raise RuntimeError(f"Agent stopped by RunGuard: {e}") from e
        finally:
            self._agent.step = original_step

The guard wraps the agent's step() method rather than catching the outer exception. This ensures the limit fires inside the agent's execution context where it can be attributed to a specific task — and critically, it raises LoopDetectedError rather than AgentMaxIterationsError, which makes it harder for calling code's existing bare-exception handlers to accidentally swallow it and retry. The RuntimeError wrapper in the outer run() further distinguishes a guard trip from a transient tool failure that should be retried.

Failure mode 2: Tool error retry cascade

Transformers Agents delivers tool exceptions back to the model as Observations: the tool's traceback string is formatted and appended to the prompt as "Observation: Error — <traceback>". This design is deliberate — the model can reason about errors and try to fix them. But it creates a retry amplification pattern on tools that fail deterministically (network endpoints that are down, authentication tokens that have expired, APIs that require inputs the model consistently provides incorrectly).

The model sees the error, reasons about it in the next Thought ("The API returned a 401 — I should try with a different header"), generates a new Action (the same tool call with a variation), receives another error Observation, and continues. Each iteration consumes the full input token count (system prompt + all prior Thought-Action-Observation history + the growing error log) for the next model call. After 6 iterations of a deterministic error, the input token count for step 7 may be 3–4× the input token count for step 1.

The cascade compounds when tools call downstream services that have their own retry logic. A tool that wraps an HTTP client with 3 retries and a 2-second backoff will take 6+ seconds per agent step. The agent retrying the tool 6 times × 3 HTTP retries per tool call = 18 actual HTTP requests to a down endpoint, plus the inference cost of 6 model forward passes each reading the accumulated error history.

The signature: tool_error:<tool_name> appearing 3 or more times in sequence for the same tool. A genuine transient error produces 1–2 retries and then either succeeds or fails with a different error. Three consecutive identical errors from the same tool indicate a deterministic failure that more retries will not fix.

Python

from dataclasses import dataclass, field
from runguard import LoopDetector, LoopDetectedError
from transformers.agents import Tool

@dataclass
class ToolErrorCascadeGuard:
    consecutive_error_limit: int = 3
    _detectors: dict = field(default_factory=dict)

    def _get_detector(self, tool_name: str) -> LoopDetector:
        if tool_name not in self._detectors:
            self._detectors[tool_name] = LoopDetector(
                repeats=self.consecutive_error_limit,
                max_cycle_len=1
            )
        return self._detectors[tool_name]

    def on_tool_error(self, tool_name: str, error: Exception) -> str:
        """
        Call when a tool raises an exception.
        Returns the error string for the agent's Observation if not tripping.
        Raises LoopDetectedError if the consecutive error limit is hit.
        """
        sig = f"tool_error:{tool_name}"
        detector = self._get_detector(tool_name)
        match = detector.push(sig)
        if match:
            raise LoopDetectedError(
                f"Tool error cascade: '{tool_name}' has failed {self.consecutive_error_limit} "
                f"consecutive times with {type(error).__name__}. "
                f"This tool is failing deterministically — stopping to prevent "
                f"further inference cost on a converging task. "
                f"Fix the tool or its inputs before retrying."
            )
        return f"Error calling {tool_name}: {type(error).__name__}: {error}"

    def on_tool_success(self, tool_name: str) -> None:
        if tool_name in self._detectors:
            self._detectors[tool_name].reset()

def make_guarded_tool(tool: Tool, guard: "ToolErrorCascadeGuard") -> Tool:
    """Wrap a Tool to report errors through the cascade guard."""
    original_call = tool.__call__

    def guarded_call(*args, **kwargs):
        try:
            result = original_call(*args, **kwargs)
            guard.on_tool_success(tool.name)
            return result
        except LoopDetectedError:
            raise
        except Exception as e:
            observation = guard.on_tool_error(tool.name, e)
            return observation  # Return as Observation string to the model

    tool.__call__ = guarded_call
    return tool

# Apply the guard to all tools before passing them to the agent:
cascade_guard = ToolErrorCascadeGuard(consecutive_error_limit=3)
guarded_tools = [make_guarded_tool(t, cascade_guard) for t in my_tools]

agent = ReactAgent(tools=guarded_tools, llm_engine=my_engine)

The guard intercepts tool exceptions before they reach the agent's default error formatter. On consecutive failures below the limit, it returns the error string as an Observation — normal agent behavior. When the limit fires, it raises LoopDetectedError, which propagates out of the agent's run() call rather than becoming another Observation. This prevents the model from generating another Action against a tool that has already proved it will fail: the cascade stops, and the error is surfaced to the operator with context about which tool failed and how many times.

Failure mode 3: Model reload amplification

Transformers Agents wraps any callable that accepts a list of messages and returns a string as its llm_engine. Teams commonly implement this engine as a function that creates a transformers.pipeline object on every call. The pipeline object encapsulates model loading: when pipeline("text-generation", model="meta-llama/Meta-Llama-3-8B-Instruct") is called, it downloads (or reads from cache) and loads the model weights into GPU memory. The first call is unavoidably slow; subsequent calls on the same loaded instance are fast because the weights stay in VRAM.

The reload pattern appears in three common shapes: a lambda passed as the engine that recreates the pipeline on every invocation; a web request handler that instantiates the pipeline inside the request scope (so each HTTP request to an endpoint triggers a full load); and a CLI script that passes --model as a flag and creates the pipeline at the start of main() called by a wrapper script that invokes it once per task. In all three cases, the model is loaded, used for one agent run (6–10 model calls), and then unloaded when the process exits or the object goes out of scope.

On an A10G GPU instance at $0.90/hour ($0.00025/second), a 7B model that takes 12 seconds to load costs ~$0.003 per load. At 100 agent invocations per day with 1 load each, that is $0.30/day or ~$110/year in pure load-time waste — before any inference tokens are counted. On A100 or H100 instances, the per-second rate is 3–5× higher and the waste scales accordingly.

The signature: GPU memory allocation events at the start of each agent invocation — visible in nvidia-smi as VRAM jumping from near-zero to the model footprint at the start of each run, then dropping back after. A correctly warmed agent holds the model in VRAM continuously and shows stable VRAM allocation across invocations.

Python

from transformers import pipeline
from transformers.agents import ReactAgent, HfApiEngine
from runguard import BudgetGuard

# WRONG — reloads the model on every call:
def bad_engine(messages, stop_sequences=None):
    pipe = pipeline("text-generation", model="Qwen/Qwen2.5-7B-Instruct",
                    device_map="auto")  # <-- loads from disk every call
    return pipe(messages)[0]["generated_text"]

# RIGHT — load once at module level, reuse across all agent invocations:
_PIPELINE = None

def get_pipeline():
    global _PIPELINE
    if _PIPELINE is None:
        _PIPELINE = pipeline(
            "text-generation",
            model="Qwen/Qwen2.5-7B-Instruct",
            device_map="auto",
            torch_dtype="auto",
        )
    return _PIPELINE

def warmed_engine(messages, stop_sequences=None):
    pipe = get_pipeline()
    result = pipe(messages, max_new_tokens=1024,
                  stop_sequences=stop_sequences or [])
    return result[0]["generated_text"]

# Add a token budget guard to the engine so per-invocation cost is bounded:
budget_guard = BudgetGuard(max_tokens_per_run=4096)

def budgeted_engine(messages, stop_sequences=None):
    budget_guard.check_and_record(messages)  # raises BudgetExceededError if over
    return warmed_engine(messages, stop_sequences)

# Use the warmed, budgeted engine for all agents in the process:
agent = ReactAgent(
    tools=my_tools,
    llm_engine=budgeted_engine,
    max_iterations=8,
)

The module-level singleton pattern — _PIPELINE = None initialized on first call — keeps the model resident in VRAM across all agent invocations in the same process. For deployments using Gunicorn or uvicorn with multiple worker processes, each worker maintains its own singleton; with 4 workers on a 4-GPU machine, each GPU hosts one warm model instance. The BudgetGuard adds a per-invocation token ceiling: even if the agent's iteration guard and tool error guard both fail to trip, the engine itself refuses to process a prompt that would exceed the token budget for the run, surfacing the error before billing accumulates further.

Failure mode 4: Context overflow spiral

The Transformers Agents prompt is built by concatenating the system prompt (tool descriptions + instructions), all prior Thought-Action-Observation steps, and the current step prefix. As the reasoning chain grows, this concatenated prompt approaches the model's max_length — typically 2,048–8,192 tokens depending on the checkpoint. When the prompt exceeds the limit, one of two things happens depending on the pipeline configuration:

If truncation=True is set (the safest option), tokens are removed from the left of the prompt — the earliest parts of the conversation history are discarded. This can silently remove the original task description, earlier successful tool results that later steps depend on, or the system prompt sections that describe how to format tool calls. A model that loses its tool-call format instructions may start generating free-form text that the agent framework cannot parse as an Action, producing an exception that adds another error Observation to the growing prompt — a spiral where overflow causes parse failures which add tokens which cause more overflow.

If truncation=False (the default in some configurations), the pipeline raises a ValueError when the input exceeds the model's maximum sequence length. The agent framework catches this, the entire run fails, and calling code that retries the agent starts a fresh run — rebuilding the full Thought-Action-Observation history from scratch, paying the full inference cost of all prior steps again before the context overflows again at the same point in the reasoning chain.

The signature: Prompt token count growing linearly across steps, with the rate of growth exceeding (model max_length / max_iterations). If a 5-step agent on a 4,096-token model is already at 3,000 tokens after step 3, it will overflow before completing. The spiral indicator is a ValueError: sequence length is longer than appearing in logs followed immediately by a fresh agent run starting from step 1 on the same task.

Python

from dataclasses import dataclass, field
from transformers import AutoTokenizer
from runguard import LoopDetector, LoopDetectedError

@dataclass
class ContextOverflowGuard:
    model_name: str
    max_context_tokens: int
    safety_margin: float = 0.85  # Trip at 85% of max to leave room for output
    _tokenizer: object = field(default=None, init=False)
    _step_token_counts: list = field(default_factory=list)
    _detector: LoopDetector = field(
        default_factory=lambda: LoopDetector(repeats=3, max_cycle_len=1)
    )

    def __post_init__(self):
        self._tokenizer = AutoTokenizer.from_pretrained(self.model_name)

    @property
    def _trip_threshold(self) -> int:
        return int(self.max_context_tokens * self.safety_margin)

    def check_prompt(self, prompt: str, step_index: int) -> None:
        """
        Call with the full assembled prompt before each model forward pass.
        Raises LoopDetectedError if approaching context overflow.
        """
        token_count = len(self._tokenizer.encode(prompt))
        self._step_token_counts.append(token_count)

        if token_count >= self._trip_threshold:
            # Check growth rate — is it accelerating toward overflow?
            sig = "context_overflow:approaching_limit"
            match = self._detector.push(sig)
            if match or token_count >= self.max_context_tokens:
                raise LoopDetectedError(
                    f"Context overflow spiral: prompt at step {step_index} has "
                    f"{token_count} tokens (limit: {self.max_context_tokens}, "
                    f"trip threshold: {self._trip_threshold}). "
                    f"Token growth across steps: {self._step_token_counts}. "
                    f"Stopping before truncation discards task context. "
                    f"Consider summarizing intermediate results or splitting the task."
                )

    def reset(self) -> None:
        self._step_token_counts.clear()
        self._detector.reset()

# Wire the guard into the llm_engine wrapper:
overflow_guard = ContextOverflowGuard(
    model_name="Qwen/Qwen2.5-7B-Instruct",
    max_context_tokens=8192,
    safety_margin=0.80
)

def overflow_protected_engine(messages, stop_sequences=None):
    # Assemble the prompt string to check token count
    prompt = _PIPELINE.tokenizer.apply_chat_template(
        messages, tokenize=False, add_generation_prompt=True
    )
    step_index = len(messages) // 2  # Approximate step index from message count
    overflow_guard.check_prompt(prompt, step_index)
    return warmed_engine(messages, stop_sequences)

agent = ReactAgent(
    tools=my_tools,
    llm_engine=overflow_protected_engine,
    max_iterations=8,
)

The guard checks token count before each model call, using the same tokenizer as the model so the count is exact. The safety margin (80% of context length) leaves room for the model's output tokens and prevents the agent from entering the truncation regime where context loss causes cascading parse failures. Three consecutive steps above the threshold trip the LoopDetector, catching the spiral early even when individual steps are below the absolute limit — because a prompt that grows by 800 tokens per step will hit the limit in 1–2 more steps regardless.

When the guard trips, the recommended recovery strategy is task decomposition: split the original task into smaller subtasks that each complete in fewer steps with shorter history, then synthesize the results in a final step. This is more effective than increasing max_length because the root cause is task complexity that exceeds what the model can reason about in a single context window — a longer window delays but does not prevent overflow for genuinely complex tasks.

Combining the guards: Guarded agent factory

In production, all four guards compose cleanly because they operate at different layers: the iteration guard wraps the agent's step loop, the tool error guard wraps individual tool calls, the model reload guard is a deployment pattern rather than runtime code, and the context overflow guard wraps the engine callable. Assembling them into a guarded agent factory gives you a single safe entry point for all agent invocations.

Python

from transformers.agents import ReactAgent
from runguard import LoopDetector, LoopDetectedError, BudgetGuard

def create_guarded_agent(
    tools: list,
    model_name: str = "Qwen/Qwen2.5-7B-Instruct",
    max_iterations: int = 8,
    max_tokens_per_run: int = 4096,
    max_context_tokens: int = 8192,
) -> "GuardedReactAgent":
    # Guard 1: iteration limit (enforced in wrapper, not relying on agent's own limit)
    iteration_guard = IterationGuard(hard_step_limit=max_iterations)

    # Guard 2: tool error cascade
    cascade_guard = ToolErrorCascadeGuard(consecutive_error_limit=3)
    guarded_tools = [make_guarded_tool(t, cascade_guard) for t in tools]

    # Guard 3: model reload — use warmed singleton (defined at module level)
    # Guard 4: context overflow
    overflow_guard = ContextOverflowGuard(
        model_name=model_name,
        max_context_tokens=max_context_tokens,
        safety_margin=0.80,
    )

    # Budget guard: hard token ceiling per run
    budget_guard = BudgetGuard(max_tokens_per_run=max_tokens_per_run)

    def composite_engine(messages, stop_sequences=None):
        prompt = get_pipeline().tokenizer.apply_chat_template(
            messages, tokenize=False, add_generation_prompt=True
        )
        step_index = len(messages) // 2
        overflow_guard.check_prompt(prompt, step_index)
        budget_guard.check_and_record(messages)
        return warmed_engine(messages, stop_sequences)

    base_agent = ReactAgent(
        tools=guarded_tools,
        llm_engine=composite_engine,
        max_iterations=max_iterations + 2,  # Guard fires first; agent limit is backup
    )

    return GuardedReactAgent(agent=base_agent, guard=iteration_guard)

# Usage:
agent = create_guarded_agent(
    tools=[search_tool, calculator_tool, python_tool],
    model_name="Qwen/Qwen2.5-7B-Instruct",
    max_iterations=8,
    max_tokens_per_run=8192,
    max_context_tokens=8192,
)

result = agent.run("Summarize the top 3 papers on attention mechanisms from arXiv this week")
print(result)

Summary of failure modes and guards

Failure mode	Primary cost driver	Guard	Trip condition
Infinite iteration loop max_iterations limit swallowed by retry handler	Unbounded model forward passes on accumulating history	`IterationGuard`	More than 8 steps for the same task hash
Tool error retry cascade Deterministic tool failure repeated across steps	Full context re-read on each retry; downstream HTTP retries	`ToolErrorCascadeGuard`	3 consecutive errors from the same tool
Model reload amplification Pipeline instantiated per-call instead of module-level	GPU-seconds of load time billed before first token	Module-level singleton + `BudgetGuard`	Deployment pattern: load once, share across invocations
Context overflow spiral History fills context; truncation causes parse failures; full retry	All prior step inference costs re-paid on restart	`ContextOverflowGuard`	Prompt token count ≥ 80% of model max_length

Frequently asked questions

How is this different from the smolagents cost control patterns?

The smolagents post covers the evolved library (now the recommended Hugging Face agent framework), which has a redesigned step execution model, a different tool interface, and the managed agents hierarchy. The transformers.agents module described here is the earlier implementation still in production use at many organizations — it has a different class hierarchy (ReactAgent, CodeAgent vs smolagents' ToolCallingAgent, CodeAgent), different error handling behavior on max_iterations, and a different context management approach. The core guard patterns are similar in concept, but the integration points — which methods to wrap, which exceptions to intercept — differ because the class structures differ.

Does the max_iterations parameter not already prevent infinite loops?

max_iterations fires a specific exception (AgentMaxIterationsError) that production wrappers commonly catch alongside transient network errors. The guard fires a different exception type (LoopDetectedError) that should not be in existing retry handler's catch lists, and it fires inside the step execution rather than at the outer run boundary — giving you granular context about which step triggered it. The guard also catches the pattern where calling code's retry logic restarts the entire run, resetting the built-in iteration counter but not the guard's per-task counter.

The model reload guard is a design pattern, not a runtime guard — how do I detect if I have the problem?

Run nvidia-smi dmon -s u -d 1 during an agent invocation and watch the memory utilization column. If VRAM usage rises sharply at the start of each invocation and drops afterward, the model is being reloaded. A warmed deployment shows stable, high VRAM throughout. You can also add timing instrumentation to the llm_engine callable: if the first call in a run takes 10–30× longer than subsequent calls, you're loading the model on the first call. The fix is always the same: move the pipeline() or AutoModelForCausalLM.from_pretrained() call out of the per-invocation path and into module scope or application startup.

Can the context overflow guard be used with the HF Inference API instead of local models?

Yes — the guard uses the tokenizer from the same model family as the endpoint, which is available locally via AutoTokenizer.from_pretrained(model_name) even when the model itself runs remotely. Tokenization is CPU-only and fast (<1ms per prompt). The HF Inference API enforces its own context limits and will return a 422 error when the input exceeds the model's maximum, so the guard also prevents those API errors from being formatted as Observations and escalating the already-overflowing prompt further. Set the guard's max_context_tokens to the model's documented maximum on the Hub (visible on the model card under "Context length").

Do these guards work with CodeAgent and ToolCallingAgent as well as ReactAgent?

The iteration guard, context overflow guard, and model reload pattern apply identically to CodeAgent and ToolCallingAgent — they operate at the engine and prompt level, below the agent class differentiation. The tool error cascade guard applies to ReactAgent and ToolCallingAgent which use discrete tool calls; CodeAgent executes Python code and its cost amplification patterns are different — the equivalent guard for CodeAgent intercepts the code execution sandbox's exception output before it becomes an Observation, and trips on 3 consecutive code execution failures that produce the same exception type (indicating the model is generating structurally incorrect code rather than code with a fixable runtime error).

Stop paying for Transformers Agents loops before they compound

RunGuard wraps your ReactAgent and CodeAgent deployments with guards that trip on iteration loops, tool error cascades, and context spirals before a 6-step bounded run becomes an unbounded billing event. One import, four guards, full coverage across local GPU and HF Inference API deployments.

Join the waitlist Try the cost estimator