AI agent sandbox escape prevention: catching systematic probe patterns before they succeed

When an AI agent has access to file system, network, or subprocess tools, it can attempt to probe or escape its sandbox — not through a single forbidden call, but through a sequence of systematic, incrementing tool calls that map the host environment. This is distinct from a jailbreak (prompt injection to change the model’s behavior) because sandbox escape is behavioral: the agent repeatedly calls tools in patterns that probe boundaries. A sequence like read_file("/etc/passwd")read_file("/etc/shadow")read_file("/root/.ssh/id_rsa") doesn’t require a single call to be “forbidden” — the pattern of systematic probing is the threat. RunGuard’s LoopDetector treats the call signature stream as a behavioral signal and fires a circuit breaker when a probe pattern emerges, before the sequence completes or causes damage.

The three sandbox escape patterns agents attempt

Why traditional sandboxing isn’t enough

RunGuard behavioral loop detection for sandbox security

RunGuard’s LoopDetector analyzes the stream of tool call signatures for repeating cycles. The key insight for sandbox escape detection is that the sig_fn (the function that maps a tool call to its signature) should encode the category of the argument, not the exact value. Two calls to read_file with paths /etc/passwd and /etc/shadow are different calls — but they share a category (reading files under /etc/) and together form a probe cycle.

A custom sig_fn for sandbox escape detection groups:

import os
from urllib.parse import urlparse
from runguard import LoopDetector, LoopDetectedError

def security_sig_fn(tool_name: str, args: dict) -> str:
    """Map tool calls to category-level signatures for probe pattern detection."""
    if tool_name in ("read_file", "open_file"):
        path = args.get("path", args.get("filename", ""))
        # Group by parent directory, not exact path
        parent = os.path.dirname(str(path)) or "/"
        return f"read_file:{parent}"

    if tool_name == "list_dir":
        path = args.get("path", args.get("directory", ""))
        return f"list_dir:{str(path)}"

    if tool_name in ("http_get", "fetch_url", "requests_get"):
        url = args.get("url", args.get("uri", ""))
        try:
            hostname = urlparse(str(url)).hostname or "unknown"
        except Exception:
            hostname = "unknown"
        return f"http_get:{hostname}"

    if tool_name in ("run_bash", "shell_exec", "subprocess_run"):
        cmd = str(args.get("command", args.get("cmd", "")))
        # Group by first token (the executable)
        first_token = cmd.strip().split()[0] if cmd.strip() else "shell"
        return f"shell:{first_token}"

    # Default: tool name only, ignore args
    return tool_name


# Security-sensitive: fire after 2 same-category probes in a cycle of length 2
detector = LoopDetector(
    sig_fn=security_sig_fn,
    repeats=2,
    max_cycle_len=2,
)

With this configuration, two read_file calls under the same parent directory will fire the circuit breaker — even if the exact paths differ. The detector sees the signature read_file:/etc appear twice in a cycle of length 1, which satisfies repeats=2.

Full Python example: file-access agent with behavioral detection

import os
import asyncio
from urllib.parse import urlparse
from runguard import guard_async, LoopDetectedError, BudgetExceededError, BudgetTracker

# --- Signature function (from above) ---

def security_sig_fn(tool_name: str, args: dict) -> str:
    if tool_name in ("read_file", "open_file"):
        path = args.get("path", args.get("filename", ""))
        parent = os.path.dirname(str(path)) or "/"
        return f"read_file:{parent}"
    if tool_name == "list_dir":
        path = args.get("path", args.get("directory", ""))
        return f"list_dir:{str(path)}"
    if tool_name in ("http_get", "fetch_url"):
        url = args.get("url", args.get("uri", ""))
        try:
            hostname = urlparse(str(url)).hostname or "unknown"
        except Exception:
            hostname = "unknown"
        return f"http_get:{hostname}"
    if tool_name in ("run_bash", "shell_exec"):
        cmd = str(args.get("command", args.get("cmd", "")))
        first_token = cmd.strip().split()[0] if cmd.strip() else "shell"
        return f"shell:{first_token}"
    return tool_name


# --- Tool dispatcher (your actual implementation) ---

async def dispatch_tool(tool_name: str, args: dict) -> str:
    """Execute one tool call and return the result as a string."""
    # Replace with your real tool implementations.
    if tool_name == "read_file":
        path = args.get("path", "")
        with open(path) as f:
            return f.read()
    if tool_name == "list_dir":
        path = args.get("path", ".")
        return "\n".join(os.listdir(path))
    if tool_name == "http_get":
        import urllib.request
        with urllib.request.urlopen(args["url"], timeout=5) as r:
            return r.read().decode()[:4096]
    if tool_name == "run_bash":
        import subprocess
        result = subprocess.run(
            args.get("command", ""),
            shell=True, capture_output=True, text=True, timeout=10
        )
        return result.stdout + result.stderr
    raise ValueError(f"Unknown tool: {tool_name}")


# --- Guarded agent entry point ---

from openai import AsyncOpenAI

client = AsyncOpenAI()

async def _call_model(messages: list) -> dict:
    response = await client.chat.completions.create(
        model="gpt-4o",
        messages=messages,
    )
    choice = response.choices[0]
    usage = response.usage
    usd = ((usage.prompt_tokens * 2.5 + usage.completion_tokens * 10.0) / 1_000_000) if usage else 0.0
    tool_calls = getattr(choice.message, "tool_calls", None) or []
    # Extract (tool_name, args) pairs for the loop detector
    sigs = [
        security_sig_fn(tc.function.name, __import__("json").loads(tc.function.arguments or "{}"))
        for tc in tool_calls
    ]
    sig = sigs[0] if sigs else "end_turn"
    return {"choice": choice, "usd": usd, "sig": sig}


guarded_model = guard_async(
    _call_model,
    # Limit blast radius: abort after $1.00 for any single session
    budget={"max_usd": 1.0},
    # Behavioral loop detection: fire after 2 same-category probes
    loop={
        "sig_fn": security_sig_fn,
        "repeats": 3,
        "max_cycle_len": 3,
    },
)


async def run_file_agent(system_prompt: str, user_query: str) -> str:
    messages = [
        {"role": "system", "content": system_prompt},
        {"role": "user", "content": user_query},
    ]

    while True:
        try:
            result = await guarded_model(messages)
        except LoopDetectedError as e:
            return (
                f"[SECURITY] Probe pattern detected — circuit breaker fired. "
                f"Pattern: {e.pattern!r}. Session aborted."
            )
        except BudgetExceededError as e:
            return (
                f"[SECURITY] Budget cap reached (${e.spent:.4f} of $1.00). "
                f"Session aborted."
            )

        choice = result["choice"]
        messages.append(choice.message)

        tool_calls = getattr(choice.message, "tool_calls", None) or []
        if not tool_calls:
            return choice.message.content or ""

        import json
        for tc in tool_calls:
            tool_name = tc.function.name
            args = json.loads(tc.function.arguments or "{}")
            try:
                tool_result = await dispatch_tool(tool_name, args)
            except PermissionError as exc:
                tool_result = f"Error: permission denied — {exc}"
            except Exception as exc:
                tool_result = f"Error: {exc}"
            messages.append({
                "role": "tool",
                "tool_call_id": tc.id,
                "content": tool_result,
            })


if __name__ == "__main__":
    result = asyncio.run(run_file_agent(
        system_prompt="You are a helpful file assistant. Only access files in /home/user/project.",
        user_query="Show me the contents of the project README.",
    ))
    print(result)

TypeScript example: same pattern with @runguard/sdk

import { guardAsync, LoopDetectedError, BudgetExceededError } from "@runguard/sdk";
import OpenAI from "openai";
import * as path from "path";
import { URL } from "url";

const client = new OpenAI();

// --- Security-focused signature function ---

function securitySigFn(toolName: string, args: Record<string, unknown>): string {
  if (toolName === "read_file" || toolName === "open_file") {
    const filePath = String(args.path ?? args.filename ?? "");
    const parent = path.dirname(filePath) || "/";
    return `read_file:${parent}`;
  }
  if (toolName === "list_dir") {
    return `list_dir:${String(args.path ?? args.directory ?? "")}`;
  }
  if (toolName === "http_get" || toolName === "fetch_url") {
    const rawUrl = String(args.url ?? args.uri ?? "");
    try {
      const { hostname } = new URL(rawUrl);
      return `http_get:${hostname}`;
    } catch {
      return `http_get:unknown`;
    }
  }
  if (toolName === "run_bash" || toolName === "shell_exec") {
    const cmd = String(args.command ?? args.cmd ?? "").trim();
    const firstToken = cmd.split(/\s+/)[0] ?? "shell";
    return `shell:${firstToken}`;
  }
  return toolName;
}


// --- Guarded model call ---

async function callModel(messages: OpenAI.ChatCompletionMessageParam[]) {
  const response = await client.chat.completions.create({
    model: "gpt-4o",
    messages,
  });
  const choice = response.choices[0];
  const usage = response.usage;
  const usd = usage
    ? (usage.prompt_tokens * 2.5 + usage.completion_tokens * 10.0) / 1_000_000
    : 0;
  const toolCalls = choice.message.tool_calls ?? [];
  const sig =
    toolCalls.length > 0
      ? securitySigFn(toolCalls[0].function.name, JSON.parse(toolCalls[0].function.arguments || "{}"))
      : "end_turn";
  return { choice, usd, sig };
}

const guardedModel = guardAsync(callModel, {
  budget: { maxUsd: 1.0 },
  loop: {
    sigFn: securitySigFn,
    repeats: 3,
    maxCycleLen: 3,
  },
});


// --- Agent run loop ---

async function runFileAgent(systemPrompt: string, userQuery: string): Promise<string> {
  const messages: OpenAI.ChatCompletionMessageParam[] = [
    { role: "system", content: systemPrompt },
    { role: "user", content: userQuery },
  ];

  while (true) {
    let result: Awaited<ReturnType<typeof guardedModel>>;
    try {
      result = await guardedModel(messages);
    } catch (err) {
      if (err instanceof LoopDetectedError) {
        return `[SECURITY] Probe pattern detected — circuit breaker fired. Pattern: ${err.pattern}. Session aborted.`;
      }
      if (err instanceof BudgetExceededError) {
        return `[SECURITY] Budget cap reached ($${err.spent.toFixed(4)} of $1.00). Session aborted.`;
      }
      throw err;
    }

    const { choice } = result;
    messages.push(choice.message as OpenAI.ChatCompletionMessageParam);

    const toolCalls = choice.message.tool_calls ?? [];
    if (toolCalls.length === 0) {
      return choice.message.content ?? "";
    }

    for (const tc of toolCalls) {
      const toolName = tc.function.name;
      const args = JSON.parse(tc.function.arguments || "{}");
      let toolResult: string;
      try {
        toolResult = await dispatchTool(toolName, args);
      } catch (err) {
        toolResult = `Error: ${(err as Error).message}`;
      }
      messages.push({
        role: "tool",
        tool_call_id: tc.id,
        content: toolResult,
      });
    }
  }
}

async function dispatchTool(toolName: string, args: Record<string, unknown>): Promise<string> {
  // Replace with real implementations
  throw new Error(`Tool not implemented: ${toolName}`);
}

// Entry point
runFileAgent(
  "You are a helpful file assistant. Only access files in /home/user/project.",
  "Show me the contents of the project README."
).then(console.log);

Defense-in-depth: RunGuard as the behavioral layer

RunGuard’s LoopDetector is one layer in a defense stack. No single control is sufficient; the combination of layers reduces the attack surface to near zero at each level:

Tuning the loop detector for different security threat models

The LoopDetector parameters determine how aggressively the circuit breaker fires. The two primary parameters are repeats (how many times the same cycle must repeat before firing) and max_cycle_len (the maximum cycle length to detect). Security contexts call for more aggressive settings than reliability contexts:

No detection vs. allowlist-only vs. RunGuard behavioral circuit breaker

Capability No detection Allowlist only RunGuard behavioral circuit breaker
Probe pattern detection None — agent can complete full traversal Partial — blocks specific blocked paths, not the probing behavior Yes — detects category-level repeating probe cycles in real time
Cost limiting None — probe session runs until manual intervention None — cost control is out of scope for allowlists Yes — BudgetTracker cap_usd=1.0 halts the session automatically
False positive rate Zero (no detection) Low for path-exact blocks, zero for pattern-level behavior Tunable: repeats=2 for security, repeats=3 for general use
Real-time alert None Per-call allow/deny log only Yes — LoopDetectedError raised before the next probe call is dispatched
Audit log None unless you add it manually Allow/deny log for each call Full signature stream + cycle pattern captured in the exception payload

Related topics

Add behavioral sandbox escape detection to your agent today

RunGuard’s Python SDK installs with pip install runguard. The TypeScript SDK installs with npm install @runguard/sdk. Wire the LoopDetector with a custom sig_fn, add a BudgetTracker(cap_usd=1.0) for blast-radius limiting, and catch LoopDetectedError in your agent’s run loop. Both SDKs work with any agent framework — LangChain, LangGraph, OpenAI Agents SDK, AutoGen, PydanticAI, CrewAI, and plain tool-loop code.

RunGuard pricing: Solo $19/mo · Team $79/mo · 14-day free trial, no credit card required. Get started at runguard.dev.