June 15, 2026 Modal Labs Serverless Cost Control

Modal Labs Serverless AI Cost Control: Cold Start Storms, Autoscaling Spikes, and Retry Amplification

Modal Labs has become the default serverless GPU platform for AI teams in 2026. The pitch is compelling: decorate a Python function with @app.function(gpu="H100"), call it remotely from anywhere, and Modal handles container provisioning, GPU allocation, and horizontal scaling automatically. You pay per second of actual compute with no idle cost — far cheaper than reserving a dedicated GPU instance for workloads that run in bursts.

That serverless model introduces a distinct class of cost failure modes that don't exist on always-on servers. The same autoscaling machinery that saves money during normal operation can amplify costs dramatically when an agent's retry behavior, tool-calling pattern, or sub-agent spawn rate triggers the wrong scaling feedback loop. Modal's per-container billing unit and cold start mechanics mean that certain agent architectures — ones that look perfectly normal on a traditional server — generate unexpected bills on Modal's infrastructure.

Four failure modes specific to Modal Labs deployments:

Cold start overhead amplification — When a burst of agent sub-calls arrives simultaneously and no warm containers are available, Modal provisions many containers in parallel. Each container pays a cold start penalty (model weight loading, dependency import time) before processing a single token. An agent that spawns 20 parallel sub-calls when all containers are cold pays 20× the cold start overhead, even if each actual inference takes only seconds.
Autoscaling concurrency storm from retry loops — Modal's autoscaler counts queued requests and provisions new containers to drain the queue. A retry loop at the agent level — one that re-enqueues the same request after a failure — can cause the queue to grow faster than containers can drain it, triggering continuous horizontal scaling. The caller believes it's sending one request; Modal sees a queue that never empties and keeps scaling up.
Cross-layer retry multiplication — Modal Functions accept a retries parameter that automatically re-runs failed functions. When the agent framework calling Modal (LangChain, LlamaIndex, a custom loop) also has its own retry policy, the two layers compose multiplicatively. A function with retries=3 called from a LangChain tool with max_retries=3 generates up to 16 invocations — each billed as a separate GPU container run — from a single logical tool call.
Minimum container lifetime billing on high-frequency short calls — Modal bills at a per-second granularity with a minimum billing interval per container. Sub-agent tasks that complete in under one second — fast embedding calls, short classification runs, token-count checks — still incur the minimum per-container charge. An orchestrating agent that fires 100 such calls in rapid succession pays for 100 minimum-interval container runs, not the few seconds of actual compute they collectively represent.

Failure mode 1: Cold start overhead amplification

Modal containers are provisioned on demand and kept warm for a configurable period after their last invocation (keep_warm parameter). When no warm containers exist — after a quiet period, on the first call of the day, or when a burst exceeds the current warm pool — Modal starts new containers from scratch. For AI workloads this means loading model weights into GPU memory, importing large Python dependencies (PyTorch, Transformers, vLLM), and running any startup initialization.

On a typical H100 container, this cold start takes 15–60 seconds depending on model size. The cost during that window is real: you're paying for a reserved H100 GPU while it boots. For a single sequential request this is annoying but manageable. For an agent that spawns parallel sub-calls — a CrewAI crew with 8 parallel agents, a LangGraph fan-out node, or a batch embedding run — all 8 containers may cold-start simultaneously if the warm pool is empty. The cold start cost multiplies by the fan-out factor.

The guard strategy is to track concurrent cold starts and throttle new invocations when the pool is clearly all-cold:

Python — cold start tracker with fan-out throttle

import modal
import time
import threading
from collections import deque

class ColdStartTracker:
    """
    Detects simultaneous cold starts across parallel Modal invocations.
    When many containers boot at once, throttle new spawns to avoid
    paying cold start overhead on work that could be queued instead.
    """

    def __init__(
        self,
        cold_start_threshold_seconds: float = 8.0,
        concurrent_cold_limit: int = 6,
        measurement_window: float = 30.0,
    ):
        self.threshold = cold_start_threshold_seconds
        self.max_concurrent_cold = concurrent_cold_limit
        self.window = measurement_window
        self._starts: deque[float] = deque()
        self._lock = threading.Lock()

    def record_invocation_start(self) -> float:
        """Call immediately before .remote() / .spawn(). Returns start time."""
        return time.monotonic()

    def record_invocation_end(self, start_time: float, was_cold: bool = False) -> None:
        """
        Call after .remote() returns. Pass was_cold=True if the response
        included a cold start indicator (e.g. first call, or latency > threshold).
        """
        elapsed = time.monotonic() - start_time
        # infer cold start from latency if caller doesn't know explicitly
        if elapsed >= self.threshold or was_cold:
            with self._lock:
                now = time.monotonic()
                self._starts.append(now)

    def check_fan_out_allowed(self, proposed_count: int) -> None:
        """
        Call before spawning proposed_count parallel Modal calls.
        Raises if the recent cold start rate suggests all-cold conditions.
        """
        with self._lock:
            now = time.monotonic()
            cutoff = now - self.window
            while self._starts and self._starts[0] < cutoff:
                self._starts.popleft()
            recent_cold_starts = len(self._starts)

        if recent_cold_starts >= self.max_concurrent_cold:
            raise RuntimeError(
                f"[ColdStartTracker] {recent_cold_starts} cold starts detected "
                f"in the last {self.window:.0f}s — container pool appears cold. "
                f"Refusing fan-out of {proposed_count} parallel calls. "
                f"Batch sequentially or wait {self.window:.0f}s for pool to warm."
            )


tracker = ColdStartTracker(
    cold_start_threshold_seconds=8.0,
    concurrent_cold_limit=6,
    measurement_window=30.0,
)

def invoke_modal_with_tracking(fn, *args, **kwargs):
    """Wrapper that records invocation timing and infers cold starts."""
    t0 = tracker.record_invocation_start()
    try:
        result = fn.remote(*args, **kwargs)
        elapsed = time.monotonic() - t0
        tracker.record_invocation_end(t0, was_cold=(elapsed >= 8.0))
        return result
    except Exception:
        tracker.record_invocation_end(t0, was_cold=False)
        raise

def parallel_agent_calls(tasks: list, modal_fn) -> list:
    """Fan-out across Modal with cold-start gate."""
    tracker.check_fan_out_allowed(len(tasks))
    # proceed with parallel spawn
    handles = [modal_fn.spawn(*task) for task in tasks]
    return [h.get() for h in handles]

Preventing cold starts in the first place: Use keep_warm=N on your Modal function to maintain a floor of N warm containers. The cost of keeping 2 containers warm 24/7 on an H100 is predictable and often less than the cost of a single cold start storm during a high-traffic period. For agent workloads with bursty patterns, set keep_warm=2 and concurrency_limit=20 so the warm pool absorbs normal traffic while still autoscaling for genuine bursts.

Failure mode 2: Autoscaling concurrency storm from retry loops

Modal's autoscaler is queue-depth-driven. When queued requests exceed current container capacity, Modal provisions new containers to drain the backlog. This works correctly when the queue grows because of genuine traffic spikes. It misbehaves when the queue grows because a caller is retrying the same failed request in a tight loop.

The mechanism is subtle. Consider an agent that calls inference_fn.remote(prompt) and retries immediately on any exception. If inference_fn fails with a transient error — a GPU OOM, a timeout, a model loading race condition — the agent re-enqueues the request before the previous container has finished cleaning up. From Modal's autoscaler's perspective, queue depth just increased. It responds by provisioning another container. That container may hit the same OOM. Another retry. Another container provisioned. The autoscaler is doing exactly what it's designed to do; the caller is inadvertently driving it into an amplification loop.

The fix is to enforce minimum retry intervals at the caller and track total retry attempts per logical task:

Python — Modal retry loop breaker

import modal
import time
import asyncio
from dataclasses import dataclass, field

@dataclass
class ModalRetryPolicy:
    max_attempts: int = 3
    min_backoff_seconds: float = 2.0
    backoff_multiplier: float = 2.0
    max_backoff_seconds: float = 30.0
    # trip if total attempts across all tasks exceeds this in the window
    storm_threshold: int = 15
    storm_window_seconds: float = 60.0

class ModalRetryBreaker:
    """
    Wraps Modal .remote() calls with backoff and a storm detector.
    The storm detector trips when total retry volume exceeds a threshold,
    indicating an autoscaling cascade rather than isolated transient errors.
    """

    def __init__(self, policy: ModalRetryPolicy | None = None):
        self.policy = policy or ModalRetryPolicy()
        self._attempt_times: list[float] = []
        self._lock = asyncio.Lock() if asyncio.get_event_loop().is_running() else None

    def _record_attempt(self) -> None:
        now = time.monotonic()
        cutoff = now - self.policy.storm_window_seconds
        self._attempt_times = [t for t in self._attempt_times if t >= cutoff]
        self._attempt_times.append(now)
        if len(self._attempt_times) >= self.policy.storm_threshold:
            raise RuntimeError(
                f"[ModalRetryBreaker] {len(self._attempt_times)} Modal invocation "
                f"attempts in {self.policy.storm_window_seconds:.0f}s — retry storm "
                f"detected. Autoscaler may be cascading. Halting all retries."
            )

    def call(self, modal_fn, *args, **kwargs):
        """Synchronous wrapper with exponential backoff and storm detection."""
        last_exc: Exception | None = None
        backoff = self.policy.min_backoff_seconds

        for attempt in range(1, self.policy.max_attempts + 1):
            self._record_attempt()
            try:
                return modal_fn.remote(*args, **kwargs)
            except Exception as exc:
                last_exc = exc
                if attempt < self.policy.max_attempts:
                    sleep_time = min(backoff, self.policy.max_backoff_seconds)
                    print(
                        f"[ModalRetryBreaker] attempt {attempt}/{self.policy.max_attempts} "
                        f"failed: {exc}. Retrying in {sleep_time:.1f}s."
                    )
                    time.sleep(sleep_time)
                    backoff *= self.policy.backoff_multiplier

        raise RuntimeError(
            f"[ModalRetryBreaker] all {self.policy.max_attempts} attempts failed. "
            f"Last error: {last_exc}"
        ) from last_exc

    async def call_async(self, modal_fn, *args, **kwargs):
        """Async wrapper for use inside asyncio event loops."""
        last_exc: Exception | None = None
        backoff = self.policy.min_backoff_seconds

        for attempt in range(1, self.policy.max_attempts + 1):
            self._record_attempt()
            try:
                return await modal_fn.remote.aio(*args, **kwargs)
            except Exception as exc:
                last_exc = exc
                if attempt < self.policy.max_attempts:
                    sleep_time = min(backoff, self.policy.max_backoff_seconds)
                    await asyncio.sleep(sleep_time)
                    backoff *= self.policy.backoff_multiplier

        raise RuntimeError(
            f"[ModalRetryBreaker] async: all {self.policy.max_attempts} attempts failed. "
            f"Last error: {last_exc}"
        ) from last_exc


breaker = ModalRetryBreaker(ModalRetryPolicy(
    max_attempts=3,
    min_backoff_seconds=2.0,
    storm_threshold=15,
    storm_window_seconds=60.0,
))

The storm_threshold is the key parameter. At 15 attempts per 60 seconds, a single task retrying three times would need five concurrent failures all happening simultaneously to trip the breaker. Under normal conditions you'll never see 15 invocation attempts in a minute. When the autoscaler is cascading, you'll see it spike past 15 quickly — which is exactly the signal you want.

Failure mode 3: Cross-layer retry multiplication

Modal Functions accept a retries parameter in the @app.function decorator. When set, Modal automatically re-runs the function up to that many times on failure before surfacing an exception to the caller. This is useful for recovering from transient infrastructure errors without changing caller code.

The problem arises when the caller also has a retry policy. AI frameworks almost universally implement retries: LangChain's tool executor retries on ToolException, LlamaIndex's retry mechanisms on query failures, custom agent loops that catch exceptions and re-invoke the same tool. When both Modal's internal retries and the framework's retries are active, they multiply:

Python — cross-layer retry math

# Modal function defined with retries=3
@app.function(gpu="A10G", retries=3)
def classify_document(text: str) -> dict:
    # ... calls a GPU model
    pass

# LangChain tool calling it with its own retry logic
from langchain.tools import tool
from tenacity import retry, stop_after_attempt, wait_exponential

@tool
@retry(stop=stop_after_attempt(3), wait=wait_exponential(min=1, max=10))
def classify_tool(text: str) -> str:
    return str(classify_document.remote(text))

# When classify_document fails on ALL Modal retries:
# - Modal: attempt 1 (fail) → retry 1 (fail) → retry 2 (fail) → retry 3 (fail) → raises
# - tenacity: sees exception, retries classify_tool
#   → Modal: attempt 1 (fail) → retry 1 (fail) → retry 2 (fail) → retry 3 (fail) → raises
#   → tenacity: retries again
#   → Modal: attempt 1 (fail) → retry 1 (fail) → retry 2 (fail) → retry 3 (fail) → raises
#
# Total invocations: (1 + 3 retries) × (1 + 2 tenacity-retries) = 4 × 3 = 12
# Each invocation is a billable GPU container run.

The fix is to keep exactly one retry layer active. If the Modal function handles infrastructure retries (connection failures, GPU OOM during load), disable framework-level retries for that tool. If you want the framework to handle high-level retry logic (different prompt strategies, fallback models), set retries=0 on the Modal function and let the framework control the retry decision:

Python — single retry layer enforcement

import modal
from langchain.tools import tool

# Pattern A: Modal retries only (infrastructure errors)
# Use when: Modal errors are transient infra issues you want to auto-heal.
# Framework sees success or final failure, no double retry.
@app.function(gpu="A10G", retries=2)
def classify_document_modal_retries(text: str) -> dict:
    return run_model(text)

@tool
def classify_tool_no_framework_retry(text: str) -> str:
    # No framework retry here — Modal handles it
    return str(classify_document_modal_retries.remote(text))


# Pattern B: Framework retries only (business logic errors)
# Use when: you want to vary the prompt or model on retry.
@app.function(gpu="A10G", retries=0)  # no Modal retries
def classify_document_no_modal_retries(text: str) -> dict:
    return run_model(text)

class ClassifyWithFallback:
    def __init__(self):
        self.attempts = 0
        self.max_attempts = 3

    def run(self, text: str, context: str = "") -> dict:
        self.attempts += 1
        if self.attempts > self.max_attempts:
            raise RuntimeError(f"Failed after {self.max_attempts} attempts")
        try:
            return classify_document_no_modal_retries.remote(text)
        except Exception as exc:
            # Enrich prompt on retry, then re-call
            enriched = f"{text}\n\nContext: {context}"
            return classify_document_no_modal_retries.remote(enriched)


class RetryLayerAuditor:
    """
    Audits Modal function definitions at startup to detect functions
    where both Modal retries and known framework retry decorators are active.
    """

    FRAMEWORK_RETRY_MARKERS = [
        "retry",          # tenacity
        "max_retries",    # LangChain / OpenAI SDK
        "retries",        # httpx, requests adapters
        "backoff",        # backoff library
    ]

    def audit(self, fn) -> list[str]:
        warnings = []
        modal_retries = getattr(getattr(fn, "_modal_options", {}), "retries", 0)
        if modal_retries and modal_retries > 0:
            # Check wrapper chain for framework retry decorators
            inner = getattr(fn, "__wrapped__", None)
            while inner:
                for marker in self.FRAMEWORK_RETRY_MARKERS:
                    if marker in (inner.__dict__ or {}):
                        warnings.append(
                            f"[RetryLayerAuditor] {fn.__name__}: Modal retries={modal_retries} "
                            f"AND framework retry marker '{marker}' detected. "
                            f"Set Modal retries=0 or remove framework retry."
                        )
                inner = getattr(inner, "__wrapped__", None)
        return warnings

Deciding which layer owns retries: Modal's built-in retries are best for transient infrastructure failures (GPU OOM on load, container start race conditions) that the application layer should never see. Framework retries are best for business logic failures (bad model output, tool errors that need a different approach). The two should rarely both be active for the same function — if you're uncertain which layer applies, default to framework retries with retries=0 on Modal so your agent's retry logic has full visibility into each attempt.

Failure mode 4: Minimum container lifetime billing on high-frequency short calls

Modal's billing model charges per second of container compute, but with a minimum charge per container invocation. A function that takes 200ms of actual compute still incurs the minimum billing interval. For most workloads — where each function call runs for several seconds — this is negligible. For AI agent architectures that spawn many rapid, lightweight sub-calls, the minimum billing granularity can make a workload cost dramatically more than the actual compute time suggests.

The pattern appears most often in orchestrating agents that use Modal for fast utility calls: embedding a document fragment (50ms), checking token count before sending to an LLM (30ms), classifying a short string (80ms), or running a fast regex extraction with GPU acceleration (40ms). An agent that does 200 of these per run is paying for 200 minimum-interval billing units, not the 200 × 80ms = 16 seconds of actual compute they represent.

The fix is batching. Instead of calling the Modal function once per item, accumulate items and call once per batch. The same single container processes all items in one invocation, paying one minimum billing interval instead of N:

Python — call batcher for short Modal functions

import modal
import time
import threading
from typing import TypeVar, Generic, Callable
from dataclasses import dataclass, field

T = TypeVar("T")
R = TypeVar("R")

@dataclass
class BatchConfig:
    max_batch_size: int = 32
    max_wait_seconds: float = 0.5
    # warn if average per-call latency suggests < this seconds of actual compute
    min_expected_compute_seconds: float = 1.0

class ModalBatcher(Generic[T, R]):
    """
    Accumulates calls to a Modal function and dispatches them as a batch.
    Designed for short-running Modal functions where per-invocation overhead
    (minimum billing interval, cold start amortization) dominates actual compute.
    """

    def __init__(
        self,
        modal_batch_fn: "modal.Function",
        config: BatchConfig | None = None,
    ):
        self.fn = modal_batch_fn
        self.config = config or BatchConfig()
        self._queue: list[tuple[T, threading.Event, list]] = []
        self._lock = threading.Lock()
        self._dispatch_thread: threading.Thread | None = None
        self._stats = {"total_items": 0, "total_batches": 0, "total_invocations": 0}

    def call(self, item: T) -> R:
        """Add item to batch, block until result is available."""
        result_holder: list = []
        done = threading.Event()

        with self._lock:
            self._queue.append((item, done, result_holder))
            should_dispatch = (
                len(self._queue) >= self.config.max_batch_size
                or self._dispatch_thread is None
                or not self._dispatch_thread.is_alive()
            )

        if should_dispatch:
            self._maybe_dispatch()

        done.wait(timeout=60.0)
        if not result_holder:
            raise RuntimeError("[ModalBatcher] timed out waiting for batch result")
        if isinstance(result_holder[0], Exception):
            raise result_holder[0]
        return result_holder[0]

    def _maybe_dispatch(self) -> None:
        with self._lock:
            if not self._queue:
                return
            # wait up to max_wait_seconds for more items
            batch = self._queue[: self.config.max_batch_size]
            self._queue = self._queue[self.config.max_batch_size :]

        t0 = time.monotonic()
        items = [b[0] for b in batch]
        try:
            results = self.fn.remote(items)  # batch call
            elapsed = time.monotonic() - t0
            per_item = elapsed / len(items) if items else 0
            if per_item < self.config.min_expected_compute_seconds:
                print(
                    f"[ModalBatcher] avg {per_item:.3f}s per item "
                    f"(below {self.config.min_expected_compute_seconds}s threshold). "
                    f"Consider larger batches or consolidating sub-agent calls."
                )
            for (item, done, holder), result in zip(batch, results):
                holder.append(result)
                done.set()
        except Exception as exc:
            for (item, done, holder) in batch:
                holder.append(exc)
                done.set()

        self._stats["total_items"] += len(batch)
        self._stats["total_batches"] += 1
        self._stats["total_invocations"] += 1  # one Modal call for the whole batch


# Example: batch embedding Modal function
@app.function(gpu="T4")
def embed_batch(texts: list[str]) -> list[list[float]]:
    from sentence_transformers import SentenceTransformer
    model = SentenceTransformer("all-MiniLM-L6-v2")
    return model.encode(texts).tolist()

batcher = ModalBatcher(embed_batch, BatchConfig(max_batch_size=64, max_wait_seconds=0.2))

# Caller code unchanged — batcher accumulates and dispatches
def embed_text(text: str) -> list[float]:
    return batcher.call(text)

The key requirement is that the Modal function accepts a list and returns a list. This is the natural shape for embedding, classification, and other embarrassingly parallel AI operations. If your current Modal function is written to process one item at a time, refactoring it to accept list[T] and return list[R] is usually a small change with significant cost impact at high call rates.

Combining all four guards

In production, all four failure modes can interact. An agent that hits a GPU OOM failure (failure mode 3 trigger) may retry rapidly (failure mode 2), spawning parallel containers that all cold-start (failure mode 1), each of which processes tiny tasks inefficiently (failure mode 4). A composite policy applies each guard in the right order:

Python — ModalCostPolicy composite

from dataclasses import dataclass

@dataclass
class ModalCostPolicy:
    # Failure mode 1: cold start tracking
    cold_start_threshold_seconds: float = 8.0
    concurrent_cold_limit: int = 6
    cold_start_window_seconds: float = 30.0

    # Failure mode 2: retry storm detection
    max_retry_attempts: int = 3
    min_retry_backoff_seconds: float = 2.0
    storm_threshold: int = 15
    storm_window_seconds: float = 60.0

    # Failure mode 3: cross-layer retry audit
    enforce_single_retry_layer: bool = True

    # Failure mode 4: batching
    min_batch_size_for_short_calls: int = 8
    short_call_threshold_seconds: float = 1.0


class ModalCostPolicyEnforcer:
    def __init__(self, policy: ModalCostPolicy | None = None):
        p = policy or ModalCostPolicy()
        self.cold_tracker = ColdStartTracker(
            cold_start_threshold_seconds=p.cold_start_threshold_seconds,
            concurrent_cold_limit=p.concurrent_cold_limit,
            measurement_window=p.cold_start_window_seconds,
        )
        self.retry_breaker = ModalRetryBreaker(ModalRetryPolicy(
            max_attempts=p.max_retry_attempts,
            min_backoff_seconds=p.min_retry_backoff_seconds,
            storm_threshold=p.storm_threshold,
            storm_window_seconds=p.storm_window_seconds,
        ))
        self.auditor = RetryLayerAuditor()
        self.policy = p

    def guarded_call(self, modal_fn, *args, **kwargs):
        """Single entry point for a guarded Modal function call."""
        return self.retry_breaker.call(modal_fn, *args, **kwargs)

    def guarded_fan_out(self, modal_fn, tasks: list, **kwargs) -> list:
        """Fan-out with cold start gating."""
        self.cold_tracker.check_fan_out_allowed(len(tasks))
        handles = []
        for task in tasks:
            t0 = self.cold_tracker.record_invocation_start()
            h = modal_fn.spawn(*task, **kwargs)
            handles.append((h, t0))
        results = []
        for h, t0 in handles:
            result = h.get()
            self.cold_tracker.record_invocation_end(t0)
            results.append(result)
        return results

    def audit_function(self, modal_fn) -> None:
        """Run at startup to flag cross-layer retry configurations."""
        if self.policy.enforce_single_retry_layer:
            warnings = self.auditor.audit(modal_fn)
            for w in warnings:
                print(w)


enforcer = ModalCostPolicyEnforcer(ModalCostPolicy(
    concurrent_cold_limit=4,
    storm_threshold=12,
    enforce_single_retry_layer=True,
))

Unguarded vs guarded: the difference in practice

Failure mode	Unguarded behavior	Guarded behavior
Cold start amplification	20 parallel sub-agent calls → 20 simultaneous cold starts → 20× GPU reservation overhead during boot	Fan-out gated when cold pool detected; excess calls queued behind warm containers
Autoscaling concurrency storm	Retry loop drives queue depth up → autoscaler provisions 10–30+ containers that each hit the same failure	Storm threshold trips after 15 attempts/60s; breaker halts all retries until reset
Cross-layer retry multiplication	`retries=3` + framework `max_retries=3` → up to 16 billed invocations per logical tool call	Startup audit flags mixed configurations; single retry layer eliminates multiplication
Minimum billing granularity	200 × 80ms calls → 200 minimum-interval billing events at full per-invocation overhead	Batcher accumulates into 3–4 batches of 50–64 items → 3–4 billing events total

Frequently asked questions

Does Modal's built-in keep_warm eliminate cold start problems entirely?

keep_warm=N prevents cold starts up to N concurrent requests. If your agent occasionally fans out to more than N parallel calls, containers beyond the warm pool still cold-start. The ColdStartTracker catches the cases where keep_warm is either not set or insufficient for actual burst patterns. A good operational discipline is to monitor actual cold starts per hour from your Modal dashboard, then set keep_warm to the 95th-percentile concurrent call count during peak usage.

Can I use Modal's @app.function(retries=N) and still apply the ModalRetryBreaker?

You can, but the storm detection accounting becomes complicated — the breaker counts invocations at the caller layer, which means it sees one call per logical request, while Modal's internal retries are invisible to the caller until they're all exhausted. If you want the ModalRetryBreaker's storm detector to be accurate, set retries=0 on the Modal function and let the breaker control all retry behavior. That gives you clean accounting: one breaker-recorded attempt per invocation of the underlying container.

What's the right batch size for the ModalBatcher?

The optimal batch size depends on your model's memory requirements and the minimum billing interval. Start with 32 and watch the per-item latency in your logs — if average latency per item is below 1 second, increase the batch size. If you're hitting GPU OOM errors, reduce it. For embedding models on T4 hardware, 64 items of typical document-fragment length is a common sweet spot. For larger generation models, 4–8 items may be the practical ceiling before memory pressure causes failures.

How do these guards interact with Modal's allow_concurrent_inputs parameter?

allow_concurrent_inputs=N lets a single Modal container handle N requests in parallel, which can drastically reduce cold start costs for I/O-bound workloads. If your function uses allow_concurrent_inputs, the ColdStartTracker's fan-out limit should account for it: with allow_concurrent_inputs=8, a fan-out of 24 calls only requires 3 containers rather than 24, so you can set concurrent_cold_limit to 3 (containers) rather than 24 (calls). The key is tracking container provisioning events rather than raw call counts.

Should these guards live in the Modal function itself or in the caller?

All four guards belong in the caller, not the Modal function. The Modal function is the unit of compute — it should do exactly one thing and do it correctly. Cost policy logic (retry limits, cold start detection, batching) is an orchestration concern that lives in the agent or the application layer calling Modal. Keeping them in the caller also means you can update retry thresholds, batch sizes, and storm detection windows without redeploying your Modal functions.

RunGuard catches these loops in production

The patterns above — cold start amplification, retry cascades, cross-layer multiplication — are exactly the failure modes RunGuard's circuit breaker was built to detect at runtime. Add one line to your agent and get automatic loop detection, budget enforcement, and Slack alerts before Modal's bill reflects what just happened.

Start free 14-day trial