Beam.cloud Serverless AI Cost Control: Cold Start Storms, Task Queue Explosions, Retry Container Billing, and Autoscale Thrash
Beam.cloud is a serverless GPU compute platform where AI workloads run inside managed containers that start on demand and bill by the second from boot until shutdown. The per-second billing model makes Beam attractive for bursty AI workloads — you only pay when your agent is actually running. But that same model creates a category of cost failure modes that simply do not exist on always-on servers: billing accumulates during container startup before a single inference token is generated, autoscaling spawns multiple containers simultaneously when load spikes, and a retry loop that looks cheap on a dedicated server becomes an expanding fleet of parallel containers each billing concurrently.
Four structural failure modes account for the majority of unexpected Beam billing in AI agent deployments. All four share the same root cause: an agent's request pattern — parallel task dispatch, recursive subtask spawning, timeout-triggered retry, or burst-gap invocation cadence — interacts with Beam's container lifecycle in a way that multiplies the number of simultaneously running (and simultaneously billing) containers without a proportional increase in useful work.
- Cold start amplification storms — an agent dispatches N subtasks in parallel; each subtask lands on a cold container and pays 30–90 seconds of GPU billing before inference begins. Ten parallel tasks on an A10G can accumulate $0.36–$1.08 in pure startup overhead before a single model forward pass runs.
- Task queue depth explosions — an agent's tasks recursively enqueue more tasks; queue depth grows faster than containers drain it; Beam's
QueueDepthAutoscalerresponds by launching all replicas simultaneously, each cold-starting in parallel. The queue never drains because each drained task adds more entries. - Endpoint retry container storms — the agent's HTTP client retries a Beam endpoint call after a timeout; the original container is still running (processing a long generation); a second container starts for the retry. Both containers bill in parallel for the same logical task until the original's timeout fires.
- Autoscale thrash — an agent invokes Beam in periodic bursts separated by gaps longer than
keep_warm_seconds; containers scale to zero in the gap and must cold-start at every burst. An agent that runs 10 bursts per hour with a 60-second cold start pays 10 × 60 × (burst container count) seconds of cold start billing per hour that could be eliminated with the rightkeep_warm_secondssetting.
Billing model in detail
Beam bills per second of container runtime at GPU-type-specific rates. An A10G container bills from the moment the container process starts — before your Python code runs, before your model loads, before your agent receives its first task. Cold start time (downloading and decompressing the container image, initializing the Python environment, importing packages, loading model weights) all bill at the same per-second rate as active inference time. For containers that run a 7B-parameter model, package import and model loading alone can consume 45–90 seconds before the container accepts its first task.
The implication: the ratio of cold start time to active processing time determines how much of your Beam bill reflects useful work. An agent that dispatches tasks in 10-task parallel batches with cold containers pays 10 × cold_start_seconds × gpu_rate before any task result returns. A 60-second cold start at $0.0006/second on an A10G costs $0.036 per container — times 10 parallel containers, that's $0.36 in startup overhead for a single batch. If the actual inference per task takes 5 seconds ($0.003/task), the cold start overhead is 12× the inference cost.
Beam's QueueDepthAutoscaler is the mechanism that converts queue depth into container count. It monitors the ratio of pending tasks to active containers and scales up when pending tasks per container exceed a configurable threshold (tasks_per_replica). In an agent whose tasks recursively enqueue more tasks, the autoscaler sees ever-growing queue depth and launches containers up to max_replicas. Every launched container starts its cold start billing clock immediately. If max_replicas=8 and each container cold-starts for 60 seconds, a queue explosion that takes 2 minutes to stabilize may generate 8 × 60 = 480 container-seconds of cold start billing across the autoscaling event — $0.29 on an A10G — before a single task in the explosion actually completes.
Failure mode 1: Cold start amplification storm
The most common source of unexpected Beam billing in agent deployments is the pattern where an agent's orchestration layer dispatches multiple subtasks in parallel to a Beam function. This looks exactly like efficient parallel processing — and on warm containers it is. But when no warm containers are available (at session start, after a period of inactivity, or after scale-to-zero), each parallel dispatch lands on a cold container. All cold starts run simultaneously. All bill per second simultaneously.
The pattern is common in agent architectures where a planning agent breaks work into subtasks and fans them out:
# Agent orchestration — looks efficient, but triggers N simultaneous cold starts
subtasks = planner.decompose(task, n=10)
futures = [beam_inference.put(subtask) for subtask in subtasks] # 10 cold starts in parallel
results = [f.get() for f in futures]
Each call to .put() (or the equivalent endpoint call) lands immediately on a cold container. Beam boots 10 containers in parallel, all billing from T=0. By the time the first result returns, the agent has consumed 600 container-seconds of cold start time (10 containers × 60 seconds each) — $0.36 on A10G — before a single inference token was generated.
The failure compounds when the agent's planner is itself running on Beam. A Beam-hosted planner that fans out to Beam-hosted inference workers creates a two-level cold start storm: the planner cold-starts first, then dispatches subtasks that each trigger their own cold starts.
The signature: N containers appearing in Beam's usage dashboard within a 30-second window at session start, each showing identical boot timestamps. Cold start storms are visible in Beam's container timeline as a horizontal band of simultaneous starts — distinct from the staggered starts of healthy gradual autoscaling.
The BeamColdStartGuard limits parallel dispatches to a configurable concurrency ceiling, queuing subsequent dispatches to reuse containers that warm up during the first batch:
import time
import threading
from collections import deque
from dataclasses import dataclass, field
from typing import Callable, Any
@dataclass
class BeamColdStartGuard:
max_parallel_cold_starts: int = 3
cold_start_detection_window_seconds: float = 90.0
daily_cold_start_budget_seconds: float = 600.0
_active_dispatches: int = field(default=0, init=False)
_dispatch_lock: threading.Lock = field(default_factory=threading.Lock, init=False)
_cold_start_log: deque = field(default_factory=deque, init=False)
_total_cold_start_seconds_today: float = field(default=0.0, init=False)
def dispatch(self, beam_func, payload: dict) -> Any:
with self._dispatch_lock:
if self._active_dispatches >= self.max_parallel_cold_starts:
# Wait for a slot rather than spawning a new container
self._wait_for_slot()
self._active_dispatches += 1
start = time.monotonic()
try:
result = beam_func(payload)
duration = time.monotonic() - start
if duration > self.cold_start_detection_window_seconds * 0.33:
# Heuristic: first-third of detection window = likely cold start contribution
self._record_cold_start(duration)
return result
finally:
with self._dispatch_lock:
self._active_dispatches -= 1
def _wait_for_slot(self):
# Release lock while waiting to avoid deadlock; re-acquire to check
self._dispatch_lock.release()
while True:
time.sleep(0.5)
self._dispatch_lock.acquire()
if self._active_dispatches < self.max_parallel_cold_starts:
return
self._dispatch_lock.release()
def _record_cold_start(self, duration_seconds: float):
now = time.monotonic()
self._cold_start_log.append((now, duration_seconds))
self._total_cold_start_seconds_today += duration_seconds
if self._total_cold_start_seconds_today > self.daily_cold_start_budget_seconds:
raise RuntimeError(
f"BeamColdStartGuard: daily cold start budget exhausted "
f"({self._total_cold_start_seconds_today:.0f}s > "
f"{self.daily_cold_start_budget_seconds:.0f}s limit). "
"Increase keep_warm_seconds or reduce parallel dispatch concurrency."
)
# Usage
guard = BeamColdStartGuard(max_parallel_cold_starts=3, daily_cold_start_budget_seconds=300)
subtasks = planner.decompose(task, n=10)
# Dispatches in controlled batches of 3; containers from first batch warm by the time
# the guard releases the 4th dispatch — avoiding 10 simultaneous cold starts
results = [guard.dispatch(beam_inference, subtask) for subtask in subtasks]
Failure mode 2: Task queue depth explosion
Beam's QueueDepthAutoscaler is designed for steady, independent task workloads where queue depth is a reliable proxy for needed compute. It works well when tasks arrive from an external source at a bounded rate. It breaks down when tasks enqueue more tasks — because in that pattern, queue depth is not a proxy for backlog; it is a proxy for how deeply the recursive fan-out has progressed.
The failure mode appears in AI agents that use task queues for parallelism across agent subtasks, where each subtask is itself an agent step that may decompose further:
from beam import task_queue, QueueDepthAutoscaler, Image
@task_queue(
autoscaler=QueueDepthAutoscaler(max_replicas=8, tasks_per_replica=1),
image=Image(python_packages=["openai", "runguard"]),
gpu="A10G",
)
def agent_step(payload: dict) -> dict:
# Each step may enqueue child steps
result = run_llm_step(payload["prompt"])
if result.get("subtasks"):
for subtask in result["subtasks"]:
# This enqueues more work into the same queue
agent_step.put({"prompt": subtask, "depth": payload["depth"] + 1})
return result
When the LLM generates subtasks, each worker enqueues more work. The autoscaler sees queue depth growing (because enqueue rate > drain rate) and launches more containers. Each new container drains tasks — and each drained task enqueues more tasks. The queue depth never drops below the autoscaler's scale-up threshold until either the task tree converges (the LLM stops generating subtasks) or max_replicas is hit and the queue grows unboundedly.
With max_replicas=8, 8 A10G containers running simultaneously bill $0.0006/second × 8 = $0.0048/second, or $17.28 per hour. A task tree that takes 30 minutes to converge (or never converges because the LLM is hallucinating subtask decompositions in a loop) generates an $8.64 billing event from a single initial task.
The signature: Queue depth in Beam's dashboard increasing monotonically despite containers actively processing tasks, combined with container count at or near max_replicas. Unlike genuine high-volume workloads, the queue grows even when all containers are active — a reliable indicator that tasks are generating more tasks faster than they can be processed.
The BeamTaskQueueDepthGuard enforces a per-task-tree depth ceiling by passing depth metadata through the task payload and blocking enqueue when the ceiling is exceeded:
from runguard import LoopDetector
MAX_TASK_DEPTH = 4
MAX_TASKS_PER_TREE = 50
class BeamTaskQueueDepthGuard:
def __init__(self, max_depth: int = MAX_TASK_DEPTH, max_tree_size: int = MAX_TASKS_PER_TREE):
self.max_depth = max_depth
self.max_tree_size = max_tree_size
self._tree_counters: dict[str, int] = {}
def check_and_enqueue(self, beam_func, payload: dict) -> bool:
depth = payload.get("_task_depth", 0)
tree_id = payload.get("_tree_id", "default")
if depth >= self.max_depth:
raise RuntimeError(
f"BeamTaskQueueDepthGuard: task depth {depth} exceeds ceiling "
f"{self.max_depth} for tree '{tree_id}'. "
"LLM is likely generating recursive subtask decompositions. "
"Check the agent's subtask generation prompt for unbounded decomposition."
)
tree_count = self._tree_counters.get(tree_id, 0) + 1
if tree_count > self.max_tree_size:
raise RuntimeError(
f"BeamTaskQueueDepthGuard: tree '{tree_id}' has spawned {tree_count} tasks "
f"(ceiling: {self.max_tree_size}). Blocking further enqueue."
)
self._tree_counters[tree_id] = tree_count
child_payload = {**payload, "_task_depth": depth + 1, "_tree_id": tree_id}
beam_func.put(child_payload)
return True
# In the task handler
depth_guard = BeamTaskQueueDepthGuard(max_depth=4, max_tree_size=50)
@task_queue(
autoscaler=QueueDepthAutoscaler(max_replicas=8, tasks_per_replica=1),
gpu="A10G",
)
def agent_step(payload: dict) -> dict:
result = run_llm_step(payload["prompt"])
if result.get("subtasks"):
for subtask in result["subtasks"]:
try:
depth_guard.check_and_enqueue(
agent_step,
{"prompt": subtask, "_task_depth": payload.get("_task_depth", 0),
"_tree_id": payload.get("_tree_id", "root")}
)
except RuntimeError as e:
# Log the guard trip; return partial result rather than blocking indefinitely
result["guard_trip"] = str(e)
break
return result
Failure mode 3: Endpoint retry container storm
Beam endpoints (decorated with @endpoint()) expose HTTP interfaces that agents call synchronously. Beam's container timeout is a hard limit: if the function does not return within timeout_seconds, the container raises a timeout error and the response to the caller is a 504 or equivalent error. The calling agent's HTTP client — following standard retry patterns — treats a timeout as a transient failure and retries after a brief backoff.
The retry creates a second container. The original container is still running — the function did not finish; the timeout ended the HTTP response, not the container process. Beam's billing does not distinguish between a container running useful work and a container that has already served its response but hasn't been recycled yet. Both the original (still running its long inference) and the retry (starting fresh on the same payload) bill concurrently.
The failure is especially costly when the root cause of the timeout is an agent-driven generation loop — the model generating tokens without stopping because the stop sequence wasn't reached. In this case:
- The original container runs until the Beam-level timeout fires (billing the full
timeout_secondsat GPU rate) - Each retry spawns a new container that cold-starts, runs the same runaway generation payload, and times out again
- With 3 retries: 4 containers each timing out after 120 seconds on an A10G = 480 container-seconds × $0.0006 = $0.29 for a single logical task that produced no useful output
The signature: Multiple containers in Beam's dashboard showing identical start times offset by the retry backoff interval, all with runtime equal to timeout_seconds. A container cluster where every container hits exactly the same timeout wall — rather than finishing at varying times — is a retry storm on a runaway payload.
The BeamEndpointRetryGuard deduplicates retry calls by payload hash and cancels in-flight requests before allowing a retry to proceed:
import hashlib
import json
import time
import httpx
from dataclasses import dataclass, field
@dataclass
class BeamEndpointRetryGuard:
endpoint_url: str
max_retries: int = 2
backoff_base_seconds: float = 5.0
payload_hash_ttl_seconds: float = 300.0
_in_flight: dict = field(default_factory=dict, init=False)
_timeout_history: dict = field(default_factory=dict, init=False)
def _payload_hash(self, payload: dict) -> str:
canonical = json.dumps(payload, sort_keys=True, default=str)
return hashlib.sha256(canonical.encode()).hexdigest()[:16]
def call(self, payload: dict, timeout_seconds: float = 60.0) -> dict:
ph = self._payload_hash(payload)
now = time.monotonic()
# If this payload has timed out before, it's likely a structural problem
prior_timeouts = self._timeout_history.get(ph, [])
prior_timeouts = [t for t in prior_timeouts if now - t < self.payload_hash_ttl_seconds]
if len(prior_timeouts) >= self.max_retries:
raise RuntimeError(
f"BeamEndpointRetryGuard: payload hash {ph!r} has timed out "
f"{len(prior_timeouts)} times in the last {self.payload_hash_ttl_seconds:.0f}s. "
"Likely a generation loop in the model response. "
"Check max_tokens, stop sequences, or reduce prompt complexity."
)
# Check if this payload is already in-flight (concurrent duplicate)
if ph in self._in_flight:
raise RuntimeError(
f"BeamEndpointRetryGuard: payload hash {ph!r} already in-flight. "
"Blocking duplicate dispatch to prevent parallel container billing."
)
self._in_flight[ph] = now
attempt = 0
last_error = None
try:
while attempt <= self.max_retries:
try:
response = httpx.post(
self.endpoint_url,
json=payload,
timeout=timeout_seconds,
)
response.raise_for_status()
return response.json()
except (httpx.TimeoutException, httpx.HTTPStatusError) as e:
self._timeout_history.setdefault(ph, []).append(time.monotonic())
attempt += 1
last_error = e
if attempt <= self.max_retries:
# Exponential backoff before retry — gives original container
# time to finish and be recycled before new container starts
time.sleep(self.backoff_base_seconds * (2 ** (attempt - 1)))
raise RuntimeError(
f"BeamEndpointRetryGuard: {self.max_retries + 1} attempts exhausted "
f"for payload hash {ph!r}. Last error: {last_error}"
)
finally:
self._in_flight.pop(ph, None)
# Usage: replace direct httpx calls with guarded calls
guard = BeamEndpointRetryGuard(
endpoint_url="https://app.beam.cloud/endpoint/my-inference-fn",
max_retries=2,
backoff_base_seconds=10.0, # 10s backoff gives original container time to time out
)
result = guard.call({"prompt": user_prompt, "max_tokens": 512})
Failure mode 4: Autoscale thrash
Beam containers have a keep_warm_seconds parameter that keeps a container alive after its last invocation completes. If an agent invokes Beam in periodic bursts separated by gaps, the relationship between the burst interval and keep_warm_seconds determines whether each burst hits warm or cold containers.
When the burst interval exceeds keep_warm_seconds, containers scale to zero between bursts. Every burst becomes a cold start event. An agent that:
- Invokes a Beam function every 3 minutes (e.g., a monitoring agent that polls on a schedule)
- With default
keep_warm_seconds=10 - Using an A10G container with a 60-second cold start
…pays 60 seconds of cold start billing for every 3-minute invocation cycle. In an 8-hour session, that agent executes ~160 invocations and pays 160 × 60 × $0.0006 = $5.76 in pure cold start overhead — 33% of the total compute budget if each invocation takes 2 minutes of real work.
The thrash pattern is distinct from a one-time cold start because it is structural and recurring: fixing it requires matching keep_warm_seconds to the invocation cadence, not adding circuit breakers to individual calls. The guard's role is to detect the pattern and surface the configuration mismatch before it accumulates across a full session.
The signature: Cold start events appearing at regular intervals in Beam's container timeline, spaced exactly at the agent's invocation cadence. Each container appears, runs for a brief period, and disappears — rather than staying alive across invocations. The container count oscillates between 0 and N with each burst.
import time
from collections import deque
from dataclasses import dataclass, field
@dataclass
class BeamAutoscaleThrashGuard:
cold_start_threshold_seconds: float = 20.0
max_cold_start_rate_per_hour: int = 5
session_window_seconds: float = 3600.0
_invocation_log: deque = field(default_factory=deque, init=False)
_cold_start_events: list = field(default_factory=list, init=False)
def record_invocation(self, duration_seconds: float, was_cold_start: bool):
now = time.monotonic()
self._invocation_log.append((now, duration_seconds, was_cold_start))
if was_cold_start:
self._cold_start_events.append(now)
# Trim to session window
cutoff = now - self.session_window_seconds
while self._invocation_log and self._invocation_log[0][0] < cutoff:
self._invocation_log.popleft()
self._cold_start_events = [t for t in self._cold_start_events if t >= cutoff]
cold_start_rate = len(self._cold_start_events)
if cold_start_rate > self.max_cold_start_rate_per_hour:
self._raise_thrash_alert(cold_start_rate)
def _raise_thrash_alert(self, cold_start_rate: int):
if len(self._invocation_log) < 3:
return
timestamps = [entry[0] for entry in self._invocation_log]
gaps = [timestamps[i+1] - timestamps[i] for i in range(len(timestamps) - 1)]
avg_gap = sum(gaps) / len(gaps)
recommended_keep_warm = int(avg_gap * 1.5)
raise RuntimeError(
f"BeamAutoscaleThrashGuard: {cold_start_rate} cold starts detected in "
f"the last hour (ceiling: {self.max_cold_start_rate_per_hour}). "
f"Average invocation gap: {avg_gap:.0f}s. "
f"Set keep_warm_seconds>={recommended_keep_warm} in your Beam function "
f"decorator to keep containers warm between invocations. "
f"Estimated cold start waste: "
f"{cold_start_rate * self.cold_start_threshold_seconds * 0.0006:.4f} USD/hr on A10G."
)
def wrap(self, beam_func):
def guarded(*args, **kwargs):
start = time.monotonic()
result = beam_func(*args, **kwargs)
duration = time.monotonic() - start
# Heuristic: if duration significantly exceeds expected inference time,
# a cold start contributed
was_cold_start = duration > self.cold_start_threshold_seconds
self.record_invocation(duration, was_cold_start)
return result
return guarded
# Usage: wrap the beam function call at the orchestration layer
thrash_guard = BeamAutoscaleThrashGuard(
cold_start_threshold_seconds=30.0,
max_cold_start_rate_per_hour=4,
)
guarded_beam_call = thrash_guard.wrap(beam_inference)
# In your agent loop:
for poll_cycle in range(max_cycles):
result = guarded_beam_call({"input": collect_inputs()})
time.sleep(poll_interval_seconds)
When the guard fires, the recommended fix is always in the Beam function decorator, not in the calling code:
from beam import endpoint, Image
# Before: default keep_warm expires in 10 seconds — every burst = cold start
@endpoint(gpu="A10G", image=Image(python_packages=["transformers"]))
def run_inference(payload: dict) -> dict: ...
# After: keep warm for 5 minutes — polling agents at 3-minute intervals stay warm
@endpoint(gpu="A10G", keep_warm_seconds=300, image=Image(python_packages=["transformers"]))
def run_inference(payload: dict) -> dict: ...
Composite guard for Beam AI agents
In production, most Beam AI agent deployments combine multiple risk surfaces: a planning agent that dispatches parallel subtasks, subtasks that use task queues for further decomposition, and a monitoring loop that polls for results. Wrapping each dispatch point with a dedicated guard and sharing configuration at session initialization produces a layered defense:
from runguard import BudgetGuard
class BeamSessionGuard:
"""Composite guard for a Beam AI agent session.
Wraps cold start throttling, task queue depth, endpoint retry dedup,
and autoscale thrash detection with a shared session budget ceiling.
"""
def __init__(
self,
endpoint_url: str,
session_budget_usd: float = 5.0,
gpu_rate_per_second: float = 0.0006, # A10G
):
self.cold_start_guard = BeamColdStartGuard(
max_parallel_cold_starts=3,
daily_cold_start_budget_seconds=session_budget_usd * 0.3 / gpu_rate_per_second,
)
self.queue_guard = BeamTaskQueueDepthGuard(max_depth=4, max_tree_size=50)
self.retry_guard = BeamEndpointRetryGuard(
endpoint_url=endpoint_url,
max_retries=2,
backoff_base_seconds=10.0,
)
self.thrash_guard = BeamAutoscaleThrashGuard(
cold_start_threshold_seconds=30.0,
max_cold_start_rate_per_hour=6,
)
# Hard budget ceiling across all guards
self.budget = BudgetGuard(
max_spend_usd=session_budget_usd,
gpu_rate_per_second=gpu_rate_per_second,
)
def dispatch_parallel(self, beam_func, payloads: list) -> list:
results = []
for payload in payloads:
self.budget.check()
result = self.cold_start_guard.dispatch(beam_func, payload)
results.append(result)
return results
def enqueue_subtask(self, beam_func, payload: dict) -> bool:
self.budget.check()
return self.queue_guard.check_and_enqueue(beam_func, payload)
def call_endpoint(self, payload: dict) -> dict:
self.budget.check()
result = self.retry_guard.call(payload)
self.thrash_guard.record_invocation(
duration_seconds=self.retry_guard._last_duration,
was_cold_start=self.retry_guard._last_was_cold,
)
return result
# Initialization — once per session
session_guard = BeamSessionGuard(
endpoint_url="https://app.beam.cloud/endpoint/agent-inference",
session_budget_usd=3.0,
)
# Parallel subtask dispatch — cold start throttled
results = session_guard.dispatch_parallel(beam_worker, subtasks)
# Recursive task enqueue — depth and tree-size bounded
session_guard.enqueue_subtask(beam_worker, child_task)
# Endpoint call — retry storm prevented, thrash detected
output = session_guard.call_endpoint({"prompt": user_prompt})
Summary of failure modes and guards
| Failure mode | Primary cost driver | Guard | Trip condition |
|---|---|---|---|
| Cold start amplification storm N parallel dispatches → N simultaneous cold starts |
Cold start seconds × N containers billed before first inference token | BeamColdStartGuard |
Active parallel dispatches > max_parallel_cold_starts |
| Task queue depth explosion Recursive subtask spawning outruns drain rate |
All max_replicas containers cold-starting simultaneously; queue never drains |
BeamTaskQueueDepthGuard |
Task tree depth > 4 or tree size > 50 tasks |
| Endpoint retry container storm Timeout retry spawns new container while original still runs |
K+1 containers billing concurrently for K retries on same payload | BeamEndpointRetryGuard |
Same payload hash timed out ≥ max_retries times |
| Autoscale thrash Burst-gap cadence exceeds keep_warm_seconds |
Every burst triggers full cold start; 0 warm container reuse | BeamAutoscaleThrashGuard |
Cold start rate > max_cold_start_rate_per_hour |
Frequently asked questions
How is this different from the Modal Labs cost control patterns?
The Modal Labs post covers Modal's billing model, which includes a minimum container lifetime granularity (you're billed for at least N seconds even for sub-second invocations) and Modal's own retry and autoscaling mechanism. Beam's billing starts from container boot with per-second granularity and no minimum lifetime floor — the cost risk is in cold start duration and parallel container count, not in minimum billing per invocation. Beam's QueueDepthAutoscaler responds to queue depth, which creates the recursive spawning pattern specific to Beam task queues. The guard patterns are architecturally similar (cold start throttling, retry dedup, budget ceiling), but the triggers, thresholds, and integration points differ because the two platforms have different container lifecycle models.
Can I detect cold starts from inside a Beam function to measure them directly?
Yes. Beam exposes a @app.on_start() hook that runs once when a container boots (not on warm invocations). Instrumenting this hook with a timestamp write to an in-memory global lets your handler code distinguish cold invocations (time.monotonic() - cold_start_timestamp < 5.0) from warm ones. You can log cold start count and duration to Beam's output or to an external counter (a lightweight Redis INCR or a Beam Volume write). This gives you a per-container cold start count without the heuristic duration threshold approach used in the BeamAutoscaleThrashGuard above — direct measurement is always preferred over heuristics when the instrumentation point is available.
The BeamEndpointRetryGuard uses payload hashing — what if two different tasks legitimately have identical payloads?
This is a valid concern for high-volume agent deployments where many users might submit identical prompts (e.g., a customer service agent with a narrow question domain). The guard's payload_hash_ttl_seconds controls how long a hash remains "blocked" after a timeout. Setting this value to the container's expected maximum timeout (timeout_seconds + 30s buffer) means the block only applies during the original container's possible active window — not across independent user sessions. For fine-grained control, add a request_id field to every payload before hashing; identical prompts from different users will hash differently because their request_id values differ, while retries of the same logical request will hash identically because they share the same request_id.
Does setting keep_warm_seconds high enough to prevent thrash always make sense economically?
Not always. The break-even analysis is: keep_warm_cost = keep_warm_seconds × gpu_rate × avg_containers_kept_warm vs. cold_start_cost = cold_start_seconds × gpu_rate × cold_start_events_per_hour. If your agent invokes Beam 3 times per hour (low frequency), setting keep_warm_seconds=1200 (20 minutes) keeps a container alive for 20 minutes after each invocation — that's up to 60 idle minutes per hour at full GPU rate. On an A10G, that idle cost is $0.036/hr. If cold start costs $0.036 × 3 events = $0.036/hr anyway, there's no saving. The guard's recommendation of keep_warm_seconds = avg_gap × 1.5 is a starting point; you should run the break-even calculation for your specific invocation frequency and GPU type before committing to a keep_warm_seconds value. For very low-frequency agents (<2 invocations/hour), cold starts may be cheaper than continuous warm container billing.
Does the RunGuard SDK integrate with Beam's native observability?
RunGuard operates at the Python application layer, so it runs wherever your Python code runs — including inside Beam containers and in the orchestration code that calls Beam. Guards that live inside Beam functions (like BeamTaskQueueDepthGuard and the @app.on_start() cold start instrumentation) write their trip events to RunGuard's telemetry endpoint, which is reachable from inside Beam containers via standard outbound HTTP. Guards that live in the orchestration layer (like BeamColdStartGuard and BeamEndpointRetryGuard) run in your own infrastructure and write telemetry from there. Both surface guard trips in RunGuard's circuit breaker dashboard alongside LLM token budget events — giving you a unified view of compute cost control across your Beam functions and the LLM API calls they make.