Temporal AI Workflow Cost Control: History Bloat, Activity Retry Amplification, and ContinueAsNew
Temporal is architecturally different from every other AI agent orchestration framework in this series. Python SDKs like LangChain, CrewAI, and the OpenAI Agents SDK run in a single process — when the process dies, the agent state dies with it. Temporal workflows are durable: every function call, activity result, timer, and signal is persisted as an immutable event in a workflow history stored in the Temporal server. Workers replay this history to reconstruct execution state when they resume. This durability is Temporal's main feature, and it is also the source of all four AI cost failure modes in this guide.
Temporal Cloud pricing is based on actions: each workflow task, activity task, signal, timer, and child workflow start is a billable action. A research agent that calls 200 LLM activities generates at minimum 400 actions (schedule + completion per activity), plus workflow task actions for the orchestration logic between activities. This is fundamentally different from token-based billing — you pay for orchestration density, not just inference cost. High-frequency LLM tools inside a Temporal workflow compound both.
Four failure modes that are specific to Temporal AI workflows:
- Workflow history accumulation — every activity result, including the full LLM response text, is stored as a history event. A research agent processing 500 sources accumulates thousands of events and megabytes of stored LLM output in the workflow history, forcing workers to replay all of it on every resume.
- Activity retry amplification — the default Temporal
RetryPolicyhasMaximumAttempts: 0(unlimited). LLM activities hitting API rate limits will retry indefinitely, each retry charged as a new activity action at Temporal Cloud pricing. - Child workflow fan-out runaway — fan-out patterns using
workflow.ExecuteChildWorkfloware seeded by LLM output (e.g., "generate N sub-queries"). LLMs produce variable-length lists; without a ceiling, an agent that generates 200 sub-queries spawns 200 concurrent child workflows, each with its own history and action meter. - ContinueAsNew neglect — long-running AI agent workflows that never call
workflow.ContinueAsNewaccumulate history indefinitely. At Temporal Cloud storage pricing, a workflow with 10,000 events storing 40KB average LLM outputs represents 400MB of history. Every time the workflow processes a signal, the worker replays all 400MB to reach the current state.
Temporal Cloud action pricing (as of mid-2026): Actions are priced at approximately $25/million. An AI research agent running 500 LLM activities generates ~1,000 actions minimum — $0.025 for the Temporal overhead on top of LLM inference cost. At 1,000 agent runs/day, Temporal overhead is ~$25/day before any optimization. This is non-trivial when multiplied across multiple agent types.
Failure Mode 1: Workflow History Accumulation
Every result returned from a Temporal activity is serialized into the workflow history as an ActivityTaskCompleted event. For LLM activities, this includes the full model response — hundreds to thousands of tokens of text, serialized to JSON and stored in the Temporal server's persistence layer. A research workflow that runs 200 LLM calls doesn't just pay for 200 LLM API calls; it also accumulates 200 ActivityTaskCompleted events, each potentially storing 2–8KB of serialized LLM output.
The replay cost is the hidden problem. When a Temporal worker picks up an existing workflow (after a crash, after a long-sleep timer fires, after receiving a signal), it must replay the full history to reconstruct the workflow's Go or Python data structures. A workflow with 5,000 history events that average 4KB each requires deserializing 20MB of data and re-executing the workflow logic up to the current event — before processing the actual new event that woke the workflow.
The naive pattern stores complete LLM responses as activity return values:
Go
// naive: full LLM response stored in history
func ResearchWorkflow(ctx workflow.Context, query string) (string, error) {
results := make([]string, 0, 50)
// 50 search sub-queries, each returning full LLM synthesis text
for i := 0; i < 50; i++ {
var synthesis string
ao := workflow.ActivityOptions{
StartToCloseTimeout: 60 * time.Second,
}
ctx = workflow.WithActivityOptions(ctx, ao)
// BUG: full LLM response (2-8KB) stored as activity result in history
err := workflow.ExecuteActivity(ctx, SynthesizeLLMActivity, query, i).Get(ctx, &synthesis)
if err != nil {
return "", err
}
results = append(results, synthesis)
}
// After 50 iterations: ~200KB of LLM text in history events
// Replay cost: deserialize + re-evaluate all 50 results on every wake
return strings.Join(results, "\n\n"), nil
}
After 50 activity completions with 4KB average responses, the history contains ~200KB of serialized LLM text in ActivityTaskCompleted payloads, plus the workflow task events, timer events, and the growing result accumulation. A workflow that processes 500 sources would carry ~2MB of LLM text in its history — loaded and deserialized on every signal receipt.
History event counts per activity execution:
| Event type | Count per activity | Payload size | Notes |
|---|---|---|---|
| ActivityTaskScheduled | 1 | ~200 bytes | Input args serialized |
| ActivityTaskStarted | 1 | ~100 bytes | Worker identity |
| ActivityTaskCompleted | 1 | 2–8KB | Full LLM output serialized here |
| WorkflowTaskScheduled/Started/Completed | 1–2 | ~300 bytes each | Orchestration overhead between activities |
The fix is to store large results externally and pass only identifiers through the workflow history. Write LLM outputs to an external store (S3, a database, Redis) and return only a reference key as the activity result. The history event payload shrinks from 4KB to 32 bytes:
Go
// Fixed: store LLM output externally, pass only the key
type LLMResultRef struct {
Key string `json:"key"` // e.g., "research/run-abc/result-007"
TokenCount int `json:"token_count"` // for cost tracking
Truncated bool `json:"truncated"`
}
func SynthesizeLLMActivityWithRef(
ctx context.Context,
query string,
index int,
store ResultStore, // injected dependency: S3, Redis, etc.
) (LLMResultRef, error) {
resp, err := callLLM(ctx, query)
if err != nil {
return LLMResultRef{}, err
}
key := fmt.Sprintf("research/%s/result-%03d", workflowRunID(ctx), index)
if err := store.Put(ctx, key, resp.Text); err != nil {
return LLMResultRef{}, fmt.Errorf("result store write failed: %w", err)
}
return LLMResultRef{
Key: key,
TokenCount: resp.Usage.TotalTokens,
}, nil
}
// Workflow now stores only 32-byte refs in history, not 4KB LLM texts
func ResearchWorkflow(ctx workflow.Context, query string) (string, error) {
refs := make([]LLMResultRef, 0, 50)
for i := 0; i < 50; i++ {
var ref LLMResultRef
ao := workflow.ActivityOptions{StartToCloseTimeout: 60 * time.Second}
err := workflow.WithActivityOptions(ctx, ao).
ExecuteActivity(ctx, SynthesizeLLMActivityWithRef, query, i).
Get(ctx, &ref)
if err != nil {
return "", err
}
refs = append(refs, ref)
}
// Final aggregation reads from external store — not from history
return aggregateFromStore(ctx, refs), nil
}
This pattern reduces history payload from ~200KB to ~1.6KB for a 50-activity workflow (50 refs × 32 bytes), cutting replay overhead by 99%. Apply it to any activity that returns LLM output, search results, or large document content.
Failure Mode 2: Activity Retry Amplification
Temporal's default RetryPolicy is generous by design — it assumes transient infrastructure failures, not rate-limited external APIs. The defaults are:
MaximumAttempts: 0— unlimited retriesInitialInterval: 1sBackoffCoefficient: 2.0MaximumInterval: 100sNonRetryableErrorTypes: []— all errors are retryable by default
An LLM activity that hits a rate limit (429 Too Many Requests) will retry with exponential backoff: 1s, 2s, 4s, 8s, 16s, 32s, 64s, 100s, 100s, 100s… Each retry is a new activity attempt — a new ActivityTaskScheduled event, a new ActivityTaskStarted event, and eventually either a new ActivityTaskCompleted or ActivityTaskFailed event. At Temporal Cloud action pricing, 50 retries on a single rate-limited activity generates 100+ billable actions before the underlying quota window resets.
The rate limit window is the critical insight. OpenAI's rate limits reset every 60 seconds. If an activity starts retrying at second 0 and the limit resets at second 60, a retry policy with 1s initial interval and 2.0 coefficient will fire approximately 9 retries before the reset: 1+2+4+8+16+32 = 63 seconds. Nine retries × 2 actions/retry = 18 actions per rate-limited activity call. With 20 concurrent LLM activities all hitting the same rate limit simultaneously, that's 360 extra actions per 60-second window.
Go
// EXPENSIVE: default retry policy on an LLM activity
func researchWorkflow(ctx workflow.Context, queries []string) error {
for _, q := range queries {
// Default ActivityOptions has unlimited retries on ALL errors
err := workflow.ExecuteActivity(ctx,
callLLMActivity, q,
).Get(ctx, nil)
// If callLLMActivity hits a 429, it retries indefinitely
// Each retry = 2+ Temporal Cloud actions
if err != nil {
return err
}
}
return nil
}
The correct pattern sets an explicit RetryPolicy that aligns with the LLM provider's rate limit window and marks rate limit errors as non-retryable after a ceiling — letting the workflow handle the backoff at the orchestration level rather than the activity level:
Go
import (
"go.temporal.io/sdk/temporal"
"go.temporal.io/sdk/workflow"
)
// LLM-tuned retry policy: cap attempts, match the 60s reset window
var llmRetryPolicy = &temporal.RetryPolicy{
InitialInterval: 10 * time.Second, // start after 10s, not 1s
BackoffCoefficient: 1.5, // slower growth than default 2.0
MaximumInterval: 60 * time.Second, // cap at one rate limit window
MaximumAttempts: 4, // 4 attempts max: ~10+15+22+33 = 80s total
NonRetryableErrorTypes: []string{
"LLMContextLengthExceeded", // retrying won't help
"LLMInvalidAPIKey", // credential error — not transient
"LLMContentPolicyViolation", // content error — not transient
},
}
func researchWorkflowGuarded(ctx workflow.Context, queries []string) error {
for _, q := range queries {
ao := workflow.ActivityOptions{
StartToCloseTimeout: 90 * time.Second,
RetryPolicy: llmRetryPolicy,
}
err := workflow.WithActivityOptions(ctx, ao).
ExecuteActivity(ctx, callLLMActivity, q).
Get(ctx, nil)
if err != nil {
var appErr *temporal.ApplicationError
if errors.As(err, &appErr) && appErr.Type() == "RateLimitExhausted" {
// All 4 attempts hit rate limits — back off at workflow level
// This is a workflow.Sleep, not an activity retry — zero action cost
workflow.Sleep(ctx, 90*time.Second)
// Retry this specific query after the sleep
err = workflow.WithActivityOptions(ctx, ao).
ExecuteActivity(ctx, callLLMActivity, q).
Get(ctx, nil)
}
if err != nil {
return fmt.Errorf("query %q failed after retries: %w", q, err)
}
}
}
return nil
}
The key insight is that workflow.Sleep is a timer event, not a billable action in the same way as activity retries. A workflow sleeping for 90 seconds generates one TimerStarted and one TimerFired event — two actions total, regardless of the sleep duration. Four activity retries generate 8+ actions. For a rate-limited activity that needs to wait for a quota reset, sleeping at the workflow level is significantly cheaper than retrying at the activity level.
Python
from datetime import timedelta
from temporalio import workflow, activity
from temporalio.common import RetryPolicy
LLM_RETRY_POLICY = RetryPolicy(
initial_interval=timedelta(seconds=10),
backoff_coefficient=1.5,
maximum_interval=timedelta(seconds=60),
maximum_attempts=4,
non_retryable_error_types=[
"LLMContextLengthExceeded",
"LLMInvalidAPIKey",
"LLMContentPolicyViolation",
],
)
@workflow.defn
class ResearchWorkflowGuarded:
@workflow.run
async def run(self, queries: list[str]) -> str:
results = []
for q in queries:
try:
result = await workflow.execute_activity(
call_llm_activity,
q,
start_to_close_timeout=timedelta(seconds=90),
retry_policy=LLM_RETRY_POLICY,
)
results.append(result)
except Exception as e:
if "RateLimitExhausted" in str(type(e).__name__):
# Wait at workflow level — timer is cheap vs activity retries
await workflow.sleep(timedelta(seconds=90))
result = await workflow.execute_activity(
call_llm_activity,
q,
start_to_close_timeout=timedelta(seconds=90),
retry_policy=LLM_RETRY_POLICY,
)
results.append(result)
else:
raise
return "\n".join(results)
Failure Mode 3: Child Workflow Fan-Out Runaway
Fan-out in Temporal is implemented with workflow.ExecuteChildWorkflow (Go) or workflow.execute_child_workflow (Python). The pattern is common for parallel research: a parent workflow asks an LLM to decompose a query into N sub-queries, then spawns N child workflows to process them in parallel.
The failure mode is that N is determined by LLM output. LLMs asked to "generate all relevant sub-queries" produce variable-length lists, and prompt wording changes significantly affect the count. "Generate sub-queries for a comprehensive literature review on transformer architectures" might return 12 sub-queries in one run and 87 sub-queries in another, depending on the model's verbosity. Without a ceiling, both are launched as concurrent child workflows:
Go
// DANGEROUS: fan-out count determined by LLM, no ceiling
func ResearchOrchestratorWorkflow(ctx workflow.Context, topic string) error {
var subQueries []string
// LLM generates the fan-out width — no ceiling enforced
err := workflow.ExecuteActivity(ctx,
generateSubQueriesActivity, topic,
).Get(ctx, &subQueries)
if err != nil {
return err
}
// subQueries could be 5 or 500 — all launched concurrently
futures := make([]workflow.Future, len(subQueries))
for i, q := range subQueries {
cwo := workflow.ChildWorkflowOptions{
WorkflowID: fmt.Sprintf("research-%s-%d", topic, i),
}
futures[i] = workflow.ExecuteChildWorkflow(
workflow.WithChildOptions(ctx, cwo),
ResearchSubWorkflow, q,
)
}
// Wait for all — if 500 children launched, 500 workflow starts billed
for _, f := range futures {
if err := f.Get(ctx, nil); err != nil {
return err
}
}
return nil
}
At Temporal Cloud action pricing, each child workflow start is 2 actions (WorkflowExecutionStarted + first WorkflowTaskScheduled). Five hundred child workflows = 1,000 actions for the fan-out alone, before any activity work. If each child workflow runs 10 LLM activities (20 actions each), the total for one parent run is 1,000 + 500×200 = 101,000 actions — roughly $2.53 at $25/million, purely in Temporal overhead, for a single orchestration run.
The fix enforces a hard ceiling on fan-out width and uses a semaphore to limit concurrency within that ceiling. Both must be present: the ceiling prevents unbounded launch cost, and the semaphore prevents simultaneous LLM quota exhaustion across all children:
Go
const (
MaxFanOutWidth = 20 // hard ceiling regardless of LLM output
MaxConcurrentLLMs = 5 // semaphore: max simultaneous child workflows
)
func ResearchOrchestratorWorkflowGuarded(ctx workflow.Context, topic string) error {
var subQueries []string
err := workflow.ExecuteActivity(ctx,
generateSubQueriesActivity, topic,
).Get(ctx, &subQueries)
if err != nil {
return err
}
// Enforce ceiling — truncate LLM output to max width
if len(subQueries) > MaxFanOutWidth {
workflow.GetLogger(ctx).Warn("fan-out ceiling applied",
"llm_count", len(subQueries),
"ceiling", MaxFanOutWidth,
)
subQueries = subQueries[:MaxFanOutWidth]
}
// Semaphore channel — limits concurrent children in flight
sem := workflow.NewChannel(ctx)
futures := make([]workflow.Future, 0, len(subQueries))
inFlight := 0
for i, q := range subQueries {
// Block if at concurrency ceiling
for inFlight >= MaxConcurrentLLMs {
sem.Receive(ctx, nil)
inFlight--
}
q := q // capture
i := i
cwo := workflow.ChildWorkflowOptions{
WorkflowID: fmt.Sprintf("research-%s-%d", topic, i),
}
f := workflow.ExecuteChildWorkflow(
workflow.WithChildOptions(ctx, cwo),
ResearchSubWorkflow, q,
)
futures = append(futures, f)
inFlight++
// Signal semaphore when child completes
workflow.Go(ctx, func(ctx workflow.Context) {
f.Get(ctx, nil)
sem.Send(ctx, nil)
})
}
// Drain remaining futures
for _, f := range futures {
if err := f.Get(ctx, nil); err != nil {
return err
}
}
return nil
}
Alternative to child workflows for bounded fan-out: For fan-out widths under 10, consider running sub-tasks as parallel activities (workflow.Go + workflow.ExecuteActivity) rather than child workflows. Activities share the parent workflow's history instead of creating separate workflow histories — cheaper in storage and action count for small fan-out. Reserve child workflows for tasks that need independent durability (long-running, may outlive the parent's reasonable wait window).
Failure Mode 4: ContinueAsNew Neglect
Temporal workflows are designed to run for days, months, or indefinitely — a chat agent that stays alive across user sessions, a monitoring workflow that checks API status every 5 minutes. The mechanism for long-running workflows to stay healthy is workflow.ContinueAsNew: it terminates the current workflow execution and immediately starts a fresh execution with a new empty history, passing any state you choose as the initial input. The workflow appears continuous from the outside, but internally the history is reset.
Without ContinueAsNew, the workflow history grows forever. Temporal enforces a hard limit of 50,000 history events and a configurable maximum history size (default 50MB in the server configuration). Hitting either limit terminates the workflow with WORKFLOW_MAX_HISTORY_SIZE_LIMIT_EXCEEDED — an unrecoverable error that drops whatever work was in progress.
For AI agent workflows, the practical problem is replay cost, which degrades long before the hard limit. A support bot workflow that processes 100 customer messages, each triggering 3 LLM activities with 3KB average responses, accumulates:
- 100 messages × 3 activities × 3 events/activity = 900 activity events
- ~200 workflow task events for orchestration
- 900
ActivityTaskCompletedpayloads × 3KB = 2.7MB of serialized LLM text in history - Total: ~1,100 events, ~3MB of history data
Each time this workflow processes a new customer message (woken by a signal), the worker must load and replay all 1,100 events and 3MB of data before doing anything. At 1,000 customer interactions per day across 500 active sessions, this replay overhead accumulates to measurable compute cost at the Temporal worker level — independent of any LLM API cost.
Go
// DANGEROUS: support bot workflow that never calls ContinueAsNew
type SupportBotState struct {
SessionID string
History []Message
TotalTokens int
}
func SupportBotWorkflow(ctx workflow.Context, state SupportBotState) error {
ch := workflow.GetSignalChannel(ctx, "user-message")
for { // infinite loop — workflow runs forever without ContinueAsNew
var msg string
ch.Receive(ctx, &msg)
var response string
err := workflow.ExecuteActivity(ctx,
callLLMActivity, state.History, msg,
).Get(ctx, &response)
if err != nil {
return err
}
state.History = append(state.History, Message{Role: "assistant", Content: response})
// After 100 turns: 300+ activities, 1,100+ events, 3MB of LLM text in history
// Replay cost per signal: O(events) deserialization + re-execution
}
}
The fix calls ContinueAsNew after a configurable number of turns, carrying forward only the minimal state needed for the next session — typically a compressed summary of the conversation, not the full history:
Go
const ContinueAsNewAfterTurns = 20
func SupportBotWorkflowGuarded(ctx workflow.Context, state SupportBotState) error {
ch := workflow.GetSignalChannel(ctx, "user-message")
turnsThisExecution := 0
for {
var msg string
ch.Receive(ctx, &msg)
var response string
err := workflow.ExecuteActivity(ctx,
callLLMActivity, state.History, msg,
).Get(ctx, &response)
if err != nil {
return err
}
state.History = append(state.History, Message{Role: "assistant", Content: response})
turnsThisExecution++
if turnsThisExecution >= ContinueAsNewAfterTurns {
// Summarize history before ContinueAsNew to preserve context
var summary string
err := workflow.ExecuteActivity(ctx,
summarizeHistoryActivity, state.History,
).Get(ctx, &summary)
if err != nil {
return err
}
// Start fresh execution with summarized state
// History resets to zero — replay cost returns to baseline
freshState := SupportBotState{
SessionID: state.SessionID,
History: []Message{
{Role: "system", Content: "Prior conversation summary: " + summary},
},
TotalTokens: state.TotalTokens,
}
return workflow.NewContinueAsNewError(ctx, SupportBotWorkflowGuarded, freshState)
}
}
}
The workflow.NewContinueAsNewError return value is a special error type that Temporal interprets as a clean handoff, not a failure. The workflow ID is preserved; the workflow appears uninterrupted to external observers. Worker replay cost resets to near-zero on the new execution.
Python
from temporalio import workflow
from datetime import timedelta
MAX_TURNS_PER_EXECUTION = 20
@workflow.defn
class SupportBotWorkflowGuarded:
def __init__(self) -> None:
self._turns = 0
self._history: list[dict] = []
@workflow.run
async def run(self, state: dict) -> None:
self._history = state.get("history", [])
self._session_id = state["session_id"]
self._turns = 0
while True:
# Wait for next user message signal
msg = await workflow.wait_condition(
lambda: len(self._pending_messages) > 0
)
user_msg = self._pending_messages.pop(0)
response = await workflow.execute_activity(
call_llm_activity,
args=[self._history, user_msg],
start_to_close_timeout=timedelta(seconds=60),
)
self._history.append({"role": "assistant", "content": response})
self._turns += 1
if self._turns >= MAX_TURNS_PER_EXECUTION:
# Summarize and ContinueAsNew
summary = await workflow.execute_activity(
summarize_history_activity,
self._history,
start_to_close_timeout=timedelta(seconds=30),
)
fresh_state = {
"session_id": self._session_id,
"history": [{"role": "system", "content": f"Prior context: {summary}"}],
}
workflow.continue_as_new(fresh_state)
@workflow.signal
def user_message(self, msg: str) -> None:
self._pending_messages.append(msg)
Set MAX_TURNS_PER_EXECUTION based on your average activity payload size. The formula is: target history size ÷ (events per turn × avg payload size). For a workflow with 3 activities per turn at 3KB each, targeting 5MB history maximum: 5,000,000 ÷ (3 × 3,072) ≈ 540 turns. In practice, include a safety margin and target 60–70% of the theoretical maximum — workflows processing user input should never approach the limit during normal operation.
Composite Guard: TemporalAgentPolicy
All four failure modes interact. A workflow accumulating history (mode 1) that never calls ContinueAsNew (mode 4) will eventually hit the 50,000-event limit. If it also has unlimited activity retries (mode 2) on LLM calls and spawns unbounded child workflows (mode 3), the cost profile is exponential rather than linear. The following policy struct enforces all four ceilings consistently across workflow implementations:
Go
package workflowguard
import (
"fmt"
"time"
"go.temporal.io/sdk/temporal"
"go.temporal.io/sdk/workflow"
)
// TemporalAgentPolicy centralizes all cost-control thresholds for AI agent workflows.
type TemporalAgentPolicy struct {
// History
ContinueAsNewAfterTurns int
// Activity retries
LLMRetryPolicy *temporal.RetryPolicy
// Fan-out
MaxFanOutWidth int
MaxConcurrentFanOut int
// External store for large payloads
ResultStore ResultStore
}
func DefaultPolicy(store ResultStore) *TemporalAgentPolicy {
return &TemporalAgentPolicy{
ContinueAsNewAfterTurns: 20,
LLMRetryPolicy: &temporal.RetryPolicy{
InitialInterval: 10 * time.Second,
BackoffCoefficient: 1.5,
MaximumInterval: 60 * time.Second,
MaximumAttempts: 4,
NonRetryableErrorTypes: []string{
"LLMContextLengthExceeded",
"LLMInvalidAPIKey",
"LLMContentPolicyViolation",
},
},
MaxFanOutWidth: 20,
MaxConcurrentFanOut: 5,
ResultStore: store,
}
}
// ActivityOptions returns configured options for an LLM activity.
func (p *TemporalAgentPolicy) ActivityOptions(timeout time.Duration) workflow.ActivityOptions {
return workflow.ActivityOptions{
StartToCloseTimeout: timeout,
RetryPolicy: p.LLMRetryPolicy,
}
}
// ClampFanOut enforces the fan-out ceiling and logs a warning if clamping occurred.
func (p *TemporalAgentPolicy) ClampFanOut(ctx workflow.Context, items []string) []string {
if len(items) <= p.MaxFanOutWidth {
return items
}
workflow.GetLogger(ctx).Warn(
"TemporalAgentPolicy: fan-out ceiling applied",
"requested", len(items),
"ceiling", p.MaxFanOutWidth,
)
return items[:p.MaxFanOutWidth]
}
// ShouldContinueAsNew returns true if the workflow should reset its history.
func (p *TemporalAgentPolicy) ShouldContinueAsNew(turnCount int) bool {
return turnCount >= p.ContinueAsNewAfterTurns
}
// ResultStore interface — implement with S3, Redis, or a database.
type ResultStore interface {
Put(ctx interface{}, key, value string) error
Get(ctx interface{}, key string) (string, error)
}
Temporal vs Other Orchestration Frameworks: Cost Profile Comparison
| Dimension | In-process frameworks (LangChain, CrewAI) | Temporal (unguarded) | Temporal (with policy) |
|---|---|---|---|
| Activity result storage | In memory — no persistence cost | Full result serialized to history event | Ref key only — result in external store |
| Retry cost | No per-retry infrastructure charge | 2+ Temporal actions per retry attempt | 4-attempt cap; workflow-level Sleep for quota |
| Fan-out width | Unconstrained (CPU bound) | Unbounded child workflows, each with own history | Ceiling enforced before child launch |
| Long-running cost | Process restart loses state; not designed for weeks | Growing history → O(n) replay cost per signal | ContinueAsNew resets history on schedule |
| Durability | None (in-process state) | Full history replay after crash | Same durability; replay cost bounded by history size |
| Observability | Custom logging only | Temporal UI shows all events and actions | Same + policy violations in structured workflow logs |
Production Checklist
- Audit every activity return type — search your codebase for activities that return
stringor large structs. Any return value containing LLM output should be replaced with an external store reference. The history payload size is the most impactful single optimization for Temporal AI workflows. - Set
MaximumAttemptson all LLM activities — the default of 0 (unlimited) is never correct for LLM API calls. Start with 4 and measure; only increase if you have evidence that your LLM provider's rate limit window requires more retries to eventually succeed. - Add
NonRetryableErrorTypesfor semantic failures — context length exceeded, invalid API key, content policy violations are not transient. Every retry on these is guaranteed to fail and costs actions with no benefit. Type your errors in activity implementations and register the error type names in the policy. - Instrument fan-out width in your Temporal metrics — the Temporal Go and Python SDKs support custom metrics via the
MetricsHandler. Record LLM-generated list length before clamping; alert when the raw LLM count exceeds your ceiling by more than 2×, which indicates a prompt engineering issue. - Test ContinueAsNew paths in development — set
ContinueAsNewAfterTurns = 2in test environments to force the code path on every test run. ContinueAsNew bugs (dropped signals, state serialization errors) are invisible until a long-running workflow reaches the threshold in production.
Temporal Cloud vs self-hosted cost model: Self-hosted Temporal (OSS server on your own infrastructure) doesn't charge per-action — you pay for the infrastructure running the Temporal server. The history bloat (mode 1) and ContinueAsNew (mode 4) failure modes still apply because they affect worker compute cost and database storage. Action-count optimization (modes 2 and 3) matters most for Temporal Cloud billing specifically; for self-hosted deployments, prioritize history size reduction and ContinueAsNew discipline instead.
FAQ
Does Temporal compress workflow history? Can I reduce storage cost without changing my code?
Temporal supports configurable data converters that can compress history payloads using zlib or zstd before persistence. Enabling compression at the data converter level can reduce history storage by 60–80% for text-heavy LLM outputs. However, compression doesn't reduce action count (billing is on actions, not storage size on Temporal Cloud) and doesn't reduce the deserialization work on replay — the worker still has to decompress and process all historical events. Compression is a useful supplement but not a substitute for storing large payloads externally.
What happens to in-flight activities when ContinueAsNew fires?
ContinueAsNew is a blocking operation from the workflow's perspective — you call it after all in-flight activities have completed (or been cancelled). The typical pattern is to check the ShouldContinueAsNew condition only at a clean quiescence point (end of a processing turn, after draining a signal queue batch) rather than mid-loop. If you call ContinueAsNew while activities are in flight, those activities complete but their results are discarded — which can cause lost work. Structure your loop to complete the current batch before evaluating whether to continue-as-new.
How do I track total token consumption across ContinueAsNew boundaries?
Pass cumulative token counts in the state struct that ContinueAsNew carries forward. In the example above, the SupportBotState includes a TotalTokens field that accumulates across executions. Each new execution starts from the prior total. Separately, use activity heartbeating with a HeartbeatDetails payload to checkpoint token usage within long activities — this also lets the worker recover the token count if the activity is retried after a worker crash. For Temporal Cloud, correlate your token counts with Temporal Cloud metrics on workflow action counts to build a cost-per-workflow dashboard.
Can I query the current event count from inside a workflow to decide when to ContinueAsNew?
Temporal workflows run in a sandboxed environment that intentionally limits access to runtime metadata to preserve determinism. You cannot query the current history event count from within a workflow function. The standard approach is to count events yourself via turn count or activity execution count, as shown in the examples. As an approximation: 1 activity = ~3 history events + payload; 1 signal received = ~2 events; 1 timer = ~2 events. A turn-based counter with some margin is more reliable than trying to derive the count from workflow metrics.
How does this compare to Dapr's durable workflow cost failure modes?
The failure modes are structurally similar — both Temporal and Dapr workflows persist execution history and replay it on resume. The key differences are in pricing model and defaults. Dapr workflows on Azure Container Apps bill by vCPU-second of worker runtime; Temporal Cloud bills per action. Dapr's default retry policy is more conservative (3 attempts) than Temporal's (unlimited). Dapr's ContinueAsNew equivalent (ContinueAsNewAsync in the Dapr workflow authoring SDK) follows the same pattern. For teams choosing between the two for AI agent orchestration, the Dapr AI agent cost control guide covers the Dapr-specific nuances.