Dify Cost Control: Loop Detection and Budget Enforcement in Production
Dify is an open-source LLM application development platform — 100,000+ GitHub stars under the langgenius/dify repository — that lets you build AI workflows and agents through a visual canvas interface. You can chain LLM nodes, Tool nodes, Code nodes, Knowledge Retrieval nodes, Iteration nodes, HTTP Request nodes, and Condition nodes together in a Chatflow; or you can drop into Agent mode, which runs a ReAct or Function Calling agent with configurable Max Iterations and a curated tool suite. Both execution modes share the same production risk: they keep calling LLMs and tools until the model decides it's done, and neither ships with a real-time cost guard.
The "Max Iterations" field in Dify's Agent mode is the framework's only built-in protection. It sets a ceiling on how many times the agent loop can cycle before being forcibly stopped. But it counts iterations, not dollars. It cannot detect that consecutive tool calls are semantically identical, that a Chatflow's conversation history has grown large enough to push the model past its effective context window, that an Iteration node is processing a dynamically-growing array that now has 400 items, or that an HTTP Request node inside an agent is compounding with the agent's own retry logic to hammer a failed external service 15 times per session. Those failure modes all complete well within the Max Iterations ceiling — or they exploit holes the counter was never designed to catch.
This post covers four failure modes specific to Dify's architecture and provides complete, immediately usable Python implementations for each one using Dify's Code node. Dify's Code node sandbox uses Python (or JavaScript), receives inputs as a Python dict, and returns outputs as a Python dict — a clean execution model for guards. Because the sandbox is stateless between invocations, all state that needs to persist across node executions is stored in /tmp/rg_*.json files, which survive within a Dify worker process lifetime. For multi-worker deployments, the Redis REST alternative is shown alongside the file-based version. The final section explains how to call RunGuard's managed API as an alternative to maintaining the guards yourself. For broader context on the framework-agnostic principles behind these patterns, the AI agent cost engineering guide is worth reading first.
How Dify executes agent loops
Dify exposes two primary application types: Chatflow and Agent. They have distinct execution models and distinct failure modes.
Chatflow mode
A Chatflow is a directed acyclic graph of typed nodes that you assemble on a visual canvas. Execution flows left-to-right through the graph, with branching handled by Condition (IF/ELSE) nodes. The node types relevant to cost control are:
- LLM node — sends a prompt to your configured model (OpenAI, Anthropic, Gemini, or a local Ollama model) and returns the completion. Each LLM node has its own model configuration, system prompt, and context variables.
- Tool node — executes a built-in Dify tool (DuckDuckGo search, Google search, Wikipedia, calculator, code interpreter) or a custom tool you've defined. Tool inputs are mapped from upstream node outputs.
- Code node — runs sandboxed Python or JavaScript with a 15-second execution timeout. The code receives a
statedict of mapped input variables and must return a dict of output variables. This is where guards live. - HTTP Request node — makes outbound HTTP calls to external APIs. Supports GET, POST, PUT, PATCH, DELETE with configurable headers, body, and retry count.
- Iteration node — processes an input array item-by-item, running a sub-workflow for each element. The iteration count is bounded only by the length of the input array.
- Knowledge Retrieval node — queries a Dify Knowledge Base (vector store) and optionally re-ranks results with an LLM.
Conversation variables store the accumulated message history for a Chatflow session. Each new user message appends to the history, and the full history is available to any node downstream via Dify's variable system. There is no automatic truncation.
Agent mode
Dify's Agent mode runs either a ReAct or Function Calling agent. The execution loop mirrors what you'd find in a LangChain AgentExecutor: the agent receives the system prompt, conversation history, and tool schema definitions; it reasons and selects a tool to call (or outputs a final answer); the tool executes and its result is appended to the context; the loop repeats. The Max Iterations setting (default 3, configurable to 10 or higher in Dify's agent settings) is the only loop counter. Dify also supports configuring a maximum token budget and a maximum time-per-run, but these are optional fields that many deployments leave unset.
Available tools in Agent mode include Dify's built-in tools (DuckDuckGo, Wikipedia, calculator, code interpreter) and any custom tools you've defined as OpenAPI-spec tool providers. Each tool call adds one round-trip to the agent loop: the model picks the tool, Dify executes it, the result comes back, and the model decides what to do next.
The gap: Max Iterations counts agent loop cycles. It cannot detect that four consecutive tool calls are searching for the same thing with rephrased queries, that the Chatflow's conversation history has grown to 90,000 tokens and the model is silently losing context, that an Iteration node is about to process a 500-item array at 1 LLM call per item, or that an HTTP Request node and the agent loop are combining their retry counts multiplicatively. None of these failure modes require the iteration counter to reach a suspicious number before causing a significant bill. Max Iterations is a last-resort backstop, not a cost control system.
Failure mode 1: Agent tool call spiral
The most common Dify production incident. A ReAct or Function Calling agent calls a search tool, receives a result that partially satisfies its reasoning goal, and calls the tool again with a slightly rephrased query. The model treats each new query as distinct progress — the phrasing changed, so something must have happened — but the underlying search space doesn't contain a definitive answer. The agent spirals through its Max Iterations budget calling the same tool with cosmetically different arguments:
- Iteration 1:
duckduckgo_search(query="dify production configuration 2026") - Iteration 2:
duckduckgo_search(query="dify deployment best practices production") - Iteration 3:
duckduckgo_search(query="best practices for running dify in production") - Iteration 4:
duckduckgo_search(query="dify production environment setup guide") - ... (continues to Max Iterations)
Each query is syntactically distinct. Max Iterations counts 4 iterations, sees normal progress, and does nothing. But semantically, all four queries are near-identical — the agent is re-searching the same information space with synonym substitutions. With a GPT-4o-class model at current pricing, 10 iterations of this pattern costs more than just the tool calls: each iteration includes the full conversation history plus tool results from all prior iterations, so per-iteration token cost escalates as the spiral continues.
The detection strategy is Jaccard similarity on normalized, tokenized tool argument strings, evaluated across a 4-call sliding window. If 3 or more pairs within the window exceed a similarity threshold of 0.72, the agent has entered a spiral. In Dify, the guard lives in a Code node placed before the tool call, using /tmp/rg_spiral_{session_id}_{tool_name}.json for per-session state.
In-memory file state (single-worker deployments)
# Dify Code node: Agent Tool Call Spiral Guard
# Place this node BEFORE each Tool node in your Agent workflow.
# Input variables (map from upstream): tool_name (string), tool_args (string or object)
# Output variables: spiral_check (string), tool_name (string), tool_args (any)
#
# State is persisted in /tmp/rg_spiral_{session_id}_{tool_name}.json
# This file survives across Code node invocations within the same Dify worker process.
# For multi-worker deployments, see the Redis REST alternative below.
import json
import os
import re
import time
import hashlib
# --- Configuration ---
JACCARD_THRESHOLD = 0.72
WINDOW_SIZE = 4
MIN_HIGH_SIM_PAIRS = 3 # 3+ near-identical pairs in last 4 calls = spiral
SESSION_TTL_SECONDS = 7200 # 2-hour state expiry
def normalize_args(args):
"""Lowercase, strip punctuation, sort words for stable fingerprinting."""
raw = args if isinstance(args, str) else json.dumps(args, sort_keys=True)
tokens = re.sub(r'[^a-z0-9\s]', ' ', raw.lower()).split()
return ' '.join(sorted(set(tokens))) # deduplicate + sort for order-invariance
def jaccard(a, b):
set_a = set(a.split())
set_b = set(b.split())
if not set_a and not set_b:
return 0.0
intersection = len(set_a & set_b)
union = len(set_a | set_b)
return intersection / union if union > 0 else 0.0
def main(tool_name: str, tool_args, session_id: str = "default") -> dict:
# Sanitize session_id and tool_name for use in filename
safe_session = re.sub(r'[^a-zA-Z0-9_-]', '_', session_id)[:64]
safe_tool = re.sub(r'[^a-zA-Z0-9_-]', '_', tool_name)[:32]
state_path = f"/tmp/rg_spiral_{safe_session}_{safe_tool}.json"
# Load existing state or initialize fresh
state = {"history": [], "updated_at": time.time()}
if os.path.exists(state_path):
try:
with open(state_path, "r") as f:
state = json.load(f)
except (json.JSONDecodeError, OSError):
pass # corrupted state — start fresh
# Expire stale state
if time.time() - state.get("updated_at", 0) > SESSION_TTL_SECONDS:
state = {"history": [], "updated_at": time.time()}
history = state.get("history", [])
fingerprint = normalize_args(tool_args)
# Append new call to sliding window
history.append({"fp": fingerprint, "ts": time.time()})
if len(history) > WINDOW_SIZE:
history = history[-WINDOW_SIZE:]
state["history"] = history
state["updated_at"] = time.time()
# Persist updated state
try:
with open(state_path, "w") as f:
json.dump(state, f)
except OSError:
pass # non-fatal — guard degrades gracefully if /tmp is unavailable
# Compute pairwise Jaccard similarities across window
if len(history) >= 3:
similarities = []
for i in range(len(history) - 1):
for j in range(i + 1, len(history)):
similarities.append(jaccard(history[i]["fp"], history[j]["fp"]))
high_sim_pairs = sum(1 for s in similarities if s >= JACCARD_THRESHOLD)
max_sim = max(similarities) if similarities else 0.0
if high_sim_pairs >= MIN_HIGH_SIM_PAIRS:
raise Exception(
f"[RunGuard] Tool call spiral detected on '{tool_name}': "
f"{high_sim_pairs} near-identical calls in last {len(history)} invocations "
f"(max Jaccard similarity: {max_sim:.3f}, threshold: {JACCARD_THRESHOLD}). "
f"Session: {session_id}. Stopping agent to prevent runaway cost."
)
return {
"spiral_check": "passed",
"tool_name": tool_name,
"tool_args": tool_args,
}
Wire this Code node so its tool_name and tool_args inputs are mapped from whatever upstream node determines the tool call, and its outputs pass into the actual Tool node. When the guard raises an exception, Dify surfaces the error to the agent as a tool failure message. Well-prompted agents (GPT-4o, Claude 3.x, Gemini 1.5+) will generally treat an explicit spiral detection message as terminal — they stop retrying and synthesize a response from the partial results they have.
Redis REST alternative for multi-worker deployments
If your Dify deployment runs multiple worker processes (the recommended production configuration uses at least 2–4 Gunicorn workers), /tmp files are not shared between workers. A session that hits worker A on call 1 and worker B on call 2 will see independent state files and never accumulate the spiral history needed to trip the guard. Use Upstash Redis REST API — no native Redis dependency needed, just urllib.request which is available in Dify's sandbox:
# Dify Code node: Redis REST-backed spiral guard (multi-worker safe)
# Requires UPSTASH_REDIS_REST_URL and UPSTASH_REDIS_REST_TOKEN in Dify's environment variables.
import json
import os
import re
import time
import urllib.request
import urllib.parse
JACCARD_THRESHOLD = 0.72
WINDOW_SIZE = 4
MIN_HIGH_SIM_PAIRS = 3
TTL_SECONDS = 7200
UPSTASH_URL = os.environ.get("UPSTASH_REDIS_REST_URL", "")
UPSTASH_TOKEN = os.environ.get("UPSTASH_REDIS_REST_TOKEN", "")
def redis_get(key: str) -> list:
url = f"{UPSTASH_URL}/get/{urllib.parse.quote(key, safe='')}"
req = urllib.request.Request(url, headers={"Authorization": f"Bearer {UPSTASH_TOKEN}"})
with urllib.request.urlopen(req, timeout=3) as resp:
data = json.loads(resp.read())
result = data.get("result")
return json.loads(result) if result else []
def redis_set(key: str, value: list) -> None:
url = f"{UPSTASH_URL}/set/{urllib.parse.quote(key, safe='')}"
body = json.dumps([json.dumps(value), "EX", TTL_SECONDS]).encode()
req = urllib.request.Request(
url, data=body, method="POST",
headers={"Authorization": f"Bearer {UPSTASH_TOKEN}", "Content-Type": "application/json"}
)
urllib.request.urlopen(req, timeout=3)
def normalize_args(args) -> str:
raw = args if isinstance(args, str) else json.dumps(args, sort_keys=True)
tokens = re.sub(r'[^a-z0-9\s]', ' ', raw.lower()).split()
return ' '.join(sorted(set(tokens)))
def jaccard(a: str, b: str) -> float:
sa, sb = set(a.split()), set(b.split())
if not sa and not sb:
return 0.0
return len(sa & sb) / len(sa | sb)
def main(tool_name: str, tool_args, session_id: str = "default") -> dict:
safe_session = re.sub(r'[^a-zA-Z0-9_-]', '_', session_id)[:64]
safe_tool = re.sub(r'[^a-zA-Z0-9_-]', '_', tool_name)[:32]
redis_key = f"rg:spiral:{safe_session}:{safe_tool}"
history = redis_get(redis_key)
fingerprint = normalize_args(tool_args)
history.append(fingerprint)
if len(history) > WINDOW_SIZE:
history = history[-WINDOW_SIZE:]
redis_set(redis_key, history)
if len(history) >= 3:
similarities = [
jaccard(history[i], history[j])
for i in range(len(history) - 1)
for j in range(i + 1, len(history))
]
high_sim_pairs = sum(1 for s in similarities if s >= JACCARD_THRESHOLD)
if high_sim_pairs >= MIN_HIGH_SIM_PAIRS:
raise Exception(
f"[RunGuard] Tool spiral on '{tool_name}': {high_sim_pairs} near-identical "
f"calls in window of {len(history)} (max sim: {max(similarities):.3f}). "
f"Session: {session_id}. Stopping agent."
)
return {"spiral_check": "passed", "tool_name": tool_name, "tool_args": tool_args}
The same Jaccard fingerprinting approach works across any framework that intercepts tool calls before they execute. The LangGraph circuit breaker guide covers implementing an equivalent guard as a LangGraph node that wraps tool execution in a compiled graph — worth reading if you use Dify's workflow export feature to run Chatflows in LangGraph-compatible environments.
Failure mode 2: Chatflow LLM context accumulation spiral
Dify's Chatflow stores conversation history in conversation variables that persist across turns. The LLM node in a Chatflow receives this history as part of its context on every invocation. For a workflow that runs multiple LLM nodes per user turn — a research flow that calls an LLM for planning, another for search query generation, another for synthesis — the per-turn token cost is: (conversation history tokens) × (number of LLM nodes that receive history) × (LLM pricing per 1K tokens).
As the conversation history grows, this multiplier compounds. By the time conversation history reaches 80,000 tokens, a workflow with 3 LLM nodes is consuming 240,000 tokens of input on every user turn — before a single output token is generated. The model's effective context window is typically exceeded silently: the provider truncates from the oldest end of the conversation, the model loses its earlier conclusions, and it starts re-answering sub-questions it already resolved in earlier turns. This produces a spiral where the model re-executes work it already did, adding more tool calls and LLM calls per turn, growing the history faster, and truncating more aggressively on the next turn.
Dify does not expose the current context token count in Chatflow execution. The guard must estimate it. The words-times-1.3 heuristic (1 word ≈ 1.3 tokens for English prose with tool outputs) is accurate enough for a circuit-breaker threshold — you want to trip well before the exact limit, not at it.
The guard runs as a Code node early in the Chatflow, before any LLM node executes. It soft-injects a "context full" signal at 70% of the model's window and hard-stops the workflow at 85% by raising an exception. The 85% hard stop is routed to an End node via a Condition node that checks the Code node's hard_stop output:
# Dify Code node: Chatflow Context Accumulation Guard
# Place early in your Chatflow, BEFORE any LLM node.
# Input variables: conversation_history (array of message objects or string),
# model_context_window (integer, e.g. 128000 for GPT-4o),
# session_id (string)
# Output variables: estimated_tokens (integer), soft_warn (bool), hard_stop (bool),
# context_signal (string), guard (string)
#
# Route this node's output through a Condition node:
# IF hard_stop == true → End node (returns context_signal as final answer)
# IF soft_warn == true → inject context_signal into next LLM node's system prompt
# ELSE → continue normal workflow path
import json
import os
import re
import time
SOFT_WARN_FRACTION = 0.70 # Warn at 70% of model window
HARD_STOP_FRACTION = 0.85 # Hard stop at 85%
WORDS_TO_TOKENS = 1.3 # Conservative heuristic: 1 word ~ 1.3 tokens
STATE_TTL_SECONDS = 3600 # 1-hour state file expiry
def count_tokens_from_history(history) -> int:
"""Estimate token count from conversation history (list or JSON string)."""
if isinstance(history, str):
try:
history = json.loads(history)
except (json.JSONDecodeError, TypeError):
# Treat as raw text
words = len(history.split())
return int(words * WORDS_TO_TOKENS)
if not isinstance(history, list):
history = [history] if history else []
total_words = 0
for msg in history:
if isinstance(msg, dict):
content = msg.get("content") or msg.get("text") or msg.get("message") or ""
if isinstance(content, list):
# Multi-part message (e.g. tool call results)
content = " ".join(
part.get("text", "") if isinstance(part, dict) else str(part)
for part in content
)
total_words += len(str(content).split())
elif isinstance(msg, str):
total_words += len(msg.split())
return int(total_words * WORDS_TO_TOKENS)
def main(
conversation_history,
model_context_window: int = 128000,
session_id: str = "default"
) -> dict:
estimated_tokens = count_tokens_from_history(conversation_history)
soft_threshold = int(model_context_window * SOFT_WARN_FRACTION)
hard_threshold = int(model_context_window * HARD_STOP_FRACTION)
history_len = len(conversation_history) if isinstance(conversation_history, list) else 0
# Track accumulation rate per session to detect re-answering spirals
safe_session = re.sub(r'[^a-zA-Z0-9_-]', '_', session_id)[:64]
state_path = f"/tmp/rg_ctx_{safe_session}.json"
state = {"snapshots": [], "updated_at": time.time()}
if os.path.exists(state_path):
try:
with open(state_path, "r") as f:
state = json.load(f)
except (json.JSONDecodeError, OSError):
pass
if time.time() - state.get("updated_at", 0) > STATE_TTL_SECONDS:
state = {"snapshots": [], "updated_at": time.time()}
snapshots = state.get("snapshots", [])
snapshots.append({"tokens": estimated_tokens, "ts": time.time()})
if len(snapshots) > 20:
snapshots = snapshots[-20:]
state["snapshots"] = snapshots
state["updated_at"] = time.time()
try:
with open(state_path, "w") as f:
json.dump(state, f)
except OSError:
pass
# Detect re-answering spiral: tokens aren't decreasing even across turns
# If the last 3 snapshots all show token counts within 5% of each other
# AND we're above soft threshold, the workflow is spinning in place
spiral_detected = False
if len(snapshots) >= 3 and estimated_tokens > soft_threshold:
recent = [s["tokens"] for s in snapshots[-3:]]
avg = sum(recent) / len(recent)
variance = max(abs(t - avg) / avg for t in recent) if avg > 0 else 0
if variance < 0.05:
spiral_detected = True
# --- Hard stop: context too full for productive work ---
if estimated_tokens >= hard_threshold or spiral_detected:
reason = (
f"context re-answering spiral detected at ~{estimated_tokens} tokens"
if spiral_detected
else f"context at ~{estimated_tokens} estimated tokens ({int(estimated_tokens/model_context_window*100)}% of {model_context_window}-token window)"
)
context_signal = (
f"[RunGuard] Hard stop: {reason}. "
f"Conversation history has grown beyond productive processing range. "
f"Synthesize a final answer from the results gathered so far and end the session."
)
return {
"estimated_tokens": estimated_tokens,
"soft_warn": False,
"hard_stop": True,
"context_signal": context_signal,
"guard": "hard_stop",
}
# --- Soft warn: approaching limit, inject conservative instruction ---
context_signal = ""
if estimated_tokens >= soft_threshold:
context_signal = (
f"[Context note: ~{estimated_tokens} estimated tokens of conversation history "
f"({int(estimated_tokens/model_context_window*100)}% of window). "
f"Avoid additional tool calls unless strictly necessary. "
f"Prioritize synthesizing existing results into a final answer.]"
)
return {
"estimated_tokens": estimated_tokens,
"soft_warn": True,
"hard_stop": False,
"context_signal": context_signal,
"guard": "soft_warn",
}
return {
"estimated_tokens": estimated_tokens,
"soft_warn": False,
"hard_stop": False,
"context_signal": "",
"guard": "passed",
}
In your Chatflow, add a Condition node immediately after this Code node. Configure it with two branches: hard_stop == true routes to an End node that outputs {{context_signal}} as the final answer; soft_warn == true routes to the main workflow path but threads context_signal into each downstream LLM node's system prompt as an appended instruction. The else branch (no warnings) continues the workflow unmodified.
The 85% hard stop threshold gives the model enough remaining context to write a coherent final answer — stopping at 100% would leave nothing for the output. The soft warning at 70% generally cuts 1–3 unnecessary tool calls per turn by making the model aware of its context state, reducing the per-turn cost before the hard stop becomes necessary. For a comparable implementation in a different no-code platform, the n8n cost control guide covers context accumulation guards for n8n's Window Buffer Memory using the same token-estimation approach.
Failure mode 3: Workflow Iteration node runaway
Dify's Iteration node processes an input array, running a sub-workflow for each element. The node is designed for batch operations: summarize each document in a list, translate each sentence in a paragraph, enrich each row in a dataset. The cost profile is straightforward when the input array has a known, bounded size. The failure mode appears when the input array is generated dynamically — typically by an upstream LLM node.
Consider a research Chatflow where an LLM node generates a list of sub-questions to investigate, and the Iteration node runs a search-and-summarize sub-workflow for each sub-question. When the task is specific, the LLM generates 5–8 sub-questions. When the task is vague ("research everything about X"), the LLM generates 30, 50, or sometimes 100+ sub-questions — each of which triggers an LLM + Tool node execution in the Iteration sub-workflow. The total cost scales linearly with the generated list length, and the list length is controlled entirely by a model making a judgment call about how thorough to be.
A second variant: the Iteration node's termination relies on a content-based condition rather than a hard count — for example, "keep processing until the output contains a satisfying answer." If the LLM inside the sub-workflow keeps outputting answers that are flagged as insufficient by a downstream Condition node, the Iteration node keeps running. In Dify's current implementation, an Iteration node will run until either the array is exhausted or a runtime error stops it. There is no built-in maximum iteration count separate from the array length.
The correct guard is a Code node placed immediately before the Iteration node. It receives the LLM-generated array, truncates it at a configurable MAX_ITEMS limit, logs a warning if truncation occurred, and passes a truncated flag downstream so the final synthesis LLM node can acknowledge that it's working with partial results:
# Dify Code node: Iteration Node Runaway Guard
# Place this node IMMEDIATELY BEFORE the Iteration node.
# Input variables: items (array — the LLM-generated list to iterate over),
# session_id (string)
# Output variables: items (truncated array), original_count (int),
# truncated (bool), truncation_warning (string)
#
# Wire: items output → Iteration node's array input
# truncated output → final LLM node's system prompt (so it can say "partial results")
import json
import os
import re
import time
MAX_ITEMS = 20 # Hard ceiling on iteration count
WARN_ABOVE = 10 # Log a soft warning if array exceeds this before truncation
def main(items, session_id: str = "default") -> dict:
# Normalize input: LLM output may be a JSON string instead of a parsed list
if isinstance(items, str):
try:
items = json.loads(items)
except (json.JSONDecodeError, ValueError):
# Try splitting on newlines as a fallback (common LLM list format)
items = [line.strip() for line in items.split("\n") if line.strip()]
if not isinstance(items, list):
items = [items] if items is not None else []
original_count = len(items)
truncated = original_count > MAX_ITEMS
truncation_warning = ""
if original_count > WARN_ABOVE:
# Log the oversized array to /tmp for observability
safe_session = re.sub(r'[^a-zA-Z0-9_-]', '_', session_id)[:64]
log_path = f"/tmp/rg_iter_warn_{safe_session}.json"
log_entry = {
"ts": time.time(),
"session_id": session_id,
"original_count": original_count,
"truncated_to": min(original_count, MAX_ITEMS),
"first_items_preview": items[:3],
}
try:
history = []
if os.path.exists(log_path):
with open(log_path, "r") as f:
history = json.load(f)
history.append(log_entry)
if len(history) > 50:
history = history[-50:]
with open(log_path, "w") as f:
json.dump(history, f)
except OSError:
pass # Non-fatal; guard proceeds even if logging fails
if truncated:
items = items[:MAX_ITEMS]
truncation_warning = (
f"[RunGuard] Iteration array truncated: {original_count} items generated, "
f"processing first {MAX_ITEMS} only to prevent runaway cost. "
f"Results below are based on a representative sample. "
f"If comprehensive coverage is required, refine the task scope or increase MAX_ITEMS."
)
return {
"items": items,
"original_count": original_count,
"truncated": truncated,
"truncation_warning": truncation_warning,
}
After the Iteration node completes, add a Condition node that checks truncated == true. If true, route to a final LLM node that has truncation_warning injected into its system prompt — the model can then preface its synthesis with a note like "Note: this analysis covers 20 of the 47 sub-questions generated; remaining questions were not processed due to cost controls." This gives users an honest response and prevents silent result degradation.
Setting MAX_ITEMS = 20 is a conservative default. For workflows where 20 iterations is genuinely too few (large batch document processing, for example), increase the limit and pair it with a per-session budget cap in your RunGuard configuration so total token spend is bounded even if the iteration count is higher. The Flowise cost control guide covers an equivalent pattern for Flowise's loop nodes, which have the same dynamic-array vulnerability.
Failure mode 4: HTTP Request node retry cascade
Dify's HTTP Request node makes outbound API calls and supports configurable retry behavior. When the node is used directly in a Chatflow, retries are a simple multiplier: if you set 3 retries and the endpoint is down, you get 3 HTTP requests before the node fails. When the HTTP Request node is used as a tool inside an Agent, the multiplier compounds. The Agent's own retry logic — the agent tries different tool arguments or retries the tool call when it receives an error — stacks on top of the HTTP-level retries:
- Agent invocations of the tool before Max Iterations stops it: up to 5
- HTTP-level retries per agent invocation: up to 3
- Total HTTP requests to the failed endpoint: up to 15 per session
At 15 requests per session, and with many concurrent users hitting a service that's experiencing an incident, your Dify agents become an unintentional DDoS on the already-struggling external service — or accumulate 15× the expected cost on metered APIs. The compound retry problem is not unique to Dify (the CrewAI cost control guide covers the same dynamic in CrewAI's tool execution layer), but Dify's HTTP Request node configuration makes the HTTP-level retry count invisible to the Agent, which makes the compounding especially easy to miss in code review.
The two-part fix: set retry count to 0 on any HTTP Request node used as an Agent tool (let the Agent handle retry decisions, not the transport layer), and add a Code node circuit-breaker that tracks per-tool consecutive failures using /tmp/rg_cb_{session_id}_{tool_name}.json, trips after 2 consecutive failures, and returns a 409-style error object that the Agent reads as "tool unavailable — do not retry this session." The circuit auto-resets after 60 seconds to allow recovery from transient failures:
# Dify Code node: HTTP Request Circuit Breaker
# Place this node AFTER the HTTP Request node, BEFORE the result returns to the Agent.
# Input variables: tool_name (string), http_status_code (integer, 0 if network error),
# http_error (string, empty if success), session_id (string)
# Output variables: circuit_state (string), allowed (bool), error_message (string),
# http_status_code (integer), consecutive_fails (integer)
#
# When circuit opens: raise an exception that the Agent reads as tool unavailable.
# Agent should see: "tool unavailable — do not retry this session"
# Add this phrase to your Agent's system prompt as an instruction to respect.
import json
import os
import re
import time
MAX_CONSECUTIVE_FAILURES = 2 # Open circuit after this many consecutive failures
CIRCUIT_RESET_SECONDS = 60 # Auto-reset after 60 seconds (allow service recovery)
STATE_TTL_SECONDS = 3600 # Full state expiry after 1 hour
def main(
tool_name: str,
http_status_code: int = 200,
http_error: str = "",
session_id: str = "default"
) -> dict:
safe_session = re.sub(r'[^a-zA-Z0-9_-]', '_', session_id)[:64]
safe_tool = re.sub(r'[^a-zA-Z0-9_-]', '_', tool_name)[:32]
state_path = f"/tmp/rg_cb_{safe_session}_{safe_tool}.json"
now = time.time()
# Load circuit state
state = {
"circuit_open": False,
"consecutive_fails": 0,
"total_calls": 0,
"total_failures": 0,
"last_failure_ts": 0,
"last_failure_code": None,
"opened_at": None,
"updated_at": now,
}
if os.path.exists(state_path):
try:
with open(state_path, "r") as f:
state = json.load(f)
except (json.JSONDecodeError, OSError):
pass
# Expire stale state
if now - state.get("updated_at", 0) > STATE_TTL_SECONDS:
state = {
"circuit_open": False,
"consecutive_fails": 0,
"total_calls": 0,
"total_failures": 0,
"last_failure_ts": 0,
"last_failure_code": None,
"opened_at": None,
"updated_at": now,
}
# Auto-reset an open circuit after CIRCUIT_RESET_SECONDS
if state.get("circuit_open") and state.get("opened_at"):
time_open = now - state["opened_at"]
if time_open >= CIRCUIT_RESET_SECONDS:
state["circuit_open"] = False
state["consecutive_fails"] = 0
state["opened_at"] = None
# If circuit is open, refuse immediately
if state.get("circuit_open"):
opened_ago = int(now - (state.get("opened_at") or now))
resets_in = max(0, CIRCUIT_RESET_SECONDS - opened_ago)
raise Exception(
f"[RunGuard] Circuit OPEN for tool '{tool_name}': "
f"{state['consecutive_fails']} consecutive failures. "
f"Last error: HTTP {state.get('last_failure_code', 'unknown')}. "
f"Tool unavailable — do not retry this session. "
f"Circuit resets in {resets_in}s. Session: {session_id}."
)
# Record this call outcome
state["total_calls"] = state.get("total_calls", 0) + 1
is_failure = bool(http_error) or (http_status_code >= 400 if http_status_code else False)
if is_failure:
state["total_failures"] = state.get("total_failures", 0) + 1
state["consecutive_fails"] = state.get("consecutive_fails", 0) + 1
state["last_failure_ts"] = now
state["last_failure_code"] = http_status_code if http_status_code else "network_error"
else:
state["consecutive_fails"] = 0 # Reset on any success
# Trip the circuit if consecutive failure threshold exceeded
if state["consecutive_fails"] >= MAX_CONSECUTIVE_FAILURES:
state["circuit_open"] = True
state["opened_at"] = now
state["updated_at"] = now
try:
with open(state_path, "w") as f:
json.dump(state, f)
except OSError:
pass
raise Exception(
f"[RunGuard] Circuit opened for tool '{tool_name}': "
f"{state['consecutive_fails']} consecutive failures "
f"(threshold: {MAX_CONSECUTIVE_FAILURES}). "
f"Last HTTP status: {state['last_failure_code']}. "
f"Tool unavailable — do not retry this session. "
f"External service appears to be down. Circuit resets in {CIRCUIT_RESET_SECONDS}s."
)
state["updated_at"] = now
try:
with open(state_path, "w") as f:
json.dump(state, f)
except OSError:
pass
return {
"circuit_state": "closed",
"allowed": True,
"error_message": "",
"http_status_code": http_status_code,
"consecutive_fails": state["consecutive_fails"],
}
The error message from the raised exception is what the Dify Agent reads as the tool's response. Including the explicit phrase "tool unavailable — do not retry this session" in the exception message, and mirroring that phrase in your Agent's system prompt as a recognized signal ("if any tool returns 'do not retry this session', stop calling that tool immediately"), creates a reliable two-layer stop: the circuit breaker refuses further calls, and the model is primed to respect the refusal. The 60-second auto-reset is appropriate for transient external service outages without requiring manual intervention.
For comparison, the OpenAI Agents SDK cost control guide covers circuit-breaking for function-called tools in native OpenAI agent code — the same consecutive-failure pattern applies, but the state management uses Python's threading.local() instead of /tmp files.
Wiring RunGuard into Dify
The four Code node guards above are production-ready and immediately deployable. The trade-off is maintenance: you need to add them to every flow that exposes these failure modes, keep threshold constants synchronized across workflows, rotate the /tmp state files when worker processes restart, and keep the Redis REST credentials current in multi-worker deployments. RunGuard provides all four checks as a managed HTTP endpoint you can call from any Dify flow via a single HTTP Request node.
HTTP Request node integration
Add an HTTP Request node at the start of each guarded execution path in your Chatflow or Agent. Configure it as follows:
# Dify HTTP Request node configuration for RunGuard
# Method: POST
# URL: https://runguard.dev/api/guard
# Headers:
# Content-Type: application/json
# X-RunGuard-Key: {{ENV.RUNGUARD_API_KEY}}
#
# Body (use Dify's variable expression syntax for dynamic fields):
{
"session_id": "{{sys.conversation_id}}",
"app_id": "dify-production",
"tool_name": "{{tool_name}}",
"tool_args": "{{tool_args}}",
"guard_types": ["spiral", "http_failure", "budget", "iteration"]
}
# Response handling:
# HTTP 200: { "allowed": true, "checks": { "spiral": "passed", "budget": "passed" } }
# → Condition node: allowed == true → proceed to tool execution
# HTTP 409: { "allowed": false, "reason": "spiral", "detail": "..." }
# → Condition node: allowed == false → route to End node with detail as final answer
#
# For Iteration node guard, send array metadata before the Iteration node:
{
"session_id": "{{sys.conversation_id}}",
"app_id": "dify-production",
"guard_types": ["iteration"],
"iteration_data": {
"item_count": "{{items | length}}",
"max_allowed": 20
}
}
# For context accumulation guard, send token estimate:
{
"session_id": "{{sys.conversation_id}}",
"app_id": "dify-production",
"guard_types": ["context"],
"context_data": {
"estimated_tokens": "{{estimated_tokens}}",
"model_context_window": 128000
}
}
On a 409 response, set the HTTP Request node's "Error Handling" to "Default Value" and output the response body. A downstream Condition node checks the allowed field: if false, route to an End node that returns the detail field as the final answer. The Agent or Chatflow completes gracefully with an explanation rather than a silent error or an unchecked spiral.
RunGuard maintains persistent state server-side — no /tmp file management, no Redis credentials to rotate, no threshold drift between flows. Trip events are logged to the RunGuard dashboard with the full tool call history, similarity scores, and session timeline. All flows sharing the same app_id are aggregated together, so you can see cost patterns across your entire Dify deployment rather than debugging each flow in isolation.
System prompt addition: For Agent mode, add this instruction to your agent's system prompt to ensure models respect circuit-breaker signals: "If a tool returns a message containing 'do not retry this session' or 'tool unavailable', stop calling that tool immediately and synthesize a final response from the information you have gathered so far." This makes the circuit-breaker semantically meaningful to the model rather than just a hard stop.
FAQ
Does Dify's Max Iterations setting prevent infinite loops?
Partially. Max Iterations is a hard upper bound on the number of agent loop cycles — the agent will always terminate eventually. But it counts cycles, not cost, and it counts cycles, not semantic repetition. An agent with Max Iterations set to 10 that calls a search tool 10 times with near-identical queries has burned 10 LLM calls, 10 tool calls, and escalating per-call token costs as the conversation history grows — all before the counter reaches 10. The default Max Iterations value in Dify is 3, which is low enough to limit accidental runaway but also low enough to cut off legitimate multi-step tasks. Teams that increase it to accommodate complex workflows (setting it to 8 or 10) inadvertently give spiraling agents more runway. Dify does provide optional per-agent maximum token and time limits, but these fields are often left unset in practice and they still do not detect the semantic repetition pattern that defines a true tool call spiral. Max Iterations is a last-resort backstop, not a cost management mechanism.
How do I prevent my Dify Chatflow from exceeding OpenAI context limits?
Dify does not automatically truncate conversation history in Chatflow mode — it accumulates until the LLM provider truncates it silently or returns a context length error. The correct approach is to implement the Code node context guard described in Failure Mode 2, which estimates token usage from the conversation history array and hard-stops the workflow at 85% of the configured model window before the provider makes the truncation decision. Set model_context_window to match your deployed model: 128,000 for GPT-4o, 200,000 for Claude 3.5 Sonnet, 1,000,000 for Gemini 1.5 Pro. The 1.3 words-to-tokens heuristic is conservative — it intentionally overestimates, so the guard trips before you actually hit the limit rather than at it. For sessions where conversation history growing is expected and necessary (long research tasks), route the hard-stop to a "summarize and continue" LLM node rather than a terminal End node: have a high-quality model summarize the conversation so far into 2,000–3,000 tokens, replace the conversation history variable with the summary, and continue. This resets the token counter while preserving session continuity.
Can I use Dify's built-in retry settings to avoid the HTTP retry multiplication problem?
The built-in retry settings make the problem worse, not better, when HTTP Request nodes are used inside Agent mode. Dify's HTTP Request node retry count controls how many times the HTTP transport layer retries a failed request before returning a failure to the node. If you set retries to 3, every agent-level tool invocation that fails first tries 3 times at the HTTP layer before the Agent sees a failure. The Agent then decides whether to retry the tool call — that decision is independent of the HTTP retry count. The correct mitigation is to set HTTP retries to 0 on any HTTP Request node used as an Agent tool, and implement the Code node circuit-breaker described in Failure Mode 4 to govern agent-level retry behavior. This gives you one retry decision point (the agent's own reasoning about whether to retry) with a hard consecutive-failure ceiling, rather than multiplicative retries at two independent layers.
Does the Code node sandbox timeout stop runaway costs?
Dify's Code node has a 15-second sandbox execution timeout that stops individual Code node executions from running indefinitely. This protects against code bugs that create infinite loops within a single Code node execution — for example, a while loop that never terminates. It does not protect against the inter-node loop failures described in this post. A tool call spiral is not caused by runaway code inside a Code node; it's caused by the Agent's reasoning loop making repeated external calls across many node executions, each of which completes well within 15 seconds. Similarly, context accumulation builds across conversation turns over minutes or hours, not within a single 15-second execution window. The Code node timeout is a sandbox safety mechanism, not a cost control mechanism. The guards in this post work precisely because they implement state that persists across Code node invocations — the timeout governs each invocation individually, while the guards reason about patterns across invocations.
What's the difference between a Dify Chatflow loop and a Dify Agent loop?
A Dify Chatflow is a directed acyclic graph — by design, it has no cycles. Execution flows forward through the node graph, and while an Iteration node can run its sub-workflow many times, the overall Chatflow graph does not loop back on itself. "Chatflow loops" in this post refer to patterns that behave like loops even without graph cycles: the context accumulation spiral (the Chatflow runs normally each turn, but the growing history causes the model to repeat work across turns) and the Iteration node runaway (a linear node that runs its sub-workflow hundreds of times). A Dify Agent loop is a true execution loop: the ReAct or Function Calling agent iterates, calling tools and updating its context on each iteration, until it outputs a final answer or hits Max Iterations. The tool call spiral (Failure Mode 1) is the primary Agent loop failure mode. The key practical difference is cost structure: in a Chatflow spiral, per-turn cost escalates as context grows, and the damage accumulates across many user messages over a session. In an Agent spiral, per-session cost escalates within a single user message, and the damage concentrates in one session. Both are expensive in different ways — Agent spirals hit harder and faster; Chatflow spirals are slower but often invisible until a weekly billing summary arrives.