Retool AI Agents & Workflows Cost Control: Subworkflow Recursion, Query Fan-Out, Retry Storms, and DB Change Loops
Retool's billing for Workflows is per workflow run. One trigger, one run, one charge — that model is clean and predictable for internal-tool automations that do a fixed sequence of database reads and writes. The moment you wire an AI component into a Retool Workflow, that predictability breaks in four distinct ways.
Retool Workflows support a "Run Workflow" step that lets one workflow call another. A workflow that loops over a list of items and calls a child workflow for each item creates one billed run per item plus the parent run. That is the intended use. The failure mode emerges when an AI orchestrator workflow calls child workflows based on dynamic output — an LLM returns a list of subtasks and the workflow calls itself or a child for each subtask. A list of 20 subtasks from a single trigger creates 21 billed runs before a second-level fan-out begins.
Four failure modes make Retool particularly tricky for AI workloads:
- Subworkflow recursion amplification — Retool's "Run Workflow" step allows workflows to call child workflows per AI-generated list item; a two-level fan-out with 20 items at each level creates 421 billed runs from a single trigger without any warning in the Retool editor.
- AI query fan-out from component events — AI queries set to auto-run or wired to component
onChangeevents fire independently for each rendered row in a listView or table; a 200-row table with a per-row AI annotation query runs 200 AI calls per page load. - Retry storm from shared AI API rate limits — Retool Workflows retry failed steps up to the configured maximum; when concurrent runs all hit the same rate-limited AI provider simultaneously, all retry at the same interval, doubling the total call volume on the already-saturated endpoint.
- Database change trigger loops — Workflows triggered by Retool DB or Postgres row changes that write AI-processed output back to the same table re-trigger themselves on every write; the loop is invisible from either workflow's editor because each workflow only sees its own trigger source.
Failure Mode 1 — Subworkflow Recursion Amplification
Retool Workflows have a "Run Workflow" step that accepts a workflow ID and a payload. It is designed for modular workflow composition: a parent workflow handles routing and a child workflow handles per-item processing. That pattern is sound when the item count is bounded by a fixed data source — a table with a known row count, a form submission with a fixed field set.
When an AI step generates the list that drives subworkflow invocation, the item count is unbounded. A task decomposition prompt asked to "break this project into subtasks" on a complex request might return 5 subtasks on a simple request and 30 subtasks on a complex one. If the parent workflow calls a child workflow for each subtask, that is 31 billed runs for the complex case (1 parent + 30 children) versus 6 for the simple case. If each child workflow also calls a grandchild workflow to execute each subtask step, and each subtask has 4 steps on average, the complex case becomes 1 + 30 + 30×4 = 151 billed runs from a single trigger.
Retool does not display a run count preview before execution. The Retool editor shows you the workflow graph — the "Run Workflow" step looks like any other step. There is no warning that this step will spawn N child runs based on AI output. The billing impact appears in your usage dashboard after the runs complete, not before.
The subworkflow rule: Billed runs from a Retool Workflow that calls child workflows = 1 (parent) + N (children) + N×M (grandchildren). When N is determined by AI output, measure actual AI output sizes across 20 real inputs before estimating monthly run consumption. The typical underestimate for AI-driven subworkflow invocation is 5–20×.
The guard pattern for subworkflow recursion is a budget-check webhook called at the start of the parent workflow that tracks total spawned child runs in the current session and blocks new invocations when the ceiling is reached. Child workflows call the same endpoint to check their generation depth before calling grandchildren:
import time
import sqlite3
import threading
from flask import Flask, request, jsonify
app = Flask(__name__)
db_lock = threading.Lock()
DB_PATH = "retool_workflow_guard.db"
def init_db():
with sqlite3.connect(DB_PATH) as conn:
conn.execute("""
CREATE TABLE IF NOT EXISTS workflow_sessions (
session_id TEXT PRIMARY KEY,
trigger_id TEXT,
parent_session_id TEXT,
depth INTEGER DEFAULT 0,
children_spawned INTEGER DEFAULT 0,
started_at REAL,
status TEXT DEFAULT 'running'
)
""")
conn.execute("""
CREATE TABLE IF NOT EXISTS workflow_config (
key TEXT PRIMARY KEY,
max_depth INTEGER DEFAULT 2,
max_children_per_session INTEGER DEFAULT 20,
max_total_runs_per_trigger INTEGER DEFAULT 50,
session_ttl_seconds INTEGER DEFAULT 3600
)
""")
conn.execute(
"INSERT OR IGNORE INTO workflow_config (key) VALUES ('default')"
)
class RetoolWorkflowBudget:
"""
Tracks subworkflow depth and total child runs per trigger session.
Blocks new child invocations when depth or child count ceilings are reached.
Call /workflow/start at the beginning of every workflow run.
Call /workflow/spawn before each 'Run Workflow' step to check remaining budget.
Call /workflow/complete when a workflow run finishes.
"""
@staticmethod
def start_session(session_id: str, trigger_id: str,
parent_session_id: str = "") -> dict:
now = time.time()
with db_lock:
with sqlite3.connect(DB_PATH) as conn:
config = conn.execute(
"SELECT max_depth, max_total_runs_per_trigger, session_ttl_seconds "
"FROM workflow_config WHERE key = 'default'"
).fetchone()
max_depth = config[0] if config else 2
max_total = config[1] if config else 50
ttl = config[2] if config else 3600
# Clean up stale sessions
conn.execute(
"DELETE FROM workflow_sessions WHERE started_at < ?",
(now - ttl,)
)
depth = 0
if parent_session_id:
parent = conn.execute(
"SELECT depth FROM workflow_sessions WHERE session_id = ?",
(parent_session_id,)
).fetchone()
if parent:
depth = parent[0] + 1
if depth > max_depth:
return {
"allow": False,
"reason": "max_depth_exceeded",
"depth": depth,
"ceiling": max_depth,
"session_id": session_id,
"message": (
f"Subworkflow depth {depth} exceeds ceiling of {max_depth}. "
"Blocking to prevent recursive run amplification. "
"Review the AI task-decomposition prompt to cap subtask nesting."
),
}
# Count total runs under this trigger
total_runs = conn.execute(
"SELECT COUNT(*) FROM workflow_sessions WHERE trigger_id = ?",
(trigger_id,)
).fetchone()[0]
if total_runs >= max_total:
return {
"allow": False,
"reason": "total_runs_ceiling",
"total_runs": total_runs,
"ceiling": max_total,
"trigger_id": trigger_id,
"message": (
f"Total runs for trigger {trigger_id!r} reached ceiling of "
f"{max_total} (current: {total_runs}). "
"Blocking further child workflow invocations."
),
}
conn.execute(
"INSERT OR REPLACE INTO workflow_sessions "
"(session_id, trigger_id, parent_session_id, depth, started_at) "
"VALUES (?, ?, ?, ?, ?)",
(session_id, trigger_id, parent_session_id, depth, now)
)
return {
"allow": True,
"session_id": session_id,
"depth": depth,
"max_depth": max_depth,
"total_runs": total_runs + 1,
"max_total_runs": max_total,
}
@staticmethod
def complete_session(session_id: str) -> dict:
with db_lock:
with sqlite3.connect(DB_PATH) as conn:
conn.execute(
"UPDATE workflow_sessions SET status = 'complete' "
"WHERE session_id = ?",
(session_id,)
)
return {"ok": True, "session_id": session_id}
@app.route("/workflow/start", methods=["POST"])
def workflow_start():
data = request.get_json(force=True)
result = RetoolWorkflowBudget.start_session(
session_id=data.get("session_id", ""),
trigger_id=data.get("trigger_id", ""),
parent_session_id=data.get("parent_session_id", ""),
)
return jsonify(result), 200 if result["allow"] else 429
@app.route("/workflow/complete", methods=["POST"])
def workflow_complete():
data = request.get_json(force=True)
return jsonify(RetoolWorkflowBudget.complete_session(
session_id=data.get("session_id", "")
))
if __name__ == "__main__":
init_db()
app.run(port=8082)
Add an HTTP step as the first step in every workflow that participates in a subworkflow hierarchy. Pass a session_id (generate from Retool's {{moment()}} + a random suffix), the trigger_id of the root event (passed as a workflow parameter from the parent), and the parent_session_id when the workflow was invoked by a parent (also passed as a parameter). Add a Retool Condition step immediately after the HTTP step to check the allow field — if false, return early with the block message. Add a final HTTP step calling /workflow/complete at the end of each workflow run. Set max_depth to the number of subworkflow levels your design intentionally uses and max_total_runs_per_trigger to your monthly run budget divided by expected trigger frequency.
Failure Mode 2 — AI Query Fan-Out from Component Events
Retool apps support JavaScript queries that call external resources — REST APIs, GraphQL endpoints, and Retool's native AI resource (which proxies to OpenAI, Anthropic, or configured custom LLM endpoints). These queries can be configured to run automatically: on page load, on component mount, or on component onChange events.
The fan-out failure mode emerges from Retool's listView and table components. A listView renders a configurable template once per data row — each instance is a separate React component subtree with its own query execution context. If a listView contains an AI query set to "run on component mount" (or wired to the row's currentSourceRow via onChange), that query executes independently for each rendered row. A listView with 200 rows runs the AI query 200 times when the page loads — and again 200 times if the user refreshes, navigates away and back, or the underlying data refreshes.
This failure mode is invisible from Retool's query editor. The query appears once in the editor's left panel. The per-row execution is a runtime behavior of the listView component's mount cycle. Engineers commonly discover it by noticing an unexplained spike in their LLM provider's API usage dashboard, not in Retool's workflow run counter (because listView queries are app queries, not workflow runs — they fire from the browser using the user's session credentials against the Retool AI resource).
The listView rule: An AI query inside a Retool listView or triggered by a table's row-selection event runs once per rendered row per trigger event. A 200-row listView with one AI query on mount = 200 AI calls per page load. If five team members open this app simultaneously, that is 1,000 AI calls before anyone has clicked anything.
The guard pattern for AI query fan-out is a JavaScript wrapper injected into the Retool query's pre-execution hook that tracks per-session query counts against a shared rate limit endpoint. Retool queries support a "Run before" transformer step in the query's JavaScript tab — use this to check the guard before the actual AI API call fires:
// Retool query JavaScript tab — paste into the query's "Transformer" section
// or as a separate JS query that wraps your AI resource query.
// This runs in the Retool app's browser context.
const GUARD_URL = "https://your-guard-host/ai-query/check";
const SESSION_KEY = "retool_ai_query_session";
const MAX_PER_SESSION = 25; // max AI calls per browser session per app
async function checkAIQueryBudget(queryName, rowIndex) {
// Use sessionStorage to avoid re-initialising session ID on every query run
let sessionId = sessionStorage.getItem(SESSION_KEY);
if (!sessionId) {
sessionId = `${Date.now()}-${Math.random().toString(36).slice(2, 9)}`;
sessionStorage.setItem(SESSION_KEY, sessionId);
}
const response = await fetch(GUARD_URL, {
method: "POST",
headers: { "Content-Type": "application/json" },
body: JSON.stringify({
session_id: sessionId,
query_name: queryName,
row_index: rowIndex ?? null,
app_id: currentUser?.email ?? "unknown", // Retool currentUser global
}),
});
const result = await response.json();
if (!result.allow) {
// Return a cached/empty result instead of calling the AI API
return {
blocked: true,
reason: result.reason,
message: result.message,
// Return a safe default so the listView item renders without crashing
summary: "[AI summary unavailable — per-session query limit reached]",
};
}
return null; // null = proceed with the AI API call
}
// Usage: call at the top of your AI resource query's transformer
const guardResult = await checkAIQueryBudget("row_ai_annotation", currentRow?.index);
if (guardResult?.blocked) {
return guardResult;
}
// ... rest of your AI query logic follows
import time
import sqlite3
import threading
from flask import Flask, request, jsonify
app = Flask(__name__)
db_lock = threading.Lock()
DB_PATH = "retool_ai_query_guard.db"
def init_db():
with sqlite3.connect(DB_PATH) as conn:
conn.execute("""
CREATE TABLE IF NOT EXISTS query_events (
id INTEGER PRIMARY KEY AUTOINCREMENT,
session_id TEXT NOT NULL,
query_name TEXT,
row_index INTEGER,
app_id TEXT,
fired_at REAL
)
""")
conn.execute(
"CREATE INDEX IF NOT EXISTS idx_session ON query_events (session_id)"
)
MAX_PER_SESSION = 25
SESSION_WINDOW = 3600 # 1 hour rolling window per session
class RetoolAIQueryGuard:
@staticmethod
def check(session_id: str, query_name: str,
row_index=None, app_id: str = "") -> dict:
now = time.time()
with db_lock:
with sqlite3.connect(DB_PATH) as conn:
count = conn.execute(
"SELECT COUNT(*) FROM query_events "
"WHERE session_id = ? AND fired_at > ?",
(session_id, now - SESSION_WINDOW)
).fetchone()[0]
if count >= MAX_PER_SESSION:
return {
"allow": False,
"reason": "session_ai_query_ceiling",
"session_id": session_id,
"count": count,
"ceiling": MAX_PER_SESSION,
"message": (
f"AI query ceiling reached for session {session_id!r}: "
f"{count} AI queries fired in the last hour "
f"(ceiling: {MAX_PER_SESSION}). "
"Returning cached/default result. "
"Move AI annotation to a batch workflow run instead of "
"per-row listView queries to avoid per-load fan-out."
),
}
conn.execute(
"INSERT INTO query_events "
"(session_id, query_name, row_index, app_id, fired_at) "
"VALUES (?, ?, ?, ?, ?)",
(session_id, query_name, row_index, app_id, now)
)
return {
"allow": True,
"session_id": session_id,
"count": count + 1,
"ceiling": MAX_PER_SESSION,
}
@app.route("/ai-query/check", methods=["POST"])
def ai_query_check():
data = request.get_json(force=True)
result = RetoolAIQueryGuard.check(
session_id=data.get("session_id", ""),
query_name=data.get("query_name", ""),
row_index=data.get("row_index"),
app_id=data.get("app_id", ""),
)
return jsonify(result), 200 if result["allow"] else 429
if __name__ == "__main__":
init_db()
app.run(port=8083)
The structural fix is more effective than the guard alone: move per-row AI annotation out of the listView component and into a Retool Workflow that processes the full batch offline. Trigger the workflow on a schedule or via a button in the app, write results to a Retool DB table, and have the listView read from that table instead of calling the AI API at render time. This converts 200 per-render AI calls into one scheduled batch workflow run — measurably cheaper and faster to load.
Failure Mode 3 — Retry Storm from Shared AI API Rate Limits
Retool Workflows support configurable retry behavior for failed steps. The default for most resource types is no automatic retry, but teams that encounter transient failures (network timeouts, upstream API 503s) commonly enable retries — up to five retries with a configurable delay between attempts. For a single workflow run hitting a temporary issue, retries work exactly as intended.
The storm emerges from concurrency. Retool Workflows execute concurrently by default — multiple triggers in the same time window start multiple workflow runs simultaneously. When those concurrent runs all call the same AI API endpoint (your OpenAI key, your Anthropic API resource, or Retool's proxied AI endpoint) and the endpoint rate-limits all of them at the same moment (because concurrent calls aggregate against the same per-minute token or request quota), all runs receive a 429 error simultaneously. All runs then retry at the configured delay — say, 30 seconds. Thirty seconds later, all runs attempt to call the API again at the same moment. The same rate limit fires. All runs retry again.
This thundering herd pattern means the total number of calls to your AI API is: concurrent_runs × (1 + max_retries) rather than concurrent_runs × 1. Ten concurrent workflow runs with five retries each generate 60 calls to the AI API from a single wave of triggers — consuming 60 API calls worth of rate limit budget and token quota before any run has successfully completed. If the rate limit is tokens-per-minute (as with most OpenAI and Anthropic endpoints), each failed call also consumes the input tokens for the prompt, which means the prompt is billed 60 times while returning zero useful output.
The retry rule: Retool retry + concurrency = multiplicative call amplification. With N concurrent runs and R retries, a single rate-limit event generates N × (R + 1) calls to the AI API. At N=20 and R=5, that is 120 calls and potentially 120× your prompt's input-token cost for a single rate-limit event. Jitter the retry delay and add a circuit breaker before enabling retries on AI steps.
The guard pattern for retry storms is a circuit breaker webhook that tracks consecutive rate-limit responses across all running workflow sessions and opens the circuit when the failure rate indicates a systemic rate limit rather than a transient single-run issue. Workflows check the circuit before calling the AI step:
import time
import sqlite3
import threading
from flask import Flask, request, jsonify
app = Flask(__name__)
db_lock = threading.Lock()
DB_PATH = "retool_circuit_breaker.db"
def init_db():
with sqlite3.connect(DB_PATH) as conn:
conn.execute("""
CREATE TABLE IF NOT EXISTS rate_limit_events (
id INTEGER PRIMARY KEY AUTOINCREMENT,
resource TEXT NOT NULL,
workflow_id TEXT,
http_status INTEGER,
recorded_at REAL
)
""")
conn.execute("""
CREATE TABLE IF NOT EXISTS circuit_state (
resource TEXT PRIMARY KEY,
state TEXT DEFAULT 'closed',
opened_at REAL,
open_duration_seconds INTEGER DEFAULT 120,
consecutive_failures INTEGER DEFAULT 0
)
""")
BREAKER_THRESHOLD = 3 # consecutive 429s before opening
OPEN_DURATION = 120 # seconds the breaker stays open
class RetoolCircuitBreaker:
"""
Opens when 3+ consecutive 429s are received from the same AI resource
across any running workflow. While open, all workflows skip the AI step
rather than retry into the rate-limited endpoint, breaking the thundering herd.
"""
@staticmethod
def check(resource: str) -> dict:
now = time.time()
with db_lock:
with sqlite3.connect(DB_PATH) as conn:
row = conn.execute(
"SELECT state, opened_at, open_duration_seconds "
"FROM circuit_state WHERE resource = ?",
(resource,)
).fetchone()
if row and row[0] == "open":
elapsed = now - row[1]
if elapsed < row[2]:
retry_after = int(row[2] - elapsed)
return {
"allow": False,
"state": "open",
"resource": resource,
"retry_after_seconds": retry_after,
"message": (
f"Circuit open for {resource!r}: rate limit cascade detected. "
f"AI step blocked for {retry_after}s. "
"Workflow will skip this step and return a degraded result. "
"The circuit closes automatically when the block window expires."
),
}
else:
# Half-open: allow one probe
conn.execute(
"UPDATE circuit_state SET state = 'half-open' WHERE resource = ?",
(resource,)
)
return {"allow": True, "state": row[0] if row else "closed", "resource": resource}
@staticmethod
def record_outcome(resource: str, workflow_id: str,
http_status: int) -> dict:
now = time.time()
with db_lock:
with sqlite3.connect(DB_PATH) as conn:
conn.execute(
"INSERT INTO rate_limit_events "
"(resource, workflow_id, http_status, recorded_at) "
"VALUES (?, ?, ?, ?)",
(resource, workflow_id, http_status, now)
)
conn.execute(
"INSERT OR IGNORE INTO circuit_state (resource) VALUES (?)",
(resource,)
)
if http_status == 429:
conn.execute(
"UPDATE circuit_state "
"SET consecutive_failures = consecutive_failures + 1 "
"WHERE resource = ?",
(resource,)
)
failures = conn.execute(
"SELECT consecutive_failures FROM circuit_state WHERE resource = ?",
(resource,)
).fetchone()[0]
if failures >= BREAKER_THRESHOLD:
conn.execute(
"UPDATE circuit_state "
"SET state = 'open', opened_at = ?, open_duration_seconds = ? "
"WHERE resource = ?",
(now, OPEN_DURATION, resource)
)
return {
"action": "breaker_opened",
"resource": resource,
"consecutive_failures": failures,
"open_duration_seconds": OPEN_DURATION,
}
else:
# Success or non-429 error: reset consecutive counter
conn.execute(
"UPDATE circuit_state "
"SET consecutive_failures = 0, state = 'closed' "
"WHERE resource = ?",
(resource,)
)
return {"action": "recorded", "resource": resource, "http_status": http_status}
@app.route("/circuit/check", methods=["POST"])
def circuit_check():
data = request.get_json(force=True)
result = RetoolCircuitBreaker.check(resource=data.get("resource", "default"))
return jsonify(result), 200 if result["allow"] else 503
@app.route("/circuit/outcome", methods=["POST"])
def circuit_outcome():
data = request.get_json(force=True)
return jsonify(RetoolCircuitBreaker.record_outcome(
resource=data.get("resource", "default"),
workflow_id=data.get("workflow_id", ""),
http_status=int(data.get("http_status", 200)),
))
if __name__ == "__main__":
init_db()
app.run(port=8084)
Add two HTTP steps around every AI API call in your Retool Workflow. Before the AI step, call /circuit/check with the resource identifier (your AI resource name in Retool). If the response is not allow: true, use a Retool Condition to return a degraded result immediately — do not proceed to the AI step, do not retry. After the AI step completes (or fails), call /circuit/outcome with the HTTP status code from the AI API response. Set BREAKER_THRESHOLD to 3 consecutive 429s and OPEN_DURATION to 120 seconds (twice a typical token-bucket refill window). Disable Retool's built-in step retries on AI steps when using this guard — the guard provides circuit-level protection that makes per-run retries redundant.
Failure Mode 4 — Database Change Trigger Loops
Retool Workflows can be triggered by database events: a new row inserted into a Retool DB table, a row updated in a connected Postgres database, or a polling query that detects changes since the last run. These triggers are designed for workflows that react to data changes — an order placed, a support ticket created, a record updated by a user.
The loop emerges when the same workflow that is triggered by a data change also writes data back to the same table (or a related table that has its own change trigger pointing to the same workflow or a related one). An AI enrichment workflow triggered by a new support ticket row might process the ticket with an LLM, extract structured fields, and write an enriched record back to the same table. If the enriched write also matches the trigger condition, the trigger fires again on the newly written row. The workflow runs again on the already-processed record, produces an updated enrichment, writes that back, and the cycle continues.
The loop is invisible from the Retool Workflows editor. The trigger configuration shows the source table and the filter condition — typically something like "new row in support_tickets where processed = false." After the first run, the workflow sets processed = true, so the trigger should not fire again on the same row. But if the AI enrichment also writes a new row (a structured extraction to a related ticket_entities table that also has a workflow trigger for AI processing), each write starts a new workflow run. A ticket with 5 extracted entities creates 5 child workflow triggers from the parent run's writes.
The DB trigger rule: Any workflow triggered by a DB change that also writes to the same or a related DB is at risk of triggering itself or another workflow. Map all DB write operations in your workflow against all DB trigger sources across all workflows before enabling DB-change triggers on AI enrichment workflows.
The guard pattern for database change trigger loops is provenance tagging: every record written by a workflow carries metadata fields that identify the workflow that wrote it and the number of processing hops accumulated from the original source event. Downstream workflows check hop count before proceeding:
import time
import sqlite3
import threading
import hashlib
import json
from flask import Flask, request, jsonify
app = Flask(__name__)
db_lock = threading.Lock()
DB_PATH = "retool_loop_guard.db"
def init_db():
with sqlite3.connect(DB_PATH) as conn:
conn.execute("""
CREATE TABLE IF NOT EXISTS provenance_log (
id INTEGER PRIMARY KEY AUTOINCREMENT,
record_key TEXT NOT NULL,
source_record_key TEXT,
workflow_id TEXT NOT NULL,
hop_count INTEGER DEFAULT 0,
parent_key TEXT,
written_at REAL
)
""")
conn.execute(
"CREATE INDEX IF NOT EXISTS idx_record_key ON provenance_log (record_key)"
)
MAX_HOP_COUNT = 2 # AI output must not re-trigger AI processing more than twice
class RetoolLoopGuard:
"""
Prevents database-change trigger loops by tracking provenance (workflow_id,
hop_count, parent_key) for every record written by a Retool Workflow.
Downstream workflows check hop_count before processing — blocked at MAX_HOP_COUNT.
"""
@staticmethod
def check_and_tag(record_key: str, workflow_id: str,
parent_key: str = "",
source_record_key: str = "") -> dict:
now = time.time()
with db_lock:
with sqlite3.connect(DB_PATH) as conn:
parent_hop = 0
if parent_key:
parent_row = conn.execute(
"SELECT hop_count FROM provenance_log "
"WHERE record_key = ? ORDER BY written_at DESC LIMIT 1",
(parent_key,)
).fetchone()
if parent_row:
parent_hop = parent_row[0]
hop_count = parent_hop + 1 if parent_key else 0
if hop_count >= MAX_HOP_COUNT:
return {
"allow": False,
"reason": "max_hop_count_exceeded",
"record_key": record_key,
"hop_count": hop_count,
"ceiling": MAX_HOP_COUNT,
"workflow_id": workflow_id,
"parent_key": parent_key,
"message": (
f"Record {record_key!r} has accumulated {hop_count} "
f"processing hops (ceiling: {MAX_HOP_COUNT}). "
f"Blocking workflow {workflow_id!r} to prevent trigger loop. "
"Root cause: a DB-change trigger workflow is writing to a table "
"whose change events re-trigger the same or a related workflow. "
"Add a processed flag to the trigger filter condition "
"or map all DB write-trigger chains before enabling this workflow."
),
}
conn.execute(
"INSERT INTO provenance_log "
"(record_key, source_record_key, workflow_id, hop_count, parent_key, written_at) "
"VALUES (?, ?, ?, ?, ?, ?)",
(record_key, source_record_key, workflow_id, hop_count, parent_key, now)
)
return {
"allow": True,
"record_key": record_key,
"hop_count": hop_count,
"ceiling": MAX_HOP_COUNT,
"workflow_id": workflow_id,
}
@staticmethod
def check_existing(record_key: str, workflow_id: str) -> dict:
with db_lock:
with sqlite3.connect(DB_PATH) as conn:
row = conn.execute(
"SELECT hop_count, workflow_id FROM provenance_log "
"WHERE record_key = ? ORDER BY written_at DESC LIMIT 1",
(record_key,)
).fetchone()
if not row:
return {"allow": True, "hop_count": 0, "record_key": record_key}
hop_count = row[0] + 1
if hop_count >= MAX_HOP_COUNT:
return {
"allow": False,
"reason": "record_already_processed",
"record_key": record_key,
"hop_count": hop_count,
"last_processed_by": row[1],
"ceiling": MAX_HOP_COUNT,
"message": (
f"Record {record_key!r} was already processed by "
f"workflow {row[1]!r} (hop_count: {row[0]}). "
f"Blocking {workflow_id!r} to prevent reprocessing loop."
),
}
return {"allow": True, "hop_count": hop_count, "record_key": record_key}
@app.route("/loop/check", methods=["POST"])
def loop_check():
data = request.get_json(force=True)
result = RetoolLoopGuard.check_existing(
record_key=data.get("record_key", ""),
workflow_id=data.get("workflow_id", ""),
)
return jsonify(result), 200 if result["allow"] else 409
@app.route("/loop/tag", methods=["POST"])
def loop_tag():
data = request.get_json(force=True)
result = RetoolLoopGuard.check_and_tag(
record_key=data.get("record_key", ""),
workflow_id=data.get("workflow_id", ""),
parent_key=data.get("parent_key", ""),
source_record_key=data.get("source_record_key", ""),
)
return jsonify(result), 200 if result["allow"] else 409
if __name__ == "__main__":
init_db()
app.run(port=8085)
Add a /loop/check HTTP call as the first step of every DB-change-triggered workflow, passing the triggering record's unique key and the workflow ID. If the response blocks the call, return immediately — do not process the record. Before writing any records to your output table, call /loop/tag with the new record key, the current workflow ID, and the trigger record's key as parent_key. Set MAX_HOP_COUNT to 1 if your workflow design is strictly single-pass (AI output must not re-trigger AI processing) or to 2 if a two-step enrichment chain is intentional (extract → classify). For most AI enrichment patterns, 1 is the correct ceiling.
State Table — Four Retool AI Failure Modes
| Failure mode | Guard class | Ceiling | What to watch |
|---|---|---|---|
| Subworkflow recursion amplificationRun Workflow step called per AI-generated list item creates geometric run multiplication | RetoolWorkflowBudget |
depth: 2, total runs/trigger: 50 | Retool usage dashboard: total workflow runs per day; alert if daily run count > 2× weekly average |
| AI query fan-out from listView/tableAI query set to auto-run fires once per rendered row on every page load or component mount | RetoolAIQueryGuard |
25 AI queries per browser session per hour | LLM provider dashboard: API calls per hour; spike above average × active users = listView fan-out |
| Retry storm from shared AI rate limitsConcurrent runs all hit rate limit simultaneously, retry simultaneously, amplify call volume | RetoolCircuitBreaker |
3 consecutive 429s → open for 120s | AI API provider error rate for 429 responses; thundering herd signature is burst of 429s followed by gap then repeat |
| DB change trigger loopsAI enrichment workflow triggered by DB change writes back to the same table, re-triggering itself | RetoolLoopGuard |
hop_count < 2 per record | Retool workflow run count for DB-triggered workflows; alert if run count exceeds expected trigger event count by >10% |
Checklist Before Enabling AI in Retool Workflows
- Map every Run Workflow step that executes on AI-generated output. For each one, count the maximum plausible items in the AI-generated list (measure actual LLM output sizes on 20 real inputs before estimating). Set
max_total_runs_per_triggerto this maximum × your parent workflow's monthly trigger frequency. - Audit all listView and table components for AI queries set to auto-run. Open the Retool editor, filter queries by resource type (AI resource or external REST API calling an LLM), and check the trigger configuration for each. Convert per-row auto-run queries to batch workflow queries that write results to a Retool DB table loaded at page start.
- Disable Retool's built-in step retries on AI steps. Add a circuit breaker guard instead. Per-run retries and concurrent run concurrency combine to produce thundering herd behavior that the circuit breaker prevents. Configure the circuit breaker's
OPEN_DURATIONto at least 2× your AI provider's rate limit refill window (typically 60 seconds for per-minute limits). - Map all DB write operations against all DB trigger sources before enabling DB-change triggers. Create a spreadsheet with two columns: "workflow triggers on changes to" and "workflow writes to." Flag any row where a workflow writes to a table that is also a trigger source for itself or another AI workflow.
- Add processed flags to every DB-change trigger's filter condition. Triggers should fire on
WHERE processed = falseand the first step of the workflow should setprocessed = truebefore any AI step. This is belt-and-suspenders protection alongside the provenance tag guard — the trigger filter prevents re-triggering on already-processed records at the database level. - Set explicit concurrency limits on high-volume Retool Workflows. Retool Business plan allows configuring maximum concurrent workflow runs. Set this to a value where concurrent runs × max retries stays within your AI API provider's per-minute request quota.
- Monitor LLM provider API usage independently of Retool's run counter. Retool's usage dashboard tracks workflow runs. Per-row listView AI queries run outside of workflow context and are not counted in workflow run totals. The only reliable signal for fan-out from component-level queries is your AI API provider's request volume dashboard.
FAQ
Does Retool's built-in Workflow concurrency setting prevent retry storms?
Not reliably. The concurrency setting limits how many runs execute simultaneously, which reduces the initial simultaneous call volume to your AI API. But if all concurrent runs still hit the same rate limit threshold and all retry at the same configured delay, the thundering herd recurs at the retry wave. The concurrency ceiling controls the peak request rate but not the retry timing distribution. The circuit breaker guard is needed to break the synchronization: once the circuit opens, all subsequent retry attempts check the circuit and return a degraded result instead of calling the AI API, eliminating the synchronized retry waves.
What is the correct way to do per-row AI annotation in a Retool listView without fan-out?
The correct pattern is a pre-computed annotation table. Run a Retool Workflow on a schedule (or triggered by a button in the app) that batches the rows needing annotation, calls the AI API once for the batch (using a prompt that returns a JSON array of annotations), and writes the results to a Retool DB annotation table. The listView loads from the annotation table at page start — one query for the full annotation set, not one AI call per row. This approach is faster to load (a DB read is milliseconds; an AI API call is seconds), cheaper (one batch call costs less than N individual calls due to shared system prompt tokens), and eliminates the fan-out failure mode entirely.
How should I set hop_count ceilings for a legitimate two-step AI enrichment chain?
A two-step chain where Step 1 extracts entities and Step 2 classifies each entity should use MAX_HOP_COUNT = 2. Step 1 writes entity records with hop_count = 0 (they are derived from the original trigger but have not been AI-processed themselves). Step 2 processes each entity record and writes a classification record with hop_count = 1. The guard blocks any downstream processing that would write with hop_count = 2. This allows the intentional chain while blocking the unintentional loop where Step 2's output triggers Step 1 again. Set MAX_HOP_COUNT = 1 for single-pass enrichment workflows where AI output should never feed back into any trigger.
Retool Workflows have an execution log — can I detect loops from there instead of adding guards?
The execution log shows runs after they have already fired. For a loop that completes in seconds (a workflow that processes and re-triggers in under 10 seconds), multiple iterations have already run by the time you open the log. The log is a diagnostic tool, not a prevention mechanism. The provenance guard prevents loop iterations from running — the log is useful for understanding why a loop happened after the guard has stopped it, not for preventing the loop in the first place. Use the log to tune your trigger filter conditions and hop_count ceilings after the guard has caught the first occurrence.
How does RunGuard integrate with Retool specifically?
RunGuard's circuit breaker and budget guard endpoints are standard HTTP APIs callable from Retool Workflows via the HTTP resource type. Add RunGuard's endpoint URL as a REST API resource in your Retool organization settings. Each guard call in your workflow is a standard HTTP step — no Retool plugins or custom components required. RunGuard's dashboard shows per-workflow trip history so you can see which workflows hit which guard and when, without parsing Retool's execution log manually. The Solo plan at $19/month covers one app with up to 1M guarded invocations — sufficient for most Retool organizations running internal-tool AI workflows.