Voiceflow Cost Control: No Match Reprompt Loops, Capture Slot-Filling Spirals, Knowledge Base Re-Query Chains, and API Block Retry Storms

Voiceflow is a collaborative AI agent and chatbot design platform used by thousands of teams to build customer support agents, lead qualification bots, onboarding assistants, and multi-step AI agents across web chat, voice (Alexa, Google), Slack, SMS, and custom API channels. Since 2024, Voiceflow has evolved from a visual conversation designer into a full AI agent platform: AI Response steps generate dynamic answers using GPT-4o and Claude, Knowledge Base blocks enable retrieval-augmented answer generation from your uploaded documents, and Workflow Agents handle multi-turn goal completion with autonomous tool selection. Each AI step — AI Response, KB Answer — is a billable LLM call counted against your plan's monthly AI step allocation or billed as token consumption against your connected API key.

Voiceflow's visual flow model makes it easy to build sophisticated conversational logic. It also makes it easy to introduce loop patterns that multiply LLM calls invisibly. When a Go To block routes a failed step back to the same node that calls an AI Response, when a Capture step re-prompts indefinitely without an attempt ceiling, or when an API Block's retry configuration meets a flaky upstream service, the session cost compounds with every iteration while the user sees only a slightly delayed response. Four structural patterns in Voiceflow's block model are responsible for the majority of AI step overruns:

  • No Match reprompt loops — A Capture or Intent step routes failed recognitions to a No Match path. If that path includes an AI Response step to generate a contextual clarification message, and a Go To block returns the conversation to the original Capture or Intent step, each failed recognition attempt calls the LLM for the reprompt and calls the NLU again for the re-evaluation. Without an attempt ceiling, three failed attempts before escalation cost three AI Response calls plus three classification calls — six billed LLM invocations for a single failed exchange.
  • Capture slot-filling spirals — Voiceflow's Capture step extracts structured entities (email addresses, order numbers, dates, phone numbers) from user input. When entity extraction fails — because the user provided a description instead of the expected format, a partial value, or an out-of-vocabulary input — the No Match path re-prompts and a Go To routes back to the same Capture step. Without a Set variable block tracking attempt count and a Condition block enforcing a ceiling, this loop has no natural exit. A slot-filling spiral on an email capture step, where the user writes "john at example dot com" across 15 re-prompts before abandoning, consumes 15 AI Response calls for a conversation that produced no captured value.
  • Knowledge Base re-query chains — Voiceflow's KB Answer block fires two billable operations per invocation: an embedding call that vectorizes the query and retrieves the most semantically similar document chunks, and an LLM synthesis call that generates an answer from those chunks and returns a confidence score. Flows that check the confidence score with a Condition block and route low-confidence responses back to the same KB Answer block — sometimes with a slightly rephrased system prompt — create a re-query chain. The second query on the same topic rarely produces a materially higher confidence score if the KB does not contain the answer; the embedding retrieves slightly different chunks, the synthesis produces a similarly uncertain answer, and the condition routes back for a third attempt. Three iterations = six billed LLM operations per unanswerable question.
  • API Block retry storms — Voiceflow's API Block sends HTTP requests to external endpoints. When the upstream service returns a 5xx error or times out, Voiceflow can be configured to retry the request automatically. In practice, retry configuration is combined with a subsequent AI Response step that synthesizes the API response into a natural-language reply — if the API Block succeeds on retry N, the AI Response still fires. Under a flaky upstream (intermittent 503s, high latency under load), every session that reaches the API Block fires multiple HTTP requests before one succeeds, and the AI Response call fires once per session regardless. The retry storm compresses when many sessions arrive simultaneously: hundreds of concurrent retry attempts hit the already-overloaded upstream, worsening its latency, causing more retries, in a cascade that multiplies both API calls and the session time to resolution.

Failure Mode 1 — No Match Reprompt Loops

Voiceflow flows handle unrecognized user input through No Match and No Reply paths attached to Capture and Intent steps. The No Match path fires when the user's utterance does not match any trained intent or does not satisfy the Capture step's entity extraction. The No Reply path fires when the user sends no input within the configured timeout window. Both paths typically include a "clarification message" node — in older Voiceflow flows, a static Speak block; in modern AI agent flows, an AI Response step that generates a contextually sensitive clarification based on the conversation history.

The loop emerges from a natural design decision: after the clarification message, route the user back to try again using a Go To block pointing at the original Capture or Intent step. This is ergonomically correct — users should get another chance. The problem is that each round trip now includes an AI Response call for the dynamic clarification. A Capture step configured to allow three re-prompt attempts before escalation fires three AI Response calls for the three clarification messages, plus three Capture step entity extraction calls, plus the eventual escalation LLM call. A bot handling 50,000 sessions per month where 15% of sessions have at least one No Match event, and where each No Match event averages 2.3 re-prompts, generates 17,250 extra AI Response calls per month from the reprompt loop alone.

The compounding factor is that AI Response steps in No Match paths are typically configured with rich system prompts — they receive the conversation history, the failed utterance, the expected entity type, and instructions to generate a helpful, non-repetitive clarification. Larger context windows = more tokens per call = higher cost per re-prompt call than a standard single-turn AI Response. A re-prompt AI Response call may cost 2–3× the tokens of a standard response call if the system prompt includes the full prior conversation context for coherence.

The reprompt rule: Every No Match path that includes an AI Response step must check a per-session attempt counter before generating the dynamic clarification. When the counter reaches the ceiling, route directly to the escalation node — do not generate one more AI Response as a "sorry, I can't help" message. A static Speak block for the final "I'll connect you with an agent" message costs zero LLM calls; an AI-generated "I'm sorry I wasn't able to understand" costs the same as the earlier clarification attempts it failed to replace. Reserve AI Response for clarifications where context-sensitivity matters; use static Speak blocks for the terminal escalation path.

Python — No Match reprompt guard with attempt ceiling and input structural check (Flask)
import re
import time
import sqlite3
import threading
from flask import Flask, request, jsonify

app = Flask(__name__)
db_lock = threading.Lock()
DB_PATH = "voiceflow_reprompt_guard.db"

MAX_REPROMPTS_PER_SESSION_PER_STEP = 3
SESSION_WINDOW_SECONDS = 3600

# Structural pre-checks: for steps that capture a specific entity type,
# verify the user's input contains the prerequisite signal before re-prompting.
# If the signal is absent, the entity cannot be extracted regardless of reprompts.
ENTITY_PRE_CHECKS = {
    "capture_email":       re.compile(r"@|at\s+\w+\s+dot|at\s+\w+\.|\w+\.\w{2,}", re.I),
    "capture_order_id":    re.compile(r"\b(?:order|#|ord)\s*[-]?\d{5,}|\b\d{6,}\b", re.I),
    "capture_phone":       re.compile(r"\b\d[\d\s\-().]{7,}\d\b", re.I),
    "capture_date":        re.compile(r"\b\d{1,2}[/-]\d{1,2}[/-]\d{2,4}\b|\b(?:jan|feb|mar|apr|may|jun|jul|aug|sep|oct|nov|dec)\w*\s+\d{1,2}", re.I),
    "capture_zip":         re.compile(r"\b\d{5}(?:[-\s]\d{4})?\b", re.I),
}

def init_db():
    with sqlite3.connect(DB_PATH) as conn:
        conn.execute("""
            CREATE TABLE IF NOT EXISTS reprompt_log (
                id INTEGER PRIMARY KEY AUTOINCREMENT,
                session_id TEXT NOT NULL,
                step_id TEXT NOT NULL,
                attempt INTEGER NOT NULL,
                user_utterance TEXT,
                pre_check_result TEXT,
                recorded_at REAL
            )
        """)
        conn.execute(
            "CREATE INDEX IF NOT EXISTS idx_session_step "
            "ON reprompt_log (session_id, step_id, recorded_at)"
        )

class NoMatchRepromptGuard:
    """
    Call check() in the No Match path before routing to an AI Response step.
    Returns allow=False when the reprompt ceiling is reached or when the
    user's input structurally cannot satisfy the step's entity requirement.
    """

    @staticmethod
    def check(session_id: str, step_id: str, user_utterance: str) -> dict:
        now = time.time()
        window_start = now - SESSION_WINDOW_SECONDS

        # Structural pre-check for entity-capture steps
        pre_check_pattern = ENTITY_PRE_CHECKS.get(step_id)
        if pre_check_pattern and not pre_check_pattern.search(user_utterance):
            return {
                "allow": False,
                "reason": "entity_signal_absent",
                "session_id": session_id,
                "step_id": step_id,
                "user_utterance": user_utterance,
                "message": (
                    f"Step {step_id!r}: the user's utterance does not contain "
                    "the structural signal required for this entity type. "
                    "Re-prompting with an AI Response will not extract the entity — "
                    "the information is not in the input. Route to a structured "
                    "prompt that asks the user to provide the specific format required "
                    "(e.g., 'Please type your email address including the @ symbol')."
                ),
            }

        with db_lock:
            with sqlite3.connect(DB_PATH) as conn:
                prior_attempts = conn.execute(
                    "SELECT COUNT(*) FROM reprompt_log "
                    "WHERE session_id = ? AND step_id = ? AND recorded_at > ?",
                    (session_id, step_id, window_start)
                ).fetchone()[0]

                if prior_attempts >= MAX_REPROMPTS_PER_SESSION_PER_STEP:
                    return {
                        "allow": False,
                        "reason": "reprompt_ceiling_reached",
                        "session_id": session_id,
                        "step_id": step_id,
                        "attempts_in_window": prior_attempts,
                        "ceiling": MAX_REPROMPTS_PER_SESSION_PER_STEP,
                        "message": (
                            f"Step {step_id!r} has re-prompted {prior_attempts} times "
                            f"in this session (ceiling: {MAX_REPROMPTS_PER_SESSION_PER_STEP}). "
                            "Further AI-generated clarifications will not change the outcome. "
                            "Route to the human escalation or static fallback path."
                        ),
                    }

                conn.execute(
                    "INSERT INTO reprompt_log "
                    "(session_id, step_id, attempt, user_utterance, "
                    "pre_check_result, recorded_at) VALUES (?, ?, ?, ?, ?, ?)",
                    (session_id, step_id,
                     prior_attempts + 1,
                     user_utterance[:500],
                     "passed" if pre_check_pattern else "no_pattern_configured",
                     now)
                )
                return {
                    "allow": True,
                    "session_id": session_id,
                    "step_id": step_id,
                    "attempt_number": prior_attempts + 1,
                    "remaining_reprompts": MAX_REPROMPTS_PER_SESSION_PER_STEP - prior_attempts - 1,
                }

@app.route("/guard/no-match-reprompt", methods=["POST"])
def no_match_reprompt_guard():
    data = request.get_json(force=True)
    result = NoMatchRepromptGuard.check(
        session_id=data.get("session_id", ""),
        step_id=data.get("step_id", ""),
        user_utterance=data.get("user_utterance", ""),
    )
    return jsonify(result), 200 if result["allow"] else 429

if __name__ == "__main__":
    init_db()
    app.run(port=8101)

Wire this guard into your Voiceflow flow using a Function step (or an API step calling the endpoint) placed at the top of your No Match path, before the AI Response step. Pass the Voiceflow session ID from {system.userId} or a custom session variable as session_id, the current step's identifier as step_id, and the last user utterance from {last_utterance} as user_utterance. Set the Function step's output variable (e.g., allow_reprompt) to the response's allow field. Add a Condition block immediately after: if allow_reprompt is false, route to the static escalation Speak block; if true, route to the AI Response step. The remaining_reprompts field lets downstream logic use progressively more direct clarification prompts as the ceiling approaches — the first reprompt can be gentle ("I didn't quite catch that"), the third can be explicit ("Please type your email address in the format user@domain.com").

Failure Mode 2 — Capture Slot-Filling Spirals

Voiceflow's Capture step extracts structured data from a user's utterance: an email address for a sign-up flow, an order number for a support flow, a date for a scheduling flow. The step sends the utterance to an entity extraction model configured for the entity type, and routes to a success path when the entity is extracted within the expected format or to a No Match path when it is not. The standard Voiceflow pattern for handling capture failures is to add a No Match path with a re-prompt message and a Go To pointing back at the Capture step — giving the user another attempt.

The spiral failure mode is distinct from the No Match reprompt loop: it is specifically about entity format mismatch rather than utterance recognition failure. A user who writes "john at example dot com" is clearly providing an email address; the intent is obvious. The Capture step's entity extractor expects a machine-recognizable email format and fails the extraction. A re-prompt asking "please enter your email" returns the same user providing "sure, john at example dot com" — the user believes their first response was correct and is now being asked to repeat it. The next four re-prompts yield the same malformed format with increasing user frustration, and the Capture step never extracts the entity. Each re-prompt iteration that includes an AI Response step for context-aware rephrasing — "It looks like you might be trying to type an email address. Could you try the format user@domain.com?" — fires one AI step call per iteration.

A slot-filling spiral differs from the reprompt loop in its structure: the spiral occurs because the format pre-check is absent, not because the reprompt ceiling is absent. Even a flow with a ceiling of 5 re-prompts fires five AI Response calls before escalation if none of those calls address the underlying format issue. The fix requires two changes: a format pre-check that detects partial or colloquially-formatted inputs (catching "at" and "dot" as email format indicators) and a dedicated "format hint" path that sends a static Speak block with an explicit format example instead of an AI-generated clarification.

The slot-filling rule: For every Capture step that extracts a structured entity with a known format, add a pattern-matching branch in the No Match path before the AI Response step. If the user's utterance matches a "colloquial format" pattern (contains "at" and "dot" for email, contains verbal number patterns like "three two one" for phone numbers, contains "the fifteenth" for dates), route to a static Speak block that shows the exact expected format and a Go To back to the Capture step — no AI call needed. Reserve the AI Response for genuinely ambiguous failures where the expected entity type is unclear from the utterance.

Python — Capture slot-filling spiral guard with format pre-check (Flask)
import re
import time
import sqlite3
import threading
from flask import Flask, request, jsonify

app = Flask(__name__)
db_lock = threading.Lock()
DB_PATH = "voiceflow_capture_guard.db"

MAX_CAPTURE_ATTEMPTS_PER_SESSION = 4
SESSION_WINDOW_SECONDS = 3600

# Colloquial-format patterns: user is providing the right entity type
# but in a non-machine-parseable format. These should route to a
# static format-hint Speak block, not an AI-generated clarification.
COLLOQUIAL_PATTERNS = {
    "email": {
        "machine": re.compile(r"[a-zA-Z0-9._%+\-]+@[a-zA-Z0-9.\-]+\.[a-zA-Z]{2,}", re.I),
        "colloquial": re.compile(r"\w+\s+(?:at|@)\s+\w+\s+(?:dot|\.)\s*\w{2,}", re.I),
        "hint": "Please type your email address in the standard format: user@domain.com",
    },
    "phone": {
        "machine": re.compile(r"\b\d[\d\s\-().]{7,}\d\b"),
        "colloquial": re.compile(r"\b(?:zero|one|two|three|four|five|six|seven|eight|nine)(?:\s+(?:zero|one|two|three|four|five|six|seven|eight|nine)){6,}\b", re.I),
        "hint": "Please type your phone number using digits, e.g. 555-867-5309",
    },
    "date": {
        "machine": re.compile(r"\b\d{1,2}[/-]\d{1,2}[/-]\d{2,4}\b|\b(?:jan|feb|mar|apr|may|jun|jul|aug|sep|oct|nov|dec)\w*\s+\d{1,2}", re.I),
        "colloquial": re.compile(r"\bthe\s+\d{1,2}(?:st|nd|rd|th)?\b|\bnext\s+(?:monday|tuesday|wednesday|thursday|friday)\b", re.I),
        "hint": "Please enter the date in MM/DD/YYYY format, e.g. 06/21/2026",
    },
    "order_id": {
        "machine": re.compile(r"\b(?:ORD|order|#)[-\s]?\d{5,}\b|\b\d{7,}\b", re.I),
        "colloquial": re.compile(r"(?:my order|order number|order #|reference)\s+(?:is\s+)?\w+", re.I),
        "hint": "Please type your order number exactly as shown in your confirmation email, e.g. ORD-12345678",
    },
}

def init_db():
    with sqlite3.connect(DB_PATH) as conn:
        conn.execute("""
            CREATE TABLE IF NOT EXISTS capture_attempts (
                id INTEGER PRIMARY KEY AUTOINCREMENT,
                session_id TEXT NOT NULL,
                capture_step TEXT NOT NULL,
                entity_type TEXT NOT NULL,
                attempt INTEGER NOT NULL,
                utterance_snippet TEXT,
                classification TEXT,
                recorded_at REAL
            )
        """)
        conn.execute(
            "CREATE INDEX IF NOT EXISTS idx_session_capture "
            "ON capture_attempts (session_id, capture_step, recorded_at)"
        )

class CaptureSlotFillGuard:
    """
    Call check() in the Capture step's No Match path.
    Returns allow_ai_response=False when the utterance has a colloquial format
    (route to static format hint instead) or when the attempt ceiling is reached.
    """

    @staticmethod
    def check(session_id: str, capture_step: str,
              entity_type: str, user_utterance: str) -> dict:
        now = time.time()
        window_start = now - SESSION_WINDOW_SECONDS

        # Check for colloquial format — entity type is right, format is wrong
        entity_config = COLLOQUIAL_PATTERNS.get(entity_type)
        classification = "unrecognized"
        format_hint = None

        if entity_config:
            if entity_config["machine"].search(user_utterance):
                # Machine-parseable: extraction should have succeeded — log and allow retry
                classification = "machine_format_extraction_failure"
            elif entity_config["colloquial"].search(user_utterance):
                # Colloquial format detected: AI clarification won't fix this
                classification = "colloquial_format"
                format_hint = entity_config["hint"]
                with db_lock:
                    with sqlite3.connect(DB_PATH) as conn:
                        conn.execute(
                            "INSERT INTO capture_attempts "
                            "(session_id, capture_step, entity_type, attempt, "
                            "utterance_snippet, classification, recorded_at) "
                            "VALUES (?, ?, ?, 0, ?, ?, ?)",
                            (session_id, capture_step, entity_type,
                             user_utterance[:200], classification, now)
                        )
                return {
                    "allow_ai_response": False,
                    "reason": "colloquial_format_detected",
                    "entity_type": entity_type,
                    "format_hint": format_hint,
                    "session_id": session_id,
                    "message": (
                        f"The user's utterance appears to contain a {entity_type} "
                        "in colloquial format. An AI Response clarification will not "
                        "resolve the format mismatch — the user believes their input "
                        "is correct. Route to the static format-hint Speak block: "
                        f"'{format_hint}'"
                    ),
                }

        with db_lock:
            with sqlite3.connect(DB_PATH) as conn:
                prior_attempts = conn.execute(
                    "SELECT COUNT(*) FROM capture_attempts "
                    "WHERE session_id = ? AND capture_step = ? AND recorded_at > ?",
                    (session_id, capture_step, window_start)
                ).fetchone()[0]

                if prior_attempts >= MAX_CAPTURE_ATTEMPTS_PER_SESSION:
                    return {
                        "allow_ai_response": False,
                        "reason": "attempt_ceiling_reached",
                        "session_id": session_id,
                        "capture_step": capture_step,
                        "entity_type": entity_type,
                        "attempts_in_window": prior_attempts,
                        "ceiling": MAX_CAPTURE_ATTEMPTS_PER_SESSION,
                        "message": (
                            f"Capture step {capture_step!r} has failed entity extraction "
                            f"{prior_attempts} times in this session "
                            f"(ceiling: {MAX_CAPTURE_ATTEMPTS_PER_SESSION}). "
                            "Route to human escalation. The user cannot provide "
                            f"a machine-parseable {entity_type} in this session context."
                        ),
                    }

                conn.execute(
                    "INSERT INTO capture_attempts "
                    "(session_id, capture_step, entity_type, attempt, "
                    "utterance_snippet, classification, recorded_at) "
                    "VALUES (?, ?, ?, ?, ?, ?, ?)",
                    (session_id, capture_step, entity_type,
                     prior_attempts + 1, user_utterance[:200], classification, now)
                )
                return {
                    "allow_ai_response": True,
                    "session_id": session_id,
                    "capture_step": capture_step,
                    "entity_type": entity_type,
                    "attempt_number": prior_attempts + 1,
                    "remaining_attempts": MAX_CAPTURE_ATTEMPTS_PER_SESSION - prior_attempts - 1,
                    "classification": classification,
                }

@app.route("/guard/capture-slot-fill", methods=["POST"])
def capture_slot_fill_guard():
    data = request.get_json(force=True)
    result = CaptureSlotFillGuard.check(
        session_id=data.get("session_id", ""),
        capture_step=data.get("capture_step", ""),
        entity_type=data.get("entity_type", ""),
        user_utterance=data.get("user_utterance", ""),
    )
    return jsonify(result), 200 if result["allow_ai_response"] else 429

if __name__ == "__main__":
    init_db()
    app.run(port=8102)

Call this endpoint from a Function step or API step in your Capture block's No Match path. Pass session_id, the Voiceflow step block name as capture_step, the entity type configured on the Capture step as entity_type, and the last user utterance as user_utterance. The response routing branches three ways: if reason is colloquial_format_detected, route to a static Speak block displaying the format_hint value, then Go To the Capture step again; if reason is attempt_ceiling_reached, route to the human escalation path; if allow_ai_response is true, route to the AI Response step for context-sensitive clarification. The remaining_attempts field can be used to increase specificity in the AI Response prompt as attempts decrease — fewer remaining attempts should correspond to more explicit and direct format instructions.

Failure Mode 3 — Knowledge Base Re-Query Chains

Voiceflow's Knowledge Base block integrates with your uploaded documentation, PDF files, and web-scraped content to provide RAG-based answers. Each KB block invocation is two billable LLM operations: an embedding call that converts the user's query into a vector and retrieves the most semantically similar document chunks from the vector index, and a synthesis call that sends those retrieved chunks to the configured LLM (typically GPT-4o or Claude) to generate a grounded answer. The KB block returns both the answer text and a confidence score indicating how well the retrieved chunks supported the generated answer.

The re-query chain failure mode emerges from a natural but flawed optimization: when the confidence score is below a threshold (say, 0.6), route the conversation back to the KB block with a modified system prompt that asks the model to "try a different approach" or "look for related information." The intention is to handle cases where the first query phrasing missed relevant documents that a rephrased query would retrieve. The reality is that KB confidence scores below 0.5 almost universally indicate that the answer is not in the indexed content — not that the query phrasing was suboptimal. A rephrased embedding query will retrieve slightly different chunks; if those chunks also do not contain the answer, the synthesis call will produce a similarly low-confidence result, and the condition will route back for a third attempt. Each iteration fires two billed LLM operations. A KB re-query chain that runs three iterations before an "I don't know" response costs six billed operations where one would have sufficed.

The KB re-query rule: A confidence score below 0.5 on the first query means the KB does not contain a satisfying answer to this question. A second query on the same topic retrieves overlapping chunks at best and unrelated chunks at worst. Block the second query for topics where the first confidence score falls below the threshold — instead of re-querying, use the first query's best-available retrieved chunks as context for a "based on limited information" response, route the question to human escalation, or escalate to a live agent. Reserve re-queries for cases where the first query returned zero retrieved chunks (a retrieval failure, not a content gap) or where the query contains an entity that can be canonicalized (correcting a typo in a product name before re-querying).

Python — KB Answer re-query guard with topic hash and confidence tracking (Flask)
import hashlib
import time
import sqlite3
import threading
from flask import Flask, request, jsonify

app = Flask(__name__)
db_lock = threading.Lock()
DB_PATH = "voiceflow_kb_guard.db"

MAX_KB_QUERIES_PER_TOPIC = 2
LOW_CONFIDENCE_THRESHOLD = 0.50
SESSION_WINDOW_SECONDS = 3600

def topic_hash(query: str) -> str:
    normalized = " ".join(query.lower().split())[:128]
    return hashlib.sha256(normalized.encode()).hexdigest()[:16]

def init_db():
    with sqlite3.connect(DB_PATH) as conn:
        conn.execute("""
            CREATE TABLE IF NOT EXISTS kb_query_log (
                id INTEGER PRIMARY KEY AUTOINCREMENT,
                session_id TEXT NOT NULL,
                topic_hash TEXT NOT NULL,
                query_text TEXT,
                confidence_score REAL,
                retrieved_chunks INTEGER,
                recorded_at REAL
            )
        """)
        conn.execute(
            "CREATE INDEX IF NOT EXISTS idx_session_topic "
            "ON kb_query_log (session_id, topic_hash, recorded_at)"
        )

class KBReQueryGuard:
    """
    Call check() before each KB Answer block invocation.
    Returns allow=False when the topic has already been queried and the
    prior confidence score was below the low-confidence threshold.
    Call record() after each successful KB call to log the confidence score.
    """

    @staticmethod
    def check(session_id: str, query: str,
              prior_confidence: float | None = None) -> dict:
        now = time.time()
        window_start = now - SESSION_WINDOW_SECONDS
        t_hash = topic_hash(query)

        with db_lock:
            with sqlite3.connect(DB_PATH) as conn:
                prior_queries = conn.execute(
                    "SELECT confidence_score, retrieved_chunks FROM kb_query_log "
                    "WHERE session_id = ? AND topic_hash = ? AND recorded_at > ? "
                    "ORDER BY recorded_at ASC",
                    (session_id, t_hash, window_start)
                ).fetchall()

                query_count = len(prior_queries)

                if query_count >= MAX_KB_QUERIES_PER_TOPIC:
                    best_confidence = max(
                        (r[0] for r in prior_queries if r[0] is not None),
                        default=0.0
                    )
                    return {
                        "allow": False,
                        "reason": "kb_query_cap_reached",
                        "session_id": session_id,
                        "topic_hash": t_hash,
                        "queries_in_window": query_count,
                        "cap": MAX_KB_QUERIES_PER_TOPIC,
                        "best_confidence_seen": best_confidence,
                        "message": (
                            f"KB has been queried {query_count} times for this topic "
                            f"in this session (cap: {MAX_KB_QUERIES_PER_TOPIC}). "
                            f"Best confidence seen: {best_confidence:.2f}. "
                            + (
                                "Confidence consistently below threshold — this topic "
                                "is not covered in the current KB content. Add documentation "
                                "for this question category or route to human escalation."
                                if best_confidence < LOW_CONFIDENCE_THRESHOLD else
                                "Confidence above threshold on a prior query — use the "
                                "best prior answer rather than re-querying."
                            )
                        ),
                    }

                if prior_queries and prior_confidence is not None:
                    if prior_confidence < LOW_CONFIDENCE_THRESHOLD:
                        return {
                            "allow": False,
                            "reason": "low_confidence_re_query_blocked",
                            "session_id": session_id,
                            "topic_hash": t_hash,
                            "prior_confidence": prior_confidence,
                            "threshold": LOW_CONFIDENCE_THRESHOLD,
                            "message": (
                                f"Prior KB query returned confidence {prior_confidence:.2f} "
                                f"(threshold: {LOW_CONFIDENCE_THRESHOLD}). "
                                "A re-query with rephrased input will not retrieve "
                                "materially different chunks when the KB does not contain "
                                "an answer for this topic. Route to fallback response or "
                                "human escalation instead."
                            ),
                        }

                return {
                    "allow": True,
                    "session_id": session_id,
                    "topic_hash": t_hash,
                    "query_number": query_count + 1,
                    "remaining_queries": MAX_KB_QUERIES_PER_TOPIC - query_count - 1,
                }

    @staticmethod
    def record(session_id: str, query: str,
               confidence_score: float, retrieved_chunks: int) -> None:
        t_hash = topic_hash(query)
        with db_lock:
            with sqlite3.connect(DB_PATH) as conn:
                conn.execute(
                    "INSERT INTO kb_query_log "
                    "(session_id, topic_hash, query_text, confidence_score, "
                    "retrieved_chunks, recorded_at) VALUES (?, ?, ?, ?, ?, ?)",
                    (session_id, t_hash, query[:300],
                     confidence_score, retrieved_chunks, time.time())
                )

@app.route("/guard/kb-requery/check", methods=["POST"])
def kb_requery_check():
    data = request.get_json(force=True)
    result = KBReQueryGuard.check(
        session_id=data.get("session_id", ""),
        query=data.get("query", ""),
        prior_confidence=data.get("prior_confidence"),
    )
    return jsonify(result), 200 if result["allow"] else 429

@app.route("/guard/kb-requery/record", methods=["POST"])
def kb_requery_record():
    data = request.get_json(force=True)
    KBReQueryGuard.record(
        session_id=data.get("session_id", ""),
        query=data.get("query", ""),
        confidence_score=data.get("confidence_score", 0.0),
        retrieved_chunks=data.get("retrieved_chunks", 0),
    )
    return jsonify({"recorded": True}), 200

if __name__ == "__main__":
    init_db()
    app.run(port=8103)

Wire this guard with two Function/API step calls in your Voiceflow flow. First, call /guard/kb-requery/check before the KB Answer block — pass the current query text and, for re-query attempts, the prior_confidence returned by the previous KB block as a Voiceflow variable. If the response allow is false, skip the KB block and route to your fallback path using the best_confidence_seen value to compose a "based on available information" response from the best prior KB result. After each permitted KB Answer block call, pass the returned confidence score and chunk count to /guard/kb-requery/record — this builds the session-level confidence history that informs the next check. The best_confidence_seen field in blocked responses lets you compose calibrated uncertainty messages: a score above 0.4 suggests you have partial relevant information to share; a score below 0.3 indicates the topic is not in the KB at all.

Failure Mode 4 — API Block Retry Storms

Voiceflow's API Block sends HTTP requests to external endpoints: order management systems, CRM APIs, inventory databases, calendar services, payment processors. The block supports configurable retry behavior — if the upstream returns a 5xx status or times out, Voiceflow retries the request up to a configured maximum. Retries are appropriate for transient failures. They become a retry storm when the upstream service is experiencing sustained degraded performance rather than a transient spike — a pattern where retries themselves contribute to the degradation.

The compounding problem is specific to AI agent architectures. In a Voiceflow flow where an API Block fetches data that an AI Response step then synthesizes into a natural-language reply, the AI Response step fires after the API Block succeeds regardless of how many retries were required. Under a degraded upstream: Session A's API Block sends 3 retry requests over 9 seconds, finally succeeds, and the AI Response fires once. Session B sends 3 retries concurrently. Session C sends 3 retries concurrently. A hundred concurrent sessions each sending 3 retries = 300 API requests hitting an already-overloaded upstream in a burst, worsening its response time, causing more sessions to timeout and retry, and escalating the storm. The AI Response costs are bounded (one per session that eventually succeeds), but the upstream API costs — per-call billing, rate limit exhaustion, and SLA penalties for excess requests — multiply with retry depth.

Retry storms also manifest as a cost attribution problem: Voiceflow's usage analytics show the AI Response calls correctly (one per session), but the actual upstream API call count may be 3–5× higher. Teams that budget for API costs based on session count underestimate by the retry factor. When upstream pricing is per-request (not per-session), the cost overrun is invisible until the API invoice arrives.

The retry storm rule: Before each API Block retry attempt, check a shared circuit breaker for the upstream endpoint. The circuit breaker counts recent failures across all concurrent sessions — not just the current session's retries. When the failure rate exceeds a threshold (e.g., more than 10 failures in the last 60 seconds for a given endpoint), trip the circuit: block all retry attempts for a recovery window (30–60 seconds) and respond with a static "service temporarily unavailable" message for all new attempts during the window. A per-session retry ceiling (3 retries) bounds per-session cost; the shared circuit breaker bounds the total system cost under sustained upstream degradation.

Python — API Block circuit breaker with shared failure tracking (Flask)
import time
import sqlite3
import threading
from flask import Flask, request, jsonify

app = Flask(__name__)
db_lock = threading.Lock()
DB_PATH = "voiceflow_api_circuit.db"

# Per-session retry limits
MAX_RETRIES_PER_SESSION = 3

# Circuit breaker configuration
FAILURE_WINDOW_SECONDS = 60       # count failures in this rolling window
CIRCUIT_TRIP_THRESHOLD = 10       # failures across all sessions in the window
CIRCUIT_RECOVERY_SECONDS = 45     # open circuit for this long before re-testing

def init_db():
    with sqlite3.connect(DB_PATH) as conn:
        conn.execute("""
            CREATE TABLE IF NOT EXISTS api_call_log (
                id INTEGER PRIMARY KEY AUTOINCREMENT,
                endpoint_key TEXT NOT NULL,
                session_id TEXT NOT NULL,
                attempt INTEGER NOT NULL,
                outcome TEXT NOT NULL,   -- 'success' | 'failure' | 'circuit_open'
                status_code INTEGER,
                latency_ms INTEGER,
                recorded_at REAL
            )
        """)
        conn.execute("""
            CREATE TABLE IF NOT EXISTS circuit_state (
                endpoint_key TEXT PRIMARY KEY,
                state TEXT NOT NULL,     -- 'closed' | 'open'
                tripped_at REAL,
                trip_failure_count INTEGER
            )
        """)
        conn.execute(
            "CREATE INDEX IF NOT EXISTS idx_endpoint_time "
            "ON api_call_log (endpoint_key, recorded_at)"
        )

class APIBlockCircuitBreaker:
    """
    Call check() before each API Block call (including retries).
    Call record() after each call completes with the outcome.
    The circuit trips when recent cross-session failure count exceeds threshold.
    """

    @staticmethod
    def check(endpoint_key: str, session_id: str) -> dict:
        now = time.time()
        failure_window_start = now - FAILURE_WINDOW_SECONDS

        with db_lock:
            with sqlite3.connect(DB_PATH) as conn:
                # Check circuit state
                circuit = conn.execute(
                    "SELECT state, tripped_at FROM circuit_state "
                    "WHERE endpoint_key = ?",
                    (endpoint_key,)
                ).fetchone()

                if circuit and circuit[0] == "open":
                    tripped_at = circuit[1]
                    time_in_open = now - tripped_at
                    if time_in_open < CIRCUIT_RECOVERY_SECONDS:
                        remaining = int(CIRCUIT_RECOVERY_SECONDS - time_in_open)
                        return {
                            "allow": False,
                            "reason": "circuit_open",
                            "endpoint_key": endpoint_key,
                            "session_id": session_id,
                            "recovery_seconds_remaining": remaining,
                            "message": (
                                f"Circuit breaker for {endpoint_key!r} is open "
                                f"(tripped {int(time_in_open)}s ago, "
                                f"recovery in {remaining}s). "
                                "The upstream service is experiencing sustained failures. "
                                "Respond with a static 'service temporarily unavailable' "
                                "message — do not retry."
                            ),
                        }
                    else:
                        # Recovery window elapsed: move to half-open (allow one test request)
                        conn.execute(
                            "UPDATE circuit_state SET state = 'closed' "
                            "WHERE endpoint_key = ?",
                            (endpoint_key,)
                        )

                # Check per-session retry count
                session_attempts = conn.execute(
                    "SELECT COUNT(*) FROM api_call_log "
                    "WHERE endpoint_key = ? AND session_id = ? "
                    "AND recorded_at > ? AND outcome != 'circuit_open'",
                    (endpoint_key, session_id, now - 3600)
                ).fetchone()[0]

                if session_attempts >= MAX_RETRIES_PER_SESSION:
                    return {
                        "allow": False,
                        "reason": "session_retry_ceiling",
                        "endpoint_key": endpoint_key,
                        "session_id": session_id,
                        "session_attempts": session_attempts,
                        "ceiling": MAX_RETRIES_PER_SESSION,
                        "message": (
                            f"Session {session_id!r} has attempted {session_attempts} "
                            f"calls to {endpoint_key!r} (ceiling: {MAX_RETRIES_PER_SESSION}). "
                            "Route to static unavailable response for this session."
                        ),
                    }

                # Check cross-session failure rate
                recent_failures = conn.execute(
                    "SELECT COUNT(*) FROM api_call_log "
                    "WHERE endpoint_key = ? AND outcome = 'failure' "
                    "AND recorded_at > ?",
                    (endpoint_key, failure_window_start)
                ).fetchone()[0]

                if recent_failures >= CIRCUIT_TRIP_THRESHOLD:
                    conn.execute(
                        "INSERT OR REPLACE INTO circuit_state "
                        "(endpoint_key, state, tripped_at, trip_failure_count) "
                        "VALUES (?, 'open', ?, ?)",
                        (endpoint_key, now, recent_failures)
                    )
                    return {
                        "allow": False,
                        "reason": "circuit_just_tripped",
                        "endpoint_key": endpoint_key,
                        "session_id": session_id,
                        "recent_failures": recent_failures,
                        "failure_window_seconds": FAILURE_WINDOW_SECONDS,
                        "recovery_seconds": CIRCUIT_RECOVERY_SECONDS,
                        "message": (
                            f"Circuit breaker tripped for {endpoint_key!r}: "
                            f"{recent_failures} failures in the last "
                            f"{FAILURE_WINDOW_SECONDS}s across all sessions "
                            f"(threshold: {CIRCUIT_TRIP_THRESHOLD}). "
                            f"Circuit open for {CIRCUIT_RECOVERY_SECONDS}s. "
                            "Do not retry — upstream is experiencing sustained degradation."
                        ),
                    }

                return {
                    "allow": True,
                    "endpoint_key": endpoint_key,
                    "session_id": session_id,
                    "session_attempt_number": session_attempts + 1,
                    "recent_failures_in_window": recent_failures,
                }

    @staticmethod
    def record(endpoint_key: str, session_id: str,
               attempt: int, outcome: str,
               status_code: int | None = None,
               latency_ms: int | None = None) -> None:
        with db_lock:
            with sqlite3.connect(DB_PATH) as conn:
                conn.execute(
                    "INSERT INTO api_call_log "
                    "(endpoint_key, session_id, attempt, outcome, "
                    "status_code, latency_ms, recorded_at) "
                    "VALUES (?, ?, ?, ?, ?, ?, ?)",
                    (endpoint_key, session_id, attempt, outcome,
                     status_code, latency_ms, time.time())
                )

@app.route("/guard/api-circuit/check", methods=["POST"])
def api_circuit_check():
    data = request.get_json(force=True)
    result = APIBlockCircuitBreaker.check(
        endpoint_key=data.get("endpoint_key", ""),
        session_id=data.get("session_id", ""),
    )
    return jsonify(result), 200 if result["allow"] else 429

@app.route("/guard/api-circuit/record", methods=["POST"])
def api_circuit_record():
    data = request.get_json(force=True)
    APIBlockCircuitBreaker.record(
        endpoint_key=data.get("endpoint_key", ""),
        session_id=data.get("session_id", ""),
        attempt=data.get("attempt", 1),
        outcome=data.get("outcome", "failure"),
        status_code=data.get("status_code"),
        latency_ms=data.get("latency_ms"),
    )
    return jsonify({"recorded": True}), 200

if __name__ == "__main__":
    init_db()
    app.run(port=8104)

Integrate the circuit breaker into your Voiceflow flow with API step calls wrapping each API Block invocation. Before each API Block (including retry attempts), call /guard/api-circuit/check with the endpoint identifier (e.g., "order-management-api") and the current session ID. If allow is false, route to a static "Service temporarily unavailable — please try again in a few minutes" Speak block instead of proceeding to the API Block. After each API Block completes, call /guard/api-circuit/record with the outcome ("success" or "failure"), the HTTP status code, and the observed latency. The shared failure tracking means the circuit trips on cross-session patterns — when 10 different sessions all fail within 60 seconds, every subsequent session gets the circuit-open response until the upstream recovers. Use the recent_failures_in_window field from successful responses to monitor your upstream's degradation trajectory before the circuit trips, and investigate upstream latency spikes when the count rises above 3–4 in the window.

State Table

Failure mode Guard class Ceiling What to watch
No Match reprompt loop
AI Response step fires on every failed intent recognition in a Go To loop
NoMatchRepromptGuard 3 reprompts per session per step; entity pre-check blocks structural mismatches immediately reason: entity_signal_absent frequency; high rate = Capture step entity type too strict for real user input distribution
Capture slot-filling spiral
Entity extraction re-prompts indefinitely on format mismatch without attempt ceiling
CaptureSlotFillGuard 4 attempts per session per capture step; colloquial format routes to static hint instead of AI Response reason: colloquial_format_detected frequency per entity type; high rate = add format examples to initial Capture step prompt
KB Answer re-query chain
Low-confidence KB response triggers repeated re-query, each firing embed + synthesis calls
KBReQueryGuard 2 queries per session per topic hash; low-confidence re-queries blocked immediately best_confidence_seen histogram in blocked responses; persistent sub-0.4 topics are KB content gaps to fill
API Block retry storm
Concurrent sessions pile retries on a degraded upstream, compounding its failure rate
APIBlockCircuitBreaker 3 attempts per session; circuit trips at 10 cross-session failures / 60s; 45s recovery window recent_failures_in_window trend in successful responses; rising counts before circuit trip indicate upstream degradation onset

Checklist Before Going Live

  1. Audit every No Match path that includes an AI Response step and confirm it has an attempt ceiling. Open each flow in the Voiceflow studio and trace the No Match and No Reply paths from every Capture and Intent step. For each path containing an AI Response step, verify that a guard check (API step or Function step calling /guard/no-match-reprompt) is placed before the AI Response. If the path uses a Set + Condition block to track attempts manually, ensure the ceiling value is ≤ 3 — higher values mean more wasted AI calls before escalation. If the path uses no ceiling at all, the reprompt loop is unbounded and will run until the session times out.
  2. For every Capture step that extracts a structured entity, add a colloquial-format detection branch in the No Match path. The four most common entities with colloquial format patterns are email (written with "at" and "dot"), phone numbers (spoken digit-by-digit), order IDs (described by context rather than value), and dates (referred to as "the fifteenth" or "next Monday"). For each of these entity types, add a Condition step in the No Match path that evaluates whether the utterance matches the colloquial pattern and routes to a static Speak block with the exact format example. This eliminates the AI Response call for the most common capture failure class — format mismatch — without degrading user experience.
  3. Set a maximum of two KB Answer block invocations per topic per session and add a fallback path for sub-threshold confidence responses. In every flow where a KB Answer block appears, confirm that the flow routes low-confidence responses (below 0.5) directly to a fallback rather than back to the KB block. Design the fallback path to use the best-available KB response even at low confidence — prefix the response with "Based on the information I have..." rather than re-querying for a more confident answer. Document which topic categories in your KB consistently produce sub-0.5 confidence scores; these represent content gaps that need documentation additions, not re-query optimizations.
  4. Test every API Block integration with a simulated upstream outage before production launch. Use a mock server that returns 503s for 60 seconds, then recovers, to verify that: (a) the circuit breaker trips after 10 failures and stops adding retry load to the mock server, (b) the recovery window fires correctly and the first successful request after recovery resets the circuit, and (c) sessions during the open circuit receive a coherent static "service unavailable" response rather than an error state. Verify that the per-session retry ceiling (3 attempts) trips before the cross-session circuit for low-volume scenarios where the upstream is only degraded for one session at a time.
  5. Monitor your Voiceflow AI step usage per flow weekly and correlate spikes with guard log events. Voiceflow's usage dashboard reports AI steps consumed per workspace and per assistant. Set a weekly review cadence where you compare AI step counts against conversation counts — the ratio should stay close to your expected steps-per-conversation for flows without guard issues. When the ratio spikes, pull guard logs for that period and count events by reason code: a spike in reprompt_ceiling_reached events indicates a high-traffic No Match path with a frequently-unsatisfied intent; a spike in kb_query_cap_reached events indicates a KB content gap category receiving high traffic; a spike in circuit_just_tripped events indicates upstream reliability degradation. Each reason code points to a specific configuration change, not just a cost alert.
  6. Test your API Block flows with production-representative concurrent session loads before launch. Most API Block retry storm scenarios do not appear in single-session testing because the circuit breaker's cross-session failure counter only trips when multiple concurrent sessions are hitting the endpoint simultaneously. Use a load testing tool to simulate 50–100 concurrent conversations reaching the API Block step within the same 60-second window, then introduce upstream latency (add 2–5 second delays to the mock server). Verify that the circuit breaker trips within the CIRCUIT_TRIP_THRESHOLD failure count and that all subsequent concurrent sessions receive the circuit-open response without sending additional requests to the mock server.

FAQ

How does Voiceflow's AI step billing compare to other conversational AI platforms?

Voiceflow bills AI steps as LLM calls made through your connected AI model account (OpenAI, Anthropic) or against Voiceflow's own managed AI step credits depending on your plan tier. This is structurally similar to Botpress's AI credit model (billed per LLM invocation) and different from Copilot Studio's per-message model (billed per completed conversation turn) or Salesforce Agentforce's per-conversation model (billed per resolved session). The per-invocation model makes Voiceflow's costs directly proportional to how many times AI Response and KB Answer blocks fire per session — loop patterns that multiply block invocations multiply costs linearly. Platforms that bill per conversation turn hide internal retry costs; Voiceflow exposes them, which makes cost control both more necessary and more tractable because the signal (AI steps consumed) is directly tied to the cause (block invocation count).

Can I implement these guards entirely within Voiceflow using Set and Condition blocks without an external endpoint?

Yes, for attempt counting. Voiceflow's Set step can increment a flow variable (e.g., reprompt_count) on each No Match event, and a Condition step can check whether reprompt_count >= 3 to route to escalation. This in-flow counting approach works for per-session ceiling enforcement without any external HTTP calls. The limitation is observability: flow variable counts live only within a session and are not aggregated across sessions or queryable for analytics. The external endpoint approach in these examples builds a cross-session log that lets you query, for example, "which KB topics produced sub-0.5 confidence scores in more than 10% of sessions this week" or "which API endpoints tripped the circuit breaker most frequently in the last 30 days." That cross-session visibility is what turns a runtime cost cap into a product improvement signal. For prototypes and low-volume deployments, in-flow counting is a reasonable starting point; for production deployments where the guard log informs weekly KB content updates and upstream reliability reviews, the external endpoint provides the data needed to act on the signal.

Our API Block uses a Voiceflow-native integration (e.g., the built-in Google Sheets or Airtable block) rather than a custom API Block. Do the circuit breaker patterns apply?

The same failure mode applies, but the implementation path is different. Voiceflow-native integrations abstract the HTTP layer — you cannot intercept the retry behavior or add a guard check between retry attempts within the integration block itself. The mitigation for native integrations is a pre-flight rate check using a custom API Block placed before the native integration block: call a lightweight status endpoint (e.g., a custom health-check endpoint for your Google Sheet or Airtable base) to verify the service is responding before invoking the native integration. If the health check fails, route to the static unavailable path without hitting the native integration. This adds one lightweight API call per session but prevents the native integration from consuming retries against an unresponsive upstream. For shared upstream services (Airtable bases accessed by multiple concurrent Voiceflow flows), the cross-session circuit breaker pattern applies directly — the health-check endpoint can be the same guard endpoint described in this post, shared across all flows that access the same Airtable base.

How do we calibrate the circuit breaker thresholds (10 failures / 60 seconds, 45-second recovery) for our specific upstream?

The defaults in this post assume a mid-traffic deployment (hundreds of concurrent sessions) with a reasonably reliable upstream (99%+ availability under normal conditions). Calibrate the CIRCUIT_TRIP_THRESHOLD based on your peak concurrent session count: for 50 concurrent sessions, 10 failures in 60 seconds represents a 20% cross-session failure rate — a clear signal of degradation. For 500 concurrent sessions, 10 failures represents only 2% — raise the threshold to 30–50 to avoid false positives from random failures. Set the FAILURE_WINDOW_SECONDS to match your API Block's timeout setting multiplied by your max retries — if your timeout is 5 seconds and you allow 3 retries, a degraded upstream produces failures at most every 15 seconds per session; a 60-second window captures at least 4 failure cycles from each affected session. Set CIRCUIT_RECOVERY_SECONDS to your upstream's documented restart or recovery time — if your order management API has an SLA of recovering within 2 minutes of a detected incident, 45 seconds is too short; use 120 seconds so the circuit stays open long enough for the upstream to stabilize before half-open testing begins.

How do I integrate RunGuard's SDK with Voiceflow to implement these guards?

Deploy RunGuard as a Python or TypeScript service accessible from Voiceflow's outbound HTTP connections (Voiceflow Cloud makes outbound HTTPS calls from a fixed set of IP ranges documented in their developer portal). For No Match reprompt and Capture slot-filling guards, use RunGuard's BudgetTracker with cap=3 per session per step — tracker.check(session_id, step_id) returns the remaining attempt budget, and tracker.record(session_id, step_id) decrements it after each permitted reprompt. For the KB re-query guard, use RunGuard's LoopDetector configured with window_seconds=3600 and max_calls=2detector.record(session_id, topic_hash) after each KB call and detector.check(session_id, topic_hash) before each re-query. For the API circuit breaker, RunGuard's built-in CircuitBreaker class implements the cross-session failure tracking and open/closed state transitions described in the code examples above, without requiring you to maintain the SQLite schema manually. Install with pip install runguard for Python or npm install @runguard/sdk for TypeScript, and call from Voiceflow Function steps (TypeScript) or API steps pointing at your RunGuard service endpoint.

Catch these loops before they drain your Voiceflow AI steps

RunGuard is a circuit-breaker SDK for AI agents and automation flows. Wire it once, get loop detection + budget enforcement + alerts when any breaker trips. Works in Python and TypeScript.

Start free 14-day trial