Rasa Cost Control: Form Validation Spirals, LLM Fallback Cascades, RAG Re-Query Loops, and ReminderScheduled Event Storms

Rasa Open Source is the most widely deployed self-hosted conversational AI framework. Its architecture separates NLU (intent classification, entity extraction via DIET or SpaCy) from dialogue management (policy ensemble: TED, AugmentedMemoization, Rule) and action execution (a separate Python HTTP server — the action server). Teams building production Rasa deployments in 2024 and beyond increasingly augment the action server with direct LLM API calls: GPT-4o or Claude for dynamic slot-filling re-asks, LLM-based intent disambiguation when DIET confidence falls below threshold, retrieval-augmented generation for knowledge base queries, and async callbacks that synthesize external API results into natural-language replies. Each of these LLM integrations lives entirely inside custom actions — Rasa itself does not call an LLM; your action server does. This means that every loop, retry, or duplication pattern that multiplies action server invocations translates directly into multiplied LLM API charges on your OpenAI or Anthropic bill.

Rasa's forms, policies, and event system are powerful precisely because they handle complex multi-turn dialogue state. That same power makes it easy to construct action server call patterns where a single user utterance produces four or eight LLM API calls instead of one. Four structural patterns in the Rasa + LLM architecture are responsible for the majority of unexpected LLM cost spikes:

  • Form validation spirals — A FormValidationAction calls an LLM inside a CustomActionAsk<SlotName> to generate contextual re-ask prompts when slot extraction fails. The validate_<slot> method returns {"slot": None} to request another attempt. Without a per-slot attempt counter capping the ceiling and a structural pre-check that blocks re-asks when the user's input cannot possibly satisfy the slot (e.g., no numeric pattern in text for an order-number slot), the form fires one LLM call per re-ask indefinitely. A user providing the wrong format across ten re-prompts consumes ten GPT-4o API calls for zero extracted value.
  • DIET fallback cascades — When DIET's intent confidence falls below the configured threshold, a fallback policy routes the conversation to a custom action that calls an LLM classifier for robust intent disambiguation. If the LLM returns an intent that DIET would have classified at low confidence (distribution-shifted queries, out-of-vocabulary vocabulary, ambiguous phrasing), the same low-confidence → LLM-fallback chain fires on the next user turn. Under a traffic shift where a class of queries consistently falls below the DIET threshold, every conversation turn from users in that class fires an LLM API call instead of using the local ML model — turning a 0-cost DIET inference into a billed GPT call on every turn.
  • RAG action re-query loops — A custom action performs retrieval-augmented generation: embed the query, retrieve the top-K chunks from a vector database, call an LLM to synthesize an answer. If the synthesized answer scores below a confidence threshold (or the user expresses dissatisfaction), the action returns an ActiveLoop event to continue the dialogue form and re-asks the question. The same topic_hash — the normalized query — is re-queried on the next turn. Since vector retrieval is deterministic for a given query, the second embedding retrieves nearly identical chunks; the LLM produces a similarly uncertain answer; the confidence check fails again. Without a per-topic query ceiling, each unsatisfied topic fires one embedding API call + one LLM synthesis call per re-ask indefinitely.
  • ReminderScheduled event storms — Rasa supports ReminderScheduled events dispatched by custom actions to trigger a callback action at a future time (e.g., poll an external API and resume the conversation when the result is ready). If the scheduling action is called multiple times in a session — due to policy disagreement retrying the action, a user re-triggering the same intent, or a form re-running the same action step — multiple overlapping reminders are scheduled. Each fires its callback action independently. If the callback calls an LLM to synthesize the external result into a reply, N scheduled reminders = N concurrent LLM synthesis calls for the same response, only one of which the user will ever see.

Failure Mode 1 — Form Validation Spirals

Rasa's form system extracts slots by routing every user utterance through the active form's FormValidationAction. Inside that class, validate_<slot_name> methods receive the extracted value (or None if DIET or the entity extractor failed to extract the slot), apply business logic, and return either an accepted value or None to signal "keep asking." The form then dispatches a re-ask: either utter_ask_<slot_name> (a static response template) or, in modern deployments, a CustomActionAsk<SlotName> — a fully custom action that can call an LLM to generate a contextually sensitive re-ask prompt tailored to the conversation history and prior failed attempts.

The LLM re-ask pattern is attractive because it produces natural-sounding prompts that reference the user's prior input: "You mentioned 'john at example dot com' — could you provide that in standard email format (e.g., john@example.com)?" instead of a static "Please enter a valid email address." The problem is that DIET entity extraction failures are often structural, not probabilistic. A user who writes "john at example dot com" will fail extraction again on the second attempt if they write "john at example dot com" again. The LLM re-ask does not change the user's behavior; it costs one GPT-4o call per attempt. A form collecting an email address, phone number, and order ID — three LLM-powered slot-ask actions — where the user provides colloquial formats across all three and requires three re-prompts per slot generates nine LLM API calls for one form submission attempt.

The compounding factor is that CustomActionAsk actions typically include the full conversation context in the system prompt for coherence. A re-ask on attempt three receives the conversation history including the two prior failed attempts plus the user's explanations, producing a longer context window per call. Token cost per re-ask call grows as the form session extends — the third re-ask for a slot may cost 2–3× the tokens of the first because the conversation context has grown with each failed exchange.

The slot-fill rule: Every CustomActionAsk<SlotName> that calls an LLM must first run a structural pre-check on the user's last utterance. If the utterance cannot structurally contain the expected entity — no digit sequence for an order number, no @ sign and domain for an email address, no date-like pattern for a date slot — the LLM call is blocked and a static clarification template fires instead. Structural failures are deterministic: if the signal is absent, an LLM-generated prompt will not make it appear. Reserve LLM re-asks for cases where the signal is present but extraction failed (partial match, format variant, ambiguous value) — these are the cases where context-sensitive prompting improves extraction rates. Add a per-slot per-session attempt counter; at ceiling, route to escalation without a final LLM re-ask.

Python — FormSlotFillGuard: structural pre-check + per-slot attempt ceiling (Flask endpoint for CustomActionAsk)
import re
import time
import sqlite3
import threading
from flask import Flask, request, jsonify

app = Flask(__name__)
_db_lock = threading.Lock()

def _db():
    conn = sqlite3.connect("/tmp/rasa_guards.db", check_same_thread=False)
    conn.execute("""
        CREATE TABLE IF NOT EXISTS slot_attempts (
            session_id TEXT NOT NULL,
            slot_name  TEXT NOT NULL,
            attempts   INTEGER NOT NULL DEFAULT 0,
            last_ts    REAL NOT NULL,
            PRIMARY KEY (session_id, slot_name)
        )
    """)
    conn.commit()
    return conn

# Structural pre-check patterns per slot type.
# Returns True if the utterance *could* contain the entity (LLM re-ask warranted).
# Returns False if the signal is structurally absent (static hint required, no LLM call).
STRUCTURAL_CHECKS = {
    "email": lambda text: bool(
        re.search(r'@|\bat\b', text, re.I) and
        re.search(r'\.|dot\b', text, re.I)
    ),
    "phone": lambda text: len(re.findall(r'\d', text)) >= 7,
    "order_id": lambda text: bool(re.search(r'\d{4,}', text)),
    "date": lambda text: bool(
        re.search(
            r'\d{1,2}[\/\-\.]\d{1,2}|\b(jan|feb|mar|apr|may|jun|jul|aug|sep|oct|nov|dec)\w*\b'
            r'|\b(monday|tuesday|wednesday|thursday|friday|saturday|sunday|today|tomorrow|yesterday)\b'
            r'|\d{4}',
            text, re.I
        )
    ),
}

MAX_SLOT_ATTEMPTS = 3   # LLM re-asks before routing to escalation
SESSION_TTL = 3600      # seconds; prune stale sessions

@app.route("/guard/slot-fill", methods=["POST"])
def slot_fill_guard():
    """
    Called at the top of every CustomActionAsk before the LLM call.
    Request body: { session_id, slot_name, last_utterance }
    Response:
      { allow_llm_reask: true }                      → proceed with LLM re-ask
      { allow_llm_reask: false, reason, static_hint } → use static hint, skip LLM
    """
    body = request.get_json(force=True)
    session_id   = body["session_id"]
    slot_name    = body["slot_name"]
    last_utt     = body.get("last_utterance", "")

    # 1. Structural pre-check
    check_fn = STRUCTURAL_CHECKS.get(slot_name)
    if check_fn and not check_fn(last_utt):
        static_hints = {
            "email":    "Please type your email address in the format name@example.com",
            "phone":    "Please enter your phone number with at least 7 digits",
            "order_id": "Please provide your order number (e.g., 10284756)",
            "date":     "Please provide a date like 'June 15' or '2026-06-15'",
        }
        return jsonify({
            "allow_llm_reask": False,
            "reason": "signal_absent",
            "static_hint": static_hints.get(slot_name, f"Please provide a valid {slot_name}"),
        })

    # 2. Per-slot attempt ceiling
    now = time.time()
    with _db_lock:
        conn = _db()
        row = conn.execute(
            "SELECT attempts FROM slot_attempts WHERE session_id=? AND slot_name=?",
            (session_id, slot_name)
        ).fetchone()
        attempts = (row[0] if row else 0) + 1
        conn.execute(
            """INSERT INTO slot_attempts (session_id, slot_name, attempts, last_ts)
               VALUES (?, ?, ?, ?)
               ON CONFLICT(session_id, slot_name) DO UPDATE SET
                   attempts=excluded.attempts, last_ts=excluded.last_ts""",
            (session_id, slot_name, attempts, now)
        )
        conn.commit()

        # Prune stale sessions periodically
        if attempts == 1:
            conn.execute("DELETE FROM slot_attempts WHERE last_ts < ?", (now - SESSION_TTL,))
            conn.commit()
        conn.close()

    if attempts > MAX_SLOT_ATTEMPTS:
        return jsonify({
            "allow_llm_reask": False,
            "reason": "ceiling_reached",
            "attempts": attempts,
            "static_hint": f"I'm having trouble capturing your {slot_name}. Let me connect you with a team member who can help.",
        })

    return jsonify({
        "allow_llm_reask": True,
        "attempts": attempts,
        "remaining": MAX_SLOT_ATTEMPTS - attempts,
    })

The guard runs two checks sequentially. First, the structural pre-check applies a slot-specific pattern test to the user's last utterance. If the test determines that the signal is structurally absent — the user's text cannot contain an email address, for example, because there is no @ sign or "at" word — the guard blocks the LLM call immediately and returns a static_hint string for the action to dispatch via a static response template. No LLM API call fires. Second, if the signal is present, the guard checks the per-slot attempt counter against MAX_SLOT_ATTEMPTS. When the ceiling is reached, the guard returns an escalation message without an LLM call, routing the conversation to a human handoff node. The guard's static_hint for ceiling-reached is a transfer message, not a re-ask, so the next action in the Rasa flow should route to escalation rather than another utter_ask_<slot>.

Failure Mode 2 — DIET Fallback Cascades

Rasa's DIET Classifier assigns a confidence score to every intent prediction. When the top-scoring intent falls below the configured threshold (fallback_classifier_threshold in the pipeline config, typically 0.7), the FallbackClassifier overrides the intent with nlu_fallback and routes the conversation to the action_default_fallback rule. Teams that want to avoid dead-end fallback responses implement an LLM fallback action: instead of the default "I didn't understand that" response, a custom action calls GPT-4o with the user's utterance and the bot's intent schema to classify the intent more robustly than DIET alone.

The failure mode emerges under distribution shift — when a class of user utterances consistently falls below DIET's confidence threshold. This is common in production: users rephrase common intents in ways not represented in the training data, use domain jargon introduced after the last training run, or mix languages in a way that confuses the tokenizer. A user asking "what's the ETA on my parcel" instead of the trained "where is my order" may land below 0.7 confidence every time. Every turn from that user routes to the LLM fallback classifier — not just the first ambiguous utterance, but every subsequent message in the conversation, because the training distribution hasn't been updated. For a customer service bot handling 200,000 monthly sessions where 8% of sessions have at least one DIET confidence miss, and where each miss in a conversation of average length 6 turns causes 1.8 LLM fallback calls, the LLM fallback cascade generates 28,800 unexpected LLM calls per month.

The compounding factor is that LLM fallback classifiers are often configured with rich context: the intent schema, example utterances for each intent, the conversation history, and the user's current utterance. This produces large-context API calls — 2,000–5,000 tokens per call on a well-specified prompt — which cost 5–10× more per call than a short slot-filling query. A cascade of 28,800 large-context LLM calls consumes materially more budget than 28,800 short calls to the same model.

The fallback rule: LLM fallback classification should have a per-session call ceiling and a topic-hash deduplication check. If the LLM classified the same normalized utterance in this session and returned a result, do not re-classify it — return the cached result. If the LLM has already been called N times in this session without producing a high-confidence routing decision, stop calling it: the LLM is not going to successfully disambiguate a topic that resists disambiguation after N attempts. Route to escalation or to a menu-based disambiguation flow instead. Log the utterances that triggered LLM fallback calls — these are your next DIET retraining candidates.

Python — DIETFallbackGuard: per-session LLM fallback ceiling + utterance deduplication (Flask endpoint)
import re
import time
import hashlib
import sqlite3
import threading
from flask import Flask, request, jsonify

app = Flask(__name__)
_db_lock = threading.Lock()

def _db():
    conn = sqlite3.connect("/tmp/rasa_guards.db", check_same_thread=False)
    conn.execute("""
        CREATE TABLE IF NOT EXISTS diet_fallback (
            session_id    TEXT NOT NULL,
            utt_hash      TEXT NOT NULL,
            llm_result    TEXT,          -- cached classification result (JSON string)
            attempt_count INTEGER NOT NULL DEFAULT 0,
            last_ts       REAL NOT NULL,
            PRIMARY KEY (session_id, utt_hash)
        )
    """)
    conn.execute("""
        CREATE TABLE IF NOT EXISTS diet_session_totals (
            session_id     TEXT PRIMARY KEY,
            total_fallbacks INTEGER NOT NULL DEFAULT 0,
            last_ts        REAL NOT NULL
        )
    """)
    conn.commit()
    return conn

MAX_SESSION_FALLBACKS = 4   # total LLM fallback calls per session before escalation
SESSION_TTL = 3600

def _normalize(utterance: str) -> str:
    return re.sub(r'\s+', ' ', utterance.lower().strip())

def _utt_hash(utterance: str) -> str:
    return hashlib.sha256(_normalize(utterance).encode()).hexdigest()[:32]

@app.route("/guard/diet-fallback/check", methods=["POST"])
def diet_fallback_check():
    """
    Called before the LLM fallback classifier fires.
    Request: { session_id, utterance }
    Response:
      { allow_llm_call: true }                     → proceed with LLM classification
      { allow_llm_call: false, reason, cached_result? } → use cache or escalate
    """
    body = request.get_json(force=True)
    session_id = body["session_id"]
    utterance  = body["utterance"]
    utt_hash   = _utt_hash(utterance)
    now = time.time()

    with _db_lock:
        conn = _db()

        # Check utterance-level cache for this session
        row = conn.execute(
            "SELECT llm_result, attempt_count FROM diet_fallback WHERE session_id=? AND utt_hash=?",
            (session_id, utt_hash)
        ).fetchone()

        if row and row[0]:  # cached result exists
            conn.close()
            return jsonify({
                "allow_llm_call": False,
                "reason": "cached",
                "cached_result": row[0],
            })

        # Check session-level total ceiling
        total_row = conn.execute(
            "SELECT total_fallbacks FROM diet_session_totals WHERE session_id=?",
            (session_id,)
        ).fetchone()
        total = total_row[0] if total_row else 0

        if total >= MAX_SESSION_FALLBACKS:
            conn.close()
            return jsonify({
                "allow_llm_call": False,
                "reason": "session_ceiling_reached",
                "total_fallbacks": total,
            })

        # Increment session total and record this utterance
        conn.execute(
            """INSERT INTO diet_session_totals (session_id, total_fallbacks, last_ts)
               VALUES (?, 1, ?)
               ON CONFLICT(session_id) DO UPDATE SET
                   total_fallbacks=total_fallbacks+1, last_ts=excluded.last_ts""",
            (session_id, now)
        )
        conn.execute(
            """INSERT INTO diet_fallback (session_id, utt_hash, attempt_count, last_ts)
               VALUES (?, ?, 1, ?)
               ON CONFLICT(session_id, utt_hash) DO UPDATE SET
                   attempt_count=attempt_count+1, last_ts=excluded.last_ts""",
            (session_id, utt_hash, now)
        )
        # Prune stale sessions
        conn.execute(
            "DELETE FROM diet_session_totals WHERE last_ts < ?", (now - SESSION_TTL,)
        )
        conn.execute(
            "DELETE FROM diet_fallback WHERE last_ts < ?", (now - SESSION_TTL,)
        )
        conn.commit()
        conn.close()

    return jsonify({
        "allow_llm_call": True,
        "session_fallbacks_used": total + 1,
        "remaining": MAX_SESSION_FALLBACKS - total - 1,
    })

@app.route("/guard/diet-fallback/record", methods=["POST"])
def diet_fallback_record():
    """
    Called after the LLM classifier returns, to cache the result.
    Request: { session_id, utterance, llm_result (JSON string) }
    """
    body = request.get_json(force=True)
    session_id = body["session_id"]
    utt_hash   = _utt_hash(body["utterance"])
    llm_result = body["llm_result"]
    now = time.time()

    with _db_lock:
        conn = _db()
        conn.execute(
            """INSERT INTO diet_fallback (session_id, utt_hash, llm_result, attempt_count, last_ts)
               VALUES (?, ?, ?, 1, ?)
               ON CONFLICT(session_id, utt_hash) DO UPDATE SET
                   llm_result=excluded.llm_result, last_ts=excluded.last_ts""",
            (session_id, utt_hash, llm_result, now)
        )
        conn.commit()
        conn.close()

    return jsonify({"recorded": True})

The guard uses a two-table design. diet_fallback stores per-session per-utterance-hash records, enabling cache lookup when the same normalized utterance appears multiple times in the same session. diet_session_totals tracks the total number of LLM fallback calls issued this session, enabling the session-level ceiling check. The /guard/diet-fallback/check endpoint runs before the LLM call: it returns allow_llm_call: false with reason: "cached" when a result is already stored for this utterance hash, or with reason: "session_ceiling_reached" when the total LLM fallback calls in the session have hit MAX_SESSION_FALLBACKS. The /guard/diet-fallback/record endpoint is called after a successful LLM classification to write the result to the utterance-hash cache. Utterances that consistently miss the DIET threshold and require LLM fallback should be exported from the diet_fallback table for inclusion in the next DIET retraining run — the guard's logs are a direct signal for model improvement priorities.

Failure Mode 3 — RAG Action Re-Query Loops

Rasa teams building knowledge-base Q&A features implement RAG inside custom actions: the action embeds the user's question via an embedding model (OpenAI text-embedding-3-small, Cohere Embed v3, or a self-hosted alternative), queries a vector database (Weaviate, Qdrant, Pinecone, pgvector), retrieves the top-K most semantically similar chunks, and calls an LLM with those chunks as context to synthesize a natural-language answer. The synthesized answer includes a confidence indicator — either from the LLM's own self-assessment ("I'm not sure, but…") or from a threshold applied to the top-K retrieval scores. When confidence is low, the action pattern is to return an ActiveLoop event, re-ask the user whether the answer was helpful, and if not, trigger the RAG action again with the user's follow-up utterance as a refined query.

The failure mode is structural: vector retrieval is deterministic for a given query. If the knowledge base does not contain a document relevant to the user's question, the top-K retrieval returns the same low-relevance chunks regardless of how many times the query is repeated or how many times the system prompt rephrases the synthesis request. A re-query on the same topic_hash retrieves the same chunks and produces an answer of the same low confidence level. Without a per-topic query ceiling, each unsatisfied exchange fires one embedding API call plus one LLM synthesis call indefinitely — two billed operations per re-ask, multiplied by the number of unsatisfied re-asks before the user abandons. A conversation where three topic-level re-queries each fire three RAG iterations consumes 18 billed API calls (3 topics × 3 iterations × 2 operations) for a conversation that never produced a useful answer.

The re-query rule: A second RAG query on the same normalized topic is rarely warranted when the first query returned low retrieval scores. If the top-K chunks had low relevance scores, the knowledge base does not contain the answer — rephrasing the synthesis call will not change the retrieval results. Block re-queries for topics where the prior retrieval confidence fell below LOW_RETRIEVAL_THRESHOLD. Allow one re-query per topic per session as a safety valve for cases where the user's follow-up genuinely rephrases the question in a way that maps to different chunks (topic drift). On the second re-query for the same topic hash, block unconditionally and route to a human handoff or a curated "not available" response. Log the blocked topic hashes — these are your knowledge base content gaps.

Python — RAGReQueryGuard: per-topic retrieval confidence tracking + re-query ceiling (Flask endpoint)
import re
import time
import hashlib
import sqlite3
import threading
from flask import Flask, request, jsonify

app = Flask(__name__)
_db_lock = threading.Lock()

def _db():
    conn = sqlite3.connect("/tmp/rasa_guards.db", check_same_thread=False)
    conn.execute("""
        CREATE TABLE IF NOT EXISTS rag_queries (
            session_id         TEXT NOT NULL,
            topic_hash         TEXT NOT NULL,
            query_count        INTEGER NOT NULL DEFAULT 0,
            best_retrieval_score REAL,
            best_answer_text   TEXT,
            last_ts            REAL NOT NULL,
            PRIMARY KEY (session_id, topic_hash)
        )
    """)
    conn.commit()
    return conn

LOW_RETRIEVAL_THRESHOLD = 0.45  # below this, re-query is very unlikely to help
MAX_QUERIES_PER_TOPIC   = 2     # allow one genuine rephrase; block the rest
SESSION_TTL = 3600

def _topic_hash(query: str) -> str:
    normalized = re.sub(r'\W+', ' ', query.lower()).strip()
    # Use first 80 chars of normalized query as topic fingerprint
    return hashlib.sha256(normalized[:80].encode()).hexdigest()[:32]

@app.route("/guard/rag/check", methods=["POST"])
def rag_check():
    """
    Called before embedding + retrieval fires.
    Request: { session_id, query }
    Response:
      { allow_query: true }
      { allow_query: false, reason, best_retrieval_score?, best_answer_text? }
    """
    body = request.get_json(force=True)
    session_id = body["session_id"]
    query      = body["query"]
    topic_h    = _topic_hash(query)
    now = time.time()

    with _db_lock:
        conn = _db()
        row = conn.execute(
            "SELECT query_count, best_retrieval_score, best_answer_text FROM rag_queries "
            "WHERE session_id=? AND topic_hash=?",
            (session_id, topic_h)
        ).fetchone()
        conn.close()

    if not row:
        return jsonify({"allow_query": True, "query_count": 0})

    query_count, best_score, best_answer = row[0], row[1], row[2]

    # Block if prior retrieval score was below threshold (KB doesn't have the answer)
    if best_score is not None and best_score < LOW_RETRIEVAL_THRESHOLD:
        return jsonify({
            "allow_query": False,
            "reason": "low_retrieval_score",
            "best_retrieval_score": best_score,
            "best_answer_text": best_answer,
            "message": "Our knowledge base may not cover this topic. A team member can help.",
        })

    # Block if topic has already been queried MAX_QUERIES_PER_TOPIC times
    if query_count >= MAX_QUERIES_PER_TOPIC:
        return jsonify({
            "allow_query": False,
            "reason": "topic_ceiling_reached",
            "query_count": query_count,
            "best_retrieval_score": best_score,
            "best_answer_text": best_answer,
        })

    return jsonify({
        "allow_query": True,
        "query_count": query_count,
        "remaining": MAX_QUERIES_PER_TOPIC - query_count,
    })

@app.route("/guard/rag/record", methods=["POST"])
def rag_record():
    """
    Called after retrieval + synthesis complete.
    Request: { session_id, query, retrieval_score (float 0-1), answer_text }
    """
    body = request.get_json(force=True)
    session_id     = body["session_id"]
    query          = body["query"]
    retrieval_score = float(body["retrieval_score"])
    answer_text    = body.get("answer_text", "")
    topic_h        = _topic_hash(query)
    now = time.time()

    with _db_lock:
        conn = _db()
        row = conn.execute(
            "SELECT query_count, best_retrieval_score FROM rag_queries "
            "WHERE session_id=? AND topic_hash=?",
            (session_id, topic_h)
        ).fetchone()
        prev_count = row[0] if row else 0
        prev_best  = row[1] if row else None
        new_best_score = retrieval_score if (prev_best is None or retrieval_score > prev_best) else prev_best
        new_best_answer = answer_text if (prev_best is None or retrieval_score >= prev_best) else (row[2] if row else "")

        conn.execute(
            """INSERT INTO rag_queries
                   (session_id, topic_hash, query_count, best_retrieval_score, best_answer_text, last_ts)
               VALUES (?, ?, ?, ?, ?, ?)
               ON CONFLICT(session_id, topic_hash) DO UPDATE SET
                   query_count=excluded.query_count,
                   best_retrieval_score=excluded.best_retrieval_score,
                   best_answer_text=excluded.best_answer_text,
                   last_ts=excluded.last_ts""",
            (session_id, topic_h, prev_count + 1, new_best_score, new_best_answer, now)
        )
        conn.execute("DELETE FROM rag_queries WHERE last_ts < ?", (now - SESSION_TTL,))
        conn.commit()
        conn.close()

    return jsonify({
        "recorded": True,
        "query_count": prev_count + 1,
        "best_retrieval_score": new_best_score,
    })

The guard records best_retrieval_score — the highest vector similarity score seen across all top-K retrieved chunks for this topic in this session. This score is the primary signal: if the best retrieval score after the first query was 0.31, the knowledge base has no relevant content for this topic, and a second query will not produce a score above 0.31 (the same documents exist in the index). The guard blocks the second query immediately and returns best_answer_text — the text of the best answer found so far, which the action can display to the user alongside a "connecting you with a team member" message. The query_count ceiling is a secondary safety: for topics where the first query produced a moderate score (above LOW_RETRIEVAL_THRESHOLD but unsatisfying), one re-query is allowed for genuine topic refinement, but the third query on the same topic hash is blocked unconditionally.

Failure Mode 4 — ReminderScheduled Event Storms

Rasa's event system includes ReminderScheduled — an event dispatched by a custom action that schedules a callback: at a future time, Rasa calls a specified action with the original conversation context. This is the standard pattern for async operations: a user asks the bot to check on a pending order shipment; the bot dispatches a ReminderScheduled event pointing to action_check_shipment_status with a 30-second delay; when the delay expires, Rasa calls action_check_shipment_status, which queries the logistics API and synthesizes a status update. If the callback action calls an LLM to generate a natural-language status summary, each reminder firing = one LLM API call.

The failure mode emerges from Rasa's policy ensemble. When a custom action dispatches a ReminderScheduled event and then encounters a subsequent policy disagreement (TED and RulePolicy disagree on the next action; the resolver applies priority rules that re-trigger the custom action via a form re-run), the scheduling action may execute a second time before the first reminder fires. This schedules a second overlapping reminder. If the user re-sends the triggering intent (impatience, thinking the bot didn't respond, network retry), the action fires a third time and schedules a third reminder. When all three reminders fire — each independently calling the logistics API and the LLM synthesis — the user receives three nearly identical notifications and the billing account records three LLM API calls for one user request. At 100,000 async-capable sessions per month where 2% encounter duplicate reminder scheduling with an average of 2.3 duplicates per affected session, the reminder storm generates 2,600 excess LLM calls per month from this failure mode alone.

The reminder rule: Every ReminderScheduled dispatch must check a session-scoped reminder registry before scheduling. If a reminder for the same action_name and intent_name pair is already scheduled in this session (i.e., the scheduling action ran once and the reminder has not yet fired), block the second scheduling call and return the existing reminder's metadata. When the callback action fires, record the reminder as consumed so subsequent re-scheduling is allowed if the user genuinely needs another async check. Use a SQLite or Redis key with a TTL equal to the reminder delay plus a buffer — this naturally expires stale reminder locks when the callback has fired or when the conversation times out.

Python — ReminderDeduplicationGuard: session-scoped reminder registry with TTL (Flask endpoint)
import time
import sqlite3
import threading
from flask import Flask, request, jsonify

app = Flask(__name__)
_db_lock = threading.Lock()

def _db():
    conn = sqlite3.connect("/tmp/rasa_guards.db", check_same_thread=False)
    conn.execute("""
        CREATE TABLE IF NOT EXISTS scheduled_reminders (
            session_id    TEXT NOT NULL,
            reminder_key  TEXT NOT NULL,   -- action_name + ":" + intent_name
            scheduled_at  REAL NOT NULL,
            fire_at       REAL NOT NULL,   -- scheduled_at + delay_seconds
            consumed      INTEGER NOT NULL DEFAULT 0,
            PRIMARY KEY (session_id, reminder_key)
        )
    """)
    conn.commit()
    return conn

REMINDER_DEDUP_BUFFER = 10  # seconds of extra TTL past the fire time

@app.route("/guard/reminder/check", methods=["POST"])
def reminder_check():
    """
    Called before dispatching ReminderScheduled from a custom action.
    Request: { session_id, action_name, intent_name, delay_seconds }
    Response:
      { allow_schedule: true }
      { allow_schedule: false, reason, existing_fire_at }
    """
    body = request.get_json(force=True)
    session_id    = body["session_id"]
    action_name   = body["action_name"]
    intent_name   = body.get("intent_name", "")
    delay_seconds = float(body.get("delay_seconds", 30))
    reminder_key  = f"{action_name}:{intent_name}"
    now = time.time()

    with _db_lock:
        conn = _db()

        # Prune expired entries (fire_at + buffer has passed)
        conn.execute(
            "DELETE FROM scheduled_reminders WHERE fire_at + ? < ?",
            (REMINDER_DEDUP_BUFFER, now)
        )
        conn.commit()

        row = conn.execute(
            "SELECT scheduled_at, fire_at, consumed FROM scheduled_reminders "
            "WHERE session_id=? AND reminder_key=?",
            (session_id, reminder_key)
        ).fetchone()

        if row and not row[2]:  # existing active (non-consumed) reminder
            conn.close()
            return jsonify({
                "allow_schedule": False,
                "reason": "already_scheduled",
                "existing_scheduled_at": row[0],
                "existing_fire_at": row[1],
                "seconds_until_fire": max(0, row[1] - now),
            })

        # No active reminder; register this one
        fire_at = now + delay_seconds
        conn.execute(
            """INSERT INTO scheduled_reminders
                   (session_id, reminder_key, scheduled_at, fire_at, consumed)
               VALUES (?, ?, ?, ?, 0)
               ON CONFLICT(session_id, reminder_key) DO UPDATE SET
                   scheduled_at=excluded.scheduled_at,
                   fire_at=excluded.fire_at,
                   consumed=0""",
            (session_id, reminder_key, now, fire_at)
        )
        conn.commit()
        conn.close()

    return jsonify({
        "allow_schedule": True,
        "fire_at": now + delay_seconds,
    })

@app.route("/guard/reminder/consume", methods=["POST"])
def reminder_consume():
    """
    Called at the start of the callback action (after the reminder fires).
    Marks the reminder as consumed so re-scheduling is allowed.
    Request: { session_id, action_name, intent_name }
    """
    body = request.get_json(force=True)
    session_id  = body["session_id"]
    reminder_key = f"{body['action_name']}:{body.get('intent_name', '')}"

    with _db_lock:
        conn = _db()
        conn.execute(
            "UPDATE scheduled_reminders SET consumed=1 WHERE session_id=? AND reminder_key=?",
            (session_id, reminder_key)
        )
        conn.commit()
        conn.close()

    return jsonify({"consumed": True})

The guard maintains a scheduled_reminders table with one row per active reminder per session. The reminder_key combines action_name and intent_name to uniquely identify what the reminder will do when it fires — this allows multiple different reminder types to be active in the same session simultaneously (one checking shipment status, another checking payment status) while deduplicating repeats of the same type. The consumed flag is set when the callback action fires, marking the reminder as spent. If the callback is never invoked (conversation times out, user closes the channel), the row expires naturally when fire_at + REMINDER_DEDUP_BUFFER passes. The periodic prune in reminder_check keeps the table bounded without a separate background job. In the callback action, calling /guard/reminder/consume at the start ensures that if the user legitimately triggers another async check after the first completes, the new scheduling call will be allowed.

Guard Integration: State Table

Failure Mode Guard Class Ceiling / Key Signal What to Monitor
Form validation spiral
LLM-powered re-ask on every failed slot extraction
FormSlotFillGuard 3 LLM re-asks per slot per session; structural pre-check blocks signal-absent inputs immediately signal_absent rate per slot (high rate = training data gap); ceiling_reached rate per slot (high rate = format UX problem)
DIET fallback cascade
LLM classifier called on every DIET confidence miss
DIETFallbackGuard 4 LLM fallback calls per session; utterance-hash cache dedups repeated queries session_ceiling_reached sessions per day (export utterances as DIET retraining candidates); cache hit rate (high = same user repeating misunderstood intent)
RAG re-query loop
Embedding + LLM synthesis on every unsatisfied knowledge query
RAGReQueryGuard 1 genuine rephrase allowed; low-retrieval-score topics blocked immediately; 2 queries per topic per session ceiling low_retrieval_score topic_hashes (content gaps in knowledge base); best_retrieval_score distribution (below 0.45 = knowledge base not covering query domain)
ReminderScheduled storm
Duplicate reminder scheduling fires N concurrent LLM callbacks
ReminderDeduplicationGuard One active reminder per (session, action_name, intent_name) at a time; TTL-based expiry; consumed flag gates re-scheduling already_scheduled blocks per session (identifies actions where policy disagreement triggers duplicate dispatch); reminder keys with high block rates (audit their dispatching action for idempotency)

Deployment Checklist

  1. Audit all CustomActionAsk actions — every one that calls an LLM needs to call /guard/slot-fill first. Add structural pre-check patterns for each slot type in your domain. Start with the three most common entity types in your forms.
  2. Log DIET confidence scores per session — add a middleware to your action server that writes intent_confidence to a structured log. Any session where DIET confidence averages below 0.7 across all turns is a retraining candidate, not a candidate for permanent LLM fallback.
  3. Add retrieval score logging to RAG actions — the max(chunk.score for chunk in retrieved_chunks) value is the primary diagnostic. If your production p50 retrieval score is below 0.5, your knowledge base coverage is materially insufficient for the query distribution your users are sending.
  4. Audit all actions that dispatch ReminderScheduled — wrap each scheduling call with a check to /guard/reminder/check before dispatching. Add a /guard/reminder/consume call at the start of each callback action.
  5. Set MAX_SESSION_FALLBACKS conservatively — start at 3 for the DIET fallback guard. Monitor the session_ceiling_reached rate. If it's above 5% of sessions, the LLM fallback is becoming load-bearing rather than exceptional — retrain DIET, don't raise the ceiling.
  6. Export guard block logs weekly — blocked slot fills, DIET fallback ceiling hits, low-retrieval-score topics, and duplicate reminder detections are all training and content signals. Running a weekly export and filing tickets against the training pipeline is what prevents the guards from becoming permanent cost caps on a broken system.

Frequently Asked Questions

Rasa is an open-source ML framework, not an LLM platform. How does it accumulate LLM API costs?

Rasa's core ML components (DIET Classifier, TED Policy, Entity extractors) run locally and have no per-inference API cost once trained. LLM costs come entirely from your custom action server's outbound calls: GPT-4o or Claude for dynamic slot-filling re-asks in CustomActionAsk actions, LLM classifiers in fallback actions, RAG synthesis in knowledge-base query actions, and LLM summarization in async callback actions. Teams migrating from static response templates to LLM-generated responses inside Rasa's action server can go from zero LLM API cost to hundreds of dollars per month without changing Rasa's configuration at all — the cost surface is entirely in the action server's HTTP calls to OpenAI or Anthropic.

Could the guards be implemented as Rasa custom action middleware rather than separate Flask endpoints?

Yes. Each guard can be refactored as a Python class instantiated once per action server process, with SQLite replaced by an in-process dictionary (for single-instance deployments) or a shared Redis store (for horizontally scaled action servers). The Flask endpoint pattern is shown here because it allows the guards to run as a separate sidecar service — useful when your action server is under high load and you want to avoid adding guard latency to the action server's critical path. For single-instance deployments, in-process guards with thread-safe dictionaries have lower latency and no HTTP round-trip overhead. For horizontally scaled deployments (multiple action server pods), a shared Redis store is required because session state must be visible across all pods handling turns from the same conversation.

Does RunGuard's SDK integrate with Rasa's action server directly?

The RunGuard SDK integrates at the LLM call boundary inside your custom actions, not at the Rasa framework level. Install with pip install runguard and wrap your LLM calls: from runguard import guard; result = guard(session_id=tracker.sender_id, fn=lambda: openai_client.chat.completions.create(...)). The SDK applies the same ceiling, deduplication, and circuit-breaker patterns as the guards above, with telemetry emitted to the RunGuard dashboard for cross-session cost visibility. The SDK is framework-agnostic: the same guard wrapping pattern works whether your Rasa action is a FormValidationAction, a CustomActionAsk, a RAG action, or a reminder callback action.

How should MAX_SLOT_ATTEMPTS be calibrated for different slot types?

The ceiling should be set based on the maximum number of re-asks that produce a materially different user behavior. For email addresses and phone numbers, empirical data from production Rasa deployments shows that 80%+ of users who will successfully provide the value do so within 2 re-asks; the 3rd re-ask has below 15% success rate on the same session. For order IDs and account numbers, where users must retrieve the value from an email or a physical document, a higher ceiling (4–5) is warranted because retrieval latency is the bottleneck, not user understanding. For date slots, 2 re-asks is usually sufficient — if the user cannot provide a date in any parseable format after 2 attempts, routing to a calendar-based picker UI (a static response with a calendar link) is more effective than a third LLM-generated re-ask. Monitor ceiling_reached events by slot type and compare against slot fill rates after ceiling hits — adjust the ceiling upward only if a meaningful proportion of ceiling-reached sessions would have succeeded with one more attempt.

How do the guards interact with Rasa X / Rasa Enterprise's conversation review and retraining pipeline?

The guards generate structured block logs that complement Rasa X's conversation review workflow. Export the diet_fallback table's utterances where llm_result contains a high-confidence classification — these are utterances where the LLM fallback successfully identified the intent but DIET failed, and they are directly usable as new training examples for DIET via Rasa X's "Flag for Review" flow. Export the rag_queries table's topic_hash values where best_retrieval_score < 0.45 — these are knowledge base content gaps. Export the slot_attempts table's session_id/slot_name pairs where attempts > MAX_SLOT_ATTEMPTS — these are form UX problems where the slot-filling UI needs redesign, not more re-asks. Together, the guard's block logs feed three distinct improvement pipelines: NLU training data, knowledge base content, and form UX design.

Stop paying for loops your users never see

RunGuard's SDK wraps LLM calls inside your Rasa action server with per-session ceilings, structural pre-checks, and cross-session circuit breakers — the same patterns as the guards above, with a dashboard for cost attribution and block-log export for your retraining pipeline.

See pricing →