June 21, 2026 Vapi.ai Voice AI Cost Control Loop Detection

Vapi.ai Cost Control: VAD Silence Loops, Endpointing Interrupt Storms, Webhook Cascades, and Outbound Call Retry Amplification

Vapi.ai has emerged as the dominant infrastructure layer for AI voice agents in 2025–2026. Teams building outbound sales agents, inbound support lines, appointment schedulers, and interactive IVR replacements use Vapi to orchestrate the four components that make a voice call work: Speech-to-Text (Deepgram, AssemblyAI, or Gladia for transcription), a Large Language Model (OpenAI GPT-4o, Anthropic Claude, or Google Gemini for conversation logic), Text-to-Speech (ElevenLabs, PlayHT, Cartesia, or Deepgram TTS for audio synthesis), and the Vapi platform layer itself (WebRTC session management, Voice Activity Detection, turn-taking orchestration, telephony via Twilio or Vonage). You pay for all four simultaneously for every minute a call is active.

This billing model creates a cost profile that text-based agent platforms do not share: every loop, retry, or architectural misfire that extends active call duration multiplies cost across all four layers at once. A text agent loop that sends an extra LLM call costs one incremental API call. The equivalent voice agent failure — a VAD silence loop that turns a 60-second conversation into a 5-minute active session — costs 5× the STT audio processing, 5× the LLM context minutes (since the context grows with each empty exchange), 5× the TTS synthesis, and 5× the platform minute fee. The compounding is multiplicative, not additive. Four structural failure modes in the Vapi architecture are responsible for the majority of unexpected voice AI cost spikes:

VAD silence loops — Vapi's Voice Activity Detection fires on thinking pauses, slow network jitter, or ambient noise, treating the detected audio as a user utterance. STT transcribes it as empty or noise; the LLM interprets the empty input and generates a filler response ("I didn't catch that — could you repeat?"); TTS synthesizes and plays the response; the end of TTS playback creates a new silence gap that re-triggers VAD. Without a consecutive-empty-transcript guard that intercepts before the LLM call, each VAD misfire chains into the next, extending the active session for as long as VAD keeps firing.
Endpointing interrupt storms — Vapi uses STT endpointing (a confidence score from the STT provider indicating the user has stopped speaking) to determine when to route the transcription to the LLM. Aggressive endpointing thresholds or STT provider-side endpointing misfires cut off the user's utterance mid-sentence. The truncated utterance routes to the LLM, which produces a response to a partial question; the user must repeat the full utterance; the LLM consumes its full context window again for the repeated input; TTS synthesizes the response again. Each interruption cycle adds at minimum one full LLM-TTS round-trip plus the time the user spends repeating themselves, both of which directly extend the active call duration and its four-way billing.
serverUrl webhook cascades — Vapi's tool calling works by routing LLM function call decisions to a webhook endpoint (your serverUrl handler). If your handler exceeds Vapi's response timeout (timeoutSeconds, default 20s), Vapi may proceed with a degraded "no tool result" response or, in some configurations, retry the webhook. Slow handlers — those calling a downstream CRM, payment API, or database — routinely exceed the timeout under load. The session remains active (and billing) during the entire wait. Webhook retries cause non-idempotent operations (database inserts, Stripe charges, CRM record creation) to execute multiple times. The combination of extended active session + duplicate side effects is the most costly failure mode per incident.
Outbound call retry amplification — Vapi supports outbound call campaigns via its REST API. The standard pattern: your backend requests a call via POST /call/phone, listens for the status webhook (call.ended), and re-queues if ended_reason === "no-answer". If Vapi's status webhook delivery is delayed (network retry, webhook queue backlog), your backend receives the no-answer event after you've already re-queued and a second call is in-progress to the same number. N calls in-progress to the same number = N simultaneous active sessions, each billing all four cost layers independently, producing N×4× the expected cost per attempted connection.

Failure Mode 1 — VAD Silence Loops

Vapi's Voice Activity Detection engine continuously monitors the audio stream from the caller's line. When VAD detects audio energy above a configured threshold followed by a period of silence, it signals turn-end: the audio segment is sent to the STT provider for transcription. This detection is designed for human speech patterns — a burst of speech, a natural pause, then silence. It does not discriminate between speech and ambient noise, brief thinking pauses with background noise, or the acoustic artifacts introduced by poor VoIP connections (echo, jitter-induced gaps, DTMF tones).

When VAD misfires on non-speech audio, the resulting STT transcription is empty, a single punctuation mark, a noise description ("mm-hmm" from background audio), or a few characters of ambient sound. The LLM receives this as the user's next turn. An empty or near-empty user utterance after the previous LLM response triggers the LLM's conversational recovery behavior: it generates a clarifying prompt ("I'm sorry, I didn't catch that — could you say that again?", "Still there?", or a context-dependent re-statement of the previous question). Vapi's TTS engine synthesizes and speaks this recovery prompt. When TTS playback ends, the audio stream returns to silence. VAD detects the post-TTS silence and, if ambient noise or line conditions persist, fires again on the next audio fragment. The cycle repeats.

The key architectural issue is that nothing in the Vapi → STT → LLM → TTS pipeline has a built-in circuit to detect that the last N LLM responses were all filler prompts generated in response to empty STT transcriptions. The LLM does not track how many times it has said "I didn't catch that" in a row. The VAD does not know the STT returned empty on the last two cycles. The cost accumulates silently: each cycle charges STT for audio transcription seconds, LLM for a full context-window token generation (since the context grows with each exchange), and TTS for synthesizing the filler message.

A 2-minute VAD silence loop on a GPT-4o voice agent (with a 4,096-token context window, using Deepgram STT and ElevenLabs TTS) costs approximately:

STT: ~$0.006 × 12 VAD-fire cycles × 10 seconds each = ~$0.007
LLM: ~$0.01/1K tokens output × 150 tokens/response × 12 cycles = ~$0.018
TTS: ~$0.18/1K characters × 60 chars/response × 12 cycles = ~$0.013
Vapi platform: ~$0.05/minute × 2 extra minutes = ~$0.10

At scale — 500 concurrent calls where 20% hit VAD silence loops — the daily overcharge from this one failure mode alone runs into hundreds of dollars. The multiplier is not the per-cycle cost; it's the number of concurrent calls experiencing the loop simultaneously.

The VAD guard rule: Track consecutive empty-or-noise transcriptions per session. If the last two consecutive STT results were empty or below a minimum meaningful-transcript word count (≥ 3 words is a reasonable floor), block the LLM call and play a static pre-synthesized audio clip ("I'm having trouble hearing you — please check your connection and try again") or silence the line for a longer wait. Only resume after a valid transcript arrives. Do not generate a new LLM response to silence; do not synthesize a new TTS clip for each VAD misfire. Log the consecutive-empty count for monitoring: a session with 5+ consecutive empty transcriptions is a VAD environment mismatch, not a user behavior pattern.

Python — VAPIVADGuard: consecutive empty-transcript detection with early LLM bypass (serverUrl webhook handler)

import re
import time
import sqlite3
import threading
from flask import Flask, request, jsonify

app = Flask(__name__)
_db_lock = threading.Lock()

# Words that count as meaningful content in a transcription.
# Transcriptions shorter than MIN_MEANINGFUL_WORDS skip the LLM call.
MIN_MEANINGFUL_WORDS = 3
# Consecutive empty transcriptions before we hard-block and play static audio.
CONSECUTIVE_EMPTY_CEILING = 2
# How long to hold the session in a "wait for real speech" state before ending the call.
EMPTY_HOLD_SECONDS = 8
# Pre-synthesized audio URL to play instead of calling TTS (avoid TTS cost on silence).
STATIC_SILENCE_AUDIO_URL = "https://storage.runguard.dev/static/vapi-silence-prompt.mp3"

def _db():
    conn = sqlite3.connect("/tmp/vapi_guards.db", check_same_thread=False)
    conn.execute("""
        CREATE TABLE IF NOT EXISTS vad_state (
            call_id              TEXT PRIMARY KEY,
            consecutive_empty    INTEGER NOT NULL DEFAULT 0,
            last_empty_ts        REAL,
            total_empty_cycles   INTEGER NOT NULL DEFAULT 0,
            started_at           REAL NOT NULL
        )
    """)
    conn.commit()
    return conn

def _words(text: str) -> int:
    """Count alphabetic words in a transcription."""
    return len(re.findall(r"[a-zA-Z]{2,}", text or ""))

@app.route("/vapi/webhook", methods=["POST"])
def vapi_webhook():
    body = request.get_json(force=True)
    msg_type = body.get("message", {}).get("type")

    if msg_type == "assistant-request":
        return handle_assistant_request(body)
    if msg_type == "function-call":
        return handle_function_call(body)
    if msg_type == "end-of-call-report":
        cleanup_call(body)

    return jsonify({}), 200

def handle_assistant_request(body: dict) -> tuple:
    """
    Called every time the LLM would otherwise be invoked.
    We intercept here to run the VAD guard before the LLM sees the transcript.
    """
    call_id = body["message"]["call"]["id"]
    # Extract the most recent user transcript from the messages list.
    messages = body["message"].get("artifact", {}).get("messages", [])
    last_user_text = ""
    for m in reversed(messages):
        if m.get("role") == "user":
            last_user_text = m.get("message", "").strip()
            break

    word_count = _words(last_user_text)
    is_empty = word_count < MIN_MEANINGFUL_WORDS

    with _db_lock:
        conn = _db()
        row = conn.execute(
            "SELECT consecutive_empty, total_empty_cycles FROM vad_state WHERE call_id = ?",
            (call_id,)
        ).fetchone()

        if row is None:
            conn.execute(
                "INSERT INTO vad_state (call_id, consecutive_empty, total_empty_cycles, started_at) VALUES (?, ?, 0, ?)",
                (call_id, 1 if is_empty else 0, time.time())
            )
            conn.commit()
            consecutive = 1 if is_empty else 0
            total = 1 if is_empty else 0
        else:
            consecutive = (row[0] + 1) if is_empty else 0
            total = row[1] + (1 if is_empty else 0)
            conn.execute(
                """UPDATE vad_state
                   SET consecutive_empty = ?, total_empty_cycles = ?, last_empty_ts = ?
                   WHERE call_id = ?""",
                (consecutive, total, time.time() if is_empty else None, call_id)
            )
            conn.commit()
        conn.close()

    if is_empty and consecutive >= CONSECUTIVE_EMPTY_CEILING:
        # Block LLM call. Return a static audio response via Vapi's playback action.
        # This avoids LLM token cost + TTS synthesis cost for a silence cycle.
        return jsonify({
            "assistant": {
                "firstMessage": "",
                "voice": {"voiceId": "static"},
            },
            # Vapi supports returning a custom response object that bypasses the LLM.
            # Use the "response" field to inject a static message directly.
            "response": {
                "role": "assistant",
                "content": "[VAD guard: consecutive empty transcriptions detected — holding for valid speech]"
            },
            # Signal Vapi to play pre-synthesized audio instead of TTS.
            "_runguard_vad_blocked": True,
            "_runguard_consecutive_empty": consecutive,
            "_runguard_total_empty": total
        }), 200

    # Valid transcript or first empty cycle: allow LLM call.
    return jsonify({}), 200

def cleanup_call(body: dict):
    call_id = body["message"]["call"]["id"]
    with _db_lock:
        conn = _db()
        row = conn.execute(
            "SELECT consecutive_empty, total_empty_cycles, started_at FROM vad_state WHERE call_id = ?",
            (call_id,)
        ).fetchone()
        if row:
            duration = time.time() - row[2]
            # Export to monitoring: high total_empty_cycles = VAD environment mismatch
            print(f"[VAD export] call={call_id} total_empty={row[1]} duration={duration:.0f}s")
        conn.execute("DELETE FROM vad_state WHERE call_id = ?", (call_id,))
        conn.commit()
        conn.close()

def handle_function_call(body: dict) -> tuple:
    # Handled in failure mode 3 — stub here.
    return jsonify({"result": "ok"}), 200

Failure Mode 2 — Endpointing Interrupt Storms

Vapi uses the STT provider's endpointing signal to decide when a user has finished speaking. Endpointing is a confidence score emitted by the STT engine indicating that the user's utterance is complete — based on trailing silence duration, acoustic patterns, and (in some providers) semantic completion signals. Vapi's default endpointing configuration is tuned for conversational latency: it fires quickly to minimize the pause between the user stopping and the assistant responding, because long response latency in voice interactions feels unnatural. The tradeoff is that the endpointing fires before the user has fully completed a complex utterance.

Consider a user saying: "I need to reschedule my appointment from this Thursday to next Monday, and also confirm that the address is still the same as—" The STT endpointing fires after the pause before "same as—", producing the transcript "I need to reschedule my appointment from this Thursday to next Monday, and also confirm that the address is still the same as." The LLM receives a truncated, grammatically incomplete sentence and generates a response to it. Meanwhile the user is about to say "the one on file." The TTS response plays over the user completing their sentence; the user hears a response that doesn't address what they were actually asking; they must repeat the full request.

Each interrupt cycle costs:

One full LLM context-window token generation for the truncated input
One TTS synthesis for the partial-input response
The additional call time spent while the user formulates and repeats the complete request
Another LLM + TTS cycle for the repeated full input

A call where the user is interrupted three times on complex requests effectively processes 6 LLM + TTS round-trips instead of 3, and extends call duration proportionally. The per-minute cost multiplier compounds because longer calls produce longer LLM context windows (each turn is appended), making each subsequent LLM call more expensive in tokens than the previous one. A 6-minute call on a context-growing GPT-4o deployment can cost 40–60% more in LLM tokens than a 6-minute call with a stable context, purely because the growing history is re-sent on every interrupted turn.

The endpointing guard rule: Track per-session interruption events — defined as turns where the user's message ends with an incomplete clause marker (trailing prepositions, conjunctions, or a comma as the final punctuation), or where the turn word count is below the rolling mean for this call session by more than one standard deviation. At ceiling (3 interruption events per session), widen the endpointing window: signal Vapi's endpointing parameter dynamically to extend silence-hold before firing. Log the interruption count per call for STT provider comparison: high interruption rates on one STT backend and low on another is direct evidence for switching providers or adjusting endpointingConfig.timeout.

Python — VAPIEndpointingGuard: incomplete-utterance detection and per-session interruption tracking

import re
import time
import sqlite3
import threading
import statistics

_db_lock = threading.Lock()

# Clause-end incompleteness markers — utterances ending with these are likely cut off.
INCOMPLETE_MARKERS = re.compile(
    r"\b(and|or|but|so|because|with|for|to|the|a|an|that|which|who|in|on|at|by|from|as)\s*[,.]?\s*$",
    re.IGNORECASE
)
MAX_INTERRUPTIONS_PER_SESSION = 3

def _db():
    conn = sqlite3.connect("/tmp/vapi_guards.db", check_same_thread=False)
    conn.execute("""
        CREATE TABLE IF NOT EXISTS endpointing_state (
            call_id            TEXT PRIMARY KEY,
            interruptions      INTEGER NOT NULL DEFAULT 0,
            turn_word_counts   TEXT NOT NULL DEFAULT '[]',
            started_at         REAL NOT NULL
        )
    """)
    conn.commit()
    return conn

def check_interruption(call_id: str, transcript: str) -> dict:
    """
    Returns a dict with:
      - is_interrupted: bool — whether this turn looks like an interrupted utterance
      - interruptions_so_far: int — total interruptions in this session
      - recommend_widen_endpointing: bool — whether to signal Vapi to widen the endpoint window
    """
    import json
    transcript = (transcript or "").strip()
    words = transcript.split()
    word_count = len(words)
    looks_incomplete = bool(INCOMPLETE_MARKERS.search(transcript))

    with _db_lock:
        conn = _db()
        row = conn.execute(
            "SELECT interruptions, turn_word_counts FROM endpointing_state WHERE call_id = ?",
            (call_id,)
        ).fetchone()

        if row is None:
            counts = [word_count]
            conn.execute(
                "INSERT INTO endpointing_state (call_id, interruptions, turn_word_counts, started_at) VALUES (?, 0, ?, ?)",
                (call_id, json.dumps(counts), time.time())
            )
            conn.commit()
            interruptions = 0
        else:
            counts = json.loads(row[1])
            interruptions = row[0]

            # Check whether this turn is significantly shorter than the session mean.
            mean_wc = statistics.mean(counts) if counts else word_count
            stddev_wc = statistics.stdev(counts) if len(counts) > 2 else 0
            well_below_mean = stddev_wc > 0 and word_count < (mean_wc - 1.5 * stddev_wc)

            is_interrupted = looks_incomplete or well_below_mean
            if is_interrupted:
                interruptions += 1

            counts.append(word_count)
            conn.execute(
                "UPDATE endpointing_state SET interruptions = ?, turn_word_counts = ? WHERE call_id = ?",
                (interruptions, json.dumps(counts[-20:]), call_id)  # Keep last 20 turns
            )
            conn.commit()

        conn.close()

    recommend_widen = interruptions >= MAX_INTERRUPTIONS_PER_SESSION
    return {
        "is_interrupted": looks_incomplete,
        "interruptions_so_far": interruptions,
        "recommend_widen_endpointing": recommend_widen
    }

def build_vapi_assistant_config_with_endpointing(base_config: dict, interruptions: int) -> dict:
    """
    Injects a wider endpointing timeout into a Vapi assistant config
    when the per-session interruption ceiling is reached.
    Each additional interruption event widens the hold by 200ms.
    """
    if interruptions < MAX_INTERRUPTIONS_PER_SESSION:
        return base_config

    extra_ms = (interruptions - MAX_INTERRUPTIONS_PER_SESSION + 1) * 200
    base_timeout_ms = 700  # Vapi default ~500-700ms
    new_timeout_ms = min(base_timeout_ms + extra_ms, 1500)  # Cap at 1500ms

    config = dict(base_config)
    config.setdefault("transcriber", {})
    config["transcriber"]["endpointing"] = new_timeout_ms
    return config

Failure Mode 3 — serverUrl Webhook Cascades

Vapi's tool calling routes LLM function-call decisions to a webhook endpoint you control. When the LLM decides to call a function — for example, check_appointment_availability(date: "2026-06-25", time: "14:00") — Vapi sends a POST request to your serverUrl with the function name and arguments. Your handler is expected to respond within the configured timeoutSeconds (default: 20 seconds for function calls) with the function result. Vapi holds the conversation's LLM context open during the wait. The session is active and billing from the moment the function call fires until your handler responds and the LLM generates its next response.

Handlers that call downstream services — CRM APIs to check customer records, scheduling systems to verify availability, payment processors to complete a charge, or internal databases under lock contention — routinely exceed the 20-second timeout during peak load, slow CRM response windows, or cold-start latency. When a handler exceeds the timeout, Vapi proceeds with a "no tool result" response to the LLM, which generates a degraded answer ("I wasn't able to check availability right now — let me connect you with an agent"). The webhook request, however, may have already been received and may still be executing in your handler. If Vapi retried the webhook (which it does in some error-state configurations), or if your backend's load balancer retried on a 502, the handler executes twice — both times calling the downstream service, both times potentially committing a side effect.

The cost implications are layered:

Extended active session cost: Every second of webhook wait time is a second of active Vapi billing. A handler with a 25-second P95 latency adds 25 seconds to every function-call turn's active duration — across all four billing streams simultaneously.
Duplicate side effects: Non-idempotent operations (inserting an appointment row, charging a card, sending a confirmation SMS) executed twice produce incorrect business state. The cost is not just monetary — a user who gets charged twice or double-booked generates a support ticket and churn risk that far exceeds the dollar cost of the duplicate API call.
LLM context blowout: When Vapi proceeds without a tool result and the LLM generates a "sorry, can't complete that" message, the user typically re-requests the same action. The LLM now processes the full conversation history (including the failed tool call exchange) plus the user's repeated request. Each repeated function call turn grows the context window, increasing the per-LLM-call token cost for all subsequent turns in the session.

The webhook cascade rule: All serverUrl function handlers must be idempotent, with deduplication keyed on (call_id, tool_call_id). Vapi includes a unique tool_call_id in every function-call webhook payload. Storing and checking this ID before executing any side effect provides complete protection against duplicate execution under retry. Separately, every handler that calls a downstream service should have a per-request timeout ceiling tighter than Vapi's own timeout (max 15s if Vapi's is 20s), with a fast-path fallback for the "downstream unavailable" case that returns a structured degraded result to Vapi immediately rather than holding the active session open for the full timeout duration.

Python — VAPIToolCallGuard: idempotent webhook handler with deduplication and active-session cost ceiling

import time
import sqlite3
import threading
import json
import httpx
from flask import Flask, request, jsonify

app = Flask(__name__)
_db_lock = threading.Lock()

# Vapi's function-call timeout is 20s. Our downstream timeout must be shorter.
DOWNSTREAM_TIMEOUT_SECONDS = 14
# If a tool call has been in-flight longer than this, return degraded immediately.
MAX_ACTIVE_TOOL_CALL_SECONDS = 16

def _db():
    conn = sqlite3.connect("/tmp/vapi_tool_guard.db", check_same_thread=False)
    conn.execute("""
        CREATE TABLE IF NOT EXISTS tool_call_log (
            call_id       TEXT NOT NULL,
            tool_call_id  TEXT NOT NULL,
            function_name TEXT NOT NULL,
            arguments     TEXT NOT NULL,
            result        TEXT,
            status        TEXT NOT NULL DEFAULT 'in_progress',
            started_at    REAL NOT NULL,
            completed_at  REAL,
            PRIMARY KEY (call_id, tool_call_id)
        )
    """)
    conn.commit()
    return conn

@app.route("/vapi/function", methods=["POST"])
def handle_function_call():
    body = request.get_json(force=True)
    msg = body.get("message", {})
    call_id = msg["call"]["id"]
    fn = msg.get("functionCall", {})
    function_name = fn.get("name", "")
    arguments = fn.get("parameters", {})
    tool_call_id = fn.get("toolCallId", f"{call_id}-{function_name}-{int(time.time())}")

    with _db_lock:
        conn = _db()
        existing = conn.execute(
            "SELECT result, status, started_at FROM tool_call_log WHERE call_id = ? AND tool_call_id = ?",
            (call_id, tool_call_id)
        ).fetchone()

        if existing is not None:
            result, status, started_at = existing
            if status == "completed":
                # Idempotency: return the cached result instead of re-executing.
                conn.close()
                return jsonify({"result": json.loads(result)}), 200

            elapsed = time.time() - started_at
            if elapsed > MAX_ACTIVE_TOOL_CALL_SECONDS:
                # Previous attempt timed out. Mark it failed and proceed with degraded result.
                conn.execute(
                    "UPDATE tool_call_log SET status = 'timeout', completed_at = ? WHERE call_id = ? AND tool_call_id = ?",
                    (time.time(), call_id, tool_call_id)
                )
                conn.commit()
                conn.close()
                return jsonify({
                    "result": {
                        "success": False,
                        "error": "downstream_timeout",
                        "message": "The system couldn't complete that check right now. I can try again or connect you with an agent."
                    }
                }), 200
            conn.close()
            # In-flight and not yet timed out — this is a concurrent duplicate. Return in-progress signal.
            return jsonify({"result": {"success": False, "error": "already_in_progress"}}), 200

        # Register this tool call before executing to block concurrent duplicates.
        conn.execute(
            "INSERT INTO tool_call_log (call_id, tool_call_id, function_name, arguments, status, started_at) VALUES (?, ?, ?, ?, 'in_progress', ?)",
            (call_id, tool_call_id, function_name, json.dumps(arguments), time.time())
        )
        conn.commit()
        conn.close()

    try:
        result = execute_function(function_name, arguments)
        with _db_lock:
            conn = _db()
            conn.execute(
                "UPDATE tool_call_log SET result = ?, status = 'completed', completed_at = ? WHERE call_id = ? AND tool_call_id = ?",
                (json.dumps(result), time.time(), call_id, tool_call_id)
            )
            conn.commit()
            conn.close()
        return jsonify({"result": result}), 200

    except httpx.TimeoutException:
        with _db_lock:
            conn = _db()
            conn.execute(
                "UPDATE tool_call_log SET status = 'downstream_timeout', completed_at = ? WHERE call_id = ? AND tool_call_id = ?",
                (time.time(), call_id, tool_call_id)
            )
            conn.commit()
            conn.close()
        return jsonify({
            "result": {
                "success": False,
                "error": "downstream_timeout",
                "message": "The system couldn't complete that check right now. I can try again or connect you with an agent."
            }
        }), 200

def execute_function(name: str, args: dict):
    """Execute the actual downstream function call with a tight timeout."""
    if name == "check_appointment_availability":
        with httpx.Client(timeout=DOWNSTREAM_TIMEOUT_SECONDS) as client:
            resp = client.post(
                "https://your-scheduling-api.example.com/availability",
                json=args,
                headers={"Authorization": "Bearer YOUR_KEY"}
            )
            resp.raise_for_status()
            return resp.json()
    raise ValueError(f"Unknown function: {name}")

Failure Mode 4 — Outbound Call Retry Amplification

Vapi's outbound call API enables high-volume campaigns: dial a list of numbers, play a voice agent, collect structured data, and webhook the results to your backend. The standard implementation is a dispatch-and-listen loop: your backend calls POST /call/phone for each target, receives a call.id, waits for the end-of-call-report webhook, checks ended_reason, and re-queues on "no-answer", "voicemail", or "busy". This pattern is correct in the common case. It fails when the webhook delivery latency exceeds your retry eligibility window.

Vapi's status webhooks are delivered asynchronously. Under load — campaign peak hours, Vapi webhook queue backlog, your endpoint under high inbound traffic — delivery can be delayed by 15 to 90 seconds past the actual call end time. If your re-queue logic fires on a timer (e.g., "if a call hasn't returned a result within 30 seconds, assume no-answer and retry"), and Vapi's webhook arrives 45 seconds after the call actually ended, your backend queues a second call while (or immediately after) the first one completes. For a list of 1,000 numbers with a 15% no-answer rate and a 45-second webhook delay, you may generate 150 duplicate calls — each an independent active Vapi session billing all four cost streams simultaneously.

The amplification factor compounds if your retry logic is not idempotent at the phone-number level. A backend that tracks retry eligibility by call.id (rather than by phone number) may not detect that a new call was dispatched while the previous call's webhook was in transit, resulting in multiple in-flight calls to the same number. The user receives multiple calls from the same AI agent in rapid succession — a significant UX failure that generates carrier complaints and risks campaign flagging.

The outbound retry rule: Track outbound call state by phone number (E.164), not by call.id. Before dispatching a new call to any number, check whether a call to that number is currently in-progress (status: ringing, in-progress) or was completed within the last N minutes (your campaign's minimum re-contact window). Do not re-queue on a timer; re-queue only after receiving the authoritative end-of-call-report webhook with ended_reason. If your webhook receiver has strict deduplication (keyed on call.id), late-delivered webhooks for the same call are safe to ignore. Set a hard per-number maximum retry count (typically 2 attempts per 24-hour window) to bound the maximum cost amplification per number under webhook backlog conditions.

Python — VAPIOutboundCallGuard: phone-number-keyed deduplication with in-progress call registry

import time
import sqlite3
import threading
import httpx
from flask import Flask, request, jsonify

app = Flask(__name__)
_db_lock = threading.Lock()

# Minimum gap between call attempts to the same number (seconds).
MIN_RETRY_GAP_SECONDS = 300  # 5 minutes
# Maximum attempts per number per 24-hour window.
MAX_ATTEMPTS_PER_DAY = 2
# A call is considered in-progress until Vapi delivers the end-of-call webhook.
# We use a safety window here in case the webhook is genuinely delayed.
MAX_CALL_DURATION_SAFETY_SECONDS = 600  # 10 minutes

VAPI_API_KEY = "your-vapi-api-key"

def _db():
    conn = sqlite3.connect("/tmp/vapi_outbound_guard.db", check_same_thread=False)
    conn.execute("""
        CREATE TABLE IF NOT EXISTS outbound_calls (
            phone_number  TEXT NOT NULL,
            call_id       TEXT NOT NULL,
            status        TEXT NOT NULL DEFAULT 'dispatched',
            attempts_today INTEGER NOT NULL DEFAULT 1,
            dispatched_at  REAL NOT NULL,
            ended_at       REAL,
            ended_reason   TEXT,
            PRIMARY KEY (phone_number, call_id)
        )
    """)
    conn.execute("""
        CREATE INDEX IF NOT EXISTS idx_phone_dispatched
        ON outbound_calls (phone_number, dispatched_at)
    """)
    conn.commit()
    return conn

def can_dispatch_call(phone_number: str) -> dict:
    """
    Returns {"allowed": bool, "reason": str}.
    Checks:
      1. No in-progress call to this number (within safety window).
      2. Retry gap has elapsed since last attempt.
      3. Max daily attempts not exceeded.
    """
    now = time.time()
    day_start = now - 86400

    with _db_lock:
        conn = _db()
        rows = conn.execute(
            """SELECT call_id, status, dispatched_at, ended_at
               FROM outbound_calls
               WHERE phone_number = ? AND dispatched_at > ?
               ORDER BY dispatched_at DESC""",
            (phone_number, day_start)
        ).fetchall()
        conn.close()

    attempts_today = len(rows)
    if attempts_today >= MAX_ATTEMPTS_PER_DAY:
        return {"allowed": False, "reason": f"max_daily_attempts_reached ({attempts_today}/{MAX_ATTEMPTS_PER_DAY})"}

    for call_id, status, dispatched_at, ended_at in rows:
        # Is there an in-progress call (no ended_at and within the safety window)?
        if ended_at is None and (now - dispatched_at) < MAX_CALL_DURATION_SAFETY_SECONDS:
            return {"allowed": False, "reason": f"call_in_progress (call_id={call_id})"}

        # Has enough time passed since the last call ended?
        last_event_ts = ended_at or dispatched_at
        if (now - last_event_ts) < MIN_RETRY_GAP_SECONDS:
            return {"allowed": False, "reason": f"retry_gap_not_elapsed ({int(now - last_event_ts)}s < {MIN_RETRY_GAP_SECONDS}s)"}

    return {"allowed": True, "reason": "ok"}

def dispatch_call(phone_number: str, assistant_id: str) -> str:
    """Dispatch a Vapi outbound call, guarded by phone-number deduplication."""
    check = can_dispatch_call(phone_number)
    if not check["allowed"]:
        raise ValueError(f"Call blocked: {check['reason']}")

    with httpx.Client(timeout=10) as client:
        resp = client.post(
            "https://api.vapi.ai/call/phone",
            headers={"Authorization": f"Bearer {VAPI_API_KEY}"},
            json={
                "phoneNumberId": "your-phone-number-id",
                "customer": {"number": phone_number},
                "assistantId": assistant_id
            }
        )
        resp.raise_for_status()
        call_id = resp.json()["id"]

    with _db_lock:
        conn = _db()
        conn.execute(
            "INSERT OR IGNORE INTO outbound_calls (phone_number, call_id, status, dispatched_at, attempts_today) VALUES (?, ?, 'dispatched', ?, ?)",
            (phone_number, call_id, time.time(), 1)
        )
        conn.commit()
        conn.close()

    return call_id

@app.route("/vapi/outbound-webhook", methods=["POST"])
def handle_outbound_webhook():
    body = request.get_json(force=True)
    msg = body.get("message", {})
    msg_type = msg.get("type")

    if msg_type == "end-of-call-report":
        call_id = msg["call"]["id"]
        ended_reason = msg["call"].get("endedReason", "unknown")
        phone_number = msg["call"].get("customer", {}).get("number", "")

        with _db_lock:
            conn = _db()
            conn.execute(
                "UPDATE outbound_calls SET status = 'ended', ended_at = ?, ended_reason = ? WHERE call_id = ?",
                (time.time(), ended_reason, call_id)
            )
            conn.commit()
            conn.close()

        # Re-queue logic runs ONLY after receiving authoritative end-of-call webhook,
        # never on a timer. Check the ended_reason before re-queuing.
        if ended_reason in ("no-answer", "busy") and phone_number:
            check = can_dispatch_call(phone_number)
            if check["allowed"]:
                # Schedule retry (async, not inline) — omitted for brevity.
                pass

    return jsonify({}), 200

Guard State Summary

Failure Mode	Guard Class	Ceiling / Threshold	What to Monitor
VAD Silence Loop VAD fires on noise/pause; STT returns empty; LLM generates filler; TTS plays; repeat	`VAPIVADGuard`	2 consecutive empty transcripts → static audio, hold for valid speech	`total_empty_cycles` per call; sessions with >5 empties = VAD env mismatch
Endpointing Interrupt Storm Aggressive endpointing cuts off user mid-utterance; user repeats; extra LLM + TTS cycles	`VAPIEndpointingGuard`	3 interruption events → widen `endpointing` timeout by 200ms per additional event	Interruption rate per call, per STT provider; endpointing mismatch by traffic source
Webhook Cascade Slow serverUrl handler exceeds timeout; duplicate delivery executes non-idempotent side effects; active session billing during wait	`VAPIToolCallGuard`	14s downstream timeout; dedup on `(call_id, tool_call_id)` before any side effect	P95 handler latency vs Vapi timeout; duplicate execution rate; timeout rate by function name
Outbound Retry Amplification Webhook delivery delay causes backend to re-queue while first call in progress; N concurrent calls per number	`VAPIOutboundCallGuard`	Re-queue only after authoritative end-of-call webhook; 5-min retry gap; 2 attempts/day per number	Concurrent calls per phone number; webhook delivery latency; duplicate call rate

Deployment Checklist

Audit VAD configuration. Review your Vapi assistant's silenceTimeoutSeconds and responseDelaySeconds settings. Increase silence timeout if your callers need more thinking time (especially for complex scheduling or order details). Deploy VAPIVADGuard in your serverUrl handler to intercept before the LLM call on consecutive empty transcripts.
Baseline endpointing per STT provider. Run 50 test calls and log interruption events (turns ending mid-clause) per STT backend. If Deepgram interruption rate > 15%, compare with Gladia or AssemblyAI on the same test set. Set endpointing timeout in your Vapi assistant config to the P90 utterance completion time for your specific use case (appointment scheduling callers speak longer than yes/no IVR callers).
Make every serverUrl function handler idempotent. Add the tool_call_id dedup table before any handler goes to production. Set downstream timeouts to ≤70% of Vapi's function timeout (14s if Vapi's is 20s). Return structured degraded results on timeout rather than holding the session open.
Switch to webhook-driven retry for outbound campaigns. Remove any timer-based retry logic. Ensure your end-of-call webhook receiver logs call_id + ended_reason with deduplication. Implement phone-number-level in-progress tracking before re-queue eligibility check.
Monitor active call duration distribution weekly. P95 call duration far above the expected conversation length is the earliest signal of VAD loops or endpointing storms. Export Vapi call analytics to a dashboard. Sessions with duration > 3× expected baseline warrant individual review.
Set campaign concurrency limits. Vapi's concurrent call limit defaults to your account tier's maximum. For outbound campaigns, set explicit concurrency limits per campaign run to bound the blast radius if the outbound retry amplification pattern fires. A campaign capped at 50 concurrent calls produces at most 50× the per-call cost at any moment, rather than unbounded amplification.

Frequently Asked Questions

How is Vapi's per-minute billing different from token-based billing in text agent platforms?

Text agent platforms like OpenAI Agents SDK or LangChain charge per LLM token consumed. Each tool call or model invocation generates a discrete billing event. Vapi's billing model has four simultaneous cost streams (STT per audio-second, LLM per token, TTS per character, Vapi platform per active minute), and all four run continuously for the duration of the active call — not just during LLM inference. A 30-second thinking pause where the user is formulating a response still bills for STT audio processing (transcribing the silence), Vapi platform minutes, and any ambient background audio that triggers VAD. A loop that extends call duration from 2 minutes to 8 minutes multiplies all four cost streams by 4× simultaneously, producing a cost impact that text-platform loops with individual discrete billing events cannot match.

Can I prevent VAD silence loops through Vapi's configuration alone, without a custom serverUrl guard?

Partially. Increasing silenceTimeoutSeconds reduces premature VAD firing by requiring longer silence before routing audio to STT. Setting a longer responseDelaySeconds gives callers more time to respond before Vapi's built-in no-response detection triggers. However, these settings address timing thresholds, not the logical pattern: they cannot distinguish between a caller genuinely pausing and a VAD misfire on ambient noise producing an empty STT result. A serverUrl guard that tracks consecutive empty-transcript count and blocks the LLM call path on the second consecutive empty is the only way to break the silence loop pattern before it incurs LLM and TTS cost. The Vapi configuration settings reduce loop frequency; the guard eliminates loop cost once VAD misfires.

The serverUrl handler deduplication relies on tool_call_id — does Vapi always provide this field?

Vapi includes a toolCallId field in function-call webhook payloads for assistant configurations that use OpenAI-compatible tool calling (the standard mode since Vapi v2). For legacy functions-style configurations, the tool call ID may need to be derived from a combination of call.id + function.name + timestamp_bucket (rounding to the nearest 5-second bucket provides a robust dedup key for retries occurring within the timeout window). Always log the raw webhook payload for any function-call-related production incident — the dedup key reconstruction depends on which Vapi API version your assistant config uses. The AI agent cost control pattern reference covers deduplication key design patterns for webhook-based tool calling across multiple platforms.

How do I integrate RunGuard's SDK into a Vapi deployment?

The guard patterns above are standalone — they use SQLite for per-session state and run inside your serverUrl Flask handler with no external dependencies. To integrate RunGuard's managed circuit breaker SDK instead, install it with pip install runguard (Python) or npm install @runguard/sdk (Node.js/TypeScript — the preferred language for Vapi serverUrl handlers), then wrap each guard check point with runguard.check(call_id, guard_name). RunGuard provides the SQLite state management, per-session tracking, and monitoring dashboard automatically; you implement only the application-specific logic (which STT transcripts count as empty, which function names require idempotency, what constitutes an acceptable retry gap for your campaign). The RunGuard dashboard shows per-call guard trip rates, call duration anomalies, and webhook latency distribution across your Vapi deployment in real time.

Does the outbound call guard pattern work for Vapi's inbound call flows as well?

The outbound call deduplication guard (phone-number-keyed in-progress registry, webhook-driven retry instead of timer-based) is specific to outbound campaigns where your backend dispatches calls via the Vapi REST API. Inbound calls (where users call your Vapi phone number) have a different cost failure mode: users who hang up and immediately redial (for example, due to a poor connection) create two simultaneous inbound sessions. Vapi handles inbound call lifecycle independently — each inbound call gets its own call.id and Vapi manages the session lifecycle. The relevant guard for inbound is the VAD silence loop guard (failure mode 1) and the serverUrl webhook cascade guard (failure mode 3), both of which apply equally to inbound and outbound sessions. The VAPIOutboundCallGuard pattern is only applicable when your backend is the call originator.

Stop paying for silence, interruptions, and duplicate webhooks

RunGuard's circuit breaker SDK wraps your Vapi serverUrl handler with per-session VAD loop detection, endpointing interrupt tracking, tool call idempotency, and outbound call deduplication — without changing your assistant configuration or Vapi account settings. Install in 5 minutes; trips show up in the dashboard immediately.

Start free trial — 14 days, no card