Relevance AI Agent Cost Control: Bulk Run Amplification, Reasoning Loops, Sub-Agent Chaining, and Knowledge Search Spirals
Relevance AI is a no-code and low-code platform for building AI agents and tools. An agent built on Relevance AI works by using an LLM — GPT-4o, Claude 3.5 Sonnet, Gemini 1.5 Pro — to reason at each step about which tool to call next, then executing that tool. A "tool" in Relevance AI is itself a workflow: a sequence of steps that may include LLM prompt steps, API calls, code execution, and knowledge base searches. The billing consequence is that a single agent interaction pays for at least two LLM calls — one for the agent's reasoning step (deciding which tool to call) and one for the LLM prompt step inside the tool itself — and that minimum multiplies with every additional step, every delegation to a sub-agent, and every row in a bulk dataset run.
What makes Relevance AI's billing model distinctive relative to lower-level frameworks is that the cost surface is mostly invisible in the builder UI. When you wire a "Talk to Agent" tool to three downstream agents, there is no step in the visual editor showing "3× reasoning cost multiplication." When you click "Bulk run on table," the row count drives the total cost but the UI presents it as a single action. The gap between the visual builder's abstraction and the underlying per-LLM-call billing is where the four structural failure modes live.
The four patterns that account for the majority of unexpected Relevance AI billing:
- Bulk dataset run amplification — running a tool on a table with N rows fires N independent tool executions in parallel; each execution pays for the full LLM prompt step inside the tool plus any API calls the tool makes; a 500-row table with a $0.01/execution tool cost produces a $5.00 single-click charge with no warning before execution begins.
- Agent reasoning loop — an agent that fails to make progress on a task keeps calling its reasoning LLM to decide what to do next; if the chosen tool consistently returns an error or an unhelpful result, the agent retries the same tool invocation repeatedly until hitting
max_iterations, paying one reasoning LLM call plus one tool LLM call per loop iteration. - Sub-agent delegation chaining — a Relevance AI agent can use a built-in "Talk to Agent" tool to delegate work to another agent; each delegation adds a full reasoning LLM call at the delegated agent's level before any tool calls execute; a three-level chain (orchestrator → specialist → tool-runner) pays three reasoning LLM calls plus tool LLM calls at each level for a single user request.
- Knowledge base search reformulation spiral — agents that re-query a knowledge base on low-confidence semantic search results loop over topic-absent queries until
max_iterationsis exhausted, paying one embedding call plus one reasoning LLM call per reformulation attempt even when the topic does not exist in the knowledge base at all.
Relevance AI's billing model
Relevance AI does not charge a per-step platform fee. Instead, costs flow from the underlying model provider: each LLM prompt step bills at the configured model's token rates (input + output), each embedding call for knowledge base search bills at the embedding model's rate, and each external API call may have its own cost if you are proxying through rate-limited third-party services. Relevance AI charges for its own platform separately (a subscription tier that covers compute, storage, and the visual builder), but LLM API costs are passed through either from your own API keys or from Relevance AI's usage-based credits system.
The key cost multipliers in Relevance AI's execution model:
- Tool execution: one LLM call per LLM step inside the tool; tools can have multiple LLM steps (extract → validate → reformat = three calls per execution).
- Agent reasoning: one LLM call per agent reasoning step, separate from tool LLM calls; the agent's reasoning context grows with each step as the conversation history accumulates.
- Bulk run: tool execution count = table row count; all rows execute in parallel by default so wall-clock time is bounded, but total LLM call count scales linearly with rows.
- Sub-agent invocation: each "Talk to Agent" call starts a new agent reasoning loop in the delegated agent with its own max_iterations budget; a delegated agent that itself loops hits its own max_iterations before returning.
- Knowledge base search: one embedding model call per search query; embedding costs are typically small ($0.0001/1K tokens) but in a tight reformulation loop on a long query they compound against reasoning step costs.
The practical implication: a single user message to a Relevance AI agent can produce dozens of LLM calls depending on how many tools are available, how many steps each tool has, and whether bulk data or sub-agents are involved. None of this is surfaced as a cost estimate before execution.
Failure mode 1: bulk dataset run amplification
Relevance AI's "bulk run" feature lets you apply a tool to every row in a dataset table. This is one of the platform's most powerful features for data enrichment, classification, and generation tasks — and its most reliable source of unexpected charges. The cost structure is straightforward: total_cost = rows × cost_per_tool_execution. The danger is that both factors can be large and neither is capped by default.
A common pattern: a sales team builds a lead enrichment tool that takes a company domain and returns an ICP score plus a personalized outreach angle. The tool has two LLM steps — one to research the company from a web search and generate context, one to score the lead and draft the angle. At GPT-4o pricing, two LLM steps with a research prompt and 300-token output each costs approximately $0.02 per row. Running this tool on a 2,000-row prospect list produces a $40 charge in a single bulk run trigger.
The amplification risk compounds when bulk runs are triggered automatically. A Relevance AI agent configured to run enrichment on every new CRM record that arrives will trigger a tool execution per record. If a CRM import lands 500 new records overnight, the agent fires 500 tool executions before anyone checks the queue in the morning.
import requests
from runguard import BudgetTracker, BudgetExceededError
def relevance_bulk_run_with_guard(
tool_id: str,
dataset_rows: list[dict],
cost_per_row_usd: float,
session_budget_usd: float = 10.0,
max_rows: int = 200,
api_key: str = "",
project_id: str = "",
) -> dict:
"""
Wraps a Relevance AI bulk run with a row-count ceiling and budget tracker.
Blocks the run if estimated cost exceeds the session budget.
"""
budget = BudgetTracker(cap=session_budget_usd)
if len(dataset_rows) > max_rows:
raise ValueError(
f"Bulk run blocked: {len(dataset_rows)} rows exceeds "
f"max_rows={max_rows}. Slice the dataset before calling."
)
estimated_cost = len(dataset_rows) * cost_per_row_usd
try:
budget.add(estimated_cost)
except BudgetExceededError as e:
raise RuntimeError(
f"Bulk run blocked: estimated cost ${estimated_cost:.2f} "
f"would exceed session budget ${session_budget_usd:.2f}. "
f"Current session spend: ${e.spent:.2f}"
) from e
# Proceed with the actual bulk run
resp = requests.post(
f"https://api.relevanceai.com/latest/{project_id}/bulk_run",
headers={
"Authorization": api_key,
"Content-Type": "application/json",
},
json={
"tool_id": tool_id,
"inputs": dataset_rows,
},
timeout=300,
)
resp.raise_for_status()
return resp.json()
The guard enforces two limits before touching the Relevance AI API: a hard row count ceiling and a session budget check against the pre-estimated cost. For CRM-triggered automatic runs, the row count ceiling is the critical control — it converts an unbounded "process every new record" pattern into "process at most N records per trigger, queue the rest." The session budget tracker prevents a series of small bulk runs from accumulating an unexpected session total even when each individual run is within the row ceiling.
The cost-per-row estimate requires knowing your tool's structure. A tool with one LLM step using GPT-4o and a 500-token context window costs roughly $0.005–$0.02 per row depending on output length. Measure a sample run of 10 rows and use actual_cost / 10 as the per-row estimate — do not rely on theoretical minimums.
Failure mode 2: agent reasoning loop
Relevance AI agents decide which tool to call by running an LLM reasoning step against the accumulated conversation history plus the available tool descriptions. When a tool consistently fails or returns a result the agent cannot use to make progress, the agent's next reasoning step often concludes that it should try the same tool again — sometimes with the same arguments, sometimes with minor variations that still fail for the same underlying reason.
A concrete example: an agent with a "Search CRM" tool hits a permissions error because the CRM API key expired. The tool returns a 403 JSON payload. The agent's reasoning LLM sees "CRM search failed, access denied" and — because there is no fallback tool available — concludes it should try again with a different query formulation. The next attempt hits the same 403. The loop continues until max_iterations (default 10 in Relevance AI) is exhausted. At each iteration: one reasoning LLM call (~$0.01 for a 1000-token context) plus one tool LLM call for the failed search step (~$0.005). Ten iterations: $0.15 per user request, all for a permissions error that should have tripped a circuit breaker after the first attempt.
The cost of a reasoning loop scales with context window growth. Each iteration appends the tool result to the agent's conversation history. By iteration 8, the reasoning context might be 4,000 tokens — four times the cost of the first reasoning call. The per-iteration cost increases monotonically within a loop.
from flask import Flask, request, jsonify
from runguard import LoopDetector, LoopDetectedError
import hashlib, json
app = Flask(__name__)
# Deploy this as a Relevance AI "tool" that the agent calls
# after each failed tool execution to check for a loop pattern.
# The agent's system prompt instructs it to call this guard
# tool whenever it receives an error response from another tool.
loop_detectors: dict[str, LoopDetector] = {}
def get_detector(session_id: str) -> LoopDetector:
if session_id not in loop_detectors:
loop_detectors[session_id] = LoopDetector(repeats=3, max_cycle_len=4)
return loop_detectors[session_id]
def error_signature(tool_name: str, error_payload: dict) -> str:
"""Stable signature: tool name + error type, not full message."""
error_type = (
error_payload.get("error_code")
or error_payload.get("status_code")
or error_payload.get("type")
or "unknown"
)
return f"{tool_name}:{error_type}"
@app.route("/loop-check", methods=["POST"])
def loop_check():
data = request.json
session_id = data.get("session_id", "default")
tool_name = data.get("tool_name", "unknown")
error_payload = data.get("error_payload", {})
detector = get_detector(session_id)
sig = error_signature(tool_name, error_payload)
try:
detector.record(sig)
return jsonify({
"loop_detected": False,
"message": "Continue — no loop pattern yet.",
"signature": sig,
})
except LoopDetectedError as e:
# Clear detector so a fresh task starts clean
loop_detectors.pop(session_id, None)
return jsonify({
"loop_detected": True,
"message": (
f"Loop detected: '{sig}' repeated {e.repeats} times. "
"Stop retrying this tool. Escalate to human or return "
"a best-effort partial result."
),
"cycle_length": e.cycle_length,
"repeats": e.repeats,
}), 200 # 200 so agent reads the payload, not an HTTP error
The guard exposes an HTTP endpoint that the agent calls as a tool after any failed tool execution. The agent's system prompt instructs: "After any tool returns an error, call the loop-check tool with the failing tool's name and the error payload. If loop-check returns loop_detected: true, stop attempting that tool and tell the user you cannot complete the task due to a repeated error." This wires the circuit breaker into the agent's decision loop without modifying Relevance AI's internal agent architecture.
The signature scheme intentionally drops prose error messages. A CRM API that returns "Access denied — your token expired on June 1st" and "Access denied — token not found in credential store" are the same failure mode (crm_search:403) and should advance the same loop counter. Including the full message text in the signature prevents loop detection across superficially different phrasings of the same underlying error.
Failure mode 3: sub-agent delegation chaining
Relevance AI's multi-agent architecture lets any agent invoke another agent as a tool via the built-in "Talk to Agent" capability. The calling agent pauses its own reasoning, the delegated agent runs its full reasoning loop to completion, and the result comes back as a tool response. This is powerful for decomposing complex tasks — an orchestrator agent delegates research to a researcher, data extraction to an extractor, and report writing to a writer — but the billing compounds at every delegation level.
The cost structure for a three-level delegation chain:
| Level | Agent | LLM calls | Notes |
|---|---|---|---|
| Level 1 | Orchestrator | 2 reasoning steps | Step 1: decide to delegate. Step 2: synthesize sub-agent result and respond to user. |
| Level 2 | Specialist | 3 reasoning steps + 2 tool LLM calls | Researcher does 3 reasoning steps to plan + refine, calls 2 tools with LLM steps each. |
| Level 3 | Tool-runner | 2 reasoning steps + 1 tool LLM call | Extractor delegates one specialized extraction tool. |
| Total | — | 7 reasoning + 3 tool = 10 LLM calls | For a single user request that looks like one message. |
At GPT-4o pricing with typical context sizes, 10 LLM calls per user message costs $0.05–$0.20. A three-level chain that runs 100 times per day for a team of five costs $5–$20/day from agent reasoning alone — before any data enrichment or generation tool costs. The key risk is that the number of reasoning steps is not visible in the Relevance AI UI when you wire agents together. The visual editor shows "Talk to Agent" as a single connection between two agent nodes, masking the full execution tree depth.
from flask import Flask, request, jsonify
from runguard import LoopDetector, BudgetTracker, BudgetExceededError
app = Flask(__name__)
# Deploy this as a pre-delegation check tool that each orchestrator
# agent calls BEFORE invoking "Talk to Agent".
# Pass _delegation_depth and _session_budget_remaining from the
# agent's input schema through every delegation level.
@app.route("/delegation-check", methods=["POST"])
def delegation_check():
data = request.json
session_id = data.get("session_id", "default")
current_depth = int(data.get("_delegation_depth", 0))
max_depth = int(data.get("max_depth", 3))
session_budget_usd = float(data.get("session_budget_usd", 1.0))
estimated_sub_agent_cost_usd = float(data.get("estimated_sub_agent_cost_usd", 0.05))
if current_depth >= max_depth:
return jsonify({
"allowed": False,
"reason": (
f"Delegation depth {current_depth} has reached max_depth={max_depth}. "
"Return a best-effort result using only tools available at this level."
),
"depth": current_depth,
})
budget = BudgetTracker(cap=session_budget_usd)
try:
budget.add(estimated_sub_agent_cost_usd)
except BudgetExceededError as e:
return jsonify({
"allowed": False,
"reason": (
f"Sub-agent delegation blocked: estimated cost "
f"${estimated_sub_agent_cost_usd:.3f} would exceed remaining "
f"session budget ${session_budget_usd:.3f}. "
f"Current session spend: ${e.spent:.3f}."
),
"depth": current_depth,
})
return jsonify({
"allowed": True,
"next_depth": current_depth + 1,
"depth": current_depth,
})
The guard enforces two independent limits on delegation: a maximum depth ceiling and a session budget ceiling against an estimated sub-agent cost. Wiring the depth check requires a convention: every agent that can delegate passes _delegation_depth from its input schema as a parameter when calling the delegation-check tool, and includes next_depth from the guard response in the "Talk to Agent" call's input payload. This lets the delegated agent's own system prompt read _delegation_depth from its inputs and know whether it is at level 1, 2, or 3 — enabling the depth ceiling to propagate down the chain without a central coordinator.
The estimated sub-agent cost per delegation is necessarily approximate. A reasonable starting point: multiply the sub-agent's expected max_iterations (default 10) by the cost of one reasoning step at your configured model's rate. For GPT-4o at 2000-token average context size per step, that is 10 × $0.01 = $0.10 as a conservative delegation cost estimate. Use the actual measured cost from representative runs and update the estimate as agent behavior stabilizes.
Failure mode 4: knowledge base search reformulation spiral
Relevance AI provides a built-in knowledge base feature — a hosted vector store where you can upload documents and retrieve relevant chunks via semantic similarity search. Agents access knowledge bases via search tool steps that embed the query and retrieve the top-K most similar chunks. The typical agent loop: search → evaluate whether the results are sufficient → if not, reformulate the query and search again.
The reformulation spiral emerges when a topic is absent from the knowledge base entirely. The agent searches for "Q3 revenue by product line" — gets chunks about general revenue strategy but nothing with actual Q3 data — decides the results are insufficient, reformulates to "quarterly product revenue breakdown" — gets the same general chunks — reformulates again. Each attempt pays one embedding API call plus one agent reasoning step. With max_iterations=10, an agent doing knowledge base search can exhaust its full iteration budget on a topic that will never return useful results, at a cost of 10 reasoning steps plus 10 embedding calls before concluding it cannot find the answer.
The subtlety: the agent often does have general chunks returned from each search — they are semantically similar to the query at the embedding level but do not contain the specific structured data the agent needs. The search does not return zero results; it returns confident-looking results that the agent's LLM then judges as insufficient. This makes it hard for the agent to exit the loop based on search result count alone.
from flask import Flask, request, jsonify
import hashlib, time
from runguard import LoopDetector, LoopDetectedError
app = Flask(__name__)
# Per-session search attempt tracking: topic hash → attempt state
search_state: dict[str, dict] = {}
def topic_hash(query: str) -> str:
"""Stable hash of a normalized query for loop detection across reformulations."""
normalized = " ".join(query.lower().split())
return hashlib.sha256(normalized.encode()).hexdigest()[:16]
@app.route("/search-guard", methods=["POST"])
def search_guard():
data = request.json
session_id = data.get("session_id", "default")
query = data.get("query", "")
search_results = data.get("search_results", [])
max_score = float(data.get("max_relevance_score", 0.0))
max_attempts = int(data.get("max_attempts", 3))
min_score_threshold = float(data.get("min_score_threshold", 0.60))
# Key by session + topic root (track across reformulations of the same topic)
# We use the first 8 chars of the query's word-set hash as the topic key
# so "Q3 revenue by product" and "quarterly product revenue" share a key
word_set = frozenset(query.lower().split())
topic_key = f"{session_id}:{hashlib.md5(str(sorted(word_set)).encode()).hexdigest()[:8]}"
state = search_state.setdefault(topic_key, {
"attempts": 0,
"best_score": 0.0,
"first_at": time.time(),
})
state["attempts"] += 1
state["best_score"] = max(state["best_score"], max_score)
# Structural topic absence: multiple attempts with consistently low scores
if state["attempts"] >= max_attempts and state["best_score"] < min_score_threshold:
search_state.pop(topic_key, None)
return jsonify({
"allow_retry": False,
"reason": (
f"Topic appears absent from knowledge base: {state['attempts']} attempts, "
f"best relevance score {state['best_score']:.2f} < threshold {min_score_threshold:.2f}. "
"Return an honest 'not found in knowledge base' response rather than reformulating further."
),
"best_score": state["best_score"],
"attempts": state["attempts"],
})
# Per-attempt attempt ceiling reached
if state["attempts"] >= max_attempts:
search_state.pop(topic_key, None)
return jsonify({
"allow_retry": False,
"reason": (
f"Search attempt ceiling reached ({state['attempts']}/{max_attempts}). "
"Use best available results or acknowledge limitations."
),
"best_score": state["best_score"],
"attempts": state["attempts"],
})
return jsonify({
"allow_retry": True,
"attempts_so_far": state["attempts"],
"attempts_remaining": max_attempts - state["attempts"],
"best_score": state["best_score"],
})
The guard uses a two-factor exit condition: attempt count ceiling AND a minimum relevance score threshold. The attempt ceiling alone is insufficient — an agent might hit attempt 3 with a relevance score of 0.80 on the third reformulation, meaning the knowledge base probably does contain relevant content and allowing a fourth attempt is reasonable. The combination of low attempt ceiling and low best score is the reliable signal for structural topic absence.
The topic hash normalization — using a word set rather than the exact query string — groups reformulations of the same underlying query together. "Q3 revenue by product line," "quarterly product revenue breakdown," and "product line Q3 earnings" all contain the core words {q3, revenue, product} and will share the same topic key. This prevents the attempt counter from resetting every time the agent reformulates, which would allow unlimited total reformulations across a topic that changes phrasing on every attempt.
Combining the guards
The four failure modes are independent — bulk runs do not interact with reasoning loops, and sub-agent chains do not interact with knowledge search spirals. Each guard operates at a different level of the Relevance AI execution stack and can be deployed independently. The typical integration path:
- Bulk run guard first — it protects against the single largest single-click charge. Wire it into any automation that triggers a Relevance AI bulk run, especially CRM integrations or data pipeline triggers.
- Knowledge search guard second — it protects against the most common iterative loop pattern. Wire it as a tool the agent calls after every knowledge base search before deciding whether to reformulate.
- Reasoning loop guard third — deploy as a standalone endpoint and add it to the agent's available tools via Relevance AI's custom tool builder. Include guidance in the system prompt about when to call it.
- Delegation depth guard last — only required if you are using multi-agent architectures. Wire it as a pre-delegation check in any orchestrator agent's tool set.
Deployment note: All four guard endpoints can be hosted as a single Flask/FastAPI service deployed on any HTTPS endpoint Relevance AI can reach. Relevance AI's custom tool builder accepts any HTTP endpoint with a JSON body and a JSON response — no SDK integration required on the Relevance AI side. The RunGuard Python SDK (pip install runguard) handles the circuit breaker state on your endpoint's server side.
Cost impact reference
| Failure mode | Unguarded cost (example) | Guarded cost | Guard mechanism |
|---|---|---|---|
Bulk run amplification2000 rows × $0.02 |
$40.00 per trigger | $4.00 (200-row cap) | Row ceiling + pre-run budget check |
Reasoning loop10 iters × $0.015 |
$0.15 per stuck request | $0.045 (trips at iter 3) | Tool-error signature loop detector |
Sub-agent chain3 levels × 10 steps × $0.01 |
$0.30 per user message | $0.10 (depth cap 2) | Depth ceiling + per-delegation budget |
Search spiral10 iters × $0.012 |
$0.12 per absent-topic query | $0.036 (3-attempt cap) | Attempt ceiling + best-score threshold |
The numbers in the table use conservative estimates at GPT-4o pricing. Costs will vary with your configured model, the token length of your tool prompts, and the actual iteration patterns your agents exhibit in practice. Measure your baseline before and after deploying guards to confirm the actual reduction in your specific configuration.
Frequently asked questions
Does Relevance AI have built-in cost controls?
Relevance AI provides a max_iterations setting per agent (default 10) which limits total reasoning steps in a single agent run. This is a ceiling on reasoning loop length but does not prevent the loop from running to that ceiling — it does not detect that the same error is repeating and trip early. There is no built-in bulk run row ceiling, no delegation depth limit, and no knowledge search attempt ceiling with a topic-absence exit condition. The guards described in this post implement those controls as custom tools integrated via Relevance AI's HTTP tool interface.
How do I measure my actual per-tool-execution cost in Relevance AI?
Run a sample batch of 10–20 rows using the bulk run feature on a small test table. After the run completes, check your LLM provider's API usage dashboard for the time window of the run. Divide total cost by number of rows to get your measured cost-per-row. If you are using Relevance AI's hosted credits rather than your own API keys, check the credits consumed in the Relevance AI usage dashboard for the same time window. Multiply the per-row cost by your realistic maximum dataset size to set your row ceiling and session budget cap in the bulk run guard.
Can I use the RunGuard Python SDK inside a Relevance AI tool's code step?
Yes. Relevance AI tools support a Python code step where you can install packages via pip at the start of the step. Adding pip install runguard in the code step's setup block lets you use LoopDetector and BudgetTracker directly inside the tool without deploying a separate HTTP endpoint. The limitation is that state does not persist between code step executions by default — you would need to write and read state from an external store (Redis, a Relevance AI dataset, or a cloud KV store) for the loop detector's history to survive across multiple agent reasoning steps.
What is the right value for max_depth in the delegation guard?
Two is the practical maximum for most production Relevance AI setups. Level 1 (orchestrator) → Level 2 (specialist) covers the vast majority of useful delegation patterns. Level 3 adds very little additional capability but doubles the minimum reasoning step count relative to Level 2. Set max_depth=3 initially if you have a genuine need for a tool-runner layer below a specialist, then instrument the depth distribution across real requests. If more than 10% of requests hit depth 3, the third level is doing real work; if fewer than 2% hit it, reduce to max_depth=2 and restructure the Level 3 logic into the Level 2 specialist's tool set.
How does the topic-hash grouping in the search guard handle unrelated topics in the same session?
The topic hash includes the session ID as a prefix, so attempt counts are scoped per-session and do not bleed across users. Within a session, different topics produce different word sets and therefore different topic hashes — a session that searches for "revenue data" and "team headcount" tracks those independently. The risk is a collision if two genuinely different queries share most of their words, in which case the attempt counter advances for both. Increasing the word-set hash length from 8 to 16 characters reduces collision probability to negligible levels for typical knowledge base query volumes.