Julep AI Agent Cost Control: foreach Fan-Out, Session Accumulation, Subworkflow Amplification, and Search Re-Query Spirals
Julep AI is a platform for building persistent AI agents that execute multi-step workflows. A Julep agent runs tasks — sequences of typed steps including prompt (LLM completion), tool_call, foreach (iterate over a list), parallel (concurrent branches), yield (call another task), and search (vector similarity lookup on agent documents). Each prompt step makes one API call to the underlying model; costs accumulate across all steps that fire in a task execution tree.
What makes Julep's billing model distinctive is that it combines three separately-compounding factors: list iteration (a single foreach turns one task into N parallel LLM calls), session persistence (accumulated conversation history inflates the token count of every future prompt step against that session), and subworkflow composition (yield steps call other tasks that run their own step trees). When these factors interact — a foreach inside a yielded subworkflow against a session with 80 accumulated turns — the per-input API call count can reach hundreds of calls for what looks like a single agent invocation.
Four structural failure modes account for the majority of unexpected Julep billing:
- foreach step LLM fan-out — a
foreachwrapping apromptstep fires one API call per list item; an unconstrained upstream list (search results, database rows, document chunks) creates an unbounded call count from a single task invocation. - Session context accumulation — Julep sessions maintain a persistent message history across all task executions; a session used for many tasks over days accumulates thousands of tokens of prior context injected into every new prompt step, making each turn progressively more expensive than the last.
- Subworkflow yield delegation amplification — orchestrator tasks that
yieldto specialist subworkflows, which themselvesyieldto tool tasks, create a task execution tree; every node in the tree runs its own prompt steps, and each level multiplies the previous level's call count by the branching factor. - Document search re-query spiral — agents that re-query on low-confidence
searchresults loop indefinitely when the topic is absent from the document store, paying one embedding operation plus one prompt step per reformulation attempt.
Julep's execution billing model
Julep tasks are defined as YAML (or JSON) workflow definitions that execute as DAGs of steps. When you call client.executions.create(task_id=..., input={...}), Julep spins up an execution that runs each step in sequence (or in parallel for parallel steps). Each prompt step calls the model configured on the agent — GPT-4o, Claude Sonnet, Gemini Pro — at the model provider's standard token rates. Each search step calls the embedding model to vectorize the query and retrieves from Julep's hosted vector store. There is no per-step Julep platform charge beyond the underlying model API calls, but that means your cost model is entirely driven by how many prompt and search steps fire across an execution tree.
The key insight for cost control: Julep steps are composable. A foreach step has a do: block that is itself a list of steps — meaning it can contain prompt steps, tool_call steps, or even yield steps. A parallel step runs multiple step sequences simultaneously. A yield step calls an entirely separate task by ID. The billing implication is that cost grows with the product of list sizes × nesting depth × steps per node — not just the number of top-level task invocations.
Failure mode 1: foreach step LLM fan-out
The most common Julep cost surprise comes from foreach steps that iterate over lists whose size is controlled by upstream data retrieval. A typical pattern: an agent searches for relevant documents, retrieves N results, then uses a foreach step to summarize or analyze each result individually.
# Research agent task — loops over all search results
name: research_and_summarize
steps:
- kind: tool_call
tool: web_search
arguments:
query: "{{inputs.topic}}"
# Returns a list — could be 5 items or 50 depending on the search tool
output: search_results
- kind: foreach
over: "{{search_results}}" # iterates over every item returned
do:
- kind: prompt
messages:
- role: user
content: "Summarize this source for: {{inputs.topic}}\n\n{{item.snippet}}"
output: summaries
If the web search tool returns 30 results, this task fires 30 separate prompt steps — 30 LLM API calls — for a single user query. At GPT-4o pricing (~$5/M input tokens), a prompt that includes the topic query plus a 300-token snippet costs roughly $0.002 per call. Thirty calls: $0.06 per invocation. This looks small until the agent runs 500 times per day across a team, producing a $30/day line item for a step that a 5-result cap would reduce to $5/day.
The structural problem is that the foreach's iteration count is determined by the tool output, which the task definition does not constrain. A search tool that returns variable result counts (5 on a specific query, 50 on a broad one) makes per-invocation cost vary by 10× with no signal to the agent or the task definition.
from flask import Flask, request, jsonify
from runguard import LoopDetector, BudgetTracker, LoopDetectedError, BudgetExceededError
app = Flask(__name__)
# Call this endpoint from a Julep evaluate step before the foreach
@app.route("/foreach/check", methods=["POST"])
def foreach_check():
data = request.json
items = data.get("items", [])
session_id = data.get("session_id", "default")
max_items = data.get("max_items", 10)
daily_call_budget = data.get("daily_call_budget", 200)
if len(items) > max_items:
# Truncate list and log the truncation event
truncated = items[:max_items]
return jsonify({
"allowed": True,
"items": truncated,
"truncated": True,
"original_count": len(items),
"truncated_count": max_items,
"message": f"List truncated from {len(items)} to {max_items} items"
})
return jsonify({
"allowed": True,
"items": items,
"truncated": False,
"original_count": len(items)
})
name: research_and_summarize_guarded
steps:
- kind: tool_call
tool: web_search
arguments:
query: "{{inputs.topic}}"
output: search_results
# Guard step: truncates list before foreach begins
- kind: tool_call
tool: http_post
arguments:
url: "https://your-guard-service/foreach/check"
body:
items: "{{search_results}}"
session_id: "{{session.id}}"
max_items: 10
daily_call_budget: 200
output: guarded_results
# foreach now iterates over the capped list
- kind: foreach
over: "{{guarded_results.items}}"
do:
- kind: prompt
messages:
- role: user
content: "Summarize this source for: {{inputs.topic}}\n\n{{item.snippet}}"
output: summaries
The guard evaluates in a tool_call step before the foreach begins. It can truncate the list, log the event for budget tracking, and return metadata about the truncation that the agent can use to tell the user "I reviewed the top 10 of 47 results." The key design point: the cap applies at the task definition layer, not in the tool that returns the results — the tool remains unchanged and other callers that want the full list still get it.
Failure mode 2: session context accumulation
Julep sessions are persistent conversation containers. When a task executes against a session, Julep injects the session's accumulated message history into each prompt step. This is the mechanism that gives agents memory across invocations — the agent "remembers" prior conversations. But it also means that every prompt step in every task run against a long-lived session pays proportionally more in input tokens than the same step would pay against a fresh session.
The growth pattern is linear with turn count but the cost impact is compounding. If each exchange adds 500 tokens to the session history, a session with 100 prior turns injects 50,000 tokens of history into every new prompt step. At GPT-4o pricing, that 50K token injection costs $0.25 per prompt step — before the actual task input. A task with three prompt steps against this session pays $0.75 in pure history-injection overhead per invocation, regardless of how short the new query is.
import julep
from runguard import BudgetTracker, BudgetExceededError
client = julep.Client(api_key="...")
budget = BudgetTracker(cap=1.0) # $1 cap per task execution
def estimate_session_tokens(session_id: str) -> int:
"""Estimate accumulated context tokens in a session."""
session = client.sessions.get(session_id)
# Julep exposes token_count or we can estimate from history length
history = client.sessions.history(session_id)
total_tokens = sum(
len(msg.content.split()) * 1.3 # rough token estimate
for msg in history.items
)
return int(total_tokens)
MAX_SESSION_TOKENS = 40_000 # rotate session above this threshold
def run_task_with_session_guard(
task_id: str,
session_id: str,
task_input: dict
) -> dict:
session_tokens = estimate_session_tokens(session_id)
if session_tokens > MAX_SESSION_TOKENS:
# Rotate: create a new session with a summary of the old one
summary_execution = client.executions.create(
task_id=SUMMARIZE_SESSION_TASK_ID,
input={"session_id": session_id, "max_summary_tokens": 2000}
)
summary = wait_for_execution(summary_execution.id)
new_session = client.sessions.create(
agent_id=session.agent_id,
situation=f"Continuing from prior session summary:\n\n{summary.output}",
)
session_id = new_session.id # use the new session
# Record projected cost: history tokens + estimated task input tokens
estimated_input_tokens = session_tokens + len(str(task_input).split()) * 1.3
projected_cost = (estimated_input_tokens / 1_000_000) * 5.0 # GPT-4o rate
budget.add(projected_cost) # raises BudgetExceededError if over cap
execution = client.executions.create(
task_id=task_id,
session_id=session_id,
input=task_input
)
return wait_for_execution(execution.id)
The session rotation strategy — creating a new session with a summarized context rather than letting history grow unboundedly — trades some continuity for cost control. For agents where the exact verbatim history is not required (most research and support agents), a 2,000-token summary of a 50,000-token history preserves the useful signal at 4% of the original cost. The BudgetTracker provides a per-task execution cost ceiling independent of session rotation, catching cases where even a fresh session's task input is unusually large.
Failure mode 3: subworkflow yield delegation amplification
Julep's yield step calls another task by ID, suspends the current execution, and resumes when the subworkflow completes. This is the mechanism for building multi-agent architectures in Julep: an orchestrator task yields to specialist tasks, which may themselves yield to tool-wrapper tasks. Each level of the hierarchy runs its own step tree, billing its own prompt steps independently.
The cost amplification is multiplicative. An orchestrator that yields to 5 specialist tasks, each of which has 3 prompt steps, produces 15 prompt calls per orchestrator invocation — before counting any prompt steps in the orchestrator itself. If the specialists each yield to 3 tool-wrapper tasks with 1 prompt step apiece, the tree becomes: 1 orchestrator + 5 specialists + 15 tool tasks = 21 task executions, 5×3 + 15×1 = 30 prompt steps. From the top level, this looks like a single run_task() call.
from runguard import LoopDetector, LoopDetectedError
# Thread this guard check as the first evaluate step in every task
# that can be called as a subworkflow
MAX_DELEGATION_DEPTH = 3
MAX_TOTAL_YIELDS = 10
def check_delegation_depth(task_input: dict) -> dict:
"""
Call at the start of any task that accepts yield delegation.
task_input must include _execution_depth and _yield_count
threaded from the orchestrator.
"""
depth = task_input.get("_execution_depth", 0)
yield_count = task_input.get("_yield_count", 0)
if depth >= MAX_DELEGATION_DEPTH:
return {
"proceed": False,
"reason": "delegation_depth_exceeded",
"depth": depth,
"max_depth": MAX_DELEGATION_DEPTH,
"partial_result": task_input.get("_best_partial_result")
}
if yield_count >= MAX_TOTAL_YIELDS:
return {
"proceed": False,
"reason": "yield_count_exceeded",
"yield_count": yield_count,
"max_yields": MAX_TOTAL_YIELDS
}
# Pass incremented depth + count to sub-tasks
return {
"proceed": True,
"_execution_depth": depth + 1,
"_yield_count": yield_count + 1
}
name: orchestrator_task
steps:
# Guard: evaluate depth before doing anything expensive
- kind: evaluate
result:
depth_check: >
{
"proceed": inputs._execution_depth < 3,
"depth": inputs._execution_depth or 0
}
- kind: if_else
if: "{{depth_check.proceed}}"
else:
# Return gracefully rather than erroring
- kind: return
output:
result: null
stopped_reason: "delegation_depth_exceeded"
depth: "{{depth_check.depth}}"
# Normal orchestration: yield to specialist tasks
- kind: foreach
over: "{{inputs.subtasks}}"
do:
- kind: yield
workflow: specialist_task
arguments:
topic: "{{item.topic}}"
_execution_depth: "{{depth_check.depth + 1}}"
_yield_count: "{{inputs._yield_count + 1}}"
output: specialist_results
Threading _execution_depth and _yield_count through every yield call's input means each task in the hierarchy sees its position in the tree. The evaluate step at the start of each task checks these values before running any prompt steps — aborting cleanly with a structured result if the limits are exceeded. This design handles cycles (task A yields to B which yields to A) as well as legitimate but unexpectedly deep fan-outs.
The choice of 3 as the max depth and 10 as the max yield count is configurable. For most agent architectures — orchestrator, specialist, tool-wrapper — depth 3 is the maximum meaningful level. Any delegation deeper than that is almost always a bug (infinite recursion) rather than intent. Setting the ceiling at 3 catches the bug at the third level rather than after 30 levels have billed.
Failure mode 4: document search re-query spiral
Julep agents can store documents in Julep's hosted vector store and use search steps to retrieve relevant content by semantic similarity. A common pattern is to gate further processing on retrieval confidence: if the top result's score is below a threshold, re-query with a reformulated question. This works well when documents exist on the topic — a second formulation often retrieves a better result. But when the topic is absent from the document store, no reformulation finds a high-confidence result. The agent loops.
name: knowledge_lookup
steps:
- kind: evaluate
result:
query: "{{inputs.question}}"
attempt: 0
# This pattern loops indefinitely if topic is absent
- kind: search
vector_store_id: "{{inputs.store_id}}"
query: "{{query}}"
limit: 3
output: search_results
- kind: if_else
if: "{{search_results[0].score >= 0.75}}"
then:
- kind: prompt
messages:
- role: user
content: "Answer using: {{search_results}}\n\nQuestion: {{inputs.question}}"
output: answer
else:
# Re-query with a reformulated question — no ceiling!
- kind: prompt
messages:
- role: user
content: "Reformulate this question to better match documentation: {{query}}"
output: reformulated_query
- kind: evaluate
result:
query: "{{reformulated_query.choices[0].message.content}}"
attempt: "{{attempt + 1}}"
# Goes back to the search step... indefinitely
Each loop iteration costs: one search step (embedding call + vector lookup) plus one prompt step for reformulation. For a topic genuinely absent from the store — say, the agent is queried about a product feature not yet in the knowledge base — each reformulation produces a different query that still retrieves nothing useful above threshold. The agent can loop 20+ times before any external timeout fires.
import hashlib
import time
from flask import Flask, request, jsonify
app = Flask(__name__)
# In-memory store keyed by (execution_id, topic_hash)
# In production: use Redis or SQLite with TTL
_search_state: dict[str, dict] = {}
def _topic_hash(query: str) -> str:
normalized = " ".join(query.lower().split())
return hashlib.sha256(normalized.encode()).hexdigest()[:16]
@app.route("/search/check", methods=["POST"])
def search_check():
data = request.json
execution_id = data["execution_id"]
original_question = data["original_question"]
current_query = data["current_query"]
last_score = data.get("last_score", 0.0)
max_attempts = data.get("max_attempts", 3)
min_retrievable_score = data.get("min_retrievable_score", 0.30)
topic_hash = _topic_hash(original_question)
state_key = f"{execution_id}:{topic_hash}"
state = _search_state.get(state_key, {
"attempts": 0,
"best_score": 0.0,
"best_result": None,
"queries_tried": []
})
# Record current attempt
state["attempts"] += 1
state["queries_tried"].append(current_query)
if last_score > state["best_score"]:
state["best_score"] = last_score
state["best_result"] = data.get("last_result")
_search_state[state_key] = state
# Block: too many attempts
if state["attempts"] > max_attempts:
return jsonify({
"proceed": False,
"reason": "max_attempts_exceeded",
"attempts": state["attempts"],
"best_score": state["best_score"],
"best_result": state["best_result"],
"message": f"Search blocked after {state['attempts']} attempts. Best score: {state['best_score']:.2f}"
})
# Block: score too low to ever succeed (structural absence)
if state["attempts"] >= 2 and state["best_score"] < min_retrievable_score:
return jsonify({
"proceed": False,
"reason": "topic_absent_from_store",
"attempts": state["attempts"],
"best_score": state["best_score"],
"best_result": state["best_result"],
"message": "Topic not in knowledge base — score below minimum retrievable threshold"
})
return jsonify({
"proceed": True,
"attempts": state["attempts"],
"best_score": state["best_score"]
})
@app.route("/search/reset", methods=["POST"])
def search_reset():
data = request.json
execution_id = data["execution_id"]
# Clean up state for this execution
keys_to_delete = [k for k in _search_state if k.startswith(f"{execution_id}:")]
for k in keys_to_delete:
del _search_state[k]
return jsonify({"reset": True, "cleared": len(keys_to_delete)})
name: knowledge_lookup_guarded
steps:
- kind: evaluate
result:
query: "{{inputs.question}}"
attempt: 0
last_score: 0.0
last_result: null
- kind: search
vector_store_id: "{{inputs.store_id}}"
query: "{{query}}"
limit: 3
output: search_results
# Guard check before deciding to re-query
- kind: tool_call
tool: http_post
arguments:
url: "https://your-guard-service/search/check"
body:
execution_id: "{{execution.id}}"
original_question: "{{inputs.question}}"
current_query: "{{query}}"
last_score: "{{search_results[0].score}}"
last_result: "{{search_results[0]}}"
max_attempts: 3
min_retrievable_score: 0.30
output: guard_result
- kind: if_else
if: "{{search_results[0].score >= 0.75}}"
then:
- kind: prompt
messages:
- role: user
content: "Answer using: {{search_results}}\n\nQuestion: {{inputs.question}}"
output: answer
else:
- kind: if_else
if: "{{guard_result.proceed}}"
then:
- kind: prompt
messages:
- role: user
content: "Reformulate this question for better documentation match: {{query}}"
output: reformulated
# (loop continues with guard checking each iteration)
else:
# Return best available result with caveat
- kind: return
output:
answer: "{{guard_result.best_result}}"
confidence: "{{guard_result.best_score}}"
caveat: "{{guard_result.message}}"
The guard distinguishes two block conditions. Max attempts exceeded catches runaway reformulation loops where documents exist but the threshold is set too high — a configuration problem. Topic absent from store catches the more damaging case where no reformulation will ever work because the information simply is not in the vector store. The min_retrievable_score threshold of 0.30 represents "even the best semantic match in this store is below this score on this query family" — a signal that the store coverage gap is structural, not query-phrasing-dependent.
Guard state summary
| Failure mode | Trigger condition | Guard action | Block criterion |
|---|---|---|---|
foreach fan-outJulepForeachGuard |
List size exceeds max_items |
Truncate list, return metadata | len(items) > max_items |
Session accumulationsession rotation |
Session token count exceeds threshold | Create new session with summary context | session_tokens > MAX_SESSION_TOKENS |
Delegation amplification_execution_depth |
Depth or yield count exceeds ceiling | Return structured stop result | depth >= 3 or yield_count >= 10 |
Search re-query spiralJulepSearchReQueryGuard |
Attempts exceeded or score below floor | Return best result with caveat | attempts > 3 or best_score < 0.30 |
Checklist before deploying a Julep agent to production
- Audit every foreach step. List the data source feeding each
foreach. If the list size is determined by a tool output (search results, API responses, database queries), add a cap step before the foreach. Do not rely on the tool to return a bounded result count. - Set a session token budget. Decide the maximum session age in tokens, not in turns or time. Add a pre-task check that reads session history length and rotates when the threshold is exceeded. Include a summarize subworkflow for the rotation path.
- Thread execution depth through all yield calls. Any task that can be called via
yieldshould accept_execution_depthin its inputs and check it as the first evaluate step. Failing to thread this field means the guard cannot fire at the right level. - Bound every search re-query loop with an attempt counter and a floor score. The floor score is more important than the attempt counter — it detects structural absence rather than just excessive reformulation. Set it based on the minimum acceptable semantic match in your document store, not an arbitrary low number.
- Run a cost baseline per task before production launch. Execute each task against a representative input set and log the prompt step count per execution. Any task that produces more than 20 prompt steps per invocation on average deserves a ceiling review before it scales.
- Log truncation and guard trip events separately from errors. A guard that truncates a foreach list is not an error — it is the guard working as intended. If guard trips are mixed into the error log, they become noise. A separate guard event stream makes it easy to tune ceilings based on actual production patterns.
Frequently asked questions
Does Julep's built-in max_steps or execution timeout handle these failure modes?
Julep's execution timeout limits how long a single execution runs, not how many LLM API calls it makes. A timeout fires after the wall-clock deadline regardless of step count — but by then, the API calls have already been billed. The guards here fire before the expensive step executes, which eliminates the billing rather than just ending it. The foreach guard, for example, fires before the foreach loop starts, reducing the call count to the capped value from the first iteration rather than aborting mid-loop after N calls have already fired.
Can I implement these guards as Julep tools rather than external HTTP endpoints?
Yes. Julep's tool system supports HTTP tools (external endpoints), system tools (built-in Julep primitives like search and document management), and integration tools (pre-built connectors). Any of the guards above can be registered as an HTTP tool on the agent, which means calling them from a tool_call step looks identical to any other tool invocation. The advantage of HTTP tools for guards is that guard state (attempt counters, truncation logs) can live in a persistent store (Redis, Postgres) rather than in-memory — important for guards that need to survive execution interruptions or span multiple partial executions of the same task.
How does Julep's context_overflow session setting interact with the session token guard?
Julep's context_overflow setting on a session controls what happens when accumulated history exceeds the model's context window: truncate (drop oldest messages), summarize (call LLM to compress), or adaptive (Julep-managed). The session token guard described here fires before any prompt step, catching runaway session growth before it hits the model's context limit. The two mechanisms are complementary: context_overflow handles the case where the session exceeds model limits; the session guard handles the case where accumulated tokens are within model limits but are producing unexpectedly high per-turn costs. Set the guard's threshold well below the model's context window so the guard fires first and session rotation remains a controlled operation rather than an emergency.
The delegation depth guard threads depth through task inputs. What if a task is called by multiple orchestrators at different depths?
Threading depth through inputs means each execution carries its own depth context independently of other concurrent executions. Two orchestrators calling the same specialist task simultaneously each pass their own _execution_depth value — the specialist sees the depth specific to its call chain, not a global counter. This is the correct behavior: two separate user requests that each generate a 3-level delegation tree should each be evaluated independently, not share a counter that causes the second request's third level to fail because the first request already incremented a shared depth register. The only case where shared state is needed is detecting loops within a single execution (task A yields to B which yields to A) — for that, _yield_count serves as the per-execution ceiling.
How do I install RunGuard's Python SDK to use in guard service implementations?
Install with pip install runguard. The package provides LoopDetector, BudgetTracker, and ContextGuard classes as well as async-compatible guard_async() wrappers. For the guard endpoints above, BudgetTracker is the primary primitive — it tracks cumulative spend against a configurable cap and raises BudgetExceededError when the cap is reached. LoopDetector is useful for detecting repeated tool-call signatures within a single Julep execution. See runguard.dev for API documentation and the SDK source.