BabyAGI and AutoGPT Cost Control: Recursive Task Explosion, Reflection Loops, and Goal Completion Ambiguity
BabyAGI and AutoGPT are the original autonomous agent frameworks — the ones that made "AI that runs itself" feel plausible in 2023. BabyAGI implements the simplest possible architecture: a task list, an execution agent, a task-creation agent, and a prioritization agent, cycling until the task list is empty. AutoGPT adds a richer inner monologue — Thought, Reasoning, Plan, Criticism, Action — and a persistent memory store backed by a vector database. Both frameworks share the same fundamental design: the agent output at step N determines the agent input at step N+1, with no external convergence signal.
That self-directed loop is exactly what makes these frameworks compelling for research and experimentation. It is also the architectural root of every unexpected cost pattern they produce. When an agent's output determines its own next input, any mistake, ambiguity, or unhappy intermediate state propagates forward and is paid for in LLM tokens. A well-scoped task on a narrow topic terminates cleanly. A fuzzy goal on a broad topic does not — it generates subtasks faster than it resolves them, and the task queue grows until an external kill signal or a credit limit fires.
The four structural failure modes that account for the majority of unexpected costs in BabyAGI and AutoGPT deployments:
- Recursive task tree explosion — BabyAGI's task-creation agent generates subtasks from each execution result; ambiguous or broad goals produce task lists that grow faster than they shrink; a single "research X comprehensively" goal can expand to 40–60 queued tasks before the first result is useful.
- Reflection-reformulation spiral — AutoGPT's Criticism phase evaluates whether the current Plan is adequate; when the critic judges a plan insufficient, the agent reformulates and re-enters the Thought-Reasoning-Plan cycle; without a semantic-similarity check, this cycle produces minimally different plan variants at full GPT-4 cost per iteration.
- Goal completion ambiguity — both frameworks rely on the agent to decide when the top-level goal is "done"; open-ended goals (improve the codebase, research the market, make the document better) have no natural termination condition; the agent always finds one more relevant subtask, and without an explicit completion classifier, the loop runs until resources are exhausted.
- Context compression runaway — AutoGPT compresses its in-memory context when the accumulated history approaches the model's context window limit; if the compressed summary is judged insufficient (too lossy), the agent re-compresses from a slightly different memory slice, paying double or triple compression cost; this pattern is particularly expensive because compression calls inject the full current history as input.
BabyAGI's and AutoGPT's cost model
Both frameworks are model-agnostic wrappers, so costs flow directly to whichever LLM provider you configure. The canonical deployment uses GPT-4o or GPT-4 Turbo. At GPT-4o rates ($0.005/1K input tokens, $0.015/1K output tokens), the cost per agent cycle breaks down as:
- BabyAGI, per loop iteration: one execution call (the task itself — 500–2,000 token context + 200–500 token result) + one task-creation call (prior results injected — grows with task count) + one prioritization call (full task list). Three LLM calls per cycle. A 50-task run pays roughly 150 LLM calls before accounting for memory injection growth.
- AutoGPT, per action cycle: one Thought-Reasoning-Plan-Criticism-Action generation call (typically 3,000–8,000 tokens of accumulated context + memory injection) + any tool calls that invoke LLMs (web summarization, code explanation) + any memory store upsert calls. One base call per action step, multiplied by the number of tool-chained LLM operations.
A realistic AutoGPT session that completes a research task in 15 action steps costs roughly $0.30–$0.80 in model calls — reasonable. The same session caught in a reflection-reformulation spiral for 20 additional cycles before converging costs an additional $2–$5. A BabyAGI run that terminates after 30 tasks is fine. One that generates 120 tasks on an ambiguous goal costs $3–$8 in planning calls alone, most for tasks that are never useful. The failure is rarely catastrophic on a single run — it's the accumulation across many sessions, or a single runaway automated session, that produces the $50–$200 unexpected monthly charges teams report.
Failure mode 1: recursive task tree explosion
BabyAGI's task-creation agent receives the result of the just-completed task and the original objective, then generates new tasks to work toward the objective given what was just learned. The implicit assumption is that each execution result should shrink the remaining work. In practice, broad objectives produce the opposite: every research finding reveals more avenues, and the task-creation agent dutifully generates subtasks for each.
A goal like "research the competitive landscape for AI developer tools" yields an initial task: survey the market. The execution agent produces a list of 12 companies. The task-creation agent generates 12 new tasks — one per company. Each company task produces founding date, pricing, and 3 differentiators. The task-creation agent generates 36 more tasks to compare pricing tiers, track recent funding, and identify customer reviews. The task queue depth grows at 3× per generation on an unbounded research topic. By the time any useful output exists, 80 tasks have been queued and 60 LLM calls have fired.
The fix requires three controls: a hard ceiling on total tasks per session, a depth counter to track how many generations removed a task is from the root goal, and a pre-creation check that asks whether the proposed tasks are genuinely new work or reformulations of already-queued tasks.
import hashlib
import anthropic
from runguard import BudgetTracker, BudgetExceededError
class TaskTreeGuard:
"""
Tracks task tree depth and total count; blocks task creation
that would exceed either ceiling.
"""
def __init__(
self,
max_total_tasks: int = 30,
max_depth: int = 4,
session_budget_usd: float = 3.0,
):
self.max_total_tasks = max_total_tasks
self.max_depth = max_depth
self.budget = BudgetTracker(cap=session_budget_usd)
self._task_count = 0
self._task_hashes: set[str] = set()
def _task_hash(self, task: str) -> str:
return hashlib.sha256(task.lower().strip().encode()).hexdigest()[:16]
def can_add_tasks(self, proposed: list[str], current_depth: int) -> list[str]:
"""
Filters proposed task list down to tasks that are:
1. Not already in the task queue (dedup by hash)
2. Within depth ceiling
3. Within total task ceiling
Returns the filtered list; empty list means no tasks can be added.
"""
if current_depth >= self.max_depth:
return []
allowed = []
for task in proposed:
h = self._task_hash(task)
if h in self._task_hashes:
continue # duplicate
if self._task_count + len(allowed) >= self.max_total_tasks:
break # total ceiling
allowed.append(task)
self._task_hashes.add(h)
self._task_count += len(allowed)
return allowed
def record_llm_cost(self, cost_usd: float):
try:
self.budget.add(cost_usd)
except BudgetExceededError as e:
raise RuntimeError(
f"Session budget exhausted (${self.budget.cap:.2f} cap). "
f"Task tree terminated after {self._task_count} tasks."
) from e
def run_babyagi_with_guard(
objective: str,
initial_task: str,
model: str = "claude-sonnet-4-6",
max_total_tasks: int = 30,
max_depth: int = 4,
session_budget_usd: float = 3.0,
) -> list[dict]:
"""
BabyAGI-style loop with task tree guard: hard ceilings on total tasks,
tree depth, and session budget. Returns list of completed task results.
"""
client = anthropic.Anthropic()
guard = TaskTreeGuard(max_total_tasks, max_depth, session_budget_usd)
task_queue: list[dict] = [{"task": initial_task, "depth": 0}]
completed_results: list[dict] = []
while task_queue:
current = task_queue.pop(0)
task_text = current["task"]
depth = current["depth"]
# Execute the task
context_summary = "\n".join(
f"- {r['task']}: {r['result'][:200]}" for r in completed_results[-5:]
)
exec_response = client.messages.create(
model=model,
max_tokens=512,
messages=[{
"role": "user",
"content": (
f"Objective: {objective}\n\n"
f"Recent completed tasks:\n{context_summary or 'None yet'}\n\n"
f"Current task: {task_text}\n\n"
"Complete this task concisely. Return only the result."
),
}],
)
result_text = exec_response.content[0].text.strip()
input_tokens = exec_response.usage.input_tokens
output_tokens = exec_response.usage.output_tokens
exec_cost = (input_tokens * 0.000003) + (output_tokens * 0.000015)
guard.record_llm_cost(exec_cost)
completed_results.append({"task": task_text, "result": result_text, "depth": depth})
# Generate new tasks — only if within depth ceiling
next_depth = depth + 1
if next_depth >= max_depth:
continue
creation_response = client.messages.create(
model=model,
max_tokens=256,
messages=[{
"role": "user",
"content": (
f"Objective: {objective}\n\n"
f"Completed task: {task_text}\n"
f"Result: {result_text}\n\n"
f"Remaining tasks already queued: {len(task_queue)}\n"
f"Total tasks completed so far: {len(completed_results)}\n\n"
"List up to 3 new tasks needed to advance the objective, "
"given this result. Return each task on its own line. "
"If the objective is sufficiently addressed, return DONE."
),
}],
)
creation_text = creation_response.content[0].text.strip()
c_tokens = creation_response.usage.input_tokens + creation_response.usage.output_tokens
guard.record_llm_cost(c_tokens * 0.000006)
if "DONE" in creation_text.upper():
break
new_tasks_raw = [
line.lstrip("0123456789.-) ").strip()
for line in creation_text.splitlines()
if line.strip() and len(line.strip()) > 10
]
allowed_tasks = guard.can_add_tasks(new_tasks_raw, next_depth)
for t in allowed_tasks:
task_queue.append({"task": t, "depth": next_depth})
return completed_results
The guard enforces three independent ceilings: the total task count (max_total_tasks=30) prevents runaway expansion; the depth ceiling (max_depth=4) prevents recursive deepening on a single subtask branch; the budget tracker terminates the session if cumulative LLM costs exceed the session budget regardless of whether the task ceiling has been hit. Task deduplication by content hash prevents the task-creation agent from re-queuing semantically identical tasks under slightly different phrasing — a common pattern when the agent is uncertain whether a task has been completed.
Failure mode 2: reflection-reformulation spiral
AutoGPT's inner monologue has a Criticism phase by design. After generating a Plan, the agent critiques it: identifying weaknesses, missing steps, and potential failure modes. If the criticism is substantive, the next Thought-Reasoning-Plan cycle incorporates the critique and produces a revised plan. This is the intended behavior — self-correction before acting is cheaper than correcting after a failed tool call.
The spiral emerges when the critique is structurally accurate but the plan revision is cosmetically different. The agent reformulates the plan with slightly different phrasing or task ordering. The next criticism identifies the same underlying weaknesses. The plan is revised again. Each iteration looks like productive reasoning from inside the agent's context — it is generating valid text about valid concerns — but successive plan versions are semantically near-identical. The agent is paying GPT-4 rates to rewrite the same plan four times before taking a single action.
Detecting this requires comparing successive plan versions for semantic similarity. If plan version N and plan version N-1 are above a cosine similarity threshold, the loop is stuck and the agent should either act on the current plan or escalate to a human.
import anthropic
from runguard import BudgetTracker, BudgetExceededError, LoopDetector
def get_plan_embedding(plan_text: str, client: anthropic.Anthropic) -> list[float]:
"""
Returns a simple embedding proxy using token overlap as a lightweight
similarity measure. For production, use a real embedding model.
"""
# Lightweight Jaccard similarity proxy: bag-of-words overlap
# Replace with client.embeddings() or openai.embeddings if available
words = set(plan_text.lower().split())
return list(words) # type: ignore # return set for overlap comparison
def jaccard_similarity(a: list, b: list) -> float:
sa, sb = set(a), set(b)
if not sa or not sb:
return 0.0
return len(sa & sb) / len(sa | sb)
class ReflectionLoopDetector:
"""
Tracks successive plan versions in a Thought-Plan-Critique cycle.
Raises LoopDetectedError when consecutive plans are too similar,
indicating the critique cycle is not making progress.
"""
def __init__(
self,
similarity_threshold: float = 0.82,
max_similar_consecutive: int = 2,
):
self.threshold = similarity_threshold
self.max_similar = max_similar_consecutive
self._plan_history: list[list] = []
self._similar_count = 0
def check_plan(self, new_plan: str) -> bool:
"""
Returns True if plan is making progress, False if stuck in a loop.
Call before committing to another reflection cycle.
"""
new_emb = get_plan_embedding(new_plan, None) # type: ignore
if self._plan_history:
sim = jaccard_similarity(new_emb, self._plan_history[-1])
if sim >= self.threshold:
self._similar_count += 1
if self._similar_count >= self.max_similar:
return False # loop detected — act or escalate
else:
self._similar_count = 0 # genuine revision, reset counter
self._plan_history.append(new_emb)
return True # making progress
def autogpt_action_cycle(
objective: str,
model: str = "claude-sonnet-4-6",
max_action_steps: int = 20,
max_reflection_cycles: int = 3,
session_budget_usd: float = 2.0,
) -> str:
"""
AutoGPT-style Thought-Reasoning-Plan-Criticism-Action loop.
Reflection loop detector breaks out of stuck plan-critique spirals.
"""
client = anthropic.Anthropic()
budget = BudgetTracker(cap=session_budget_usd)
reflection_guard = ReflectionLoopDetector(
similarity_threshold=0.82,
max_similar_consecutive=2,
)
history: list[dict] = []
action_results: list[str] = []
reflection_cycles = 0
for step in range(max_action_steps):
# Build context: inject last 5 action results to limit token growth
context = "\n".join(action_results[-5:]) if action_results else "No actions taken yet."
# Inner monologue: generate Thought + Reasoning + Plan + Criticism + Action
inner_response = client.messages.create(
model=model,
max_tokens=800,
system=(
"You are an autonomous agent. For each step, output exactly:\n"
"THOUGHT: \n"
"REASONING: <2-3 sentences>\n"
"PLAN: \n"
"CRITICISM: \n"
"ACTION: \n"
"DONE: "
),
messages=[
*history[-6:], # keep last 3 exchanges to bound context growth
{
"role": "user",
"content": (
f"Objective: {objective}\n\n"
f"Recent action results:\n{context}\n\n"
"Produce your inner monologue and next action."
),
},
],
)
response_text = inner_response.content[0].text.strip()
tokens_used = inner_response.usage.input_tokens + inner_response.usage.output_tokens
call_cost = tokens_used * 0.000009 # blended rate
try:
budget.add(call_cost)
except BudgetExceededError as e:
raise RuntimeError(
f"AutoGPT session budget exhausted at step {step}. "
f"Completed {len(action_results)} actions."
) from e
history.extend([
{"role": "user", "content": f"Objective: {objective}"},
{"role": "assistant", "content": response_text},
])
# Extract PLAN for reflection loop detection
plan_text = ""
for line in response_text.splitlines():
if line.startswith("PLAN:") or (plan_text and not line.startswith(
("THOUGHT:", "REASONING:", "CRITICISM:", "ACTION:", "DONE:")
)):
plan_text += line + "\n"
# Check for reflection loop
if not reflection_guard.check_plan(plan_text):
reflection_cycles += 1
if reflection_cycles >= max_reflection_cycles:
# Extract the current ACTION and force execution rather than re-reflecting
action_line = next(
(l for l in response_text.splitlines() if l.startswith("ACTION:")),
"ACTION: Report current findings and stop."
)
action_results.append(
f"[Loop guard forced action after {reflection_cycles} stuck cycles] {action_line}"
)
break
# Force a fresh start on reflection by injecting a directive
history.append({
"role": "user",
"content": (
"Your last two plans are nearly identical. Stop refining the plan. "
"Execute the ACTION from your most recent step immediately."
),
})
continue
# Check DONE
if "DONE: YES" in response_text.upper():
break
# Extract and execute ACTION (stub — replace with real tool dispatch)
action_line = next(
(l for l in response_text.splitlines() if l.startswith("ACTION:")),
None,
)
if action_line:
action_result = f"Executed: {action_line.replace('ACTION:', '').strip()}"
action_results.append(action_result)
return "\n".join(action_results)
The reflection guard tracks successive plan versions using a Jaccard word-overlap similarity — cheap to compute, no embedding API call required. When two consecutive plans score above the 0.82 threshold, the similar-cycle counter increments. After two consecutive stuck cycles, the guard injects a directive forcing the agent to execute rather than re-reflect. This mirrors how a human pair-programmer would intervene: "stop planning, just do it." The agent retains the ability to reflect on genuine plan gaps; it loses the ability to spin on the same gap indefinitely.
Failure mode 3: goal completion ambiguity
BabyAGI's original loop condition is while task_list — the agent runs until the task list is empty. AutoGPT's is while not done, where "done" is determined by the agent's own DONE: YES output. Both termination conditions are correct for well-scoped goals. They fail for goals where "done" is a matter of degree rather than a binary state.
Research goals, writing improvement goals, and optimization goals are all degree-goals. "Research the AI agent market" can always be followed by "research one more category." "Improve the API documentation" can always find one more section to improve. The agent's task-creation logic is optimized to find productive next steps — it is not optimized to recognize that sufficient progress has been made on a fuzzy objective. Without an external evaluator that can assess completeness from the original requester's perspective, the agent treats the goal as perpetually incomplete.
The guard is a completion classifier: a fast model (Claude Haiku) that receives the original goal, the completed task summaries, and the proposed next tasks, and scores how complete the goal is on a 0–100 scale. Above a configurable threshold (typically 75–80), the session terminates regardless of remaining queued tasks.
import anthropic
from runguard import BudgetTracker, BudgetExceededError
class GoalCompletionClassifier:
"""
Uses a fast model to score goal completion before committing to
another batch of task creation. Prevents runaway on fuzzy objectives.
"""
def __init__(
self,
completion_threshold: float = 75.0,
check_interval: int = 5, # check every N completed tasks
classifier_model: str = "claude-haiku-4-5-20251001",
):
self.threshold = completion_threshold
self.check_interval = check_interval
self.model = classifier_model
self._client = anthropic.Anthropic()
self._checks_run = 0
def should_check(self, completed_task_count: int) -> bool:
return completed_task_count > 0 and completed_task_count % self.check_interval == 0
def score_completion(
self,
objective: str,
completed_summaries: list[str],
proposed_next_tasks: list[str],
budget: BudgetTracker,
) -> float:
"""
Returns a completion score 0–100. Higher = more complete.
Raises RuntimeError if budget is exhausted.
"""
completed_text = "\n".join(
f"- {s[:300]}" for s in completed_summaries[-10:]
)
proposed_text = "\n".join(f"- {t}" for t in proposed_next_tasks[:5])
try:
budget.add(0.0003) # Haiku classification call: ~$0.0002–$0.0004
except BudgetExceededError as e:
raise RuntimeError("Budget exhausted during completion check.") from e
response = self._client.messages.create(
model=self.model,
max_tokens=64,
messages=[{
"role": "user",
"content": (
f"Original objective: {objective}\n\n"
f"Work completed so far:\n{completed_text}\n\n"
f"Proposed next tasks:\n{proposed_text}\n\n"
"Score how completely the ORIGINAL OBJECTIVE has been addressed "
"by the completed work (0 = not at all, 100 = fully addressed). "
"Consider: would a reasonable person who set this objective "
"be satisfied with the work done so far? "
"Reply with only a number between 0 and 100."
),
}],
)
self._checks_run += 1
raw = response.content[0].text.strip()
try:
score = float("".join(c for c in raw if c.isdigit() or c == "."))
return min(100.0, max(0.0, score))
except ValueError:
return 50.0 # neutral score if parse fails
def is_complete(self, score: float) -> bool:
return score >= self.threshold
The completion classifier runs every 5 completed tasks (configurable via check_interval). It uses Claude Haiku rather than the primary model: the classification prompt is short and the task — judging whether work is sufficient — doesn't require deep reasoning, so paying $0.0003 per check rather than $0.01+ is justified. The prompt specifically asks whether "a reasonable person who set this objective would be satisfied" — this grounds the score in the original requester's intent rather than the agent's internal sense of completeness, which is systematically biased toward finding more to do.
Failure mode 4: context compression runaway
AutoGPT maintains a memory store — originally Pinecone or a local vector DB — and manages an in-memory conversation buffer. When the accumulated Thought-Action-Observation history approaches the model's context window limit, AutoGPT triggers a compression step: it feeds the current history to the LLM and asks for a summary that preserves the key facts while reducing token count. The summary replaces the raw history in the active context.
The runaway pattern emerges when the compressed summary is itself close to the context limit, or when the summary is judged by the agent to have dropped important facts. A second compression pass is triggered. The second pass injects the already-compressed content as input — paying the same token cost as the original compression — and produces a marginally shorter output. If the result is still above the threshold, a third pass fires. Each pass costs as much as generating a full action cycle because the input is the full accumulated context. Three compression passes on a 6,000-token history cost $0.18–$0.30 for context management alone, before the next actual action call.
import hashlib
import time
import anthropic
from runguard import BudgetTracker, BudgetExceededError
class ContextCompressionGuard:
"""
Prevents repeated compression of the same or near-same content.
Tracks compression calls by content hash; blocks re-compression
if the same hash was compressed within the TTL window.
Also enforces a per-session compression call ceiling.
"""
def __init__(
self,
max_compressions_per_session: int = 5,
dedup_ttl_seconds: int = 120,
budget: BudgetTracker | None = None,
):
self.max_compressions = max_compressions_per_session
self.ttl = dedup_ttl_seconds
self.budget = budget
self._compression_log: dict[str, float] = {} # hash → timestamp
self._session_count = 0
self._client = anthropic.Anthropic()
def _content_hash(self, history_text: str) -> str:
return hashlib.sha256(history_text[:4000].encode()).hexdigest()[:20]
def should_compress(self, history_text: str) -> bool:
"""Returns True if compression is warranted and allowed."""
if self._session_count >= self.max_compressions:
return False
h = self._content_hash(history_text)
last_compressed_at = self._compression_log.get(h)
if last_compressed_at and (time.time() - last_compressed_at) < self.ttl:
return False # same content compressed recently — skip
return True
def compress(
self,
history_text: str,
model: str = "claude-haiku-4-5-20251001",
target_token_ceiling: int = 2000,
) -> str:
"""
Compresses history_text to under target_token_ceiling tokens.
Tracks the compression in the dedup log.
Raises RuntimeError if session ceiling is reached.
"""
if self._session_count >= self.max_compressions:
raise RuntimeError(
f"Compression ceiling reached ({self.max_compressions} per session). "
"Truncate oldest history manually or start a new session."
)
h = self._content_hash(history_text)
est_cost = (len(history_text.split()) / 750) * 0.000001 # rough Haiku cost
if self.budget:
try:
self.budget.add(est_cost)
except BudgetExceededError as e:
raise RuntimeError("Budget exhausted before compression.") from e
response = self._client.messages.create(
model=model,
max_tokens=target_token_ceiling,
messages=[{
"role": "user",
"content": (
f"Summarize the following agent history, preserving all key facts, "
f"decisions, tool results, and open questions. "
f"Target: under {target_token_ceiling} tokens. "
f"History:\n\n{history_text}"
),
}],
)
compressed = response.content[0].text.strip()
self._compression_log[h] = time.time()
self._session_count += 1
return compressed
The compression guard uses a content hash of the first 4,000 characters of the history buffer as the dedup key. If the same hash appears in the log within the 120-second TTL, the compression call is skipped — the previous compressed output is still valid, and re-compressing the same content is pure waste. The session ceiling (max_compressions_per_session=5) provides an absolute backstop: even if the TTL window rolls over, no session can pay for more than 5 compression calls. When the ceiling is hit, the guard surfaces an explicit error rather than silently proceeding — this makes the capacity limit visible rather than masking it behind continued (increasingly expensive) retries.
Composing all four guards
The four failure modes can occur simultaneously in a single session. A BabyAGI run on an ambiguous goal hits task tree explosion and goal completion ambiguity together. An AutoGPT session on a long research task hits reflection-reformulation spirals early and context compression runaway later. The guards are designed to compose: each operates on its own state and can be instantiated independently, then passed to the agent loop functions.
from runguard import BudgetTracker
def build_autonomous_session_guards(
session_budget_usd: float = 5.0,
max_total_tasks: int = 25,
max_tree_depth: int = 4,
completion_threshold: float = 78.0,
check_every_n_tasks: int = 5,
max_compressions: int = 4,
):
"""
Returns a configured set of guards for a single autonomous agent session.
Pass budget into each guard that needs spending authority.
"""
budget = BudgetTracker(cap=session_budget_usd)
task_guard = TaskTreeGuard(
max_total_tasks=max_total_tasks,
max_depth=max_tree_depth,
session_budget_usd=session_budget_usd, # TaskTreeGuard owns its own BudgetTracker
)
reflection_guard = ReflectionLoopDetector(
similarity_threshold=0.82,
max_similar_consecutive=2,
)
completion_classifier = GoalCompletionClassifier(
completion_threshold=completion_threshold,
check_interval=check_every_n_tasks,
)
compression_guard = ContextCompressionGuard(
max_compressions_per_session=max_compressions,
dedup_ttl_seconds=120,
budget=budget,
)
return task_guard, reflection_guard, completion_classifier, compression_guard
With all four guards active, a BabyAGI or AutoGPT session has four independent termination signals: task count ceiling, plan similarity ceiling, goal completion score threshold, and context compression ceiling — plus the per-session budget cap that spans all of them. Each guard raises a distinct exception type or returns a sentinel value, so the calling code can log which guard fired for each terminated session. Over time, the distribution of which guards fire most often is a signal about which failure mode is most active for your specific objective types — useful for tuning thresholds.
Choosing thresholds
The correct threshold values depend on your goal types. A research assistant running on open-ended questions needs higher task ceilings (max_total_tasks=50) and a lower completion threshold (60–70) because the domain is inherently broad. An agent completing a specific, bounded task (write a test for function X, summarize document Y) needs lower task ceilings (max_total_tasks=10) and a higher completion threshold (85–90) because "done" has a clearer meaning.
Start conservative: max_total_tasks=20, completion_threshold=75, max_depth=3. Monitor which guard fires most often across your first 50 sessions. If the task ceiling fires before the goal is meaningfully addressed, raise it. If the completion classifier is terminating sessions too early (users report incomplete results), lower the threshold. If the reflection guard fires rarely but compression runaway is common, lower max_compressions first and raise the reflection similarity threshold. The guards are cheap to tune because the thresholds live in a single configuration object and the historical guard-firing logs provide a clear feedback signal.
Frequently asked questions
Do these guards work with the original BabyAGI and AutoGPT codebases, or only custom implementations?
The guards are standalone Python classes and work with any implementation that calls an LLM in a loop. For the original BabyAGI (task_list while-loop in main.py) and AutoGPT (the agent.py action cycle), you inject the guards at the task-creation point and the plan-generation point respectively. The integration is typically 10–20 lines of wrapping code. Forks and derivatives (AgentGPT, AgentRunner, GPT-Engineer) have the same architectural patterns and the same injection points.
The completion classifier adds another LLM call — doesn't that make costs worse?
A Haiku completion check costs roughly $0.0002–$0.0004. It runs every 5 completed tasks. A 30-task session pays for 6 completion checks at ~$0.002 total — less than 1% of the session's LLM spend. The break-even is saving even one unnecessary 5-task generation batch, which costs $0.15–$0.40. The classifier pays for itself if it catches a single premature extension of a sufficiently-addressed session.
What happens to the task tree guard when a task requires genuine deep exploration?
The depth and count ceilings are configurable. For deep-exploration tasks, set max_depth=6 and max_total_tasks=60. The key guard that prevents runaway in this case is the budget cap — even with generous depth/count ceilings, the session terminates when the dollar ceiling is hit. Set the budget ceiling to match what you're willing to pay for one session of that task type, and the other ceilings become secondary backstops rather than primary constraints.
How does the reflection guard handle legitimate multi-step plan refinement?
The similarity threshold (0.82 default) is calibrated to flag near-identical plans, not genuinely revised ones. A plan that adds a new step, removes a step, or reorders priorities significantly will score well below 0.82 on Jaccard overlap. The guard only fires when consecutive plans are word-for-word similar with surface-level variation — the hallmark of a stuck cycle rather than productive refinement. If your use case involves fine-grained iterative planning, lower the threshold to 0.90 to require higher similarity before flagging.
Is RunGuard's BudgetTracker compatible with these custom guards?
Yes. RunGuard's BudgetTracker is a thread-safe counter with a configurable cap and a BudgetExceededError on breach. The guards above use it directly via budget.add(cost_usd). You can pass a single shared BudgetTracker instance across all guards in a session so that the dollar ceiling is enforced across all four failure modes simultaneously, regardless of which one is active at any given step.