BabyAGI and AutoGPT Cost Control: Recursive Task Explosion, Reflection Loops, and Goal Completion Ambiguity

BabyAGI and AutoGPT are the original autonomous agent frameworks — the ones that made "AI that runs itself" feel plausible in 2023. BabyAGI implements the simplest possible architecture: a task list, an execution agent, a task-creation agent, and a prioritization agent, cycling until the task list is empty. AutoGPT adds a richer inner monologue — Thought, Reasoning, Plan, Criticism, Action — and a persistent memory store backed by a vector database. Both frameworks share the same fundamental design: the agent output at step N determines the agent input at step N+1, with no external convergence signal.

That self-directed loop is exactly what makes these frameworks compelling for research and experimentation. It is also the architectural root of every unexpected cost pattern they produce. When an agent's output determines its own next input, any mistake, ambiguity, or unhappy intermediate state propagates forward and is paid for in LLM tokens. A well-scoped task on a narrow topic terminates cleanly. A fuzzy goal on a broad topic does not — it generates subtasks faster than it resolves them, and the task queue grows until an external kill signal or a credit limit fires.

The four structural failure modes that account for the majority of unexpected costs in BabyAGI and AutoGPT deployments:

  • Recursive task tree explosion — BabyAGI's task-creation agent generates subtasks from each execution result; ambiguous or broad goals produce task lists that grow faster than they shrink; a single "research X comprehensively" goal can expand to 40–60 queued tasks before the first result is useful.
  • Reflection-reformulation spiral — AutoGPT's Criticism phase evaluates whether the current Plan is adequate; when the critic judges a plan insufficient, the agent reformulates and re-enters the Thought-Reasoning-Plan cycle; without a semantic-similarity check, this cycle produces minimally different plan variants at full GPT-4 cost per iteration.
  • Goal completion ambiguity — both frameworks rely on the agent to decide when the top-level goal is "done"; open-ended goals (improve the codebase, research the market, make the document better) have no natural termination condition; the agent always finds one more relevant subtask, and without an explicit completion classifier, the loop runs until resources are exhausted.
  • Context compression runaway — AutoGPT compresses its in-memory context when the accumulated history approaches the model's context window limit; if the compressed summary is judged insufficient (too lossy), the agent re-compresses from a slightly different memory slice, paying double or triple compression cost; this pattern is particularly expensive because compression calls inject the full current history as input.

BabyAGI's and AutoGPT's cost model

Both frameworks are model-agnostic wrappers, so costs flow directly to whichever LLM provider you configure. The canonical deployment uses GPT-4o or GPT-4 Turbo. At GPT-4o rates ($0.005/1K input tokens, $0.015/1K output tokens), the cost per agent cycle breaks down as:

  • BabyAGI, per loop iteration: one execution call (the task itself — 500–2,000 token context + 200–500 token result) + one task-creation call (prior results injected — grows with task count) + one prioritization call (full task list). Three LLM calls per cycle. A 50-task run pays roughly 150 LLM calls before accounting for memory injection growth.
  • AutoGPT, per action cycle: one Thought-Reasoning-Plan-Criticism-Action generation call (typically 3,000–8,000 tokens of accumulated context + memory injection) + any tool calls that invoke LLMs (web summarization, code explanation) + any memory store upsert calls. One base call per action step, multiplied by the number of tool-chained LLM operations.

A realistic AutoGPT session that completes a research task in 15 action steps costs roughly $0.30–$0.80 in model calls — reasonable. The same session caught in a reflection-reformulation spiral for 20 additional cycles before converging costs an additional $2–$5. A BabyAGI run that terminates after 30 tasks is fine. One that generates 120 tasks on an ambiguous goal costs $3–$8 in planning calls alone, most for tasks that are never useful. The failure is rarely catastrophic on a single run — it's the accumulation across many sessions, or a single runaway automated session, that produces the $50–$200 unexpected monthly charges teams report.

Failure mode 1: recursive task tree explosion

BabyAGI's task-creation agent receives the result of the just-completed task and the original objective, then generates new tasks to work toward the objective given what was just learned. The implicit assumption is that each execution result should shrink the remaining work. In practice, broad objectives produce the opposite: every research finding reveals more avenues, and the task-creation agent dutifully generates subtasks for each.

A goal like "research the competitive landscape for AI developer tools" yields an initial task: survey the market. The execution agent produces a list of 12 companies. The task-creation agent generates 12 new tasks — one per company. Each company task produces founding date, pricing, and 3 differentiators. The task-creation agent generates 36 more tasks to compare pricing tiers, track recent funding, and identify customer reviews. The task queue depth grows at 3× per generation on an unbounded research topic. By the time any useful output exists, 80 tasks have been queued and 60 LLM calls have fired.

The fix requires three controls: a hard ceiling on total tasks per session, a depth counter to track how many generations removed a task is from the root goal, and a pre-creation check that asks whether the proposed tasks are genuinely new work or reformulations of already-queued tasks.

Python — task tree guard for BabyAGI-style recursive spawning
import hashlib
import anthropic
from runguard import BudgetTracker, BudgetExceededError

class TaskTreeGuard:
    """
    Tracks task tree depth and total count; blocks task creation
    that would exceed either ceiling.
    """
    def __init__(
        self,
        max_total_tasks: int = 30,
        max_depth: int = 4,
        session_budget_usd: float = 3.0,
    ):
        self.max_total_tasks = max_total_tasks
        self.max_depth = max_depth
        self.budget = BudgetTracker(cap=session_budget_usd)
        self._task_count = 0
        self._task_hashes: set[str] = set()

    def _task_hash(self, task: str) -> str:
        return hashlib.sha256(task.lower().strip().encode()).hexdigest()[:16]

    def can_add_tasks(self, proposed: list[str], current_depth: int) -> list[str]:
        """
        Filters proposed task list down to tasks that are:
        1. Not already in the task queue (dedup by hash)
        2. Within depth ceiling
        3. Within total task ceiling
        Returns the filtered list; empty list means no tasks can be added.
        """
        if current_depth >= self.max_depth:
            return []

        allowed = []
        for task in proposed:
            h = self._task_hash(task)
            if h in self._task_hashes:
                continue  # duplicate
            if self._task_count + len(allowed) >= self.max_total_tasks:
                break  # total ceiling
            allowed.append(task)
            self._task_hashes.add(h)

        self._task_count += len(allowed)
        return allowed

    def record_llm_cost(self, cost_usd: float):
        try:
            self.budget.add(cost_usd)
        except BudgetExceededError as e:
            raise RuntimeError(
                f"Session budget exhausted (${self.budget.cap:.2f} cap). "
                f"Task tree terminated after {self._task_count} tasks."
            ) from e


def run_babyagi_with_guard(
    objective: str,
    initial_task: str,
    model: str = "claude-sonnet-4-6",
    max_total_tasks: int = 30,
    max_depth: int = 4,
    session_budget_usd: float = 3.0,
) -> list[dict]:
    """
    BabyAGI-style loop with task tree guard: hard ceilings on total tasks,
    tree depth, and session budget. Returns list of completed task results.
    """
    client = anthropic.Anthropic()
    guard = TaskTreeGuard(max_total_tasks, max_depth, session_budget_usd)

    task_queue: list[dict] = [{"task": initial_task, "depth": 0}]
    completed_results: list[dict] = []

    while task_queue:
        current = task_queue.pop(0)
        task_text = current["task"]
        depth = current["depth"]

        # Execute the task
        context_summary = "\n".join(
            f"- {r['task']}: {r['result'][:200]}" for r in completed_results[-5:]
        )
        exec_response = client.messages.create(
            model=model,
            max_tokens=512,
            messages=[{
                "role": "user",
                "content": (
                    f"Objective: {objective}\n\n"
                    f"Recent completed tasks:\n{context_summary or 'None yet'}\n\n"
                    f"Current task: {task_text}\n\n"
                    "Complete this task concisely. Return only the result."
                ),
            }],
        )
        result_text = exec_response.content[0].text.strip()
        input_tokens = exec_response.usage.input_tokens
        output_tokens = exec_response.usage.output_tokens
        exec_cost = (input_tokens * 0.000003) + (output_tokens * 0.000015)
        guard.record_llm_cost(exec_cost)

        completed_results.append({"task": task_text, "result": result_text, "depth": depth})

        # Generate new tasks — only if within depth ceiling
        next_depth = depth + 1
        if next_depth >= max_depth:
            continue

        creation_response = client.messages.create(
            model=model,
            max_tokens=256,
            messages=[{
                "role": "user",
                "content": (
                    f"Objective: {objective}\n\n"
                    f"Completed task: {task_text}\n"
                    f"Result: {result_text}\n\n"
                    f"Remaining tasks already queued: {len(task_queue)}\n"
                    f"Total tasks completed so far: {len(completed_results)}\n\n"
                    "List up to 3 new tasks needed to advance the objective, "
                    "given this result. Return each task on its own line. "
                    "If the objective is sufficiently addressed, return DONE."
                ),
            }],
        )
        creation_text = creation_response.content[0].text.strip()
        c_tokens = creation_response.usage.input_tokens + creation_response.usage.output_tokens
        guard.record_llm_cost(c_tokens * 0.000006)

        if "DONE" in creation_text.upper():
            break

        new_tasks_raw = [
            line.lstrip("0123456789.-) ").strip()
            for line in creation_text.splitlines()
            if line.strip() and len(line.strip()) > 10
        ]
        allowed_tasks = guard.can_add_tasks(new_tasks_raw, next_depth)
        for t in allowed_tasks:
            task_queue.append({"task": t, "depth": next_depth})

    return completed_results

The guard enforces three independent ceilings: the total task count (max_total_tasks=30) prevents runaway expansion; the depth ceiling (max_depth=4) prevents recursive deepening on a single subtask branch; the budget tracker terminates the session if cumulative LLM costs exceed the session budget regardless of whether the task ceiling has been hit. Task deduplication by content hash prevents the task-creation agent from re-queuing semantically identical tasks under slightly different phrasing — a common pattern when the agent is uncertain whether a task has been completed.

Failure mode 2: reflection-reformulation spiral

AutoGPT's inner monologue has a Criticism phase by design. After generating a Plan, the agent critiques it: identifying weaknesses, missing steps, and potential failure modes. If the criticism is substantive, the next Thought-Reasoning-Plan cycle incorporates the critique and produces a revised plan. This is the intended behavior — self-correction before acting is cheaper than correcting after a failed tool call.

The spiral emerges when the critique is structurally accurate but the plan revision is cosmetically different. The agent reformulates the plan with slightly different phrasing or task ordering. The next criticism identifies the same underlying weaknesses. The plan is revised again. Each iteration looks like productive reasoning from inside the agent's context — it is generating valid text about valid concerns — but successive plan versions are semantically near-identical. The agent is paying GPT-4 rates to rewrite the same plan four times before taking a single action.

Detecting this requires comparing successive plan versions for semantic similarity. If plan version N and plan version N-1 are above a cosine similarity threshold, the loop is stuck and the agent should either act on the current plan or escalate to a human.

Python — reflection loop detector for AutoGPT-style plan-critique cycles
import anthropic
from runguard import BudgetTracker, BudgetExceededError, LoopDetector

def get_plan_embedding(plan_text: str, client: anthropic.Anthropic) -> list[float]:
    """
    Returns a simple embedding proxy using token overlap as a lightweight
    similarity measure. For production, use a real embedding model.
    """
    # Lightweight Jaccard similarity proxy: bag-of-words overlap
    # Replace with client.embeddings() or openai.embeddings if available
    words = set(plan_text.lower().split())
    return list(words)  # type: ignore  # return set for overlap comparison


def jaccard_similarity(a: list, b: list) -> float:
    sa, sb = set(a), set(b)
    if not sa or not sb:
        return 0.0
    return len(sa & sb) / len(sa | sb)


class ReflectionLoopDetector:
    """
    Tracks successive plan versions in a Thought-Plan-Critique cycle.
    Raises LoopDetectedError when consecutive plans are too similar,
    indicating the critique cycle is not making progress.
    """
    def __init__(
        self,
        similarity_threshold: float = 0.82,
        max_similar_consecutive: int = 2,
    ):
        self.threshold = similarity_threshold
        self.max_similar = max_similar_consecutive
        self._plan_history: list[list] = []
        self._similar_count = 0

    def check_plan(self, new_plan: str) -> bool:
        """
        Returns True if plan is making progress, False if stuck in a loop.
        Call before committing to another reflection cycle.
        """
        new_emb = get_plan_embedding(new_plan, None)  # type: ignore

        if self._plan_history:
            sim = jaccard_similarity(new_emb, self._plan_history[-1])
            if sim >= self.threshold:
                self._similar_count += 1
                if self._similar_count >= self.max_similar:
                    return False  # loop detected — act or escalate
            else:
                self._similar_count = 0  # genuine revision, reset counter

        self._plan_history.append(new_emb)
        return True  # making progress


def autogpt_action_cycle(
    objective: str,
    model: str = "claude-sonnet-4-6",
    max_action_steps: int = 20,
    max_reflection_cycles: int = 3,
    session_budget_usd: float = 2.0,
) -> str:
    """
    AutoGPT-style Thought-Reasoning-Plan-Criticism-Action loop.
    Reflection loop detector breaks out of stuck plan-critique spirals.
    """
    client = anthropic.Anthropic()
    budget = BudgetTracker(cap=session_budget_usd)
    reflection_guard = ReflectionLoopDetector(
        similarity_threshold=0.82,
        max_similar_consecutive=2,
    )

    history: list[dict] = []
    action_results: list[str] = []
    reflection_cycles = 0

    for step in range(max_action_steps):
        # Build context: inject last 5 action results to limit token growth
        context = "\n".join(action_results[-5:]) if action_results else "No actions taken yet."

        # Inner monologue: generate Thought + Reasoning + Plan + Criticism + Action
        inner_response = client.messages.create(
            model=model,
            max_tokens=800,
            system=(
                "You are an autonomous agent. For each step, output exactly:\n"
                "THOUGHT: \n"
                "REASONING: <2-3 sentences>\n"
                "PLAN: \n"
                "CRITICISM: \n"
                "ACTION: \n"
                "DONE: "
            ),
            messages=[
                *history[-6:],  # keep last 3 exchanges to bound context growth
                {
                    "role": "user",
                    "content": (
                        f"Objective: {objective}\n\n"
                        f"Recent action results:\n{context}\n\n"
                        "Produce your inner monologue and next action."
                    ),
                },
            ],
        )

        response_text = inner_response.content[0].text.strip()
        tokens_used = inner_response.usage.input_tokens + inner_response.usage.output_tokens
        call_cost = tokens_used * 0.000009  # blended rate
        try:
            budget.add(call_cost)
        except BudgetExceededError as e:
            raise RuntimeError(
                f"AutoGPT session budget exhausted at step {step}. "
                f"Completed {len(action_results)} actions."
            ) from e

        history.extend([
            {"role": "user", "content": f"Objective: {objective}"},
            {"role": "assistant", "content": response_text},
        ])

        # Extract PLAN for reflection loop detection
        plan_text = ""
        for line in response_text.splitlines():
            if line.startswith("PLAN:") or (plan_text and not line.startswith(
                ("THOUGHT:", "REASONING:", "CRITICISM:", "ACTION:", "DONE:")
            )):
                plan_text += line + "\n"

        # Check for reflection loop
        if not reflection_guard.check_plan(plan_text):
            reflection_cycles += 1
            if reflection_cycles >= max_reflection_cycles:
                # Extract the current ACTION and force execution rather than re-reflecting
                action_line = next(
                    (l for l in response_text.splitlines() if l.startswith("ACTION:")),
                    "ACTION: Report current findings and stop."
                )
                action_results.append(
                    f"[Loop guard forced action after {reflection_cycles} stuck cycles] {action_line}"
                )
                break
            # Force a fresh start on reflection by injecting a directive
            history.append({
                "role": "user",
                "content": (
                    "Your last two plans are nearly identical. Stop refining the plan. "
                    "Execute the ACTION from your most recent step immediately."
                ),
            })
            continue

        # Check DONE
        if "DONE: YES" in response_text.upper():
            break

        # Extract and execute ACTION (stub — replace with real tool dispatch)
        action_line = next(
            (l for l in response_text.splitlines() if l.startswith("ACTION:")),
            None,
        )
        if action_line:
            action_result = f"Executed: {action_line.replace('ACTION:', '').strip()}"
            action_results.append(action_result)

    return "\n".join(action_results)

The reflection guard tracks successive plan versions using a Jaccard word-overlap similarity — cheap to compute, no embedding API call required. When two consecutive plans score above the 0.82 threshold, the similar-cycle counter increments. After two consecutive stuck cycles, the guard injects a directive forcing the agent to execute rather than re-reflect. This mirrors how a human pair-programmer would intervene: "stop planning, just do it." The agent retains the ability to reflect on genuine plan gaps; it loses the ability to spin on the same gap indefinitely.

Failure mode 3: goal completion ambiguity

BabyAGI's original loop condition is while task_list — the agent runs until the task list is empty. AutoGPT's is while not done, where "done" is determined by the agent's own DONE: YES output. Both termination conditions are correct for well-scoped goals. They fail for goals where "done" is a matter of degree rather than a binary state.

Research goals, writing improvement goals, and optimization goals are all degree-goals. "Research the AI agent market" can always be followed by "research one more category." "Improve the API documentation" can always find one more section to improve. The agent's task-creation logic is optimized to find productive next steps — it is not optimized to recognize that sufficient progress has been made on a fuzzy objective. Without an external evaluator that can assess completeness from the original requester's perspective, the agent treats the goal as perpetually incomplete.

The guard is a completion classifier: a fast model (Claude Haiku) that receives the original goal, the completed task summaries, and the proposed next tasks, and scores how complete the goal is on a 0–100 scale. Above a configurable threshold (typically 75–80), the session terminates regardless of remaining queued tasks.

Python — goal completion classifier to break ambiguous infinite loops
import anthropic
from runguard import BudgetTracker, BudgetExceededError

class GoalCompletionClassifier:
    """
    Uses a fast model to score goal completion before committing to
    another batch of task creation. Prevents runaway on fuzzy objectives.
    """
    def __init__(
        self,
        completion_threshold: float = 75.0,
        check_interval: int = 5,  # check every N completed tasks
        classifier_model: str = "claude-haiku-4-5-20251001",
    ):
        self.threshold = completion_threshold
        self.check_interval = check_interval
        self.model = classifier_model
        self._client = anthropic.Anthropic()
        self._checks_run = 0

    def should_check(self, completed_task_count: int) -> bool:
        return completed_task_count > 0 and completed_task_count % self.check_interval == 0

    def score_completion(
        self,
        objective: str,
        completed_summaries: list[str],
        proposed_next_tasks: list[str],
        budget: BudgetTracker,
    ) -> float:
        """
        Returns a completion score 0–100. Higher = more complete.
        Raises RuntimeError if budget is exhausted.
        """
        completed_text = "\n".join(
            f"- {s[:300]}" for s in completed_summaries[-10:]
        )
        proposed_text = "\n".join(f"- {t}" for t in proposed_next_tasks[:5])

        try:
            budget.add(0.0003)  # Haiku classification call: ~$0.0002–$0.0004
        except BudgetExceededError as e:
            raise RuntimeError("Budget exhausted during completion check.") from e

        response = self._client.messages.create(
            model=self.model,
            max_tokens=64,
            messages=[{
                "role": "user",
                "content": (
                    f"Original objective: {objective}\n\n"
                    f"Work completed so far:\n{completed_text}\n\n"
                    f"Proposed next tasks:\n{proposed_text}\n\n"
                    "Score how completely the ORIGINAL OBJECTIVE has been addressed "
                    "by the completed work (0 = not at all, 100 = fully addressed). "
                    "Consider: would a reasonable person who set this objective "
                    "be satisfied with the work done so far? "
                    "Reply with only a number between 0 and 100."
                ),
            }],
        )

        self._checks_run += 1
        raw = response.content[0].text.strip()
        try:
            score = float("".join(c for c in raw if c.isdigit() or c == "."))
            return min(100.0, max(0.0, score))
        except ValueError:
            return 50.0  # neutral score if parse fails

    def is_complete(self, score: float) -> bool:
        return score >= self.threshold

The completion classifier runs every 5 completed tasks (configurable via check_interval). It uses Claude Haiku rather than the primary model: the classification prompt is short and the task — judging whether work is sufficient — doesn't require deep reasoning, so paying $0.0003 per check rather than $0.01+ is justified. The prompt specifically asks whether "a reasonable person who set this objective would be satisfied" — this grounds the score in the original requester's intent rather than the agent's internal sense of completeness, which is systematically biased toward finding more to do.

Failure mode 4: context compression runaway

AutoGPT maintains a memory store — originally Pinecone or a local vector DB — and manages an in-memory conversation buffer. When the accumulated Thought-Action-Observation history approaches the model's context window limit, AutoGPT triggers a compression step: it feeds the current history to the LLM and asks for a summary that preserves the key facts while reducing token count. The summary replaces the raw history in the active context.

The runaway pattern emerges when the compressed summary is itself close to the context limit, or when the summary is judged by the agent to have dropped important facts. A second compression pass is triggered. The second pass injects the already-compressed content as input — paying the same token cost as the original compression — and produces a marginally shorter output. If the result is still above the threshold, a third pass fires. Each pass costs as much as generating a full action cycle because the input is the full accumulated context. Three compression passes on a 6,000-token history cost $0.18–$0.30 for context management alone, before the next actual action call.

Python — context compression guard with hash-based deduplication
import hashlib
import time
import anthropic
from runguard import BudgetTracker, BudgetExceededError

class ContextCompressionGuard:
    """
    Prevents repeated compression of the same or near-same content.
    Tracks compression calls by content hash; blocks re-compression
    if the same hash was compressed within the TTL window.
    Also enforces a per-session compression call ceiling.
    """
    def __init__(
        self,
        max_compressions_per_session: int = 5,
        dedup_ttl_seconds: int = 120,
        budget: BudgetTracker | None = None,
    ):
        self.max_compressions = max_compressions_per_session
        self.ttl = dedup_ttl_seconds
        self.budget = budget
        self._compression_log: dict[str, float] = {}  # hash → timestamp
        self._session_count = 0
        self._client = anthropic.Anthropic()

    def _content_hash(self, history_text: str) -> str:
        return hashlib.sha256(history_text[:4000].encode()).hexdigest()[:20]

    def should_compress(self, history_text: str) -> bool:
        """Returns True if compression is warranted and allowed."""
        if self._session_count >= self.max_compressions:
            return False

        h = self._content_hash(history_text)
        last_compressed_at = self._compression_log.get(h)
        if last_compressed_at and (time.time() - last_compressed_at) < self.ttl:
            return False  # same content compressed recently — skip

        return True

    def compress(
        self,
        history_text: str,
        model: str = "claude-haiku-4-5-20251001",
        target_token_ceiling: int = 2000,
    ) -> str:
        """
        Compresses history_text to under target_token_ceiling tokens.
        Tracks the compression in the dedup log.
        Raises RuntimeError if session ceiling is reached.
        """
        if self._session_count >= self.max_compressions:
            raise RuntimeError(
                f"Compression ceiling reached ({self.max_compressions} per session). "
                "Truncate oldest history manually or start a new session."
            )

        h = self._content_hash(history_text)
        est_cost = (len(history_text.split()) / 750) * 0.000001  # rough Haiku cost
        if self.budget:
            try:
                self.budget.add(est_cost)
            except BudgetExceededError as e:
                raise RuntimeError("Budget exhausted before compression.") from e

        response = self._client.messages.create(
            model=model,
            max_tokens=target_token_ceiling,
            messages=[{
                "role": "user",
                "content": (
                    f"Summarize the following agent history, preserving all key facts, "
                    f"decisions, tool results, and open questions. "
                    f"Target: under {target_token_ceiling} tokens. "
                    f"History:\n\n{history_text}"
                ),
            }],
        )

        compressed = response.content[0].text.strip()
        self._compression_log[h] = time.time()
        self._session_count += 1
        return compressed

The compression guard uses a content hash of the first 4,000 characters of the history buffer as the dedup key. If the same hash appears in the log within the 120-second TTL, the compression call is skipped — the previous compressed output is still valid, and re-compressing the same content is pure waste. The session ceiling (max_compressions_per_session=5) provides an absolute backstop: even if the TTL window rolls over, no session can pay for more than 5 compression calls. When the ceiling is hit, the guard surfaces an explicit error rather than silently proceeding — this makes the capacity limit visible rather than masking it behind continued (increasingly expensive) retries.

Composing all four guards

The four failure modes can occur simultaneously in a single session. A BabyAGI run on an ambiguous goal hits task tree explosion and goal completion ambiguity together. An AutoGPT session on a long research task hits reflection-reformulation spirals early and context compression runaway later. The guards are designed to compose: each operates on its own state and can be instantiated independently, then passed to the agent loop functions.

Python — composed guard setup for a full autonomous session
from runguard import BudgetTracker

def build_autonomous_session_guards(
    session_budget_usd: float = 5.0,
    max_total_tasks: int = 25,
    max_tree_depth: int = 4,
    completion_threshold: float = 78.0,
    check_every_n_tasks: int = 5,
    max_compressions: int = 4,
):
    """
    Returns a configured set of guards for a single autonomous agent session.
    Pass budget into each guard that needs spending authority.
    """
    budget = BudgetTracker(cap=session_budget_usd)

    task_guard = TaskTreeGuard(
        max_total_tasks=max_total_tasks,
        max_depth=max_tree_depth,
        session_budget_usd=session_budget_usd,  # TaskTreeGuard owns its own BudgetTracker
    )
    reflection_guard = ReflectionLoopDetector(
        similarity_threshold=0.82,
        max_similar_consecutive=2,
    )
    completion_classifier = GoalCompletionClassifier(
        completion_threshold=completion_threshold,
        check_interval=check_every_n_tasks,
    )
    compression_guard = ContextCompressionGuard(
        max_compressions_per_session=max_compressions,
        dedup_ttl_seconds=120,
        budget=budget,
    )

    return task_guard, reflection_guard, completion_classifier, compression_guard

With all four guards active, a BabyAGI or AutoGPT session has four independent termination signals: task count ceiling, plan similarity ceiling, goal completion score threshold, and context compression ceiling — plus the per-session budget cap that spans all of them. Each guard raises a distinct exception type or returns a sentinel value, so the calling code can log which guard fired for each terminated session. Over time, the distribution of which guards fire most often is a signal about which failure mode is most active for your specific objective types — useful for tuning thresholds.

Choosing thresholds

The correct threshold values depend on your goal types. A research assistant running on open-ended questions needs higher task ceilings (max_total_tasks=50) and a lower completion threshold (60–70) because the domain is inherently broad. An agent completing a specific, bounded task (write a test for function X, summarize document Y) needs lower task ceilings (max_total_tasks=10) and a higher completion threshold (85–90) because "done" has a clearer meaning.

Start conservative: max_total_tasks=20, completion_threshold=75, max_depth=3. Monitor which guard fires most often across your first 50 sessions. If the task ceiling fires before the goal is meaningfully addressed, raise it. If the completion classifier is terminating sessions too early (users report incomplete results), lower the threshold. If the reflection guard fires rarely but compression runaway is common, lower max_compressions first and raise the reflection similarity threshold. The guards are cheap to tune because the thresholds live in a single configuration object and the historical guard-firing logs provide a clear feedback signal.

Frequently asked questions

Do these guards work with the original BabyAGI and AutoGPT codebases, or only custom implementations?

The guards are standalone Python classes and work with any implementation that calls an LLM in a loop. For the original BabyAGI (task_list while-loop in main.py) and AutoGPT (the agent.py action cycle), you inject the guards at the task-creation point and the plan-generation point respectively. The integration is typically 10–20 lines of wrapping code. Forks and derivatives (AgentGPT, AgentRunner, GPT-Engineer) have the same architectural patterns and the same injection points.

The completion classifier adds another LLM call — doesn't that make costs worse?

A Haiku completion check costs roughly $0.0002–$0.0004. It runs every 5 completed tasks. A 30-task session pays for 6 completion checks at ~$0.002 total — less than 1% of the session's LLM spend. The break-even is saving even one unnecessary 5-task generation batch, which costs $0.15–$0.40. The classifier pays for itself if it catches a single premature extension of a sufficiently-addressed session.

What happens to the task tree guard when a task requires genuine deep exploration?

The depth and count ceilings are configurable. For deep-exploration tasks, set max_depth=6 and max_total_tasks=60. The key guard that prevents runaway in this case is the budget cap — even with generous depth/count ceilings, the session terminates when the dollar ceiling is hit. Set the budget ceiling to match what you're willing to pay for one session of that task type, and the other ceilings become secondary backstops rather than primary constraints.

How does the reflection guard handle legitimate multi-step plan refinement?

The similarity threshold (0.82 default) is calibrated to flag near-identical plans, not genuinely revised ones. A plan that adds a new step, removes a step, or reorders priorities significantly will score well below 0.82 on Jaccard overlap. The guard only fires when consecutive plans are word-for-word similar with surface-level variation — the hallmark of a stuck cycle rather than productive refinement. If your use case involves fine-grained iterative planning, lower the threshold to 0.90 to require higher similarity before flagging.

Is RunGuard's BudgetTracker compatible with these custom guards?

Yes. RunGuard's BudgetTracker is a thread-safe counter with a configurable cap and a BudgetExceededError on breach. The guards above use it directly via budget.add(cost_usd). You can pass a single shared BudgetTracker instance across all guards in a session so that the dollar ceiling is enforced across all four failure modes simultaneously, regardless of which one is active at any given step.

Stop paying for stuck loops

RunGuard's SDK ships with BudgetTracker, LoopDetector, and per-framework guard primitives for BabyAGI, AutoGPT, LangChain, CrewAI, and 40+ other frameworks. Drop-in circuit breakers that fire before your monthly invoice does.

Join the waitlist — free during beta