CrewAI Crews-of-Crews Cost Control: Manager LLM Cascade, Async Spawn Amplification, and Hierarchical Delegation Retry

This guide covers crews-of-crews — CrewAI's hierarchical orchestration pattern where one crew spawns, coordinates, and aggregates results from multiple sub-crews. It is architecturally different from the single-crew delegation model covered in CrewAI single-crew cost control and from the event-driven @listen/@router cycles covered in CrewAI Flows cost control. Crews-of-crews introduces a distinct set of cost failure modes that emerge specifically from hierarchical crew composition, shared memory pools, and asynchronous parallel spawning.

When you compose crews hierarchically, each crew in the hierarchy operates independently: it has its own manager_llm for task delegation planning, its own task queue, and its own agent pool. The orchestrator crew interacts with sub-crews via kickoff() or kickoff_async() — the sub-crew output becomes an input to the orchestrator crew's next task. At each level of the hierarchy, CrewAI runs the same planning loop: the manager LLM receives the full crew context (agent descriptions, task descriptions, tool manifests), plans which agent executes which task in which order, and emits a delegation decision. This happens before any task runs, and it happens at every level of the hierarchy.

Four failure modes that are specific to crews-of-crews architectures:

  1. Manager LLM cascade — each crew in the hierarchy pays a manager LLM planning cost before executing any task. A three-level hierarchy with branching factor two triggers seven manager LLM planning calls (1 orchestrator + 2 at level-2 + 4 at level-3) before the first agent task runs. At gpt-4o pricing, seven planning calls averaging 3,000 input tokens each cost ~$0.06 per hierarchy invocation — before any actual work.
  2. Cross-crew shared memory accumulation — crews in a hierarchy often share a CrewAI memory backend (Long Term Memory, Short Term Memory, Entity Memory). Every task output written by any sub-crew appears in the shared memory pool. Subsequent tasks in any crew pull the full shared context on every retrieval. By the time the orchestrator crew aggregates results, its tasks receive memory contexts containing all intermediate outputs from all sub-crews.
  3. Async spawn amplificationkickoff_async() returns immediately, enabling fan-out. When the orchestrator crew uses an LLM to generate a sub-task list and spawns one sub-crew per item, the number of concurrent crews is bounded only by the LLM's output. A research orchestrator that generates 25 sub-queries runs 25 parallel sub-crews, each paying its own manager LLM planning cost, each accumulating memory, each running agents.
  4. Hierarchical delegation retry multiplication — task failures propagate up and down the crew hierarchy without a shared retry budget. A task that fails inside a sub-crew retries at the agent level (up to max_iter), then the sub-crew itself may be re-kicked off by the orchestrator crew's retry logic, which in turn is retried by the manager LLM planner if the orchestrator task fails. Independent retry policies at three levels multiply failure cost by up to 27×.

Hierarchy math: A three-tier crews-of-crews setup — one orchestrator, four coordinator crews, sixteen worker crews — with each crew holding three agents that make two LLM calls per task generates a minimum of 21 manager LLM planning calls and 96 task LLM calls per full run. Without caps on shared memory retrieval size and async spawn count, the actual call count scales with the LLM-generated output at the orchestrator level.

Failure Mode 1: Manager LLM Cascade

CrewAI's Process.hierarchical mode assigns a manager_llm to each crew. The manager LLM is called once per crew execution to plan task delegation across that crew's agents. In a single-crew setup, this is one planning call. In crews-of-crews, every crew in the hierarchy has its own manager LLM, and every crew execution triggers its own planning call.

The planning prompt includes the full crew context: descriptions of all agents, all task descriptions, and all tool manifests available to those agents. For a crew with four specialized agents each having five tools, the planning context easily reaches 4,000–6,000 tokens of input before any task content. A three-level hierarchy multiplies this by the number of crews at each level.

The naive pattern builds a hierarchy without accounting for manager LLM cost at each level:

from crewai import Agent, Task, Crew, Process
from langchain_openai import ChatOpenAI

gpt4o = ChatOpenAI(model="gpt-4o")

# Level 3: worker crews — each pays its own manager_llm planning call
research_crew = Crew(
    agents=[researcher_1, researcher_2, researcher_3],
    tasks=[research_task_1, research_task_2, research_task_3],
    process=Process.hierarchical,
    manager_llm=gpt4o,   # planning call: ~4,000 tokens input
    verbose=True
)

analysis_crew = Crew(
    agents=[analyst_1, analyst_2],
    tasks=[analysis_task_1, analysis_task_2],
    process=Process.hierarchical,
    manager_llm=gpt4o,   # another planning call: ~3,500 tokens input
)

# Level 2: coordinator crew — also pays a planning call
coordinator_crew = Crew(
    agents=[coordinator],
    tasks=[
        Task(description="Run research and analysis crews, synthesize output",
             agent=coordinator)
    ],
    process=Process.hierarchical,
    manager_llm=gpt4o,   # yet another planning call
)

# Level 1: orchestrator — runs coordinator which runs research + analysis
# total manager planning calls: 3, before any task runs

The fix is to apply hierarchical planning only where agent heterogeneity requires it, and use a cheaper model for planning at levels where the decision is deterministic:

from crewai import Agent, Task, Crew, Process
from langchain_openai import ChatOpenAI

# use a fast, cheap model for manager planning where tasks are predictable
gpt4o_mini = ChatOpenAI(model="gpt-4o-mini")  # ~15× cheaper for planning
gpt4o = ChatOpenAI(model="gpt-4o")            # only for task execution

class ManagedCrew:
    """Wraps a Crew with planning cost tracking."""
    def __init__(self, crew: Crew, name: str):
        self.crew = crew
        self.name = name
        self._planning_calls = 0

    def kickoff(self, inputs: dict | None = None):
        self._planning_calls += 1
        if self._planning_calls > 5:
            raise RuntimeError(
                f"Crew {self.name} exceeded 5 planning calls — possible retry spiral"
            )
        return self.crew.kickoff(inputs=inputs)

# worker crews use cheap manager for planning, expensive model only runs tasks
research_crew = Crew(
    agents=[researcher_1, researcher_2],
    tasks=[research_task_1, research_task_2],
    process=Process.hierarchical,
    manager_llm=gpt4o_mini,   # planning is cheap: ~$0.0003 vs $0.005
)

# orchestrator also uses cheap manager; agents use expensive model internally
orchestrator_crew = Crew(
    agents=[orchestrator_agent],
    tasks=[orchestration_task],
    process=Process.hierarchical,
    manager_llm=gpt4o_mini,
)

managed_research = ManagedCrew(research_crew, "research")
managed_orchestrator = ManagedCrew(orchestrator_crew, "orchestrator")

Failure Mode 2: Cross-Crew Shared Memory Accumulation

When multiple crews share a CrewAI memory backend, every task output written by any crew accumulates in the shared pool. This is intentional — the point of shared memory in crews-of-crews is that the research crew's findings should be available to the analysis crew without explicit passing. The cost failure occurs because CrewAI's memory retrieval is additive: each task's memory retrieval pulls relevant entries from the entire accumulated pool, and as the run progresses, more entries accumulate, growing retrieval context size linearly.

In a ten-crew hierarchy where each crew runs five tasks, by the time the final aggregation crew runs its tasks, the shared memory pool contains up to 45 prior task outputs. Each task in the final crew retrieves a subset of those 45 entries — typically embedding-scored top-k — but the retrieved content is appended to the task prompt. If each memory entry averages 800 tokens, ten retrieved entries add 8,000 tokens to every aggregation task prompt.

from crewai import Crew, Task, Process
from crewai.memory import LongTermMemory, ShortTermMemory, EntityMemory
from crewai.memory.storage.ltm_sqlite_storage import LTMSQLiteStorage
import time

class BoundedMemoryCrew:
    """Crew wrapper that limits memory retrieval size per task."""

    def __init__(
        self,
        crew: Crew,
        max_memory_tokens: int = 4000,
        memory_ttl_seconds: int = 3600
    ):
        self.crew = crew
        self.max_memory_tokens = max_memory_tokens
        self.memory_ttl_seconds = memory_ttl_seconds
        self._task_start_times: dict[str, float] = {}
        self._accumulated_memory_tokens = 0

    def kickoff(self, inputs: dict | None = None):
        run_start = time.time()

        # check accumulated memory before running
        if self._accumulated_memory_tokens > self.max_memory_tokens:
            raise MemoryBudgetExceeded(
                f"Accumulated memory tokens ({self._accumulated_memory_tokens}) "
                f"exceed budget ({self.max_memory_tokens}). "
                "Call reset_memory_pool() before re-running."
            )

        result = self.crew.kickoff(inputs=inputs)

        # estimate tokens added this run (rough: 4 chars per token)
        if result and result.raw:
            self._accumulated_memory_tokens += len(result.raw) // 4

        return result

    def reset_memory_pool(self):
        """Call between runs when memory pool has grown too large."""
        if hasattr(self.crew, 'memory') and self.crew.memory:
            self.crew.reset_memories(which='all')
        self._accumulated_memory_tokens = 0

class MemoryBudgetExceeded(RuntimeError):
    pass

# usage: reset shared memory between major orchestration phases
research_crew_managed = BoundedMemoryCrew(
    crew=research_crew,
    max_memory_tokens=8000   # ~10 task outputs at 800 tokens each
)

analysis_crew_managed = BoundedMemoryCrew(
    crew=analysis_crew,
    max_memory_tokens=6000
)

# run research phase
research_result = research_crew_managed.kickoff(inputs={"topic": "AI cost control"})

# explicitly reset non-essential short-term memory before analysis phase
# so the analysis crew doesn't pull in all 10 research intermediates
analysis_crew_managed.reset_memory_pool()
analysis_result = analysis_crew_managed.kickoff(
    inputs={"research_summary": research_result.raw[:2000]}  # pass only the summary
)

Failure Mode 3: Async Spawn Amplification

kickoff_async() returns a coroutine immediately, allowing an orchestrator to fan out to multiple sub-crews in parallel. This is the intended pattern for parallelizing independent research tracks. The cost failure mode occurs when the orchestrator uses an LLM to generate the list of sub-tasks to parallelize — and the LLM output is variable-length.

A research orchestrator with the prompt "generate a comprehensive list of sub-queries for this topic" might return 8 queries on a narrow topic and 35 queries on a broad topic. If the orchestrator spawns one async crew per query without a ceiling, the broad-topic run spawns 35 parallel crews. Each crew pays a manager LLM planning call, runs its agents, and writes to shared memory. The cost is linear in the number of queries the LLM generates — fully unbounded without a guard.

import asyncio
from crewai import Crew
from typing import Callable

class AsyncSpawnGuard:
    """Caps the number of concurrent async crew kickoffs."""

    def __init__(self, max_concurrent: int = 8, max_total: int = 20):
        self.max_concurrent = max_concurrent
        self.max_total = max_total
        self._semaphore = asyncio.Semaphore(max_concurrent)
        self._total_spawned = 0

    async def kickoff_many(
        self,
        crew_factory: Callable[[dict], Crew],
        inputs_list: list[dict]
    ) -> list:
        """
        Spawn one crew per input dict, capped by max_concurrent and max_total.
        crew_factory: callable that accepts an inputs dict and returns a Crew.
        """
        if len(inputs_list) > self.max_total:
            raise SpawnLimitExceeded(
                f"Requested {len(inputs_list)} sub-crews but max_total is {self.max_total}. "
                "Reduce sub-task count or raise the limit explicitly."
            )

        async def run_one(inputs: dict):
            async with self._semaphore:
                self._total_spawned += 1
                crew = crew_factory(inputs)
                return await crew.kickoff_async(inputs=inputs)

        tasks = [run_one(inp) for inp in inputs_list]
        results = await asyncio.gather(*tasks, return_exceptions=True)

        # surface any exceptions rather than silently dropping them
        errors = [r for r in results if isinstance(r, Exception)]
        if errors:
            raise SubCrewFailed(f"{len(errors)} sub-crews failed: {errors[:3]}")

        return results

class SpawnLimitExceeded(RuntimeError):
    pass

class SubCrewFailed(RuntimeError):
    pass

# orchestrator parses LLM output into sub-task list, then guards the spawn
guard = AsyncSpawnGuard(max_concurrent=6, max_total=15)

async def orchestrated_research(topic: str) -> list:
    # LLM generates sub-queries — cap enforced before spawning
    sub_queries = parse_queries_from_llm(
        llm_output=orchestrator_crew.kickoff(inputs={"topic": topic}).raw,
        max_queries=15   # truncate at parse time, before spawn guard
    )
    return await guard.kickoff_many(
        crew_factory=lambda inp: build_research_crew(inp["query"]),
        inputs_list=[{"query": q} for q in sub_queries]
    )

Failure Mode 4: Hierarchical Delegation Retry Multiplication

CrewAI agents have a max_iter parameter that limits the number of reasoning iterations an agent attempts before giving up on a task. In a single-crew setup, max_iter is your primary cost guardrail. In crews-of-crews, there are three independent retry layers: the agent's own max_iter, the sub-crew's re-kickoff if the orchestrator retries the task that calls that sub-crew, and the orchestrator crew's own manager LLM re-delegation if the orchestrator-level task fails.

The worst case: an agent with max_iter=3 fails a task. The sub-crew manager LLM re-assigns the task to a second agent (1 re-delegation). That agent also exhausts its 3 iterations. The sub-crew returns a failure output. The orchestrator task that called this sub-crew now fails, triggering the orchestrator manager LLM to re-delegate the orchestration task (1 more re-delegation). The orchestrator retries the sub-crew kickoff. This repeats up to max_iter times at the orchestrator level. With max_iter=3 at each of three levels, a single failing tool call can produce 3 × 3 × 3 = 27 LLM calls — 26 of which are pure waste.

from crewai import Crew, Task, Agent
import time
import functools
from dataclasses import dataclass, field

@dataclass
class HierarchyRetryBudget:
    """Shared retry budget across an entire crews-of-crews run."""
    max_total_retries: int = 12
    max_retries_per_crew: int = 4
    _total_retries: int = field(default=0, init=False)
    _crew_retries: dict = field(default_factory=dict, init=False)

    def consume(self, crew_name: str, count: int = 1):
        self._total_retries += count
        self._crew_retries[crew_name] = self._crew_retries.get(crew_name, 0) + count

        if self._total_retries > self.max_total_retries:
            raise HierarchyBudgetExhausted(
                f"Total retry budget ({self.max_total_retries}) exhausted across hierarchy. "
                f"Per-crew breakdown: {self._crew_retries}"
            )
        if self._crew_retries.get(crew_name, 0) > self.max_retries_per_crew:
            raise CrewBudgetExhausted(
                f"Crew '{crew_name}' exceeded per-crew retry budget "
                f"({self.max_retries_per_crew})"
            )

def guarded_kickoff(
    crew: Crew,
    crew_name: str,
    budget: HierarchyRetryBudget,
    inputs: dict | None = None,
    max_attempts: int = 2,
) -> str:
    """
    Kick off a crew with a shared retry budget.
    Raises on budget exhaustion rather than silently retrying.
    """
    last_exc = None
    for attempt in range(max_attempts):
        try:
            result = crew.kickoff(inputs=inputs)
            if result and result.raw:
                return result.raw
            # empty result treated as soft failure
            budget.consume(crew_name)
        except Exception as exc:
            last_exc = exc
            budget.consume(crew_name)
            if attempt < max_attempts - 1:
                time.sleep(2 ** attempt)   # brief backoff

    raise SubCrewExhausted(
        f"Crew '{crew_name}' failed after {max_attempts} attempts"
    ) from last_exc

class HierarchyBudgetExhausted(RuntimeError):
    pass

class CrewBudgetExhausted(RuntimeError):
    pass

class SubCrewExhausted(RuntimeError):
    pass

# orchestrator uses the shared budget for all sub-crew calls
budget = HierarchyRetryBudget(max_total_retries=12, max_retries_per_crew=3)

research_output = guarded_kickoff(
    crew=research_crew,
    crew_name="research",
    budget=budget,
    inputs={"topic": "LLM cost optimization"},
    max_attempts=2
)

analysis_output = guarded_kickoff(
    crew=analysis_crew,
    crew_name="analysis",
    budget=budget,
    inputs={"data": research_output[:3000]},
    max_attempts=2
)

Composite CrewsOfCrewsPolicy

Combining all four guards into a single policy class gives you a single attachment point for the full hierarchy. Every sub-crew invocation goes through the policy, which enforces planning cost caps, memory budgets, spawn limits, and shared retry budgets simultaneously.

import asyncio
from crewai import Crew
from dataclasses import dataclass, field
from typing import Callable

@dataclass
class CrewsOfCrewsPolicy:
    """
    Unified cost guard for hierarchical CrewAI orchestration.
    Attach to each crew invocation in the hierarchy.
    """
    max_manager_calls: int = 10          # total manager LLM planning calls allowed
    max_memory_tokens: int = 12000       # shared memory pool ceiling
    max_concurrent_spawns: int = 8       # async fan-out concurrency ceiling
    max_total_spawns: int = 20           # hard ceiling on async sub-crew count
    max_hierarchy_retries: int = 15      # total retries across all crews
    max_retries_per_crew: int = 4        # per-crew retry ceiling

    _manager_calls: int = field(default=0, init=False)
    _memory_tokens: int = field(default=0, init=False)
    _total_spawns: int = field(default=0, init=False)
    _total_retries: int = field(default=0, init=False)
    _crew_retries: dict = field(default_factory=dict, init=False)
    _semaphore: asyncio.Semaphore | None = field(default=None, init=False)

    def _get_semaphore(self) -> asyncio.Semaphore:
        if self._semaphore is None:
            self._semaphore = asyncio.Semaphore(self.max_concurrent_spawns)
        return self._semaphore

    def record_manager_call(self, crew_name: str):
        self._manager_calls += 1
        if self._manager_calls > self.max_manager_calls:
            raise RuntimeError(
                f"Manager LLM call budget ({self.max_manager_calls}) exhausted. "
                f"Last call from crew '{crew_name}'. Reduce hierarchy depth or "
                "switch coordinator crews to Process.sequential."
            )

    def record_memory_write(self, tokens_written: int):
        self._memory_tokens += tokens_written
        if self._memory_tokens > self.max_memory_tokens:
            raise RuntimeError(
                f"Shared memory pool exceeded {self.max_memory_tokens} tokens. "
                "Call reset_shared_memory() between orchestration phases."
            )

    def kickoff(
        self,
        crew: Crew,
        crew_name: str,
        inputs: dict | None = None,
        max_attempts: int = 2,
    ) -> str:
        """Synchronous crew kickoff with policy enforcement."""
        import time
        self.record_manager_call(crew_name)

        last_exc = None
        for attempt in range(max_attempts):
            try:
                result = crew.kickoff(inputs=inputs)
                if result and result.raw:
                    self.record_memory_write(len(result.raw) // 4)
                    return result.raw
                self._consume_retry(crew_name)
            except Exception as exc:
                last_exc = exc
                self._consume_retry(crew_name)
                if attempt < max_attempts - 1:
                    time.sleep(2 ** attempt)

        raise RuntimeError(
            f"Crew '{crew_name}' failed after {max_attempts} attempts"
        ) from last_exc

    async def kickoff_many_async(
        self,
        crew_factory: Callable[[dict], Crew],
        crew_name_prefix: str,
        inputs_list: list[dict],
    ) -> list[str]:
        """Async fan-out with concurrency and total spawn caps."""
        if len(inputs_list) > self.max_total_spawns:
            raise RuntimeError(
                f"Requested {len(inputs_list)} async sub-crews but max_total_spawns "
                f"is {self.max_total_spawns}. Truncate input list before spawning."
            )

        sem = self._get_semaphore()

        async def run_one(idx: int, inputs: dict) -> str:
            crew_name = f"{crew_name_prefix}-{idx}"
            async with sem:
                self._total_spawns += 1
                self.record_manager_call(crew_name)
                crew = crew_factory(inputs)
                result = await crew.kickoff_async(inputs=inputs)
                if result and result.raw:
                    self.record_memory_write(len(result.raw) // 4)
                    return result.raw
                return ""

        tasks = [run_one(i, inp) for i, inp in enumerate(inputs_list)]
        results = await asyncio.gather(*tasks, return_exceptions=True)

        errors = [r for r in results if isinstance(r, Exception)]
        if errors:
            raise RuntimeError(f"{len(errors)} async sub-crews failed: {errors[:2]}")

        return [r for r in results if not isinstance(r, Exception)]

    def _consume_retry(self, crew_name: str):
        self._total_retries += 1
        self._crew_retries[crew_name] = self._crew_retries.get(crew_name, 0) + 1
        if self._total_retries > self.max_hierarchy_retries:
            raise RuntimeError(
                f"Hierarchy retry budget ({self.max_hierarchy_retries}) exhausted. "
                f"Breakdown: {self._crew_retries}"
            )
        if self._crew_retries[crew_name] > self.max_retries_per_crew:
            raise RuntimeError(
                f"Crew '{crew_name}' exceeded per-crew retry budget "
                f"({self.max_retries_per_crew})"
            )

    def summary(self) -> dict:
        return {
            "manager_calls": self._manager_calls,
            "memory_tokens": self._memory_tokens,
            "total_spawns": self._total_spawns,
            "total_retries": self._total_retries,
            "crew_retries": dict(self._crew_retries),
        }


# usage

async def run_research_pipeline(topic: str) -> dict:
    policy = CrewsOfCrewsPolicy(
        max_manager_calls=12,
        max_memory_tokens=15000,
        max_concurrent_spawns=6,
        max_total_spawns=16,
        max_hierarchy_retries=12,
        max_retries_per_crew=3,
    )

    # orchestrator decides how many research tracks to open
    orchestration_output = policy.kickoff(
        crew=orchestrator_crew,
        crew_name="orchestrator",
        inputs={"topic": topic},
        max_attempts=2,
    )

    # parse sub-queries from orchestrator output (capped externally before spawn)
    sub_queries = parse_and_cap_queries(orchestration_output, max_queries=16)

    # fan out research crews with concurrency guard
    research_results = await policy.kickoff_many_async(
        crew_factory=lambda inp: build_research_crew(inp["query"]),
        crew_name_prefix="research",
        inputs_list=[{"query": q} for q in sub_queries],
    )

    # synthesize in a final analysis crew
    synthesis = policy.kickoff(
        crew=synthesis_crew,
        crew_name="synthesis",
        inputs={"research": "\n\n".join(research_results[:8000 // 200])},
        max_attempts=2,
    )

    print("Policy summary:", policy.summary())
    return {"result": synthesis, "policy": policy.summary()}

RunGuard SDK note: RunGuard attaches these guards automatically to CrewAI crew execution hooks. The CrewsOfCrewsPolicy equivalent ships as a single decorator — @runguard.crews_of_crews(max_manager_calls=12, max_spawns=16) — that wraps any Crew subclass and emits per-run telemetry without changes to your crew definitions.

Monitoring Hierarchy Cost in Production

The policy.summary() output gives you the per-run metrics you need to right-size the policy limits. Run five representative workloads, collect the summaries, and set limits at 1.5× the observed 95th-percentile values. This avoids both runaway cost (limits too high) and false circuit trips (limits too low).

Metric What it tracks Action if above threshold
manager_calls Total manager LLM planning invocations across the hierarchy Switch coordinator crews to Process.sequential; use cheap model for planning
memory_tokens Estimated tokens written to shared memory pool this run Reset memory between phases; pass summaries instead of full outputs
total_spawns Async sub-crew kickoffs in this run Reduce max_queries in orchestrator task prompt; add explicit count cap
total_retries Total retry events across all crews Reduce per-crew max_iter; add tool-level circuit breakers in failing agents

Difference from Single-Crew and Flows Guards

If you have single-crew cost guards in place from the single-crew guide or Flows guards from the Flows guide, they do not transfer to a crews-of-crews architecture. Single-crew guards operate per-crew and see only that crew's agents and tasks — they have no visibility into manager LLM calls at other levels, shared memory growth across crews, or async spawn fan-out. Flows guards are event-driven and apply to @listen/@router cycles — they don't intercept kickoff() and kickoff_async() calls between hierarchically composed crews. The CrewsOfCrewsPolicy above is the only guard that operates at the hierarchy level.

Guard type Covers Misses
Single-crew guard Per-crew agent iteration, tool call spirals, max_iter Manager LLM cost across levels, shared memory pool, async spawn count, cross-crew retries
Flows guard @listen/@router event cycles, FlowState accumulation Anything that uses Crew.kickoff() directly; hierarchical composition
CrewsOfCrewsPolicy (this guide) Manager LLM cascade, cross-crew memory, async fan-out, hierarchy retry multiplication Per-agent tool call spirals (delegate to single-crew guards inside each sub-crew)

Combine all three: single-crew guards inside each sub-crew's agents, a Flows guard if you wrap the orchestration in a Flow, and CrewsOfCrewsPolicy at the hierarchy coordination layer.

Frequently asked questions

Does Process.sequential avoid the manager LLM cascade?

Yes, for the crews that use it. Process.sequential skips the manager LLM planning step entirely — tasks are executed in the order they are defined, and the crew does not use a manager LLM. Switching coordinator crews (those that just call sub-crews in a fixed order) from Process.hierarchical to Process.sequential eliminates their manager LLM cost while keeping the hierarchical structure. Use Process.hierarchical only for crews whose task routing is genuinely dynamic across heterogeneous agents.

How should I pass data between crews to avoid shared memory accumulation?

Pass explicit summaries rather than raw task outputs. After each sub-crew completes, run a truncation pass on the output (or use a summarizer agent) and pass the summary as an inputs key to the next crew. This replaces unbounded memory accumulation with bounded, explicit hand-offs. Call crew.reset_memories(which='short_term') between major phases if you're using CrewAI's built-in memory — short-term memory entries from a completed phase should not pollute the retrieval context of subsequent phases.

What's a safe default for max_total_spawns?

For most production hierarchies, 10–15 is a safe default for async sub-crew fan-out. This covers most research patterns (topic decomposition into 10–12 sub-queries) while preventing runaway spawning on broad topics. If your use case genuinely requires more parallelism, increase the ceiling explicitly and ensure your LLM API rate limits can handle the concurrent load — async spawn amplification often hits rate limits before budget limits at high spawn counts.

Can I use a cheaper model for some agents in the hierarchy without changing manager_llm?

Yes. manager_llm affects only the planning/delegation model for that crew. Individual agents within the crew use their own llm parameter, which is set separately. You can have a coordinator crew with manager_llm=gpt-4o-mini for planning and agent=Agent(llm=gpt-4o, ...) for execution. This combination reduces manager planning cost by ~15× while keeping execution quality high — a practical optimization for any crew in the hierarchy where the planning decision is predictable.

What happens to in-flight async crews if one fails and raises?

With asyncio.gather(..., return_exceptions=True), as used in kickoff_many_async, in-flight coroutines are not cancelled when one raises — they run to completion. Their results are still returned (or logged as exceptions). If you want strict fail-fast behaviour that cancels remaining crews when any one fails, replace asyncio.gather with a TaskGroup (Python 3.11+): the first exception cancels the group. Fail-fast reduces wasted spend on subsequent spawns when an early crew failure already invalidates the run — consider it for orchestration patterns where sub-crew outputs are mutually dependent.

Enforce crews-of-crews limits automatically

RunGuard attaches hierarchy-level cost guards to CrewAI orchestration without changes to your crew definitions. Manager LLM call counting, shared memory budgets, async spawn caps, and cross-crew retry budgets — all configurable per pipeline, with per-run telemetry surfaced in the dashboard.

See RunGuard pricing →