Cursor AI Cost Control: Composer Agent Loops, Codebase Context Amplification, Multi-File Session Drift, and Background Agent Fan-out

Cursor is an AI-first code editor built on VS Code. It ships four distinct AI-powered surfaces: Composer agent mode with terminal auto-run, @Codebase semantic retrieval for context injection, long multi-file Composer sessions that accumulate state across edits, and cloud-hosted Background Agents that operate asynchronously without the developer's IDE. Each surface has different triggering semantics, different context assembly logic, and different token consumption behavior.

The per-request cost model — "I send a prompt, Cursor replies, I pay for N tokens" — holds for simple chat usage. It breaks down for the other three surfaces. In agent mode, Cursor re-reads files and terminal output before every action. @Codebase retrieves top-K semantically relevant code chunks on every query and injects all of them into the context window. Multi-file Composer sessions carry accumulated diffs and re-reads across dozens of files. Background Agents each independently upload project context and iterate without the cost signal visible in any dashboard until billing.

Cursor operates on your own API keys for any "bring your own key" model configuration, or routes through its Cursor-hosted API for subscription plan users. Either way, agentic usage on a large codebase or in a loop produces token volumes that dwarf the simple chat estimates most developers use when choosing between Cursor plans or estimating API spend. The Cursor interface shows neither cumulative session token count nor a spend ceiling — it surfaces the model name and nothing else.

What this post covers: Four cost amplification patterns specific to Cursor's Composer agent mode, @Codebase context retrieval, long multi-file Composer sessions, and concurrent Background Agents — and a runtime circuit breaker guard for each. The guards operate at the orchestration and context assembly layer, giving you spend ceilings without changing Cursor's behavior for requests that fit within budget.

Pattern 1: Composer Agent Iteration Loops

Cursor's Composer in Agent mode — enabled via the Agent toggle in the Composer panel — executes a ReAct loop: generate action, run tool, observe output, repeat. When "auto-run" (often called "yolo mode") is enabled, Cursor skips all approval dialogs and automatically applies file edits and runs terminal commands. The agent re-prompts itself on every action, and each prompt includes the full prior context: original task, all prior actions taken, all file content read, all terminal output observed.

Context growth is the core problem. At step 1, the agent sees the task description plus the initial file (typically 2,000–5,000 tokens). At step 5, the agent sees all of that plus four prior actions, the read content of every file it touched, and the terminal output of every command it ran. A stuck agent debugging a failing test that re-reads three source files (300 lines each) and accumulates five error outputs (200 tokens each) grows from ~3,500 tokens at step 1 to ~22,000 tokens at step 10 — a 6.3× input cost multiplier with no matching output value increase.

Step 1: 3,500 input tokens × $0.003/K = $0.0105
Step 10: 22,000 input tokens × $0.003/K = $0.066
10-step stuck agent: ~$0.42 total vs. $0.03 for a clean 3-step success
50 stuck agent runs/day (automated test repair pipeline): $630/month overrun

This failure mode becomes critical in two scenarios: CI-triggered Composer agents (where the developer isn't watching the IDE and no one is approving steps) and batch task pipelines where Cursor is invoked programmatically. In both cases, a stuck agent accumulates iterations without any alerting surface until the API bill arrives.

Python — ComposerAgentGuard

import time
from dataclasses import dataclass, field

@dataclass
class ComposerAgentGuard:
    max_iterations: int = 10
    max_input_tokens: int = 40_000
    consecutive_failure_limit: int = 3
    _iteration: int = field(default=0, init=False)
    _total_input_tokens: int = field(default=0, init=False)
    _consecutive_failures: int = field(default=0, init=False)
    _last_terminal_output: str = field(default="", init=False)

    def before_action(self, context_tokens: int, last_terminal_ok: bool, terminal_output: str) -> None:
        self._iteration += 1
        self._total_input_tokens += context_tokens

        if last_terminal_ok:
            self._consecutive_failures = 0
        else:
            self._consecutive_failures += 1
            # Duplicate output fingerprint — same error, no progress
            if terminal_output and terminal_output == self._last_terminal_output:
                raise RuntimeError(
                    f"ComposerAgentGuard: identical terminal output on consecutive steps — "
                    f"agent is stuck. Iteration {self._iteration}. "
                    "Halting before further token accumulation."
                )
        self._last_terminal_output = terminal_output

        if self._consecutive_failures >= self.consecutive_failure_limit:
            raise RuntimeError(
                f"ComposerAgentGuard: {self._consecutive_failures} consecutive tool failures. "
                f"Iteration {self._iteration}. Escalating to human review."
            )
        if self._iteration > self.max_iterations:
            raise RuntimeError(
                f"ComposerAgentGuard: iteration ceiling {self.max_iterations} reached. "
                f"Total input tokens so far: {self._total_input_tokens:,}."
            )
        if self._total_input_tokens > self.max_input_tokens:
            raise RuntimeError(
                f"ComposerAgentGuard: context ceiling {self.max_input_tokens:,} tokens reached "
                f"at iteration {self._iteration}. Partial results returned."
            )

# Usage in a Cursor automation wrapper
guard = ComposerAgentGuard(
    max_iterations=10,
    max_input_tokens=35_000,
    consecutive_failure_limit=3
)

for step in cursor_agent_steps():
    guard.before_action(
        context_tokens=step.input_token_count,
        last_terminal_ok=step.last_exit_code == 0,
        terminal_output=step.terminal_output
    )
    step.execute()

The guard trips on whichever threshold is hit first: max iterations (prevents long-tail runaway), max input tokens (prevents context-window cost blow-through), consecutive failures (catches the early-exit case where errors repeat before the iteration ceiling), and identical terminal output (catches the fingerprint pattern where the agent is stuck in a retry loop producing the same error every step).

Pattern 2: @Codebase Retrieval Token Amplification

Cursor's @Codebase context feature embeds your entire codebase into a proprietary vector index and retrieves the top-K most semantically relevant code chunks for each query. The retrieval ceiling is configurable in Cursor's settings — the default retrieves between 10 and 20 chunks per query depending on the configured context mode ("large context" vs. "small context"). Each chunk contains a code excerpt typically ranging from 150 to 400 lines of code, which at ~4 tokens per line means each retrieved chunk contributes 600–1,600 tokens of injected context.

The retrieval-per-query token cost is invisible in the Cursor UI. The developer sees "asking about auth flow" — they do not see "injecting 18 code chunks = 14,400 tokens into this prompt before sending to the API." On a small project (20K lines), the default retrieval stays within reason. On a large project (200K+ lines), the indexer retrieves from a denser embedding space, and the top-K chunks are often only marginally relevant — but all are injected at full token cost.

Large codebase @Codebase query (claude-sonnet-4-6):
15 chunks × 1,200 tokens/chunk = 18,000 retrieval tokens
+ 500 token question + 300 token system prompt = 18,800 input tokens
Cost per query: 18,800 × $0.003/K = $0.0564
vs. small codebase (5 chunks): 6,800 tokens = $0.0204
Team of 8 developers, 50 @Codebase queries/day each:
Large codebase: $0.0564 × 400 queries = $22.56/day = $677/month
Small codebase: $0.0204 × 400 queries = $8.16/day = $245/month

The amplification is also context-mode-dependent: Cursor's "large context" setting (enabled via the model context protocol configuration) can retrieve up to 25–30 chunks per query. Teams that enable this mode for complex codebases often see 3–4× the query cost of the default setting without a proportional increase in answer quality — retrieval recall improves but many retrieved chunks are tangential, and the model's attention diffuses across a larger injected context.

Python — CodebaseRetrievalGuard

from dataclasses import dataclass

@dataclass
class CodebaseRetrievalGuard:
    max_retrieved_chunks: int = 12
    max_tokens_per_chunk: int = 1_400
    max_total_retrieval_tokens: int = 15_000
    token_budget_usd_per_query: float = 0.05

    def validate_retrieval(
        self,
        chunks: list[dict],  # each: {"content": str, "tokens": int, "score": float}
        model_input_price_per_1k: float = 0.003
    ) -> list[dict]:
        # Hard chunk count ceiling
        if len(chunks) > self.max_retrieved_chunks:
            chunks = sorted(chunks, key=lambda c: c["score"], reverse=True)
            chunks = chunks[:self.max_retrieved_chunks]

        # Per-chunk token ceiling — truncate oversized chunks
        pruned = []
        for chunk in chunks:
            if chunk["tokens"] > self.max_tokens_per_chunk:
                chunk = {**chunk, "content": chunk["content"][:self.max_tokens_per_chunk * 4], "tokens": self.max_tokens_per_chunk}
            pruned.append(chunk)

        total_retrieval_tokens = sum(c["tokens"] for c in pruned)

        # Total retrieval token budget
        if total_retrieval_tokens > self.max_total_retrieval_tokens:
            # Keep highest-scoring chunks within budget
            budget_chunks = []
            running_total = 0
            for chunk in sorted(pruned, key=lambda c: c["score"], reverse=True):
                if running_total + chunk["tokens"] <= self.max_total_retrieval_tokens:
                    budget_chunks.append(chunk)
                    running_total += chunk["tokens"]
            pruned = budget_chunks

        # USD guard — abort retrieval if projected cost exceeds per-query ceiling
        projected_cost = sum(c["tokens"] for c in pruned) * model_input_price_per_1k / 1000
        if projected_cost > self.token_budget_usd_per_query:
            raise RuntimeError(
                f"CodebaseRetrievalGuard: projected retrieval cost ${projected_cost:.4f} "
                f"exceeds per-query ceiling ${self.token_budget_usd_per_query}. "
                f"Reduce max_retrieved_chunks or max_tokens_per_chunk."
            )

        return pruned

The guard operates at the chunk-pruning layer before the assembled context is sent to the API. It enforces a chunk count ceiling, a per-chunk token ceiling, and a total retrieval token budget — trimming the lowest-scoring chunks first to preserve answer quality. The USD guard converts projected tokens to an estimated cost at the model's current per-1K-token input rate and aborts if the query would exceed the per-query ceiling.

Pattern 3: Multi-File Composer Session Token Accumulation

Cursor's Composer — distinct from simple chat — maintains a session context across an entire editing task. When you ask Composer to refactor an authentication module that spans 12 files, it reads each file, generates edits, applies them, and then re-reads the updated content of each file in subsequent steps to stay coherent with its prior changes. By the end of a 12-file refactor, the Composer session context contains: the original task, all 12 original file contents, all 12 updated file contents, and the accumulated diff discussion between steps.

The re-read on each step is the structural source of accumulation. Composer does not maintain a compact diff of what changed — it re-assembles the full context including all previously read file contents before each new generation step. A refactor that touches 20 files, averaging 200 lines (800 tokens) each, accumulates: at step 1, 16,000 tokens of files; at step 10, the original 16,000 + 9 sets of re-reads = 16,000 + (9 × 16,000) = 160,000 tokens of file context alone, before factoring in instructions, diffs, and model outputs from prior steps.

20-file refactor at claude-sonnet-4-6 ($0.003/K input):
Files: 20 × 800 tokens = 16,000 tokens
Step 5 context: 16,000 × 5 re-reads = 80,000 input tokens = $0.24
Step 10 context: 16,000 × 10 re-reads + accumulated diffs ≈ 175,000 input tokens = $0.525
5 such sessions/day (automated code migration pipeline): $2.625/day = $78.75/month
vs. expected from single-step estimate: 16,000 × $0.003/K = $0.048 per session = $7.20/month

The 10× overrun ratio is typical for large automated Composer sessions. It becomes critical in two scenarios: programmatic Composer use via Cursor's API layer for bulk migrations, and agent-orchestrated Composer runs where an outer agent triggers Composer on each of N repositories as a batch operation. In the latter case, the N×10× multiplier applies independently per repository, compounding across the batch.

Python — ComposerSessionGuard

from dataclasses import dataclass, field

@dataclass
class ComposerSessionGuard:
    max_files_in_session: int = 15
    max_tokens_per_file: int = 2_000
    max_session_input_tokens: int = 80_000
    max_session_steps: int = 8
    _files_read: set = field(default_factory=set, init=False)
    _session_input_tokens: int = field(default=0, init=False)
    _step: int = field(default=0, init=False)

    def before_step(self, files_in_context: list[str], context_tokens: int) -> None:
        self._step += 1
        self._session_input_tokens += context_tokens

        # File ceiling — prevent sprawl into unrelated parts of the codebase
        new_files = set(files_in_context) - self._files_read
        self._files_read.update(new_files)
        if len(self._files_read) > self.max_files_in_session:
            raise RuntimeError(
                f"ComposerSessionGuard: session has touched {len(self._files_read)} files, "
                f"exceeding ceiling of {self.max_files_in_session}. "
                "Break this task into smaller, focused Composer sessions."
            )

        # Cumulative session input token ceiling
        if self._session_input_tokens > self.max_session_input_tokens:
            raise RuntimeError(
                f"ComposerSessionGuard: session accumulated {self._session_input_tokens:,} input tokens "
                f"at step {self._step}, exceeding ceiling {self.max_session_input_tokens:,}. "
                "Commit current changes and start a new Composer session."
            )

        # Step ceiling — prevent open-ended refactors from running forever
        if self._step > self.max_session_steps:
            raise RuntimeError(
                f"ComposerSessionGuard: step ceiling {self.max_session_steps} reached. "
                f"Session has modified {len(self._files_read)} files. "
                "Review and commit before continuing."
            )

    def summarize(self) -> dict:
        return {
            "steps": self._step,
            "files_touched": len(self._files_read),
            "total_input_tokens": self._session_input_tokens,
            "estimated_cost_usd": round(self._session_input_tokens * 0.003 / 1000, 4)
        }

The guard tracks three independent trip conditions: the number of distinct files touched in the session (catches sprawling refactors that reach into unrelated modules), the cumulative session input token count (catches the re-read accumulation pattern), and the step count (catches open-ended sessions where the agent keeps generating follow-up edits without converging). Tripping any condition returns a structured error that tells the caller to commit the current partial state and start a new, focused session.

Pattern 4: Background Agent Concurrent Run Fan-out

Cursor's Background Agents — available on Max and Business plans — run in isolated cloud environments. Each Background Agent independently: uploads project context (the current repo state), assembles a working environment, executes an agentic task, and iterates until completion or error. The Background Agent's iterations follow the same ReAct loop as Composer agent mode, with the same context accumulation behavior — but without the developer's IDE open, so there is no real-time cost signal.

The fan-out failure mode emerges when multiple Background Agents are triggered concurrently. Each agent independently uploads the full project context. A 150K-line repository at ~4 tokens per line = 600,000 tokens of base context, plus retrieved semantic chunks, plus the task description. At claude-sonnet-4-6 input rates, the base context upload alone costs $1.80 per agent. Triggering 10 concurrent Background Agents to parallelize a refactor task = $18 in base context uploads before any agent has taken its first action.

Background Agent on 150K-line repo (claude-sonnet-4-6):
Base context: 600,000 tokens × $0.003/K = $1.80
+ 8 iterations × 5,000 tokens/step = 40,000 more input tokens = $0.12
Single agent total: $1.92
10 concurrent Background Agents (automated PR review pipeline): $19.20
100 PRs/week → 100 × 10 agents: $1,920/week = $8,320/month

The concurrent fan-out is particularly dangerous in CI/CD pipelines where Background Agents are triggered on pull request open events. A team that configured "open one Background Agent per PR for automated review" with 50 concurrent open PRs after a big merge day can see hundreds of dollars in Background Agent charges in a single afternoon.

Python — BackgroundAgentConcurrencyGuard

import asyncio
import time
from dataclasses import dataclass, field

@dataclass
class BackgroundAgentConcurrencyGuard:
    max_concurrent_agents: int = 3
    max_agents_per_hour: int = 20
    max_repo_context_tokens: int = 200_000
    max_cost_per_agent_usd: float = 2.50
    model_input_price_per_1k: float = 0.003
    _active_agents: int = field(default=0, init=False)
    _hourly_launches: list = field(default_factory=list, init=False)
    _lock: asyncio.Lock = field(default_factory=asyncio.Lock, init=False)

    async def acquire(self, repo_token_count: int, task_description: str) -> str:
        async with self._lock:
            # Prune launches older than 1 hour
            now = time.time()
            self._hourly_launches = [t for t in self._hourly_launches if now - t < 3600]

            if self._active_agents >= self.max_concurrent_agents:
                raise RuntimeError(
                    f"BackgroundAgentConcurrencyGuard: {self._active_agents} agents already running. "
                    f"Ceiling: {self.max_concurrent_agents}. Queue this task or wait for an agent slot."
                )

            if len(self._hourly_launches) >= self.max_agents_per_hour:
                raise RuntimeError(
                    f"BackgroundAgentConcurrencyGuard: {len(self._hourly_launches)} agents launched "
                    f"in the past hour, exceeding hourly ceiling {self.max_agents_per_hour}."
                )

            projected_context_cost = repo_token_count * self.model_input_price_per_1k / 1000
            if projected_context_cost > self.max_cost_per_agent_usd:
                raise RuntimeError(
                    f"BackgroundAgentConcurrencyGuard: projected base context cost "
                    f"${projected_context_cost:.2f} for this repo exceeds per-agent ceiling "
                    f"${self.max_cost_per_agent_usd}. Reduce repo scope or increase ceiling explicitly."
                )

            if repo_token_count > self.max_repo_context_tokens:
                raise RuntimeError(
                    f"BackgroundAgentConcurrencyGuard: repo context {repo_token_count:,} tokens "
                    f"exceeds ceiling {self.max_repo_context_tokens:,}. "
                    "Use .cursorignore to exclude large generated files before launching."
                )

            self._active_agents += 1
            self._hourly_launches.append(now)
            agent_id = f"bg-agent-{int(now)}"
            return agent_id

    async def release(self, agent_id: str) -> None:
        async with self._lock:
            self._active_agents = max(0, self._active_agents - 1)

# Usage in a CI/CD webhook handler
guard = BackgroundAgentConcurrencyGuard(
    max_concurrent_agents=3,
    max_agents_per_hour=15,
    max_cost_per_agent_usd=2.00
)

async def on_pull_request_opened(pr_event):
    repo_tokens = estimate_repo_tokens(pr_event.repo_path)
    agent_id = await guard.acquire(repo_tokens, pr_event.task)
    try:
        await launch_background_agent(pr_event, agent_id)
    finally:
        await guard.release(agent_id)

The guard enforces four independent limits: concurrent agent count (prevents fan-out), hourly launch rate (prevents burst spikes from CI events), per-agent projected context cost (blocks launches where the repo is so large that base context alone exceeds the ceiling), and raw repo token count (enforces a .cursorignore discipline by failing loudly when the repo is unscoped). The async lock is critical — without it, a burst of concurrent PR events can race past the concurrent-agent check before any single agent has registered its launch.

Putting It Together: Cursor Cost Surface Map

The four failure modes cover the four agentic surfaces Cursor exposes. Simple chat and inline suggestions fall within the per-request model and are predictable. The cost complexity lives in the agentic layer:

Surface	Failure Mode	Guard	Trip Condition
Composer Agent mode `auto-run / yolo`	ReAct loop accumulates tool output at each step; stuck agent = 6–10× expected cost	`ComposerAgentGuard`	max_iterations, max_input_tokens, consecutive failures, output fingerprint
@Codebase retrieval `large context mode`	15–25 retrieved chunks per query on large repos; injected at full token cost per query	`CodebaseRetrievalGuard`	chunk count ceiling, per-chunk token ceiling, total retrieval budget, USD ceiling
Multi-file Composer session `bulk refactor`	Full file content re-read at each step; 10-file session = 10× base file tokens	`ComposerSessionGuard`	files-touched ceiling, cumulative session tokens, step count ceiling
Background Agents `concurrent CI runs`	Each agent independently uploads full repo context; fan-out multiplies base cost by agent count	`BackgroundAgentConcurrencyGuard`	concurrent ceiling, hourly launch rate, per-agent cost, repo token ceiling

The common thread across all four is that Cursor's agentic surfaces break the per-request cost model. Each surface has a structural reason why token consumption grows faster than developers expect: iteration × context accumulation in agent mode, retrieval fan-out in @Codebase, file re-reads in multi-file sessions, and concurrent context uploads in Background Agents. A circuit breaker placed at the context assembly or launch coordination layer — before the API call — is the correct insertion point for each.

For teams using Cursor on large codebases or in automated pipelines, a .cursorignore file scoping out node_modules/, build artifacts, and generated files is the first line of defense and costs nothing. Beyond that, the guards above give you programmable ceilings: the ability to say "a single agent run that exceeds $2 is a bug, not a feature" and have the system enforce it at runtime rather than discover it in the monthly invoice.

Common Questions

Does Cursor show token usage or cost during a session?

No. Cursor surfaces the model name in the Composer panel and the status bar but does not display cumulative session token count, per-query token breakdown, or a running cost estimate. For BYOK (bring-your-own-key) configurations, you can check your API provider's dashboard after the fact, but there is no in-editor cost signal. For Cursor-hosted API usage (Max, Business plans), cost is billed as a subscription with usage limits, not per-token — but Background Agent and agent mode can consume request quota faster than simple chat.

What is "yolo mode" in Cursor and does it affect cost?

Yolo mode (the "auto-run" setting in Composer's agent mode) skips all approval dialogs and automatically applies file edits and runs terminal commands. It does not directly change token cost, but it removes the human approval step that would naturally gate iteration count — a developer reviewing each step will catch a stuck agent at step 3; yolo mode lets it run to 20+ iterations before anyone notices. Cost impact is indirect but significant: yolo mode in a CI context with no iteration ceiling is the primary setup in which the ComposerAgentGuard pattern above becomes critical.

How does @Codebase retrieval differ between Cursor and Continue.dev?

Both use vector embedding indexes for semantic retrieval. The key differences are embedding ownership (Cursor's index is proprietary and hosted; Continue.dev's LanceDB index is local and open-source) and retrieval ceiling defaults. Cursor's "large context" mode retrieves up to 25–30 chunks; Continue.dev's default is 10–15 with explicit configuration required to increase it. Cost exposure is structurally similar — both inject retrieved chunks at full token cost — but Cursor's proprietary index makes the chunk count and token count per-chunk less transparent to the developer.

Are Background Agents charged differently from regular Composer on Cursor's Max plan?

Yes. Cursor's Max plan (as of 2026) includes a fast request quota for standard chat and inline completions, and a separate, lower quota for "tool calls" and agentic mode usage. Background Agents consume from the agentic quota, which is smaller, and concurrent Background Agent runs consume quota in parallel — not sequentially. Teams that trigger Background Agents from CI webhooks can exhaust their monthly agentic quota in days if a large PR batch triggers concurrent agents. The concurrency guard above prevents quota exhaustion as well as token cost overrun for BYOK configurations.

What is a .cursorignore file and where should it go?

A .cursorignore file lives in the repository root and uses .gitignore syntax to exclude files and directories from Cursor's codebase indexing and context assembly. Adding node_modules/, dist/, .next/, *.lock, and generated migration files typically reduces repository token count by 60–80% for a typical Node.js or Python web application, directly reducing @Codebase retrieval cost and Background Agent base context upload cost proportionally. It's the single highest-leverage cost reduction action for teams using Cursor on any non-trivial codebase.

Catch Cursor runaway costs before billing

RunGuard is a runtime SDK that trips a circuit breaker the moment your AI agent's tool-call pattern shows a loop, context-window blow-through, or budget overrun — before the bill lands. One-line install for TypeScript and Python pipelines that orchestrate Cursor agents or any other agentic surface.

See plans →