Continue.dev is an open-source AI code assistant for VS Code and JetBrains. It supports any LLM backend — Anthropic, OpenAI, Azure, Google, Ollama, and a growing list of provider adapters — and it ships four distinct AI-powered surfaces: @codebase semantic context retrieval, tab autocomplete, agent-mode multi-step task execution, and @docs documentation indexing. Each surface has different triggering semantics, different context assembly logic, and different token consumption behavior.

The per-request mental model that most developers use for cost estimation — "a chat message sends N tokens, gets back M tokens" — holds for the simplest Continue.dev usage (open a chat, type a question, send). It breaks down for the other three surfaces. Autocomplete fires on every keypress. The @codebase provider retrieves dozens of code chunks per query. The agent carries all prior tool output in each new call. The @docs provider crawls entire documentation sites on first use. None of these behaviors are visible in a single-request trace.

Continue.dev does not enforce a per-session, per-day, or per-project spend ceiling. It routes every LLM call through the model you configured in config.json — your API key, your provider, your rates. When retrieval over-fetches, when the autocomplete model is set to something expensive, when an agent gets stuck, or when docs indexing triggers unexpectedly, the bill accumulates without the tool reporting it.

What this post covers: Four cost amplification patterns specific to Continue.dev's @codebase context provider, autocomplete engine, agent mode, and @docs indexer — and a runtime circuit breaker guard for each. The guards operate at the context assembly and API call layer, giving you spend ceilings without changing Continue.dev's behavior for requests that fit within budget.

Pattern 1: @codebase Context Window Expansion

Continue.dev's @codebase context provider builds a local embedding index over your repository using LanceDB. When a user sends a query prefixed with @codebase, Continue performs a vector similarity search and retrieves the top-K most similar code chunks, injects them verbatim into the chat context, and sends the combined context to your configured LLM. The default nRetrievedFiles is typically 10–15, and each retrieved chunk can be 1,500–3,000 tokens — the chunk size depends on your indexing configuration and the average size of files in your repository.

On a small codebase (10K lines, 50 files), the top-15 retrieved chunks might total 8,000 tokens. On a large codebase — a Next.js monorepo with API routes, server components, utility modules, and test files spanning 200K+ lines — the same top-15 retrieval returns 25,000–40,000 tokens of context injection before the user's actual question is even added. Across a team making 50 @codebase queries per day, the cost of context injection alone is the dominant LLM spend line.

@codebase context injection cost (large codebase, 15 retrieved chunks):
Retrieved chunks: 15 × 2,500 tokens avg = 37,500 retrieval tokens
System prompt + chat history: 5,800 tokens
User message: 300 tokens
Model response: 2,000 tokens
Total input per query: ~43,600 tokens

At claude-sonnet-4-6 ($3.00/$15.00 per 1M tokens):
Input: 43,600 × $3.00 / 1M = $0.131 per query
Output: 2,000 × $15.00 / 1M = $0.030 per query
$0.161 per @codebase query with 15 retrieved chunks

10-person dev team, 50 @codebase queries/day:
500 queries × $0.161 = $80.50/day ($2,415/month)
Same team with retrieval capped at 5 chunks (8,000 retrieval tokens): $0.055/query → $27.50/day → $825/month
Unguarded retrieval: 2.9× the budget-aware cost

The amplification compounds when @codebase is used inside an agent task rather than a single chat message. The agent may invoke @codebase lookup as part of a multi-step plan — each lookup injects a fresh 37,500 tokens into the already-growing context window. By step 4 of a 10-step agent task, the running context can exceed 100,000 tokens from retrieval injections alone.

The guard: CodebaseContextGuard

The guard intercepts the retrieved chunk list before it is assembled into the final context payload, counts the tokens in each chunk, and returns a pruned list that fits within a configurable retrieval token budget. It leaves the chat history, system prompt, and user message space intact, reserving headroom for the model's response:

Python
from dataclasses import dataclass, field
from typing import Optional, List

@dataclass
class CodebaseContextGuard:
    max_retrieval_tokens: int = 12_000  # ceiling on @codebase chunk injection
    system_prompt_tokens: int = 1_000   # estimate: system prompt + prior history
    user_message_tokens: int = 300      # estimate: current user message
    headroom_tokens: int = 4_000        # reserve for model response
    model_context_limit: int = 200_000  # model window (claude-sonnet-4-6 = 200K)
    _retrieval_tokens_used: int = field(default=0, init=False)
    _chunks_dropped: int = field(default=0, init=False)
    _trip_reason: Optional[str] = field(default=None, init=False)

    @property
    def tripped(self) -> bool:
        return self._trip_reason is not None

    def _count_tokens(self, text: str) -> int:
        try:
            import tiktoken
            enc = tiktoken.get_encoding("cl100k_base")
            return len(enc.encode(text))
        except Exception:
            return len(text) // 4

    def evaluate_retrieval(
        self,
        retrieved_chunks: List[str],
        user_message_tokens: Optional[int] = None,
    ) -> List[str]:
        """
        Prune retrieved_chunks to fit within the retrieval token ceiling.
        Returns the approved subset (highest-ranked chunks that fit the budget).
        Sets trip_reason when any chunks are dropped.
        """
        if self.tripped:
            return []

        msg_tokens = user_message_tokens or self.user_message_tokens
        overhead = self.system_prompt_tokens + msg_tokens + self.headroom_tokens
        available_for_retrieval = min(
            self.max_retrieval_tokens,
            self.model_context_limit - overhead,
        )

        approved: List[str] = []
        running = 0

        for chunk in retrieved_chunks:
            chunk_tokens = self._count_tokens(chunk)
            if running + chunk_tokens > available_for_retrieval:
                self._chunks_dropped = len(retrieved_chunks) - len(approved)
                self._retrieval_tokens_used = running
                self._trip_reason = (
                    f"@codebase retrieval ceiling: accepted {len(approved)}/{len(retrieved_chunks)} "
                    f"chunks ({running} tokens) — {self._chunks_dropped} chunk(s) dropped to stay "
                    f"under {available_for_retrieval}-token retrieval budget"
                )
                break
            approved.append(chunk)
            running += chunk_tokens

        self._retrieval_tokens_used = running
        return approved

    def status(self) -> dict:
        return {
            "retrieval_tokens_used": self._retrieval_tokens_used,
            "chunks_dropped": self._chunks_dropped,
            "tripped": self.tripped,
            "trip_reason": self._trip_reason,
        }


def guarded_codebase_query(
    continue_client,
    user_message: str,
    guard: CodebaseContextGuard,
) -> dict:
    """
    Retrieves @codebase context, applies the guard, and sends the pruned
    context to the LLM. Callers receive the guard_status alongside the response
    so they can log or alert on trips.
    """
    raw_chunks = continue_client.retrieve_codebase_chunks(
        query=user_message,
        n_results=25,  # over-retrieve, then prune to budget
    )

    approved_chunks = guard.evaluate_retrieval(raw_chunks)

    response = continue_client.chat(
        message=user_message,
        context_chunks=approved_chunks,
    )

    return {
        "response": response,
        "guard_status": guard.status(),
        "chunks_used": len(approved_chunks),
        "chunks_available": len(raw_chunks),
    }

The guard over-retrieves intentionally (asking for 25 chunks) and then prunes to budget. This ensures the highest-ranked semantically similar chunks are always included first — the pruning drops the lowest-ranked chunks, not arbitrary ones. When a trip occurs, the caller receives a guard_status with the exact token count and drop reason, which can be logged for tuning the nRetrievedFiles default in config.json.

Pattern 2: Autocomplete Model Amplification

Continue.dev's tab autocomplete fires on every keypress that passes the debounce threshold (default ~300ms of inactivity). Each autocomplete request sends the file prefix (code above the cursor), the file suffix (code below the cursor), and model-specific instructions — typically 2,000–4,000 tokens of input context — to the configured tabAutocompleteModel. The response is a short completion (20–150 tokens). In an hour of active coding, a developer generates 200–600 autocomplete triggers.

The amplification failure mode is a configuration gap in Continue.dev's config.json: if tabAutocompleteModel is not set, Continue falls back to the default model in the models array — typically the same expensive model the developer uses for chat. A developer who configured claude-sonnet-4-6 or gpt-4o as their primary chat model, without explicitly setting a cheaper model for autocomplete, is billing every keypress at chat-model rates. Purpose-built autocomplete models (deepseek-coder-v2-lite, starcoder2-7b, Codestral, or a local Ollama model) run at 10–100× lower cost per request.

Autocomplete cost per developer/day — model comparison:
Request profile: 3,200 input tokens (prefix + suffix + instructions), 80 output tokens
Active coding: 400 autocomplete triggers/day

claude-sonnet-4-6 ($3.00 in / $15.00 out per 1M):
Per trigger: 3,200 × $3/M + 80 × $15/M = $0.0096 + $0.0012 = $0.0108
400 triggers: $4.32/developer/day ($129.60/month)

Codestral ($0.30 in / $0.90 out per 1M):
Per trigger: 3,200 × $0.30/M + 80 × $0.90/M = $0.00096 + $0.000072 = $0.001
400 triggers: $0.40/developer/day ($12/month)

deepseek-coder-v2-lite via Ollama (local, $0/trigger):
400 triggers: $0/day

15-developer team using Sonnet for autocomplete:
15 × $129.60 = $1,944/month from autocomplete alone
Same team with Codestral: $180/month — 10.8× cheaper

The misconfiguration is easy to make and hard to spot. Continue.dev's model selection UI shows only the chat model prominently. The tabAutocompleteModel field in config.json is a separate optional key — its absence is silent. Teams that onboard developers with a shared config template are particularly exposed if the template omits the autocomplete model: every developer in the team silently bills chat-model rates for every keystroke.

The guard: AutocompleteBudgetGuard

The guard tracks autocomplete token consumption on a rolling hourly and daily window, trips when thresholds are exceeded, and exposes a model-check function that detects when an expensive model is configured as the autocomplete provider. At trip, the caller can disable autocomplete for the current session rather than silently continuing to accumulate cost:

Python
import time
from dataclasses import dataclass, field
from typing import Optional

EXPENSIVE_MODELS = frozenset({
    "claude-sonnet", "claude-opus", "gpt-4o", "gpt-4-turbo",
    "claude-3-5-sonnet", "claude-sonnet-4", "claude-sonnet-4-6",
    "gemini-1.5-pro", "gemini-2.0-flash",
})

@dataclass
class AutocompleteBudgetGuard:
    max_triggers_per_hour: int = 300
    max_input_tokens_per_day: int = 2_000_000
    _hourly: list = field(default_factory=list, init=False)      # trigger timestamps
    _daily: list = field(default_factory=list, init=False)        # (timestamp, tokens)
    _trip_reason: Optional[str] = field(default=None, init=False)

    @property
    def tripped(self) -> bool:
        return self._trip_reason is not None

    def _prune(self) -> None:
        now = time.time()
        self._hourly = [t for t in self._hourly if now - t < 3600]
        self._daily = [(t, n) for t, n in self._daily if now - t < 86400]

    def check_model(self, model_name: str) -> Optional[str]:
        """
        Returns a warning string if model_name matches an expensive model.
        Call this at Continue.dev config load time to detect misconfiguration.
        """
        lower = model_name.lower()
        for name in EXPENSIVE_MODELS:
            if name in lower:
                return (
                    f"tabAutocompleteModel is set to an expensive model ('{model_name}'). "
                    f"Consider setting a dedicated low-cost autocomplete model: "
                    f"codestral, deepseek-coder-v2-lite, starcoder2-7b, or a local Ollama model. "
                    f"Every keystroke triggers an API call at this model's input rate."
                )
        return None

    def record_trigger(self, input_tokens: int = 3_200) -> None:
        if self.tripped:
            return
        self._prune()
        now = time.time()
        self._hourly.append(now)
        self._daily.append((now, input_tokens))

        hourly_count = len(self._hourly)
        daily_tokens = sum(n for _, n in self._daily)

        if hourly_count >= self.max_triggers_per_hour:
            self._trip_reason = (
                f"autocomplete trigger ceiling: {hourly_count} triggers "
                f"in the last hour >= {self.max_triggers_per_hour} limit — "
                f"autocomplete paused for this session"
            )
        elif daily_tokens >= self.max_input_tokens_per_day:
            self._trip_reason = (
                f"daily autocomplete token budget: {daily_tokens:,} input tokens "
                f">= {self.max_input_tokens_per_day:,} daily limit"
            )

    def status(self) -> dict:
        self._prune()
        return {
            "hourly_triggers": len(self._hourly),
            "daily_input_tokens": sum(n for _, n in self._daily),
            "tripped": self.tripped,
            "trip_reason": self._trip_reason,
        }

The check_model function should be called once at IDE startup when Continue loads its config — it catches the misconfiguration before any tokens are spent. The rolling-window trigger and token counters catch the volumetric case: even with a correctly configured cheap model, high-frequency autocomplete in very large files can exceed daily budgets. At trip, the caller returns null completions (disabling autocomplete for the session) rather than continuing to accumulate cost.

Pattern 3: Agent Tool Failure Re-evaluation Loop

Continue.dev's agent mode (/agent) executes multi-step tasks using a ReAct-style loop: the model plans and reasons, calls a tool (run terminal command, edit a file, create a file, read a file), observes the tool output, then plans the next step. The full conversation history — including every prior reasoning step and tool output — is carried forward in each new LLM call. This is by design: the agent needs its history to make coherent decisions.

The cost amplification failure mode arises when a tool call fails and the agent enters a diagnosis-and-retry cycle. Each retry adds the failure output to the accumulated context. If the underlying problem is not solvable by the agent — a missing system dependency, a permission error, a fundamental misunderstanding of the codebase structure — the agent keeps generating plans and retrying with a context window that grows by 2,000–5,000 tokens per iteration. A 10-step stuck agent accumulates significantly more input tokens than a successful 10-step agent, because failure output is verbose and error messages often include stack traces.

Agent re-evaluation context growth (stuck on tool failure):
Step 1 success — read package.json: context = 3,000 tokens
Step 2 success — read src/auth.js: context = 5,800 tokens
Step 3 fail — npm install passport (EACCES): context = 8,200 tokens
Step 4 fail — retry via npx: context = 10,900 tokens
Step 5 fail — manual package.json edit (syntax error): context = 14,100 tokens
Step 8 fail — still stuck, trying alternative approaches: context = 22,400 tokens
Step 10 fail: context = 28,800 tokens

Sum of input tokens across 10 calls: ~155,000 tokens total input
At claude-sonnet-4-6 $3/M: $0.465 input + output at $15/M: ~$0.075 = $0.54 per stuck run

Expected 3-step success (same task without the failure): ~18,000 total input tokens
$0.054 input + $0.022 output = $0.076 per successful run
Stuck 10-step agent: 7.1× the expected cost
50 stuck agent runs/day across a team: $27/day overrun ($810/month)

The pattern is particularly expensive for tasks that touch system-level operations — package installation, database migration, CI/CD configuration — where the agent cannot resolve the failure without operator intervention but will attempt every variant it can reason about before giving up. Continue.dev's agent currently has no built-in iteration ceiling or context-growth ceiling; it runs until the model stops generating tool calls or the session is killed manually.

The guard: AgentIterationGuard

The guard tracks per-step context token growth, consecutive tool failures, and total iteration count. It trips on whichever ceiling is hit first, returning the partial work product and the trip reason so the developer can inspect what failed rather than watching the agent silently exhaust its context window:

Python
from dataclasses import dataclass, field
from typing import Optional, List

@dataclass
class AgentIterationGuard:
    max_iterations: int = 10
    max_context_tokens: int = 60_000
    max_consecutive_failures: int = 3
    _iterations: int = field(default=0, init=False)
    _context_tokens: int = field(default=0, init=False)
    _consecutive_failures: int = field(default=0, init=False)
    _tool_log: List[dict] = field(default_factory=list, init=False)
    _trip_reason: Optional[str] = field(default=None, init=False)

    @property
    def tripped(self) -> bool:
        return self._trip_reason is not None

    def record_step(
        self,
        tool_name: str,
        tool_output: str,
        context_tokens: int,
        success: bool,
    ) -> None:
        if self.tripped:
            return

        self._iterations += 1
        self._context_tokens = context_tokens
        self._tool_log.append({
            "step": self._iterations,
            "tool": tool_name,
            "success": success,
            "output_preview": tool_output[:150],
        })

        if success:
            self._consecutive_failures = 0
        else:
            self._consecutive_failures += 1

        if self._iterations >= self.max_iterations:
            self._trip_reason = (
                f"agent iteration ceiling: {self._iterations} steps "
                f">= {self.max_iterations} limit "
                f"(context: {context_tokens:,} tokens)"
            )
        elif context_tokens >= self.max_context_tokens:
            self._trip_reason = (
                f"agent context ceiling: {context_tokens:,} tokens "
                f">= {self.max_context_tokens:,} after {self._iterations} steps"
            )
        elif self._consecutive_failures >= self.max_consecutive_failures:
            last = self._tool_log[-1]
            self._trip_reason = (
                f"consecutive tool failures: {self._consecutive_failures} failures "
                f"on '{last['tool']}' — last output: {last['output_preview']!r}"
            )

    def status(self) -> dict:
        return {
            "iterations": self._iterations,
            "context_tokens": self._context_tokens,
            "consecutive_failures": self._consecutive_failures,
            "tool_log": self._tool_log,
            "tripped": self.tripped,
            "trip_reason": self._trip_reason,
        }


def run_agent_with_guard(
    agent_session,
    initial_task: str,
    guard: AgentIterationGuard,
) -> dict:
    """
    Drives a Continue.dev agent session step-by-step, enforcing the iteration
    guard after each tool call. Returns partial work product on trip.
    """
    work_product = []

    while not guard.tripped:
        step = agent_session.next_step()

        if step is None or step.get("type") == "done":
            break

        if step["type"] == "tool_call":
            result = agent_session.execute_tool(step["tool"], step["args"])
            success = result.get("exit_code", 0) == 0

            guard.record_step(
                tool_name=step["tool"],
                tool_output=result.get("output", ""),
                context_tokens=agent_session.context_token_count(),
                success=success,
            )
            work_product.append({
                "tool": step["tool"],
                "success": success,
                "output_preview": result.get("output", "")[:200],
            })

    return {
        "work_product": work_product,
        "guard_status": guard.status(),
        "completed": not guard.tripped,
    }

The consecutive-failure detector is the most important ceiling for the re-evaluation pattern: when the same tool is failing three times in a row, the agent is stuck in a local retry loop and the problem requires developer attention, not more LLM reasoning. Surfacing this trip early — at 3 consecutive failures rather than 10 iterations — returns control to the developer before the context has grown to the point where reading the trip log becomes difficult.

Pattern 4: @docs Crawl Storm on Cloud Embeddings

Continue.dev's @docs context provider indexes external documentation sites so they can be searched semantically alongside your codebase. When a user adds a new docs entry to config.json (e.g., the React docs, Next.js docs, or Prisma docs), Continue.dev crawls the site on first query: it fetches each page, splits the content into chunks, and embeds them into a local LanceDB index.

By default, Continue.dev uses Transformers.js to run embeddings locally — no API cost for embedding calls. The amplification failure mode activates when teams configure a cloud embeddingsProvider in config.json (e.g., openai with text-embedding-3-large, or voyage, or cohere) to improve retrieval quality. In that configuration, every page crawled from a documentation site generates an embedding API call at cloud rates. A large documentation site (Next.js docs: ~380 pages; React docs: ~120 pages; Django docs: ~450 pages) generates hundreds of embedding calls in a single first-crawl event.

@docs cloud embedding cost per documentation site (first crawl):
Next.js docs: ~380 pages × 3,000 tokens avg = 1,140,000 embedding tokens
At text-embedding-3-large ($0.13 per 1M tokens): $0.148 per site

Full developer stack (Next.js + React + TypeScript + Prisma + Tailwind CSS):
Total pages: ~1,200 pages, ~3,600,000 embedding tokens
$0.47 per full stack initial crawl

The crawl storm scenario: config.json change → Continue detects new docs source → re-crawl triggered
Shared config repo with 20 developers refreshing config daily: 20 × $0.47 = $9.40/day
CI/CD pipeline building a new dev environment per PR (100 PRs/week): 100 × $0.47 = $47/week
Automated workspace setup scripts that force-reindex docs: unlimited re-triggering

The crawl can also be triggered by calling continue.reindexDocs programmatically, which some teams do in workspace setup scripts to ensure the docs index is fresh. If that script runs in CI or on every container start, it silently re-crawls and re-embeds every configured docs source on every invocation. The LanceDB index is local to the developer's machine, so there is no shared index — each developer's environment incurs the full crawl cost independently.

The guard: DocsIndexGuard

The guard enforces per-source and total page count ceilings before a crawl begins, and tracks embedding tokens as pages are indexed. It intercepts the pre-crawl decision — blocking the crawl before any API calls are made — rather than trying to abort a crawl already in progress:

Python
from dataclasses import dataclass, field
from typing import Optional

@dataclass
class DocsIndexGuard:
    max_pages_per_source: int = 150          # max pages crawled from one docs site
    max_total_pages: int = 600               # max across all configured sources
    max_embedding_tokens: int = 6_000_000   # embedding API token ceiling per reindex
    avg_tokens_per_page: int = 2_500        # rough estimate per docs page
    _pages_by_source: dict = field(default_factory=dict, init=False)
    _total_pages: int = field(default=0, init=False)
    _total_embedding_tokens: int = field(default=0, init=False)
    _trip_reason: Optional[str] = field(default=None, init=False)

    @property
    def tripped(self) -> bool:
        return self._trip_reason is not None

    def check_before_crawl(
        self,
        source_url: str,
        estimated_pages: int,
    ) -> dict:
        """
        Call before starting to crawl a docs source.
        Returns {"approved": True} if the crawl fits within budget, or
        {"approved": False, "reason": ...} to block it.
        """
        if self.tripped:
            return {"approved": False, "reason": self._trip_reason}

        projected_source_pages = self._pages_by_source.get(source_url, 0) + estimated_pages
        projected_total_pages = self._total_pages + estimated_pages
        projected_tokens = self._total_embedding_tokens + estimated_pages * self.avg_tokens_per_page

        if projected_source_pages > self.max_pages_per_source:
            reason = (
                f"docs crawl ceiling for '{source_url}': {projected_source_pages} pages projected "
                f"> {self.max_pages_per_source} per-source limit — "
                f"add a URL filter (startUrl + rootPatterns) to reduce crawl scope in config.json"
            )
            self._trip_reason = reason
            return {"approved": False, "reason": reason}

        if projected_total_pages > self.max_total_pages:
            reason = (
                f"total docs page ceiling: {projected_total_pages} pages projected "
                f"> {self.max_total_pages} across all sources"
            )
            self._trip_reason = reason
            return {"approved": False, "reason": reason}

        if projected_tokens > self.max_embedding_tokens:
            reason = (
                f"embedding token ceiling: ~{projected_tokens:,} tokens projected "
                f"> {self.max_embedding_tokens:,} limit — "
                f"reduce docs sources or switch to local Transformers.js embeddings"
            )
            self._trip_reason = reason
            return {"approved": False, "reason": reason}

        return {"approved": True}

    def record_page(self, source_url: str, page_tokens: int = 0) -> None:
        if self.tripped:
            return
        self._pages_by_source[source_url] = self._pages_by_source.get(source_url, 0) + 1
        self._total_pages += 1
        self._total_embedding_tokens += page_tokens or self.avg_tokens_per_page

    def status(self) -> dict:
        return {
            "total_pages": self._total_pages,
            "pages_by_source": dict(self._pages_by_source),
            "total_embedding_tokens": self._total_embedding_tokens,
            "tripped": self.tripped,
            "trip_reason": self._trip_reason,
        }

The guard's check_before_crawl takes an estimated_pages count, which Continue.dev can compute from the sitemap of the target docs site before starting the crawl. When the trip fires, the guard provides an actionable suggestion: add a startUrl and rootPatterns filter to the docs entry in config.json to scope the crawl to relevant subsections (e.g., only the API reference pages, not the full guide and tutorial content). This is preferable to raising the ceiling, because smaller scoped indexes also improve retrieval precision.

When to apply which guard

The four patterns operate independently and can co-occur within a single Continue.dev session. A developer using @codebase inside an agent task with @docs context and an unguarded autocomplete model can hit all four patterns simultaneously. In that configuration, the guards compose: run CodebaseContextGuard per @codebase retrieval, AutocompleteBudgetGuard per keystroke trigger (with a startup check_model call), AgentIterationGuard per agent session, and DocsIndexGuard per config load when cloud embeddings are configured.

Pattern Trigger Guard Trip signal
@codebase context expansion Large repos, high nRetrievedFiles, complex queries CodebaseContextGuard Retrieval token ceiling; drops lowest-ranked chunks
Autocomplete model amplification Expensive model in tabAutocompleteModel or fallback AutocompleteBudgetGuard Model warning at config load; hourly trigger or daily token ceiling
Agent re-evaluation loop Tool failures, missing dependencies, permission errors AgentIterationGuard Max iterations, context ceiling, or consecutive failures
@docs crawl storm Cloud embeddingsProvider + large docs sites or CI reindex DocsIndexGuard Per-source or total page ceiling; embedding token budget

Frequently asked questions

Does Continue.dev have a built-in spend ceiling?

Continue.dev does not expose a per-session, per-day, or per-project spend ceiling in its configuration. It routes all LLM calls through the provider and model you specify in config.json at whatever rate that provider charges. The maxTokens and contextLength settings control model context window size, not cost. You are responsible for tracking and capping spend — either at the provider level (OpenAI usage limits, Anthropic spend limits) or via instrumentation at the API call layer as shown in this post.

How do I check what model Continue.dev is using for autocomplete vs chat?

In your ~/.continue/config.json, look for the tabAutocompleteModel key. If it is absent or null, Continue.dev falls back to the first model in the models array — typically your chat model. The AutocompleteBudgetGuard.check_model() function in this post can be run against that model name at IDE startup to detect expensive fallbacks before any tokens are spent. The safest configuration is to explicitly set tabAutocompleteModel to a local Ollama model (zero API cost) or a low-cost dedicated completions model like Codestral.

Does the @codebase guard need access to the embedding index?

No. The CodebaseContextGuard operates on the already-retrieved chunk list — the plain text strings that Continue.dev's retrieval pipeline returns after similarity search. It counts tokens in those strings and prunes the list before the chunks are sent to the LLM. The guard does not interact with the LanceDB index or the embedding model; it operates purely on the text that would otherwise be included in the LLM context payload.

How does the AgentIterationGuard get the context token count?

Most LLM provider APIs return usage.input_tokens in the response body for each call (Anthropic) or via the streaming usage object (OpenAI). Continue.dev's underlying API client receives this value and it is available in the response metadata. The guard takes this running input token count per step — which includes all prior history — as the context_tokens parameter in record_step(). If the provider does not return token counts in real time, you can estimate by counting tokens in the assembled context using tiktoken (as shown in the CodebaseContextGuard) before each call.

Is @docs crawl cost a problem if I use the default local Transformers.js embeddings?

No. The @docs crawl storm pattern only applies when a cloud embeddingsProvider is configured in config.json. With the default local Transformers.js provider, all embedding calls run in-process on the developer's machine with no API cost. The tradeoff is retrieval quality — cloud embedding models typically outperform the bundled local model on semantic similarity tasks. If you upgrade to cloud embeddings for quality reasons, add the DocsIndexGuard at the same time to contain the crawl scope before the first indexing run.

RunGuard adds the ceiling Continue.dev doesn't ship with

The guards in this post implement the same patterns as RunGuard's SDK: context window budget enforcement, trigger-rate ceilings, iteration counting, and consecutive-failure detection. RunGuard wraps your existing LLM API calls — you configure ceilings in one place, and every @codebase query, autocomplete trigger, agent step, and docs embed inherits them automatically.

Start free trial — no card required