Cohere Command R+ Agent Cost Control: Tool Loop Runaway, Chat History Accumulation, Document Injection, and Rerank Amplification

Command R+ occupies a distinct architectural position among large language models: it was built from the start for retrieval-augmented generation and enterprise tool use, not retrofitted from a pure text model. The co.chat() API reflects this with first-class support for tools, documents, chat_history, and the separate co.rerank() endpoint — each a production primitive for building agentic pipelines. For teams building research agents, enterprise search assistants, and multi-step data analysis pipelines, the Cohere stack offers a compelling combination of model quality and RAG-native infrastructure that few competitors match.

The same RAG-native design introduces four cost failure modes that are specific to how Cohere manages conversation state and retrieval. The tool-use loop terminates only when the model returns finish_reason == "COMPLETE" — without an application-layer step ceiling, a broad research query drives 20–40 sequential tool calls before the model decides it has enough information. Chat history must be passed explicitly on every co.chat() call, and tool results — often hundreds of tokens of structured data — accumulate in that history, growing the input context quadratically with every turn. When agents fetch and inject documents for RAG across multiple tool-call steps, the injected context window grows unbounded: the fifth step in a research loop sends all documents accumulated from steps one through five, not just the latest batch. The co.rerank() endpoint bills per document per query; used naively inside an agentic tool loop, it turns a single user request into dozens of billable rerank operations across the same document set.

Four failure modes specific to Cohere Command R+ agentic pipelines:

  1. Tool-use loop without step ceilingco.chat(tools=[...]) returns finish_reason="TOOL_CALL" when the model wants to invoke a tool. Application code must loop, execute tools, and call co.chat() again with the results. There is no platform-enforced step limit; a research agent given a broad question will call tools iteratively until the model judges its information sufficient — which can take 25–40 rounds on under-specified tasks.
  2. Chat history quadratic growth — The chat_history parameter injects the full prior conversation into every co.chat() call. Tool calls and their results are stored as CHATBOT and TOOL role messages respectively. A 20-turn agentic session with 10 tool results averaging 400 tokens each accumulates 4,000 tokens of tool-result history — history that doubles the effective input context of every subsequent call in the session.
  3. Document injection accumulation across RAG steps — The documents= parameter injects retrieved snippets as grounding context for the model's response. In an iterative research loop where each tool call fetches new documents and appends them to the injection list, the sixth step sends all documents from steps one through six. A 10-step agent fetching 5 documents per step passes 50 document chunks — potentially 25,000 tokens — to the final synthesis call, even when most of those chunks are no longer relevant.
  4. Rerank-in-loop billing amplificationco.rerank() bills per document per query. Calling it inside a tool loop to re-score retrieval results on every iteration multiplies rerank costs proportionally with step count: 40 documents × 15 tool-call steps = 600 billable rerank comparisons, turning a sub-cent retrieval step into a dollar-scale cost center across a single agent run.

Failure mode 1: Tool-use loop without step ceiling

Cohere's tool-use protocol is a request-response loop. The model either returns a final text response (finish_reason="COMPLETE") or a set of tool calls (finish_reason="TOOL_CALL"). When it returns tool calls, the application executes the requested tools, formats the results, and calls co.chat() again with the results injected via tool_results. This continues until the model produces a complete response. The Cohere platform does not limit the number of iterations — that contract is left entirely to the application.

The failure mode emerges with under-specified tasks. A task like "research the competitive landscape for enterprise document processing, including pricing, API availability, recent funding rounds, and customer reviews" gives the model a termination condition that is genuinely difficult to satisfy quickly. Command R+ will call a web search tool, read the results, call it again with a refined query, read again, call a different tool for pricing data, read, call again — accumulating information until the model judges the research sufficiently complete. On broad queries, this judgment can arrive after 25–40 iterations. At $0.003 per 1K input tokens and growing context with every step, the final synthesis call alone can cost more than the entire run was budgeted for.

Python — no step ceiling (risky)
import os
import cohere

co = cohere.Client(os.environ["COHERE_API_KEY"])

def research_agent_unsafe(query: str) -> str:
    chat_history = []
    message = query
    preamble = "You are a research assistant. Use the provided tools to gather comprehensive information before answering."

    while True:
        response = co.chat(
            model="command-r-plus",
            message=message,
            preamble=preamble,
            tools=RESEARCH_TOOLS,
            chat_history=chat_history,
        )

        if response.finish_reason == "TOOL_CALL":
            # Append user message and model tool calls to history
            chat_history.append({"role": "USER", "message": message})
            chat_history.append({
                "role": "CHATBOT",
                "message": response.text or "",
                "tool_calls": response.tool_calls,
            })
            # Execute tools and collect results
            tool_results = []
            for tool_call in response.tool_calls:
                output = execute_tool(tool_call.name, tool_call.parameters)
                tool_results.append({"call": tool_call, "outputs": [{"result": output}]})

            # Next iteration: empty message, tool results injected
            message = ""
            chat_history.append({"role": "TOOL", "tool_results": tool_results})

        else:
            # Model returned COMPLETE — but after how many steps?
            return response.text

The fix is a step counter that trips a circuit breaker after a configurable maximum and forces the model to synthesize from what it has collected so far. The guard injects a synthetic stop instruction as a tool result, steering the model toward COMPLETE without an abrupt API error. This preserves partial results — the agent produces a less-thorough but coherent answer rather than failing with an exception.

Python — CohereStepGuard
import os
from dataclasses import dataclass, field
from typing import Optional
import cohere

@dataclass
class CohereStepGuard:
    max_steps: int = 12
    alert_at: int = 8
    _step_count: int = field(default=0, init=False)

    def reset(self):
        self._step_count = 0

    def check(self) -> tuple[bool, Optional[str]]:
        """Returns (allow_continue, synthetic_stop_message_or_None)."""
        self._step_count += 1
        if self._step_count >= self.max_steps:
            print(
                f"[RunGuard] CIRCUIT BREAKER — step count {self._step_count} "
                f"reached max_steps {self.max_steps}. Forcing synthesis."
            )
            return False, (
                "Research step limit reached. Synthesize a complete answer from "
                "the information gathered so far. Do not call any more tools."
            )
        if self._step_count >= self.alert_at:
            print(f"[RunGuard] ALERT — step {self._step_count} of max {self.max_steps}")
        return True, None


def research_agent_guarded(query: str) -> str:
    co = cohere.Client(os.environ["COHERE_API_KEY"])
    guard = CohereStepGuard(max_steps=12, alert_at=8)
    chat_history = []
    message = query
    preamble = "You are a research assistant. Use the provided tools to gather information."

    while True:
        response = co.chat(
            model="command-r-plus",
            message=message,
            preamble=preamble,
            tools=RESEARCH_TOOLS,
            chat_history=chat_history,
        )

        if response.finish_reason == "TOOL_CALL":
            allow, stop_msg = guard.check()
            # Append assistant tool calls to history
            chat_history.append({"role": "USER", "message": message})
            chat_history.append({
                "role": "CHATBOT",
                "message": response.text or "",
                "tool_calls": response.tool_calls,
            })

            if not allow:
                # Inject synthetic stop result — model will synthesize on next call
                chat_history.append({
                    "role": "TOOL",
                    "tool_results": [{
                        "call": response.tool_calls[0],
                        "outputs": [{"result": stop_msg}],
                    }],
                })
                message = ""
                # Force a final completion call with no tools offered
                final = co.chat(
                    model="command-r-plus",
                    message=message,
                    preamble=preamble,
                    chat_history=chat_history,
                    # No tools= parameter — model cannot loop further
                )
                return final.text

            tool_results = []
            for tool_call in response.tool_calls:
                output = execute_tool(tool_call.name, tool_call.parameters)
                tool_results.append({"call": tool_call, "outputs": [{"result": output}]})
            message = ""
            chat_history.append({"role": "TOOL", "tool_results": tool_results})

        else:
            guard.reset()
            return response.text

Failure mode 2: Chat history quadratic growth

Cohere's co.chat() API is stateless — the platform does not retain conversation context between calls. Every call must supply the full conversation via the chat_history parameter. This is the correct design for reproducibility and session isolation, but it means the application is responsible for managing the size of that history. In agentic pipelines, this responsibility is easy to overlook: the natural pattern is to append every message and tool result to a list and pass the growing list on every call.

The cost compounds with tool results. A tool call that searches the web might return 800 tokens of structured JSON. A database query result might return a 1,200-token table. Each of these lands in chat_history as a TOOL role message. By step 10 of a research agent, the history includes 10 user turns, 10 chatbot turns (each with tool call metadata), and 10 tool result turns averaging 600 tokens. That is approximately 18,000 tokens of history overhead injected before the user's current question even appears. Every subsequent step pays the same overhead — and step 11 pays even more.

Step History entries Estimated history tokens Cost multiplier vs. step 1
100
36~2,400~1.5×
512~5,200~2.3×
1027~14,400~4.8×
2057~37,200~11.4×
3087~62,400~17.2×

The fix is a BoundedChatHistory class that maintains the history list but trims it before each call — keeping a configurable window of recent messages and a token-based ceiling. For sessions where full history is semantically important, summarize the evicted portion into a single compressed context message rather than discarding it outright.

Python — BoundedChatHistory
from dataclasses import dataclass, field
from typing import Any
import cohere

def estimate_tokens(text: str) -> int:
    """Rough estimate: 1 token ≈ 4 characters."""
    return max(1, len(str(text)) // 4)


def history_token_count(history: list[dict]) -> int:
    total = 0
    for entry in history:
        total += estimate_tokens(entry.get("message", ""))
        for tr in entry.get("tool_results", []):
            for out in tr.get("outputs", []):
                total += estimate_tokens(out.get("result", ""))
        for tc in entry.get("tool_calls", []):
            total += estimate_tokens(str(tc))
    return total


@dataclass
class BoundedChatHistory:
    max_turns: int = 20          # hard message-count cap (3 entries per turn)
    max_tokens: int = 12_000     # token ceiling before trimming
    compress_evicted: bool = True

    _history: list[dict] = field(default_factory=list, init=False)

    def append(self, entry: dict):
        self._history.append(entry)

    def get(self, co_client: cohere.Client, model: str) -> list[dict]:
        """Return a trimmed history safe to pass to co.chat()."""
        # Trim to max_turns (each turn = USER + CHATBOT + TOOL = 3 entries)
        max_entries = self.max_turns * 3
        if len(self._history) > max_entries:
            evicted = self._history[:-max_entries]
            self._history = self._history[-max_entries:]
            if self.compress_evicted:
                summary = self._compress(evicted, co_client, model)
                self._history.insert(0, {
                    "role": "CHATBOT",
                    "message": f"[Prior session summary] {summary}",
                })

        # Token ceiling
        while history_token_count(self._history) > self.max_tokens and len(self._history) > 3:
            evicted = self._history[:3]
            self._history = self._history[3:]
            if self.compress_evicted:
                summary = self._compress(evicted, co_client, model)
                if self._history and self._history[0].get("role") == "CHATBOT" and "[Prior session summary]" in self._history[0].get("message", ""):
                    self._history[0]["message"] += " " + summary
                else:
                    self._history.insert(0, {"role": "CHATBOT", "message": f"[Prior session summary] {summary}"})

        return list(self._history)

    def _compress(self, entries: list[dict], co_client: cohere.Client, model: str) -> str:
        content = " ".join(e.get("message", "") or str(e.get("tool_results", "")) for e in entries)
        if len(content) < 200:
            return content
        resp = co_client.chat(
            model=model,
            message=f"Summarize this agent conversation segment in 2 sentences: {content[:4000]}",
        )
        return resp.text

Failure mode 3: Document injection accumulation across RAG steps

Command R+ was designed for RAG: the documents= parameter accepts a list of retrieved snippets which the model uses as grounding context for its response. The model's training optimizes for synthesizing from injected documents rather than relying on parametric memory, which means it actively uses the content you inject. This is the feature. The cost failure mode is what happens when documents are accumulated across tool-call steps in an agentic loop.

The pattern is common in iterative research agents: on each tool call step, the agent fetches new search results or database rows and adds them to a growing documents list. The intent is that the model should have all the evidence it has gathered available for the final synthesis. The cost consequence is that each subsequent co.chat() call injects the entire accumulated document set — not just the new documents from the current step. A 10-step research agent that fetches 5 documents per step (each document averaging 500 tokens) sends 5 documents on step 1, 10 on step 2, 15 on step 3, and 50 on step 10. The final synthesis call injects 25,000 tokens of document context before a single token of actual query appears.

The quadratic pattern here is worse than history accumulation because documents are typically longer and denser than conversational turns, and Cohere's RAG grounding mechanism uses all injected documents to generate citations — which means the model attends to more of the injected content per token than a standard context injection. The economic penalty scales super-linearly with step count.

Python — document accumulation (risky)
def iterative_research_unsafe(query: str) -> str:
    co = cohere.Client(os.environ["COHERE_API_KEY"])
    all_documents = []   # grows on every tool call step
    chat_history = []
    message = query

    while True:
        response = co.chat(
            model="command-r-plus",
            message=message,
            tools=SEARCH_TOOLS,
            documents=all_documents,   # PROBLEM: grows unboundedly
            chat_history=chat_history,
        )
        if response.finish_reason == "TOOL_CALL":
            chat_history.append({"role": "USER", "message": message})
            chat_history.append({"role": "CHATBOT", "message": "", "tool_calls": response.tool_calls})
            tool_results = []
            for tc in response.tool_calls:
                docs = fetch_documents(tc.name, tc.parameters)
                all_documents.extend(docs)   # uncapped accumulation
                tool_results.append({"call": tc, "outputs": [{"result": f"Fetched {len(docs)} documents"}]})
            message = ""
            chat_history.append({"role": "TOOL", "tool_results": tool_results})
        else:
            return response.text
Python — DocumentBudget guard
from dataclasses import dataclass, field
from typing import Any
import hashlib

@dataclass
class DocumentBudget:
    max_documents: int = 20        # hard cap on injected document count
    max_doc_tokens: int = 10_000   # token budget for all injected docs
    dedup: bool = True             # drop near-duplicate snippets

    _seen_hashes: set[str] = field(default_factory=set, init=False)
    _retained: list[dict] = field(default_factory=list, init=False)

    def _doc_hash(self, doc: dict) -> str:
        text = doc.get("snippet", "") + doc.get("title", "")
        return hashlib.md5(text[:200].encode()).hexdigest()

    def _doc_tokens(self, doc: dict) -> int:
        return estimate_tokens(doc.get("snippet", "") + doc.get("title", ""))

    def add(self, new_docs: list[dict]) -> int:
        """Add documents, applying dedup and cap. Returns count actually added."""
        added = 0
        for doc in new_docs:
            if self.dedup:
                h = self._doc_hash(doc)
                if h in self._seen_hashes:
                    continue
                self._seen_hashes.add(h)

            current_tokens = sum(self._doc_tokens(d) for d in self._retained)
            doc_tok = self._doc_tokens(doc)

            if len(self._retained) >= self.max_documents:
                print(
                    f"[RunGuard] DOCUMENT CAP — {self.max_documents} documents "
                    f"reached. Dropping new document: {doc.get('title', 'untitled')[:40]}"
                )
                break

            if current_tokens + doc_tok > self.max_doc_tokens:
                print(
                    f"[RunGuard] DOCUMENT TOKEN BUDGET — {current_tokens} + {doc_tok} "
                    f"exceeds {self.max_doc_tokens}. Dropping document."
                )
                break

            self._retained.append(doc)
            added += 1
        return added

    def get(self) -> list[dict]:
        return list(self._retained)


def iterative_research_guarded(query: str) -> str:
    co = cohere.Client(os.environ["COHERE_API_KEY"])
    doc_budget = DocumentBudget(max_documents=20, max_doc_tokens=10_000)
    step_guard = CohereStepGuard(max_steps=12)
    chat_history = []
    message = query

    while True:
        response = co.chat(
            model="command-r-plus",
            message=message,
            tools=SEARCH_TOOLS,
            documents=doc_budget.get(),   # capped, deduped document set
            chat_history=chat_history,
        )
        if response.finish_reason == "TOOL_CALL":
            allow, stop_msg = step_guard.check()
            chat_history.append({"role": "USER", "message": message})
            chat_history.append({"role": "CHATBOT", "message": "", "tool_calls": response.tool_calls})
            tool_results = []
            for tc in response.tool_calls:
                docs = fetch_documents(tc.name, tc.parameters)
                added = doc_budget.add(docs)
                tool_results.append({
                    "call": tc,
                    "outputs": [{"result": f"Fetched {len(docs)} documents; {added} added to budget"}],
                })
            message = ""
            chat_history.append({"role": "TOOL", "tool_results": tool_results})
            if not allow:
                final = co.chat(
                    model="command-r-plus",
                    message="Synthesize a complete answer from the research so far.",
                    documents=doc_budget.get(),
                    chat_history=chat_history,
                )
                return final.text
        else:
            return response.text

Failure mode 4: Rerank-in-loop billing amplification

Cohere's co.rerank() endpoint is a separate billing meter from the chat model. It accepts a query and a list of documents and returns them scored by semantic relevance. In isolation it is inexpensive — typically fractions of a cent per call. Inside an agentic tool loop, it is an independent multiplier on top of the LLM generation costs and any retrieval costs.

The failure mode follows a logical implementation pattern: on each tool-call step, the agent fetches candidate documents from a vector store, reranks them to select the top-K for injection, and proceeds. This seems sensible — reranking before injection improves relevance and reduces document injection costs by filtering noise. The billing consequence is that a 15-step research agent that reranks a 40-document pool on every step makes 15 separate co.rerank() calls, each billing for 40 document comparisons. That is 600 billable comparisons from a document pool that barely changes between steps — the same 40 candidates are being re-scored 15 times with slightly evolved queries. Depending on Cohere pricing tier, this can exceed the LLM generation costs for the entire run.

There are two distinct guards. First, a rerank call cache: if the query is semantically similar to a recent rerank call (above a Jaccard or cosine threshold), return the cached ranking rather than calling the API again. Second, a call-rate circuit breaker: if co.rerank() has been called more than N times in the current agent run, block further calls and fall back to the existing ranking. The rerank result on step 3 is very likely still valid for step 5 if the document pool has not changed.

Python — RerankGuard
import hashlib
from dataclasses import dataclass, field
from typing import Optional
import cohere

def _query_hash(query: str) -> str:
    return hashlib.md5(query.lower().strip().encode()).hexdigest()

def _jaccard(a: str, b: str) -> float:
    sa, sb = set(a.lower().split()), set(b.lower().split())
    if not sa or not sb:
        return 0.0
    return len(sa & sb) / len(sa | sb)


@dataclass
class RerankGuard:
    max_calls_per_run: int = 5       # circuit breaker ceiling
    similarity_threshold: float = 0.75  # cache hit if query is similar
    model: str = "rerank-english-v3.0"

    _calls_this_run: int = field(default=0, init=False)
    _cache: dict[str, list] = field(default_factory=dict, init=False)
    _last_query: Optional[str] = field(default=None, init=False)
    _last_result: Optional[list] = field(default=None, init=False)

    def reset(self):
        self._calls_this_run = 0
        self._cache.clear()
        self._last_query = None
        self._last_result = None

    def rerank(
        self,
        co_client: cohere.Client,
        query: str,
        documents: list[dict],
        top_n: int = 5,
    ) -> list[dict]:
        """Rerank documents with caching and circuit breaker."""
        # Check semantic cache
        if self._last_query is not None:
            similarity = _jaccard(query, self._last_query)
            if similarity >= self.similarity_threshold:
                print(
                    f"[RunGuard] RERANK CACHE HIT — query similarity {similarity:.2f} "
                    f">= {self.similarity_threshold}. Reusing prior ranking."
                )
                return (self._last_result or documents)[:top_n]

        # Circuit breaker
        if self._calls_this_run >= self.max_calls_per_run:
            print(
                f"[RunGuard] RERANK CIRCUIT BREAKER — {self._calls_this_run} rerank "
                f"calls in this run >= max {self.max_calls_per_run}. "
                f"Returning prior ranking."
            )
            return (self._last_result or documents)[:top_n]

        # Execute rerank
        self._calls_this_run += 1
        doc_texts = [d.get("snippet", d.get("text", str(d))) for d in documents]
        result = co_client.rerank(
            model=self.model,
            query=query,
            documents=doc_texts,
            top_n=top_n,
        )
        # Map back to original document dicts in ranked order
        ranked_docs = [documents[r.index] for r in result.results]
        self._last_query = query
        self._last_result = ranked_docs
        return ranked_docs


# Usage in a tool handler
def search_and_rerank(query: str, rerank_guard: RerankGuard, co_client: cohere.Client) -> list[dict]:
    candidates = vector_store.search(query, top_k=40)  # fetch broad candidate set
    ranked = rerank_guard.rerank(co_client, query, candidates, top_n=5)
    return ranked

Composite: CohereAgentPolicy

The four guards compose naturally because they intercept at different points in the agent loop: step guard on every tool call, history guard before each co.chat() call, document budget on every document fetch, and rerank guard on every rerank call. A single CohereAgentPolicy object coordinates all four from a single configuration.

Python — CohereAgentPolicy composite
from dataclasses import dataclass, field
import cohere

@dataclass
class CohereAgentPolicy:
    step_guard: CohereStepGuard = field(default_factory=lambda: CohereStepGuard(max_steps=12, alert_at=8))
    history: BoundedChatHistory = field(default_factory=lambda: BoundedChatHistory(max_turns=20, max_tokens=12_000))
    doc_budget: DocumentBudget = field(default_factory=lambda: DocumentBudget(max_documents=20, max_doc_tokens=10_000))
    rerank_guard: RerankGuard = field(default_factory=lambda: RerankGuard(max_calls_per_run=5))

    def reset(self):
        self.step_guard = CohereStepGuard()
        self.history = BoundedChatHistory()
        self.doc_budget = DocumentBudget()
        self.rerank_guard.reset()


def run_cohere_agent(query: str, policy: CohereAgentPolicy) -> str:
    co = cohere.Client(os.environ["COHERE_API_KEY"])
    policy.reset()
    message = query

    while True:
        safe_history = policy.history.get(co, "command-r-plus")
        response = co.chat(
            model="command-r-plus",
            message=message,
            tools=RESEARCH_TOOLS,
            documents=policy.doc_budget.get(),
            chat_history=safe_history,
        )

        if response.finish_reason == "TOOL_CALL":
            allow, stop_msg = policy.step_guard.check()
            policy.history.append({"role": "USER", "message": message})
            policy.history.append({
                "role": "CHATBOT",
                "message": response.text or "",
                "tool_calls": response.tool_calls,
            })

            tool_results = []
            for tc in response.tool_calls:
                if tc.name == "search":
                    candidates = fetch_raw_candidates(tc.parameters)
                    query_text = tc.parameters.get("query", message)
                    ranked = policy.rerank_guard.rerank(co, query_text, candidates, top_n=5)
                    added = policy.doc_budget.add(ranked)
                    tool_results.append({
                        "call": tc,
                        "outputs": [{"result": f"Retrieved and ranked {len(ranked)} documents; {added} added."}],
                    })
                else:
                    output = execute_tool(tc.name, tc.parameters)
                    tool_results.append({"call": tc, "outputs": [{"result": output}]})

            policy.history.append({"role": "TOOL", "tool_results": tool_results})
            message = ""

            if not allow:
                final = co.chat(
                    model="command-r-plus",
                    message="Please synthesize a complete answer from the research gathered so far.",
                    documents=policy.doc_budget.get(),
                    chat_history=policy.history.get(co, "command-r-plus"),
                )
                return final.text
        else:
            return response.text


# Example usage
policy = CohereAgentPolicy(
    step_guard=CohereStepGuard(max_steps=10, alert_at=7),
    history=BoundedChatHistory(max_turns=15, max_tokens=8_000),
    doc_budget=DocumentBudget(max_documents=15, max_doc_tokens=8_000),
    rerank_guard=RerankGuard(max_calls_per_run=4),
)
result = run_cohere_agent("Research the top 5 enterprise document processing vendors", policy)

Cohere-specific nuance: The documents= and chat_history= parameters are additive but billed together as input tokens. When both are large, the total input token count reflects documents + history + system preamble + current message. Profiling each component separately (with response.meta.billed_units) shows which is the dominant cost driver — in most research agents past step 5, document injection overtakes history as the largest input cost contributor.

FAQ

Does Cohere's connectors API (built-in web search) have the same document injection problem?

Yes, and it compounds it. When you pass connectors=[{"id": "web-search"}], Cohere fetches and injects search results automatically without exposing the document list to your application — which means DocumentBudget cannot inspect or cap the injected content. For production agentic workloads where cost control matters, use the tools= pattern with your own search tool instead: this gives you explicit control over what gets fetched, how many documents are injected, and whether deduplication applies. The connectors API is convenient for single-shot queries but becomes a cost black box in iterative agent loops.

How does Cohere's preamble parameter affect token costs compared to other providers' system prompts?

The preamble is included in every co.chat() call and billed as input tokens, identical to OpenAI's system message. In agentic pipelines, the per-call overhead of a verbose preamble is easily overlooked because it is not part of the accumulating history — it is a fixed cost per call, not a growing one. A 500-token preamble on a 20-step agent run adds 10,000 tokens of fixed overhead that appear in no history or document count. Audit your preamble length as part of cost profiling: for most research agents, the preamble can be reduced to two or three sentences without measurable quality impact.

The BoundedChatHistory uses a compression call to summarize evicted history. Doesn't that add cost?

Yes — compression adds one cheap chat call per eviction event. For a well-configured guard, eviction happens once or twice per session (when the turn count or token ceiling is hit). The compression call costs a fraction of what it saves: evicting 3,000 tokens of history overhead that would otherwise have been re-injected on every subsequent call is worth a one-time 500-token summarization. The break-even point is reached by the second call after eviction. For agents where the evicted history is genuinely irrelevant to the remaining task, set compress_evicted=False to skip compression entirely and accept the semantic discontinuity in exchange for a marginally simpler implementation.

Can these guards be applied to Cohere's Command A (the enterprise model) or only Command R+?

All four guards are model-agnostic — they instrument the application layer, not the model internals. CohereStepGuard, BoundedChatHistory, DocumentBudget, and RerankGuard work identically with command-a-03-2025, command-r, or any other Cohere model that uses the same co.chat() / co.rerank() API surface. The token cost per unit differs between models — Command A is billed at significantly higher rates than Command R — making the guards proportionally more valuable when using higher-tier models.

How does the rerank cache interact with document changes between steps?

The RerankGuard cache checks query similarity, not document identity. If the document pool changes significantly between steps (new documents fetched) but the query is similar, the cached ranking may reference documents that are no longer in the current pool. The guard handles this by returning the intersection of the cached ranking and the current document list. In practice, this edge case is rare: when the document pool changes substantially, the query typically changes too (the agent is exploring a new angle), which will fall below the similarity threshold and trigger a fresh rerank. For agents that frequently rotate the document pool while keeping a stable query, set similarity_threshold=0.0 to disable query-similarity caching and rely solely on the call-count circuit breaker.

Stop the next runaway Cohere agent before it bills

RunGuard packages all four guards — step ceiling, history trimming, document budget, rerank rate limiting — as a managed API. No guard logic to maintain across Cohere SDK upgrades, a dashboard that shows every trip across all your apps, configurable thresholds per application. The 14-day free trial requires no credit card.

Start free trial — no card required