Amazon Q Developer is AWS's AI coding assistant, available as an IDE plugin (VS Code, JetBrains, Visual Studio), a CLI extension, and an API. At its core, Q Developer does inline code completion and chat-based code generation — tasks with bounded LLM call counts and predictable token usage. But Q Developer also has four agentic capabilities that operate across multiple files, multiple tool invocations, and multiple LLM round-trips: the /dev agent command, code transformation, workspace context expansion, and security scan with auto-remediation.

Each of these features is designed to save developer time at scale. None of them expose a per-session spend ceiling to the caller. The result is the familiar gap: Q Developer routes every call through AWS Bedrock models and charges per token; it does not decide that the session has spent enough. An enterprise team running automated Q Developer workflows — code transformation on a 500-file Java codebase, nightly security scans with auto-fix, CI-integrated code review — can accumulate bills that are 10–50× the naive estimate before anyone checks the AWS Cost Explorer dashboard.

This post covers four structural cost amplification patterns specific to Amazon Q Developer's architecture, and a runtime circuit breaker guard for each. The guards instrument Q's API surface without replacing it — you keep the Q Developer features; you add the spend ceilings.

What this post covers: Four cost amplification patterns specific to Amazon Q Developer's agentic capabilities, and a runtime circuit breaker guard for each. Guards work alongside Q Developer via the qdevelo per CLI and Bedrock agent APIs — they do not replace the Q features. You keep the workspace intelligence and transformation capabilities; you add the ceilings.

Pattern 1: The /dev Command Plan-Execute-Replan Loop

Amazon Q Developer's /dev command is an agentic task executor. You give it a natural-language task — "add pagination to the user listing endpoint and write unit tests" — and Q Developer decomposes it into a plan: which files to read, which edits to make, which commands to run. It then executes each step sequentially, observes the outcome (test pass/fail, shell command exit code, lint output), and replans if a step fails.

The loop structure is: plan → execute step → observe outcome → replan if failed → execute → observe → replan → ... Each LLM round-trip — one for the initial plan, one per replanning event — bills at full context token rates. The critical detail is what goes into the replanning context: Q includes the original task description, the entire plan so far, all executed steps, and the full stdout/stderr of every failed command. This means each replanning turn is larger than the previous one as error context accumulates.

Cost accumulation per /dev session (estimated):
Turn 1 (initial plan): task description + workspace summary = ~3,000 input tokens
Turn 2 (replan after test failure): turn 1 context + test output = ~8,000 input tokens
Turn 3 (replan after lint failure): turn 2 context + lint output = ~14,000 input tokens
Turn 4 (replan after build failure): turn 3 context + build log = ~22,000 input tokens
Turn 5 (replan after second test failure): turn 4 context + test output = ~31,000 input tokens
Total: ~78,000 input tokens + ~15,000 output tokens across 5 replanning events
At Claude 3.5 Sonnet via Bedrock ($3/$15 per 1M tokens): ~$0.46 per stuck /dev session

A single stuck /dev session costs under a dollar. But teams that automate Q Developer in CI — triggering /dev on every opened PR to scaffold tests or update documentation — can accumulate hundreds of stuck sessions per day on repositories where the code is in a state Q cannot easily fix. At 200 failed sessions per day on a 10-developer team, that is $92/day ($2,760/month) from the /dev command alone, for sessions that produced no merged code.

The guard: DevCommandBudgetGuard

The guard wraps Q Developer's agent session, enforces a max-rounds ceiling, and enforces a per-session token budget using Q's usage metadata on each response:

Python
import boto3
import json
from dataclasses import dataclass, field
from typing import Optional

@dataclass
class DevCommandBudgetGuard:
    max_rounds: int = 5
    max_input_tokens: int = 40_000
    max_output_tokens: int = 8_000
    _round: int = field(default=0, init=False)
    _total_input: int = field(default=0, init=False)
    _total_output: int = field(default=0, init=False)
    _trip_reason: Optional[str] = field(default=None, init=False)

    @property
    def tripped(self) -> bool:
        return self._trip_reason is not None

    def record_turn(self, usage: dict) -> None:
        self._round += 1
        self._total_input += usage.get("inputTokens", 0)
        self._total_output += usage.get("outputTokens", 0)

        if self._round >= self.max_rounds:
            self._trip_reason = (
                f"max_rounds={self.max_rounds} reached "
                f"({self._total_input} input / {self._total_output} output tokens)"
            )
        elif self._total_input >= self.max_input_tokens:
            self._trip_reason = (
                f"input token budget exhausted: "
                f"{self._total_input} >= {self.max_input_tokens}"
            )
        elif self._total_output >= self.max_output_tokens:
            self._trip_reason = (
                f"output token budget exhausted: "
                f"{self._total_output} >= {self.max_output_tokens}"
            )

    def status(self) -> dict:
        return {
            "round": self._round,
            "input_tokens": self._total_input,
            "output_tokens": self._total_output,
            "tripped": self.tripped,
            "trip_reason": self._trip_reason,
        }


def run_dev_command_with_guard(
    q_client,
    conversation_state: dict,
    task: str,
    guard: DevCommandBudgetGuard,
) -> dict:
    """
    Drives a Q Developer /dev agent session with per-round budget enforcement.
    q_client: boto3 client for 'qbusiness' or the Q Developer SDK wrapper.
    conversation_state: mutable dict holding session context between turns.
    """
    responses = []
    conversation_state.setdefault("messages", [])
    conversation_state["messages"].append({"role": "user", "content": task})

    while not guard.tripped:
        response = q_client.chat_sync(
            applicationId=conversation_state["applicationId"],
            userId=conversation_state["userId"],
            conversationId=conversation_state.get("conversationId"),
            parentMessageId=conversation_state.get("lastMessageId"),
            clientToken=conversation_state.get("clientToken"),
            userMessage=task if not responses else None,  # first turn only
        )

        usage = response.get("sourceAttributions", {})
        # Q Developer returns token counts in the response metadata
        guard.record_turn({
            "inputTokens": response.get("inputTokens", 0),
            "outputTokens": response.get("outputTokens", 0),
        })

        responses.append(response)
        conversation_state["conversationId"] = response.get("conversationId")
        conversation_state["lastMessageId"] = response.get("systemMessageId")

        # Q Developer signals completion when there are no more agent actions
        if response.get("finalResponse") or not response.get("nextAgentAction"):
            break

    return {
        "responses": responses,
        "guard_status": guard.status(),
        "completed": not guard.tripped,
    }

The guard enforces three independent ceilings — round count, input tokens, and output tokens — so it trips on the first budget exhausted regardless of which one is hit. On trip, the caller receives the partial response list and the trip reason; it can log the trip, alert the developer, and avoid merging any partial code changes from the capped session.

Pattern 2: Code Transformation File Fan-out

Amazon Q Developer's /transform command (backed by the StartCodeTransformation / GetCodeTransformation Bedrock API) handles large-scale migrations: Java 8 → Java 17, Spring Boot 2 → Spring Boot 3, legacy test framework upgrades. The feature processes each source file independently via a separate LLM call: read the file, apply transformation rules, generate the updated version.

The fan-out is one LLM call per file, per transformation type, per retry. A Java codebase with 400 source files triggers 400 individual transformation calls. Files that fail transformation — syntax errors in the output, unresolvable import changes, test compilation failures — are retried up to the configured retry limit (default: 3). The worst case is 400 files × 3 retries × (average file input tokens + output tokens):

Transformation cost formula:
files=400, avg_input_tokens_per_file=2,500 (file content + transformation rules + context),
avg_output_tokens_per_file=2,000, retry_multiplier=1.8 (partial retry rate on real codebases)

Input cost: 400 × 2,500 × 1.8 = 1,800,000 tokens
Output cost: 400 × 2,000 × 1.8 = 1,440,000 tokens
At Claude 3.5 Sonnet via Bedrock ($3/$15 per 1M tokens):
Total: $5.40 (input) + $21.60 (output) = $27.00 per transformation run
Three transformation runs on a large codebase during a migration sprint: $81

The critical failure mode: developers run /transform iteratively as they fix compilation errors introduced by the transformation output. Each "fix one more batch of files" is a full re-run on the remaining files, not an incremental patch. Teams migrating large monorepos report running transformation 5–8 times before the codebase compiles cleanly — each run bills at the same per-file rate.

The guard: TransformBudgetGuard

Q Developer's StartCodeTransformation API returns a plan phase before executing: it lists the files it intends to transform. The guard reads the plan, computes a pre-transformation cost estimate, enforces a budget ceiling before the first LLM call, and enforces a per-file retry cap of 1 (accepting the first transformation attempt, reporting failures rather than retrying them):

Python
import boto3
import time
from dataclasses import dataclass
from typing import Optional

# Approximate token costs per file for Q Developer transformation tasks
# (empirically derived from Q transformation API responses)
TRANSFORM_TOKENS_PER_FILE = {
    "input": 2_500,   # file content + transformation ruleset + context header
    "output": 2_000,  # transformed file content
}

# Bedrock model costs in USD per 1M tokens (update as AWS pricing changes)
BEDROCK_COSTS = {
    "claude-3-5-sonnet": {"input": 3.00, "output": 15.00},
    "claude-3-haiku":    {"input": 0.25, "output": 1.25},
}

@dataclass
class TransformBudgetGuard:
    max_cost_usd: float = 20.0
    max_files: int = 200
    max_retries_per_file: int = 1
    model: str = "claude-3-5-sonnet"

    def estimate_cost(self, file_count: int, retry_rate: float = 0.3) -> float:
        effective_files = file_count * (1 + retry_rate * self.max_retries_per_file)
        input_tokens = effective_files * TRANSFORM_TOKENS_PER_FILE["input"]
        output_tokens = effective_files * TRANSFORM_TOKENS_PER_FILE["output"]
        costs = BEDROCK_COSTS.get(self.model, BEDROCK_COSTS["claude-3-5-sonnet"])
        return (
            input_tokens / 1_000_000 * costs["input"]
            + output_tokens / 1_000_000 * costs["output"]
        )

    def check_plan(self, file_count: int) -> dict:
        estimated = self.estimate_cost(file_count)
        return {
            "file_count": file_count,
            "estimated_cost_usd": round(estimated, 2),
            "approved": file_count <= self.max_files and estimated <= self.max_cost_usd,
            "reason": (
                f"file_count={file_count} exceeds max_files={self.max_files}"
                if file_count > self.max_files
                else (
                    f"estimated_cost=${estimated:.2f} exceeds max_cost=${self.max_cost_usd}"
                    if estimated > self.max_cost_usd
                    else "within budget"
                )
            ),
        }


def run_transform_with_guard(
    codewhisperer_client,
    transformation_spec: dict,
    guard: TransformBudgetGuard,
) -> dict:
    """
    Starts a Q Developer code transformation only if the plan passes the budget guard.
    Uses the StartCodeTransformation -> GetCodeTransformation polling pattern.
    """
    # Phase 1: Request a transformation plan (no LLM calls yet)
    plan_response = codewhisperer_client.start_code_transformation(
        workspaceState=transformation_spec["workspaceState"],
        transformationType=transformation_spec["transformationType"],
        # dryRun=True returns file count without executing
    )

    # The plan response includes the number of files to transform
    file_count = plan_response.get("transformationFileCount", 0)
    check = guard.check_plan(file_count)

    if not check["approved"]:
        return {
            "status": "blocked_by_guard",
            "reason": check["reason"],
            "estimated_cost_usd": check["estimated_cost_usd"],
            "file_count": file_count,
        }

    # Phase 2: Execute the transformation
    job_id = plan_response["transformationJobId"]
    codewhisperer_client.resume_code_transformation(
        transformationJobId=job_id,
        userDecision="TRANSFORM_RECOMMENDED",
    )

    # Phase 3: Poll for completion
    while True:
        status_response = codewhisperer_client.get_code_transformation(
            transformationJobId=job_id
        )
        status = status_response["transformationJob"]["status"]
        if status in ("COMPLETED", "FAILED", "STOPPED", "PARTIALLY_COMPLETED"):
            return {
                "status": status,
                "job_id": job_id,
                "estimated_cost_usd": check["estimated_cost_usd"],
                "file_count": file_count,
            }
        time.sleep(30)

The guard's key insight is intercepting before the first LLM call. Q Developer's transformation API supports a plan-then-execute pattern; the guard reads the plan's file count to estimate cost and either approves or blocks the run. Teams can set max_cost_usd per transformation run and use the blocked response to alert the engineering lead before starting an expensive migration.

Pattern 3: Workspace Context Expansion on Every Automated Turn

Amazon Q Developer's workspace context feature (Q Developer Pro) reads your project's source files and builds a searchable index. When you ask Q a question about your codebase — "where is the payment timeout configured?" — Q queries this index and retrieves relevant file snippets, injecting them into the LLM call's context window. The retrieval happens on every chat turn that touches project-wide context.

In interactive use, a developer asks 5–10 questions per session; workspace context is a net win. The failure mode appears in automated Q Developer integrations: CI workflows that invoke Q on every PR to generate a code review summary, automated documentation updaters that ask Q to describe each changed file, or chatbot integrations that forward user questions to Q with full workspace context. Each automated turn triggers a full workspace context retrieval. In large monorepos, workspace retrieval pulls 20,000–80,000 tokens of context per turn — not because all of it is relevant, but because Q's retrieval ranks many files as potentially relevant for broad questions.

Automated Q Developer integration cost formula:
workspace_context_tokens_per_turn=40,000 (average across broad and narrow questions),
output_tokens_per_turn=800, automated_turns_per_day=150 (10-developer team, 15 turns/dev)

Daily input tokens: 150 × 40,000 = 6,000,000 tokens
Daily output tokens: 150 × 800 = 120,000 tokens
At Claude 3.5 Sonnet via Bedrock ($3/$15 per 1M tokens):
Daily cost: $18.00 (input) + $1.80 (output) = $19.80/day = $594/month

The amplifying factor: workspace context token count grows with codebase size and has no built-in per-turn ceiling. A team that adds 50,000 lines of code over a quarter increases their automated Q cost proportionally, with no dashboard alert — the cost change shows only in the next AWS bill.

The guard: WorkspaceContextBudget

The guard enforces a per-turn context token ceiling and a per-session total ceiling, using Q's response metadata to track actual context usage. When a turn would exceed the ceiling, the guard reduces the workspace context scope to the most relevant N files rather than blocking the turn entirely:

Python
from dataclasses import dataclass, field
from typing import Optional

@dataclass
class WorkspaceContextBudget:
    max_context_tokens_per_turn: int = 20_000
    max_total_input_tokens_per_session: int = 100_000
    _session_input_tokens: int = field(default=0, init=False)
    _turns: int = field(default=0, init=False)
    _trip_reason: Optional[str] = field(default=None, init=False)

    @property
    def tripped(self) -> bool:
        return self._trip_reason is not None

    def check_before_turn(self, estimated_context_tokens: int) -> dict:
        if self._session_input_tokens + estimated_context_tokens > self.max_total_input_tokens_per_session:
            self._trip_reason = (
                f"session input budget exhausted: "
                f"{self._session_input_tokens} + {estimated_context_tokens} "
                f"> {self.max_total_input_tokens_per_session}"
            )
            return {"approved": False, "reason": self._trip_reason}

        if estimated_context_tokens > self.max_context_tokens_per_turn:
            # Recommend reducing context scope rather than blocking
            return {
                "approved": True,
                "recommend_scope_reduction": True,
                "recommended_max_files": self.max_context_tokens_per_turn // 2_000,
                "reason": (
                    f"per-turn ceiling exceeded: {estimated_context_tokens} "
                    f"> {self.max_context_tokens_per_turn}; scope reduced"
                ),
            }

        return {"approved": True, "recommend_scope_reduction": False}

    def record_turn(self, actual_input_tokens: int) -> None:
        self._session_input_tokens += actual_input_tokens
        self._turns += 1

    def status(self) -> dict:
        return {
            "turns": self._turns,
            "session_input_tokens": self._session_input_tokens,
            "tripped": self.tripped,
            "trip_reason": self._trip_reason,
        }


def q_chat_with_context_budget(
    q_client,
    application_id: str,
    user_id: str,
    message: str,
    budget: WorkspaceContextBudget,
    conversation_id: Optional[str] = None,
) -> dict:
    """
    Sends a message to Q Developer with workspace context budget enforcement.
    Reduces context scope when per-turn ceiling would be exceeded.
    """
    if budget.tripped:
        return {"error": "session budget exhausted", "status": budget.status()}

    # Estimate context tokens before the call using file count × avg tokens/file
    # Q Developer's ListDataSources or workspace index metadata provides file count
    estimated_context = _estimate_workspace_context_tokens(q_client, application_id)
    check = budget.check_before_turn(estimated_context)

    chat_kwargs = {
        "applicationId": application_id,
        "userId": user_id,
        "userMessage": message,
    }

    if conversation_id:
        chat_kwargs["conversationId"] = conversation_id

    if check.get("recommend_scope_reduction"):
        # Limit retrieval to top-N most relevant files only
        chat_kwargs["attributeFilter"] = {
            "maxResults": check["recommended_max_files"]
        }

    if not check["approved"]:
        return {"error": check["reason"], "status": budget.status()}

    response = q_client.chat_sync(**chat_kwargs)
    actual_input = response.get("inputTokens", estimated_context)
    budget.record_turn(actual_input)

    return {
        "response": response,
        "context_tokens_this_turn": actual_input,
        "budget_status": budget.status(),
    }


def _estimate_workspace_context_tokens(q_client, application_id: str) -> int:
    # Returns estimated context tokens based on workspace index size.
    # Use Q Developer's GetIndex API for accurate file/chunk count.
    # Fallback: 30,000 tokens as a conservative default for medium codebases.
    try:
        index_info = q_client.get_index(applicationId=application_id)
        document_count = index_info.get("indexStatistics", {}).get("indexedDocumentCount", 100)
        return min(document_count * 400, 80_000)  # ~400 tokens/file avg, cap at 80K
    except Exception:
        return 30_000

The guard's scope reduction path is important: rather than blocking the turn entirely when the per-turn ceiling would be exceeded, it reduces the retrieval scope to the most relevant N files. This preserves Q's usefulness for narrow questions while protecting against broad questions that pull in the entire codebase. Teams running automated Q integrations should set max_context_tokens_per_turn to 15,000–20,000 tokens for general questions, with a higher allowance for sessions explicitly asking about cross-cutting concerns.

Pattern 4: Security Scan + Automated Remediation Cycles

Amazon Q Developer's security scanning uses CodeGuru's vulnerability detection rules combined with Q's LLM-powered fix generation. The automated flow: scan the codebase for vulnerabilities, generate a fix for each finding, apply the fix, re-scan to verify the fix didn't introduce a new vulnerability. This scan-fix-verify loop is designed to run to completion — clearing all findings before reporting success.

The structural failure mode: fixing one vulnerability class can introduce another. A common example: fixing a SQL injection by adding input validation introduces a null-pointer risk if the validation raises on empty input. Q detects the null-pointer risk on the re-scan, generates a null-check fix, applies it, re-scans, finds that the null-check path can create an unhandled exception under concurrent access, generates a synchronization fix, and so on. Each scan→fix→scan cycle costs:

Security remediation cycle cost formula:
findings_per_scan=8, tokens_per_fix_generation=4,000 (vulnerability context + code + fix),
tokens_per_scan_summary=2,000, cycles_before_stable=4 (empirical: some fixes cascade)

Total LLM calls: 4 cycles × (8 fix generations + 1 re-scan summary) = 36 calls
Total tokens: 36 × 4,000 input + 36 × 1,200 output = 187,200 tokens
At Claude 3.5 Sonnet via Bedrock ($3/$15 per 1M tokens):
Per-repository remediation run: ~$0.56 + $0.65 = ~$1.21
Nightly automated remediation across 40 repositories: ~$48.40/night = $1,452/month

The cost-per-run is modest at small scale. The issue is that nightly automated remediation across a large repository fleet accumulates linearly: 40 repositories × $1.21 × 30 nights = $1,452/month, with no AWS alert until the billing cycle closes.

The guard: SecurityRemediationGuard

The guard caps remediation at a per-vulnerability retry ceiling, a per-session finding limit, and a per-session cost ceiling. It tracks cumulative LLM calls and token counts across the full scan-fix-scan cycle, not just per individual API call:

Python
from dataclasses import dataclass, field
from typing import Optional
import boto3

@dataclass
class SecurityRemediationGuard:
    max_remediation_rounds: int = 3     # max scan-fix-scan cycles per session
    max_findings_per_session: int = 20  # stop after processing this many findings
    max_cost_usd: float = 2.00          # hard spend ceiling per session
    _rounds: int = field(default=0, init=False)
    _findings_processed: int = field(default=0, init=False)
    _estimated_cost: float = field(default=0.0, init=False)
    _trip_reason: Optional[str] = field(default=None, init=False)

    INPUT_COST_PER_1M = 3.00   # Claude 3.5 Sonnet via Bedrock
    OUTPUT_COST_PER_1M = 15.00

    @property
    def tripped(self) -> bool:
        return self._trip_reason is not None

    def record_remediation_call(
        self, input_tokens: int, output_tokens: int, findings_fixed: int = 0
    ) -> None:
        call_cost = (
            input_tokens / 1_000_000 * self.INPUT_COST_PER_1M
            + output_tokens / 1_000_000 * self.OUTPUT_COST_PER_1M
        )
        self._estimated_cost += call_cost
        self._findings_processed += findings_fixed

        if self._estimated_cost >= self.max_cost_usd:
            self._trip_reason = (
                f"cost ceiling reached: "
                f"${self._estimated_cost:.2f} >= ${self.max_cost_usd}"
            )
        elif self._findings_processed >= self.max_findings_per_session:
            self._trip_reason = (
                f"finding limit reached: "
                f"{self._findings_processed} >= {self.max_findings_per_session}"
            )

    def start_round(self) -> bool:
        if self.tripped:
            return False
        self._rounds += 1
        if self._rounds > self.max_remediation_rounds:
            self._trip_reason = (
                f"max_remediation_rounds={self.max_remediation_rounds} reached; "
                f"escalate remaining findings to human review"
            )
            return False
        return True

    def status(self) -> dict:
        return {
            "rounds": self._rounds,
            "findings_processed": self._findings_processed,
            "estimated_cost_usd": round(self._estimated_cost, 4),
            "tripped": self.tripped,
            "trip_reason": self._trip_reason,
        }


def run_security_remediation_with_guard(
    codewhisperer_client,
    repository_path: str,
    guard: SecurityRemediationGuard,
) -> dict:
    """
    Runs a Q Developer security scan + auto-remediation cycle with spend enforcement.
    On guard trip, returns remaining findings as a human-escalation list rather than
    stopping the entire process silently.
    """
    human_escalations = []
    all_fixed = []

    while guard.start_round():
        # Run security scan
        scan_response = codewhisperer_client.create_code_scan(
            artifacts={"sourceCodeType": {"repositoryPath": repository_path}},
            programmingLanguage={"languageName": "python"},  # adapt as needed
        )
        findings = _poll_scan_findings(codewhisperer_client, scan_response["jobId"])

        if not findings:
            break  # clean scan; done

        # Enforce finding limit: take the first N unprocessed findings
        remaining_budget = guard.max_findings_per_session - guard._findings_processed
        batch = findings[:remaining_budget]
        deferred = findings[remaining_budget:]
        human_escalations.extend(deferred)

        for finding in batch:
            if guard.tripped:
                human_escalations.append(finding)
                continue

            # Generate fix for this finding
            fix_response = codewhisperer_client.generate_recommendation(
                fileContext=finding["fileContext"],
                maxResults=1,
            )
            fix_tokens = fix_response.get("inputTokens", 4000)
            fix_output_tokens = fix_response.get("outputTokens", 1200)
            guard.record_remediation_call(fix_tokens, fix_output_tokens, findings_fixed=1)

            if not guard.tripped:
                # Apply the fix (implementation-specific to your VCS integration)
                _apply_fix(repository_path, finding, fix_response["recommendations"][0])
                all_fixed.append(finding["findingId"])

    return {
        "fixed": all_fixed,
        "human_escalations": human_escalations,
        "guard_status": guard.status(),
    }


def _poll_scan_findings(client, job_id: str) -> list:
    import time
    while True:
        r = client.get_code_scan(jobId=job_id)
        if r["status"] in ("Completed", "Failed"):
            return r.get("findings", [])
        time.sleep(15)


def _apply_fix(repo_path: str, finding: dict, recommendation: dict) -> None:
    # Apply the text edit from recommendation to the file at finding["filePath"]
    pass

The guard's human_escalations output is intentional: instead of silently capping the remediation and leaving vulnerabilities unfixed with no record, the guard returns the remaining findings as a list that can be routed to a human reviewer or a ticket system. This gives the team full visibility into what was deferred without paying for additional remediation cycles that would exceed the session budget.

Summary: Amazon Q Developer cost control in one table

Feature Failure mode Guard Default ceiling
/dev command
chat_sync agent
Replanning loop accumulates error context, each round larger than the last; 5 rounds = ~78K input tokens DevCommandBudgetGuard max_rounds=5, max_input_tokens=40K
Code transformation
StartCodeTransformation
One LLM call per file × retry_rate; 400-file Java project at 1.8× retry = 1.8M input tokens ($5.40) TransformBudgetGuard max_cost_usd=$20, max_files=200
Workspace context
chat_sync with index
40K context tokens per automated turn; 150 turns/day = $19.80/day ($594/month) WorkspaceContextBudget max_context_per_turn=20K, max_session=100K
Security remediation
create_code_scan + fix loop
Scan-fix-scan cycles cascade when fixes introduce secondary vulnerabilities; 4 cycles × 8 findings = $1.21/run SecurityRemediationGuard max_rounds=3, max_findings=20, max_cost=$2

Integration with RunGuard

Each guard above tracks spend independently. For teams running all four Q Developer features, a session-level aggregator connects them to a shared monthly budget:

Python
from runguard import SessionBudget, BudgetExceeded

# Shared session budget: $50/month across all Q Developer features
session = SessionBudget(
    monthly_cap_usd=50.0,
    alert_threshold=0.8,  # alert at $40
    on_trip=lambda reason: send_slack_alert(f"Q Developer budget tripped: {reason}"),
)

# Each guard checks in with the shared session before each call
dev_guard = DevCommandBudgetGuard(max_rounds=5, max_input_tokens=40_000)
transform_guard = TransformBudgetGuard(max_cost_usd=20.0, max_files=200)
context_budget = WorkspaceContextBudget(max_context_tokens_per_turn=20_000)
remediation_guard = SecurityRemediationGuard(max_remediation_rounds=3, max_cost_usd=2.0)

# The session budget aggregates cost from all guards and trips any guard
# that would push the monthly total over the cap
session.register(dev_guard, transform_guard, context_budget, remediation_guard)

The SessionBudget aggregator works across all RunGuard-instrumented API clients. When any individual guard trips, the session records the spend. When the monthly cap approaches, the aggregator pre-empts additional calls from all registered guards — the team gets an alert before the AWS bill closes, not after.

Where this fits in the AWS ecosystem

Amazon Q Developer's cost amplification patterns are structurally similar to those in other AWS Bedrock agents (action group fan-out, knowledge base retrieval per turn) and Bedrock Converse API (token-accumulating conversation history). The patterns covered here are specific to Q Developer's agentic capabilities layered on top of Bedrock — the /dev plan-execute loop, the per-file transformation fan-out, the workspace index retrieval, and the remediation cycle. Each requires instrumentation at the Q Developer API boundary, not at the underlying Bedrock model call, because Q Developer abstracts the individual LLM calls behind task-level APIs.

Teams using Q Developer in CI/CD pipelines should treat every automated Q invocation as a potential cost amplifier — not because Q is expensive per call, but because the features that make Q powerful (multi-step task completion, whole-codebase awareness, iterative remediation) all scale superlinearly with codebase size and session depth. The guards above enforce the linear cost model your budget expects.

Frequently asked questions

Does Amazon Q Developer expose token usage in its API responses?

Q Developer Pro's API responses include inputTokens and outputTokens fields in the response metadata for chat_sync and related calls. The GetCodeTransformation response includes a transformationFileCount field useful for pre-transformation cost estimation. The token fields are not always present for all Q Developer features, particularly the IDE plugin's inline suggestions — for those, cost estimation requires empirical benchmarking against your codebase.

Does the /dev command expose its intermediate steps for guard instrumentation?

The /dev command operates as an agentic session via the Q Developer API's chat_sync endpoint with streaming. Intermediate agent actions (file reads, shell command executions, code edits) are visible as streamed events before the final response. The DevCommandBudgetGuard intercepts between turns by wrapping the outer polling loop, not the internal agent steps — which means it enforces ceilings at the replanning boundary rather than at the individual tool-call level.

How does workspace context token count scale with repository size?

Q Developer's retrieval system ranks files by relevance and retrieves a fixed top-N chunk set. Larger codebases do not proportionally increase context tokens per turn — instead, the retrieval becomes less precise as more files compete for the top-N slots. The practical risk is not linear scaling but retrieval noise: broader questions in large repositories pull in marginally relevant files, inflating context with low-signal content while the per-turn token count hits its natural ceiling around 40,000–80,000 tokens. The WorkspaceContextBudget guard's scope reduction mode addresses this by explicitly limiting the retrieval result count.

Can AWS Cost Explorer alert on Q Developer spend before it accumulates?

AWS Cost Explorer and Budget Alerts can be configured to notify when Q Developer spend crosses a monthly threshold. The limitation is latency: Cost Explorer data is delayed 24 hours, and budget alerts fire at end-of-period thresholds rather than at per-call or per-session ceilings. A team running a failed 400-file transformation at 9am will not see the AWS alert until the next morning at earliest. The guards described here operate at call time, enforcing ceilings before the tokens are billed rather than after.

Which Q Developer plan (Individual vs Pro) is affected by these patterns?

Q Developer Individual is free with usage limits. Q Developer Pro ($19/user/month) removes those limits and adds workspace context, extended code transformation, and security remediation features — the three agentic capabilities most likely to produce runaway cost. The patterns in this post primarily apply to Q Developer Pro in enterprise or automated workflow contexts. Individual tier users are protected by Q Developer's own usage limits and are less likely to hit these patterns in interactive use.

Add spend ceilings to your Q Developer workflows

RunGuard's circuit breaker SDK instruments Amazon Q Developer API calls, enforces per-session and per-month token budgets, and alerts your team when automated workflows approach the spend ceiling — before the AWS bill closes.

Start free trial — no card required