AWS Strands Agents Cost Control: Streaming Token Accumulation, Tool Result Injection Loops, Multi-Agent Amplification, and Lambda Billing Drift

AWS Strands Agents is Amazon's open-source Python SDK for building production AI agents on Amazon Bedrock, released in May 2025. It wraps Bedrock's Converse API behind a clean agent loop: you register Python functions as tools with @tool decorators, instantiate an Agent with a model ID and your tool list, and call the agent with a natural-language string. Strands handles the tool call / response cycle, streams tokens from Bedrock, and maintains conversation state across turns. The SDK's simplicity is genuine — the hello-world is under 15 lines of Python.

That simplicity creates a cost surface that's easy to miss. Strands' streaming-first architecture, combined with its full-history context management and native support for multi-agent topologies, means four distinct failure modes can drive AWS bills without producing any errors or obvious warning signs. They fail silently: the agent returns a result, the Lambda function exits successfully, and the bill arrives later.

  • Streaming token accumulation — Strands passes the full conversation history to Bedrock on every turn; in multi-turn sessions, each round-trip sends the complete preceding exchange as input tokens, compounding cost quadratically with conversation length.
  • Tool result injection loops — when a tool returns an ambiguous or empty result, the model frequently calls the same tool again with a slightly modified input; without a per-tool call ceiling, the loop runs until max_iterations is reached.
  • Multi-agent worker amplification — Strands' AgentTool primitive lets a supervisor agent spawn worker agents; each worker maintains its own full context window and its own tool call loop, multiplying costs by the number of workers called in parallel.
  • Lambda billing drift — Strands is designed for Lambda deployment; when Bedrock responses are slow (high-load Claude models, large contexts), a streaming agent session that takes 45 seconds bills the full Lambda execution time including the milliseconds spent waiting on each streaming chunk.

Failure Mode 1 — Streaming Token Accumulation

Strands maintains an internal messages list that grows with every user turn and every tool call round-trip. When you call agent("continue the analysis") in a multi-turn session, Strands sends the entire conversation history — all prior user messages, all assistant messages, all tool call requests, all tool results — as the input to Bedrock's Converse API. Bedrock charges for every input token sent, not just the new message.

The compounding is not linear. A fresh conversation might send 200 input tokens. After five tool calls (each adding a request message and a result message), the history contains roughly 1,500 tokens. After ten tool calls, it contains roughly 3,500 tokens. The eleventh call sends 3,500 tokens as input just to receive a 100-token response. The total input tokens across a 20-turn session are not 20 × average-turn-length — they are the triangular sum of accumulated history at each turn.

On Claude 3.5 Sonnet via Bedrock, input tokens cost $3.00 per million. A research agent that runs 30 tool calls per session across a 10,000-token accumulated history at turn 30 sends 300,000 input tokens in that single turn — $0.90 for one Bedrock call. The same session running 50 parallel instances in a batch job costs $45 for the final call alone. Most teams measure cost per session at turn 1 and extrapolate linearly, systematically underestimating multi-turn agent cost by 3–8×.

The accumulation rule: In a Strands session with N tool calls, total input tokens ≈ (N² / 2) × average_tokens_per_exchange. At 30 tool calls and 500 tokens per exchange, total input tokens ≈ 225,000 — not 30 × 500 = 15,000. Measure actual token counts at turn 15 and 30, not just turn 1.

Strands exposes token usage in the metrics attribute after each call. The guard pattern is a token budget that tracks cumulative input tokens across turns and terminates the session before the accumulation reaches a configured ceiling:

Python — per-session token accumulation budget for AWS Strands Agents
from strands import Agent, tool
from strands.models import BedrockModel
import time

class StrandsTokenBudget:
    """
    Tracks cumulative input + output tokens across all turns in a Strands session.
    Call check() after each agent() invocation to enforce the session ceiling.
    """

    def __init__(
        self,
        max_input_tokens: int = 50_000,
        max_total_tokens: int = 80_000,
        session_id: str = "",
    ):
        self.max_input_tokens = max_input_tokens
        self.max_total_tokens = max_total_tokens
        self.session_id = session_id
        self.cumulative_input = 0
        self.cumulative_output = 0
        self.turn_count = 0
        self.turn_snapshots: list[dict] = []

    def record_turn(self, agent: "Agent") -> None:
        metrics = getattr(agent, "metrics", None)
        if metrics is None:
            return
        usage = getattr(metrics, "accumulated_usage", None)
        if usage is None:
            return
        self.cumulative_input = usage.get("inputTokens", self.cumulative_input)
        self.cumulative_output = usage.get("outputTokens", self.cumulative_output)
        self.turn_count += 1
        self.turn_snapshots.append({
            "turn": self.turn_count,
            "cumulative_input": self.cumulative_input,
            "cumulative_output": self.cumulative_output,
            "total": self.cumulative_input + self.cumulative_output,
        })

    def check(self) -> None:
        total = self.cumulative_input + self.cumulative_output
        if self.cumulative_input > self.max_input_tokens:
            raise RuntimeError(
                f"[StrandsTokenBudget] Session {self.session_id!r} exceeded input token "
                f"ceiling: {self.cumulative_input:,} > {self.max_input_tokens:,}. "
                f"Turn {self.turn_count}, total tokens: {total:,}. "
                "Likely cause: full conversation history compounding across turns. "
                "Consider clearing agent.messages at checkpoint intervals."
            )
        if total > self.max_total_tokens:
            raise RuntimeError(
                f"[StrandsTokenBudget] Session {self.session_id!r} exceeded total token "
                f"ceiling: {total:,} > {self.max_total_tokens:,}."
            )

    def summary(self) -> dict:
        return {
            "session_id": self.session_id,
            "turn_count": self.turn_count,
            "cumulative_input": self.cumulative_input,
            "cumulative_output": self.cumulative_output,
            "input_budget_used_pct": round(self.cumulative_input / self.max_input_tokens * 100, 1),
        }


# Usage pattern
budget = StrandsTokenBudget(
    max_input_tokens=50_000,
    max_total_tokens=80_000,
    session_id="research-session-001",
)

model = BedrockModel(
    model_id="us.anthropic.claude-3-5-sonnet-20241022-v2:0",
    region_name="us-east-1",
)
agent = Agent(model=model, tools=[...])

for user_message in conversation_turns:
    agent(user_message)
    budget.record_turn(agent)
    budget.check()  # raises before the next turn if ceiling is breached

# Checkpoint: trim history when input tokens exceed 60% of ceiling
if budget.cumulative_input > budget.max_input_tokens * 0.6:
    # Keep only the system prompt + last 4 messages to reset accumulation
    if len(agent.messages) > 4:
        agent.messages = agent.messages[-4:]

The checkpoint at 60% of the ceiling is important. Clearing agent.messages resets the accumulation counter but loses conversational context. Trimming to the last 4 messages (the most recent 2 user/assistant pairs) preserves enough context for the agent to continue coherently while resetting the compounding effect. Set the trim threshold based on how much context loss is acceptable for your use case — research agents tolerate more trimming than customer support agents that reference earlier parts of the conversation.

Failure Mode 2 — Tool Result Injection Loops

Strands uses Bedrock's native tool use protocol. When the model decides to use a tool, it emits a toolUse block in the response; Strands executes the corresponding Python function, injects the result back into the conversation as a toolResult message, and calls Bedrock again. This cycle continues until the model emits a final text response without a tool call.

The loop fails in a specific pattern: the model calls a tool, receives a result that is ambiguous or doesn't resolve the original question, and calls the same tool again with a slightly modified input trying to get a clearer answer. This is not an infinite loop in the classic sense — each iteration makes progress from the model's perspective. But from a cost perspective, five iterations of a web search tool (each costing 1,500 input tokens for the accumulated history plus the tool result) at turn 25 of a session generates the same cost as running five separate fresh sessions.

The most common trigger is a tool that returns partial results. A database query tool that returns "no results found" when the query was slightly too narrow causes the model to try progressively wider queries — each "no results" response gets injected into the history, growing it further, making subsequent iterations more expensive. A web search that returns a 503 from the search API causes the model to retry with synonyms. A code execution tool that returns an empty output causes the model to modify the code and try again.

The injection loop signature: Same tool called 3+ times in a row, each call's result injected into the growing context. Check agent.messages for consecutive toolUse blocks with the same tool name — that's the diagnostic. The fix is a per-tool call counter that raises after N consecutive calls to the same tool.

Python — per-tool call ceiling guard for AWS Strands tool loops
from strands import tool
from functools import wraps
from collections import defaultdict
import threading

class ToolCallGuard:
    """
    Tracks consecutive calls to the same tool within a session.
    Raises after max_consecutive_same_tool consecutive calls to the same tool.
    Also enforces a total per-tool ceiling across the full session.
    """

    def __init__(
        self,
        max_consecutive_same_tool: int = 3,
        max_total_per_tool: int = 10,
    ):
        self.max_consecutive = max_consecutive_same_tool
        self.max_total_per_tool = max_total_per_tool
        self._lock = threading.Lock()
        self._total_calls: dict[str, int] = defaultdict(int)
        self._last_tool: str | None = None
        self._consecutive_count: int = 0

    def record(self, tool_name: str) -> None:
        with self._lock:
            # Consecutive counter
            if tool_name == self._last_tool:
                self._consecutive_count += 1
            else:
                self._consecutive_count = 1
                self._last_tool = tool_name

            # Total counter
            self._total_calls[tool_name] += 1

            # Ceiling checks
            if self._consecutive_count > self.max_consecutive:
                raise RuntimeError(
                    f"[ToolCallGuard] Tool {tool_name!r} called {self._consecutive_count} "
                    f"times consecutively (ceiling: {self.max_consecutive}). "
                    "The model is looping on an ambiguous tool result. "
                    "Check the tool's return value for empty or partial results "
                    "that cause the model to keep retrying."
                )
            if self._total_calls[tool_name] > self.max_total_per_tool:
                raise RuntimeError(
                    f"[ToolCallGuard] Tool {tool_name!r} reached session total of "
                    f"{self._total_calls[tool_name]} calls (ceiling: {self.max_total_per_tool}). "
                    "Review whether this tool's results are resolving the agent's goal."
                )

    def reset_consecutive(self) -> None:
        with self._lock:
            self._last_tool = None
            self._consecutive_count = 0

    def summary(self) -> dict:
        return {
            "total_calls_by_tool": dict(self._total_calls),
            "last_tool": self._last_tool,
            "consecutive_count": self._consecutive_count,
        }


# Decorator to wrap Strands tools with the guard
def guarded_tool(guard: ToolCallGuard):
    def decorator(fn):
        @tool
        @wraps(fn)
        def wrapper(*args, **kwargs):
            guard.record(fn.__name__)
            return fn(*args, **kwargs)
        return wrapper
    return decorator


# Apply to your Strands tools
guard = ToolCallGuard(max_consecutive_same_tool=3, max_total_per_tool=10)

@guarded_tool(guard)
def web_search(query: str) -> str:
    """Search the web for information."""
    # your implementation
    ...

@guarded_tool(guard)
def database_query(sql: str) -> str:
    """Execute a read-only SQL query."""
    # your implementation
    ...

agent = Agent(
    model=BedrockModel(model_id="us.anthropic.claude-3-5-sonnet-20241022-v2:0"),
    tools=[web_search, database_query],
)

The consecutive counter resets when a different tool is called. This handles the common pattern where the model alternates between two tools (search → query → search → query) without hitting the consecutive ceiling, even though both are looping. For that pattern, the total per-tool ceiling of 10 calls is the backstop. Set both ceilings based on your domain: a research agent that legitimately needs many searches should have a higher ceiling than a support agent that should resolve each query in one or two tool calls.

Failure Mode 3 — Multi-Agent Worker Amplification

Strands supports multi-agent topologies through its AgentTool class, which wraps an Agent instance as a callable tool that a supervisor agent can invoke. A supervisor agent receives a complex task, breaks it into subtasks, and dispatches each subtask to a worker agent by calling it as a tool. The worker agent runs its own full tool call loop, maintains its own conversation history, and returns a result to the supervisor. The supervisor synthesizes results and may dispatch additional workers.

The cost structure is multiplicative, not additive. A supervisor that calls three worker agents in parallel to research three aspects of a topic triggers three simultaneous Bedrock sessions. Each worker maintains its own full context — it doesn't share the supervisor's conversation history, but it starts with the supervisor's subtask prompt (which is often a large, detailed instruction derived from the supervisor's own accumulated context). If the supervisor's context has grown to 8,000 tokens by the time it dispatches workers, each worker's first input includes an 8,000-token prompt plus its own growing history. Three workers × 30 tool calls each × 10,000 accumulated tokens per turn = 900,000 input tokens from a single supervisor dispatch.

The amplification compounds when workers are called sequentially based on earlier results. A supervisor that says "if the first search returns X, research X further with a dedicated agent" creates a dynamic depth that's impossible to bound statically. Each recursive level multiplies the token count by the number of workers at that level.

The amplification rule: In a Strands multi-agent tree of depth D with branching factor B, total input tokens scale as O(B^D × session_tokens). A depth-2 tree with 3 workers per level and 15,000 tokens per session generates 135,000 tokens minimum (3² × 15,000). Cap both tree depth and per-session token budgets independently — neither ceiling alone is sufficient.

Python — depth-limited multi-agent orchestration with Strands AgentTool
from strands import Agent, tool
from strands.tools import AgentTool
from strands.models import BedrockModel
import threading

class MultiAgentBudget:
    """
    Tracks agent tree depth and total spawned agents.
    Pass a shared instance to all supervisors and workers in the hierarchy.
    """

    def __init__(
        self,
        max_depth: int = 2,
        max_total_agents: int = 6,
        max_tokens_per_session: int = 40_000,
    ):
        self.max_depth = max_depth
        self.max_total_agents = max_total_agents
        self.max_tokens_per_session = max_tokens_per_session
        self._lock = threading.Lock()
        self._current_depth: int = 0
        self._total_spawned: int = 0

    def enter_level(self, label: str = "") -> None:
        with self._lock:
            self._current_depth += 1
            self._total_spawned += 1
            if self._current_depth > self.max_depth:
                raise RuntimeError(
                    f"[MultiAgentBudget] Agent tree depth {self._current_depth} "
                    f"exceeds ceiling {self.max_depth}. "
                    f"Spawning agent: {label!r}. "
                    "Multi-agent recursion is compounding token costs. "
                    "Cap depth or reduce worker granularity."
                )
            if self._total_spawned > self.max_total_agents:
                raise RuntimeError(
                    f"[MultiAgentBudget] Total spawned agents {self._total_spawned} "
                    f"exceeds ceiling {self.max_total_agents}."
                )

    def exit_level(self) -> None:
        with self._lock:
            self._current_depth = max(0, self._current_depth - 1)

    def summary(self) -> dict:
        return {
            "current_depth": self._current_depth,
            "total_spawned": self._total_spawned,
            "max_depth": self.max_depth,
            "max_total_agents": self.max_total_agents,
        }


def make_bounded_worker(
    task_description: str,
    tools: list,
    budget: MultiAgentBudget,
    depth_label: str = "",
) -> str:
    """
    Spawns a bounded Strands worker agent. Raises if depth or spawn ceiling is exceeded.
    Returns the worker's final text response.
    """
    budget.enter_level(label=depth_label)
    token_budget = StrandsTokenBudget(
        max_input_tokens=budget.max_tokens_per_session,
        session_id=depth_label,
    )
    try:
        model = BedrockModel(
            model_id="us.anthropic.claude-3-5-sonnet-20241022-v2:0",
            region_name="us-east-1",
        )
        worker = Agent(model=model, tools=tools)
        result = worker(task_description)
        token_budget.record_turn(worker)
        token_budget.check()
        return str(result)
    finally:
        budget.exit_level()


# Supervisor tool that uses bounded workers
multi_budget = MultiAgentBudget(max_depth=2, max_total_agents=6)

@tool
def research_subtopic(topic: str) -> str:
    """Research a specific subtopic in depth using a dedicated agent."""
    return make_bounded_worker(
        task_description=f"Research this topic thoroughly: {topic}",
        tools=[web_search, database_query],
        budget=multi_budget,
        depth_label=f"worker-{topic[:20]}",
    )

supervisor = Agent(
    model=BedrockModel(model_id="us.anthropic.claude-3-5-haiku-20241022-v1:0"),
    tools=[research_subtopic],
)

Note the model choice in the example: the supervisor uses Claude Haiku (cheaper, faster, sufficient for routing decisions) while workers use Claude Sonnet (capable, needed for deep research). This is a standard cost optimization for multi-agent Strands deployments — the supervisor's job is orchestration, not generation. Haiku input tokens cost $0.80 per million versus Sonnet's $3.00, so a supervisor that makes 20 routing decisions saves $0.044 per session — meaningful at scale but not worth sacrificing the depth and spawn ceilings that prevent the amplification failure.

Failure Mode 4 — Lambda Billing Drift

Strands is explicitly designed for AWS Lambda deployment. The SDK's documentation examples show Lambda handler patterns, and the streaming architecture maps naturally to Lambda's invocation model. Lambda bills per millisecond of execution time (rounded up to 1ms), which means every millisecond that a Lambda function spends waiting on a Bedrock streaming response is a billable millisecond.

The billing drift failure mode is specific to streaming over long Bedrock calls. A Strands agent running a 30-tool-call session at turn 20 (with ~15,000 accumulated input tokens) sends a request to Bedrock that takes 8–12 seconds to complete on a loaded Claude 3.5 Sonnet endpoint. The Lambda function is alive and billable during every millisecond of that wait. A 128MB Lambda function costs $0.0000021 per 100ms ($0.000021/s). A single 10-second Bedrock call costs $0.00021 in Lambda compute — negligible per call, but a 30-call session that averages 5 seconds per call costs $0.00315 in Lambda compute alone, before Bedrock token charges. At 10,000 sessions per day, that's $31.50/day in Lambda compute from idle wait time.

The drift accelerates with context length. As conversation history grows, Bedrock takes longer to process the input (time-to-first-token increases with input length). Turn 30 of a session with 20,000 accumulated tokens takes 12 seconds where turn 1 took 2 seconds. The Lambda billing for the same agent doubles from turn 1 to turn 30 due to input length alone. This is invisible in Bedrock's cost dashboard, which shows token charges — the Lambda idle-wait cost appears in a different cost category.

The Lambda drift rule: Lambda idle-wait cost is proportional to both session length (more turns = more calls) and context length (larger context = slower Bedrock = longer idle per call). Both grow together in a Strands multi-turn session. Set a hard Lambda timeout well below the 15-minute maximum and pair it with the session token ceiling from failure mode 1.

Python — Lambda timeout guard and per-session wall-clock ceiling for Strands
import time
import signal
from contextlib import contextmanager
from strands import Agent
from strands.models import BedrockModel

class LambdaSessionGuard:
    """
    Enforces a wall-clock ceiling on a Strands session running inside Lambda.
    Raises before the Lambda timeout to allow graceful partial-result return.
    """

    def __init__(
        self,
        max_session_seconds: float = 25.0,
        lambda_timeout_seconds: float = 30.0,
        warn_at_fraction: float = 0.75,
    ):
        self.max_session_seconds = max_session_seconds
        self.lambda_timeout_seconds = lambda_timeout_seconds
        self.warn_at_fraction = warn_at_fraction
        self._session_start: float = 0.0
        self._turn_count: int = 0

    def start(self) -> None:
        self._session_start = time.monotonic()
        self._turn_count = 0

    def check(self, after_turn_label: str = "") -> None:
        elapsed = time.monotonic() - self._session_start
        self._turn_count += 1
        remaining = self.max_session_seconds - elapsed
        if remaining <= 0:
            raise RuntimeError(
                f"[LambdaSessionGuard] Session wall-clock ceiling of "
                f"{self.max_session_seconds}s exceeded (elapsed: {elapsed:.1f}s) "
                f"after turn {self._turn_count} ({after_turn_label!r}). "
                "Lambda billing drift: session is spending more time waiting on "
                "Bedrock responses than executing — likely caused by large context "
                "growing time-to-first-token on each call."
            )
        if elapsed > self.max_session_seconds * self.warn_at_fraction:
            print(
                f"[LambdaSessionGuard] WARNING: {elapsed:.1f}s elapsed of "
                f"{self.max_session_seconds}s ceiling after turn {self._turn_count}. "
                f"Remaining: {remaining:.1f}s. Consider trimming agent.messages."
            )

    @property
    def elapsed_seconds(self) -> float:
        return time.monotonic() - self._session_start


# Lambda handler pattern with Strands + guard
def lambda_handler(event: dict, context) -> dict:
    # Lambda context provides ms_remaining — use it to set dynamic ceiling
    lambda_remaining_ms = getattr(context, "get_remaining_time_in_millis", lambda: 30_000)()
    safe_ceiling_seconds = (lambda_remaining_ms / 1000) - 5.0  # 5-second safety margin

    session_guard = LambdaSessionGuard(
        max_session_seconds=min(safe_ceiling_seconds, 25.0),
        lambda_timeout_seconds=lambda_remaining_ms / 1000,
    )
    token_budget = StrandsTokenBudget(
        max_input_tokens=30_000,
        session_id=event.get("session_id", ""),
    )

    model = BedrockModel(
        model_id="us.anthropic.claude-3-5-haiku-20241022-v1:0",  # Haiku for Lambda: faster TTFT, lower idle-wait cost
        region_name="us-east-1",
        streaming=True,
    )
    agent = Agent(model=model, tools=[...])

    session_guard.start()
    partial_results = []

    try:
        for user_message in event.get("messages", []):
            result = agent(user_message)
            partial_results.append(str(result))
            token_budget.record_turn(agent)
            token_budget.check()
            session_guard.check(after_turn_label=user_message[:50])
    except RuntimeError as e:
        # Return partial results on budget/time ceiling rather than timing out hard
        return {
            "statusCode": 206,
            "partial": True,
            "results": partial_results,
            "guard_message": str(e),
            "elapsed_seconds": session_guard.elapsed_seconds,
        }

    return {
        "statusCode": 200,
        "results": partial_results,
        "elapsed_seconds": session_guard.elapsed_seconds,
        "token_summary": token_budget.summary(),
    }

The key pattern here is using the Lambda context's get_remaining_time_in_millis() to set a dynamic wall-clock ceiling rather than a hardcoded constant. This ensures the guard fires 5 seconds before Lambda's hard timeout regardless of whether the function was configured with 30s or 5 minutes. Returning a 206 partial response rather than letting Lambda time out hard gives the calling service a chance to handle incomplete results gracefully — important for user-facing agents where a partial answer with a clear "ran out of time" signal is better than a 502 gateway timeout.

Combined Guard: Production Strands Agent

The four failure modes interact. A session that's accumulating tokens (failure mode 1) will trigger longer Bedrock response times that exacerbate Lambda billing drift (failure mode 4). A tool injection loop (failure mode 2) will accelerate token accumulation. A multi-agent topology (failure mode 3) amplifies all three others in each worker. The complete production guard wires all four protections together:

Failure Mode Guard Ceiling What to Watch
Token accumulationStrandsTokenBudget Cumulative input token counter 50,000 input tokens / 80,000 total agent.metrics.accumulated_usage after each turn
Tool injection loopToolCallGuard Consecutive + total per-tool counter 3 consecutive / 10 total per tool Consecutive toolUse blocks for same tool in agent.messages
Multi-agent amplificationMultiAgentBudget Tree depth + total spawn counter Depth 2 / 6 total agents Worker spawn count and depth in orchestration logs
Lambda billing driftLambdaSessionGuard Wall-clock ceiling from Lambda context Lambda timeout − 5s CloudWatch Lambda duration metrics, Bedrock TTFT percentiles

Frequently Asked Questions

Does Strands have a built-in max_iterations parameter like LangChain?

Strands does not expose a max_iterations parameter at the Agent constructor level in the same way LangChain's AgentExecutor does. It has a max_parallel_tools parameter for controlling concurrent tool calls and a callback_handler mechanism for observability, but loop termination is driven by the model's decision to stop calling tools. You must implement your own iteration ceiling using a callback or the tool-level guard patterns above.

Can I disable conversation history accumulation in Strands?

Yes — set agent.messages = [] between turns to clear history, or slice it to keep only recent turns: agent.messages = agent.messages[-4:]. Strands also supports passing messages directly to the agent call to control history explicitly. For stateless Lambda deployments, consider reconstructing the agent fresh each invocation rather than maintaining session state, and passing only the relevant context as the initial prompt rather than replaying the full history.

How does Bedrock's cross-region inference affect Strands cost?

Cross-region inference (using inference profiles like us.anthropic.claude-3-5-sonnet-20241022-v2:0 instead of a region-specific model ID) routes requests across AWS regions to improve availability. The token pricing is the same, but latency can vary — a cross-region request that normally completes in 3 seconds might take 8 seconds during high load. This longer TTFT directly increases Lambda billing drift. For cost-sensitive deployments, use single-region model IDs during low-traffic periods and reserve cross-region profiles for burst scenarios where availability matters more than cost.

Does Strands support prompt caching with Bedrock?

Bedrock's Prompt Caching feature (available for Claude models) caches up to 10,000 input tokens of a prompt prefix, and cached tokens cost 90% less than uncached tokens. Strands doesn't automatically use prompt caching, but you can configure it by setting cache points in the system prompt. For Strands agents with long system prompts (tool descriptions, instructions), enabling prompt caching for the system prompt can reduce the effective input token cost of the accumulating history by 30–40% — the system prompt tokens are cached after the first call while only the new conversation history tokens incur full cost.

What's the recommended Lambda memory size for Strands streaming agents?

Strands agents are I/O-bound (waiting on Bedrock), not CPU-bound. Higher Lambda memory doesn't significantly speed up streaming sessions. The cost-optimal configuration is typically 256MB–512MB: enough memory to avoid Lambda throttling from memory pressure, but not so much that you're paying for unused compute. The session wall-clock duration is dominated by Bedrock TTFT, not Lambda CPU allocation. If you're optimizing Lambda cost for Strands deployments, focus on reducing session length (fewer turns, shorter context) rather than tuning memory.

Enforce Strands token and iteration budgets automatically

RunGuard integrates with AWS Strands Agents to enforce per-session token ceilings, tool call limits, and Lambda wall-clock budgets. Connect your Strands agent in minutes — no code changes to your tools or model configuration.

See pricing Learn more