Mastra AI Agent Cost Control: Tool Loop Amplification, Workflow Retry Storms, and Memory Context Bloat

Mastra has become the dominant TypeScript-native agent framework in 2026, and for good reason: it ships first-class TypeScript types for agents, tools, and workflows; integrates directly with the Vercel AI SDK's model providers; and provides built-in memory, RAG, and workflow primitives that feel native to JavaScript runtimes rather than ported from Python. For teams building agent-first products in Node.js, Bun, or Deno, Mastra removes the friction of bridging Python's ML ecosystem with a JavaScript runtime.

That architecture also introduces a distinct set of cost failure modes. Mastra agents run tool-call loops in the same process as your application code, making it easy to miss runaway runs during development — your agent is just calling an async function, after all. Mastra Workflows compose steps that each carry independent retry configurations, so a retry storm can multiply LLM costs across a full workflow graph. Mastra's Memory system retrieves semantically relevant conversation history before every new LLM call, and that retrieved context grows as your conversation history does. Each of these failure modes is TypeScript-specific in its cause and its fix.

Four failure modes that are specific to Mastra AI agents and workflows:

  1. Agent tool loop without maxSteps — Mastra's agent.generate() and agent.stream() enter a tool-call loop when the model returns a tool use response. Without a maxSteps config, the loop runs until the model generates a non-tool response, the context window is exhausted, or the request times out. A research agent given a broad question can call search tools 40–80 times before deciding it has enough information.
  2. Workflow step retry amplification — Mastra Workflow steps accept a retryConfig option with attempts and delay. When a step that wraps an LLM call fails (rate limit, network error, model timeout), each retry re-runs the full LLM call. A 3-step workflow where each step has 3 retry attempts and each step fails once: 3 steps × (1 original + 3 retries) = 12 LLM calls for what should have been 3.
  3. Memory semantic context stuffing — Mastra's Memory class maintains a semantic message store backed by a vector database. On every new user message, Mastra retrieves the topK most semantically similar past messages and injects them into the system prompt. As conversation history grows into the hundreds of messages, retrieved context can fill 30–50% of the LLM's context window before the user's current query even appears — every turn pays this fixed overhead regardless of whether historical context is relevant.
  4. Parallel workflow step fan-out without concurrency cap — Mastra Workflows support parallel step execution via .then() fan-out patterns where multiple steps run concurrently. When the set of parallel steps is determined at runtime from LLM output (e.g., "search these N sources simultaneously"), the concurrency is unbounded. Each parallel branch runs its own agent tool loop, multiplying costs by the number of branches before any result aggregation occurs.

Mastra cost structure (mid-2026): Mastra itself is open-source (MIT). Every cost comes from the model providers you configure — typically via the Vercel AI SDK's provider adapters. For a GPT-4o-backed research agent: each tool call turn costs ~$0.005 in input tokens (accumulated tool results) + ~$0.002 in output tokens (tool selection + reasoning). A 40-step research loop = $0.28 per run before any output synthesis. At 200 runs/day, that's $56/day — for one agent doing one task type — without a maxSteps guard.

Failure Mode 1: Agent Tool Loop Without maxSteps

Mastra agents follow the standard tool-call loop pattern: the model receives a prompt with available tools, returns either a tool call or a final text response, and if a tool call is returned, Mastra executes it and feeds the result back to the model for the next step. This continues until the model decides it's done or a hard limit is hit.

The key Mastra-specific risk is that maxSteps defaults to a generous value and is easy to miss when wiring up an agent for the first time. Unlike Python frameworks that often require explicit step configuration, Mastra's fluent agent API encourages building agents incrementally — you add tools to an existing agent, which can silently extend the number of steps the agent takes before reaching a terminal response.

TypeScript — agent without maxSteps guard

import { Agent } from "@mastra/core/agent";
import { openai } from "@ai-sdk/openai";
import { searchTool, scrapePageTool, summarizeTool } from "./tools";

// No maxSteps — this agent will tool-loop until the model decides to stop
const researchAgent = new Agent({
  name: "ResearchAgent",
  instructions: `You are a research assistant. Use your tools to thoroughly
    research the given topic and provide a comprehensive answer.`,
  model: openai("gpt-4o"),
  tools: { searchTool, scrapePageTool, summarizeTool },
});

// This call can run 40-80 tool iterations on a broad research question
const result = await researchAgent.generate(
  "What are the main approaches to AI agent cost control in 2026?"
);

The problem compounds when you wire the agent into a request handler without a timeout. A research question that causes the agent to loop 60 times takes 4–6 minutes to complete. During that time, every intermediate tool call fires against your LLM API, accumulates into the context for the next step, and contributes to a final bill that arrives as a surprise at month-end rather than an error that fires at request time.

Mastra exposes maxSteps directly in generate() and stream(). Set it, and also add a cost tracking callback to catch runs that approach the limit:

TypeScript — agent with maxSteps and usage tracking

import { Agent } from "@mastra/core/agent";
import { openai } from "@ai-sdk/openai";
import { searchTool, scrapePageTool, summarizeTool } from "./tools";

const researchAgent = new Agent({
  name: "ResearchAgent",
  instructions: `You are a research assistant. Use your tools to research the
    given topic. Be concise — aim to answer within 8-12 tool calls.`,
  model: openai("gpt-4o"),
  tools: { searchTool, scrapePageTool, summarizeTool },
});

interface RunBudget {
  maxSteps: number;
  alertAtStep?: number;
  onBudgetAlert?: (step: number, totalSteps: number) => void;
}

async function guardedGenerate(
  agent: Agent,
  prompt: string,
  budget: RunBudget = { maxSteps: 12 }
) {
  let stepCount = 0;

  const result = await agent.generate(prompt, {
    maxSteps: budget.maxSteps,
    onStepFinish: ({ stepType, usage }) => {
      if (stepType === "tool-call") {
        stepCount++;
        if (
          budget.alertAtStep &&
          stepCount >= budget.alertAtStep &&
          budget.onBudgetAlert
        ) {
          budget.onBudgetAlert(stepCount, budget.maxSteps);
        }
      }
    },
  });

  return { result, stepCount };
}

// Usage: hard cap at 12 steps, alert at step 8
const { result, stepCount } = await guardedGenerate(
  researchAgent,
  "What are the main approaches to AI agent cost control in 2026?",
  {
    maxSteps: 12,
    alertAtStep: 8,
    onBudgetAlert: (step, max) => {
      console.warn(
        `[RunGuard] Agent at step ${step}/${max} — approaching limit`
      );
    },
  }
);

The onStepFinish callback gives you per-step telemetry: step type (tool-call or text), token usage, and latency. At step 8 of a 12-step budget, you know the agent is still looping and can log, alert, or throttle before the hard cap fires. The maxSteps cap itself returns a finishReason: "max-steps" in the result, which you can handle explicitly to return a partial response rather than an error.

Mastra v0.3+ note: Starting in Mastra 0.3, the Agent constructor also accepts a top-level defaultGenerateOptions and defaultStreamOptions config that sets maxSteps as a default for all calls — so you don't have to pass it at every call site. Set it at agent construction time and override per-call only when you have a specific reason to allow more steps.

Failure Mode 2: Workflow Step Retry Amplification

Mastra Workflows compose steps into a directed graph using a chainable .step().then() API. Each step is a typed function that receives context from prior steps and returns output that can be consumed by downstream steps. Steps are the natural unit for wrapping LLM calls in a workflow — you might have a planStep that decomposes a task, an executeStep for each sub-task, and a synthesizeStep that combines results.

The retry problem arises from Mastra's retryConfig option on each step. This is a sensible default for idempotent operations — database writes, HTTP requests to third-party services — but it's dangerous when applied to LLM calls. LLM calls don't fail the same way HTTP requests do: rate limit errors are transient and do retry correctly, but model timeouts and context-length errors retry without any chance of success, burning API credits on every attempt.

TypeScript — workflow with retry amplification

import { Workflow, Step } from "@mastra/core/workflow";
import { z } from "zod";
import { mastra } from "./mastra";

const planStep = new Step({
  id: "plan",
  description: "Decompose the task into subtasks",
  execute: async ({ context }) => {
    const agent = mastra.getAgent("plannerAgent");
    // Each retry re-runs this LLM call — expensive on failure
    return await agent.generate(
      `Decompose this task into 3-5 subtasks: ${context.triggerData.task}`
    );
  },
  retryConfig: {
    attempts: 3,      // 3 retries on top of the original attempt
    delay: 1000,      // 1s delay between attempts
  },
});

const executeStep = new Step({
  id: "execute",
  description: "Execute each subtask with tool calls",
  execute: async ({ context }) => {
    const agent = mastra.getAgent("executorAgent");
    // Each retry re-runs a potentially 10-step tool loop
    return await agent.generate(context.planStep.output.subtasks.join("\n"), {
      maxSteps: 10,
    });
  },
  retryConfig: {
    attempts: 3,
    delay: 2000,
  },
});

const workflow = new Workflow({ name: "task-workflow" })
  .step(planStep)
  .then(executeStep)
  .commit();

The math is stark: if planStep fails once on a transient rate limit and retries 3 times, you've paid for 4 LLM calls for one plan. If executeStep fails once and each retry runs a 10-step tool loop, you've paid for 40 tool-call steps (4 attempts × 10 steps) for what should have been 10. On a workflow with 4 LLM-backed steps each retrying twice, a bad day of transient failures can produce 3× your expected bill.

The fix has two parts: distinguish retryable errors from non-retryable ones, and cap retry attempts tightly for LLM-backed steps:

TypeScript — retry-aware step wrapper

import { Step } from "@mastra/core/workflow";

// Error categories that are safe to retry for LLM calls
const RETRYABLE_LLM_ERRORS = new Set([
  "rate_limit_exceeded",
  "service_unavailable",
  "timeout",
  "overloaded",
]);

function isRetryableLLMError(error: unknown): boolean {
  if (!(error instanceof Error)) return false;
  const msg = error.message.toLowerCase();
  return (
    msg.includes("rate limit") ||
    msg.includes("overloaded") ||
    msg.includes("service unavailable") ||
    msg.includes("timeout") ||
    // Vercel AI SDK error codes
    (error as any).code !== undefined &&
      RETRYABLE_LLM_ERRORS.has((error as any).code)
  );
}

// Non-retryable: context_length_exceeded, invalid_api_key, content_filter
// Retrying these wastes money — they won't succeed

function makeLLMStep(config: {
  id: string;
  description: string;
  execute: (ctx: any) => Promise;
  maxRetries?: number;
}) {
  return new Step({
    id: config.id,
    description: config.description,
    execute: async (ctx) => {
      let lastError: unknown;
      const maxAttempts = 1 + (config.maxRetries ?? 1); // default: 1 retry only

      for (let attempt = 1; attempt <= maxAttempts; attempt++) {
        try {
          return await config.execute(ctx);
        } catch (err) {
          lastError = err;
          if (!isRetryableLLMError(err)) {
            // Non-retryable: fail immediately, don't waste tokens on retry
            throw err;
          }
          if (attempt < maxAttempts) {
            const delay = Math.min(1000 * Math.pow(2, attempt - 1), 8000);
            await new Promise((r) => setTimeout(r, delay));
          }
        }
      }
      throw lastError;
    },
    // No Mastra-level retryConfig — we handle retries ourselves with error discrimination
  });
}

const planStep = makeLLMStep({
  id: "plan",
  description: "Decompose the task into subtasks",
  maxRetries: 1, // only 1 retry for transient errors; fail fast on anything else
  execute: async ({ context }) => {
    const agent = mastra.getAgent("plannerAgent");
    return await agent.generate(
      `Decompose this task into 3-5 subtasks: ${context.triggerData.task}`
    );
  },
});

This pattern lets rate-limit errors retry (which is the right behavior) while immediately failing on context-length errors, content filter blocks, and auth errors (which would never succeed on retry). Limiting to 1 retry rather than 3 also halves your worst-case retry cost — a design choice that makes sense for most production workloads where two consecutive transient failures indicate a deeper infrastructure problem worth surfacing rather than silently retrying through.

Failure Mode 3: Memory Semantic Context Stuffing

Mastra's Memory class provides agent memory through two mechanisms: a thread (ordered message history, like a chat buffer) and a semantic store (vector embeddings of messages, searchable by meaning). When you attach a Memory instance to an agent, Mastra automatically retrieves the topK most semantically similar messages from the store before each new LLM call and injects them into the system prompt as additional context.

This semantic retrieval is genuinely useful: it surfaces relevant context from earlier in a long conversation that a simple sliding window would drop. The cost problem is that the injected context grows proportionally with topK and the average message length, not with how much of that historical context is actually relevant to the current query. An agent with 200 stored messages and topK: 10 injects roughly the same context overhead on every single turn regardless of whether the current turn benefits from historical context at all.

TypeScript — Memory configuration with unbounded context growth

import { Memory } from "@mastra/memory";
import { openai } from "@ai-sdk/openai";
import { PgVector } from "@mastra/pg";

// Memory instance shared across all threads for this agent
const memory = new Memory({
  provider: new PgVector({ connectionString: process.env.DATABASE_URL }),
  embedder: openai.embedding("text-embedding-3-small"),
  options: {
    lastMessages: 20,   // include last 20 messages as recent history
    semanticRecall: {
      topK: 10,         // retrieve 10 semantically similar past messages
      messageRange: 2,  // include 2 surrounding messages per match
    },
  },
});

// With 500 stored messages, each new turn injects:
// - Last 20 messages as recent history (fixed overhead)
// - 10 semantically similar messages × (1 + 2×2 surrounding) = 50 extra messages
// That's 70 messages of context injected before the user's question appears.
// At ~200 tokens/message average, that's 14,000 tokens of overhead per turn.

The growth pattern is insidious: early in a conversation, the overhead is manageable. After 50 messages, the semantic store has enough content that every topK hit matches confidently, and the retrieved context stabilizes at its maximum overhead. After 200 messages, the embedding calls to update the store (triggered by every stored message) start adding per-message latency. The total cost per turn = (embedding cost per message) + (retrieval embedding cost) + (injected context tokens × LLM price), all of which scale independently.

Three levers to control Memory cost in Mastra:

TypeScript — bounded Memory configuration

import { Memory } from "@mastra/memory";
import { openai } from "@ai-sdk/openai";
import { PgVector } from "@mastra/pg";

// 1. Reduce topK and messageRange — each retrieved match pulls in surrounding
//    messages via messageRange, so the actual injection is topK × (1 + 2×range)
const boundedMemory = new Memory({
  provider: new PgVector({ connectionString: process.env.DATABASE_URL }),
  embedder: openai.embedding("text-embedding-3-small"),
  options: {
    lastMessages: 10,        // down from 20
    semanticRecall: {
      topK: 4,               // down from 10 — retrieves 4 semantically relevant
      messageRange: 1,       // down from 2 — 1 surrounding message per match
      // Total injected: 4 × (1 + 2×1) = 12 messages (vs. 50 in the naive config)
    },
  },
});

// 2. For tasks where history isn't relevant, disable semantic recall per-call
async function generateWithoutHistory(agent: any, prompt: string) {
  return await agent.generate(prompt, {
    memoryOptions: {
      lastMessages: 5,
      semanticRecall: false, // skip semantic retrieval entirely for this call
    },
  });
}

// 3. Implement a token budget for injected memory context
interface MemoryBudget {
  maxContextTokens: number;
  tokenEstimatorFn?: (text: string) => number;
}

function estimateTokens(text: string): number {
  // ~4 chars per token is a reasonable estimate for English text
  return Math.ceil(text.length / 4);
}

function trimMemoryToTokenBudget(
  messages: Array<{ role: string; content: string }>,
  budget: MemoryBudget
): Array<{ role: string; content: string }> {
  const estimate = budget.tokenEstimatorFn ?? estimateTokens;
  let total = 0;
  const trimmed: Array<{ role: string; content: string }> = [];

  for (const msg of messages) {
    const tokens = estimate(msg.content);
    if (total + tokens > budget.maxContextTokens) break;
    trimmed.push(msg);
    total += tokens;
  }

  return trimmed;
}

// Usage: pre-flight check on retrieved memory before it's injected
const MAX_MEMORY_TOKENS = 4_000; // cap retrieved history at 4K tokens

function applyMemoryBudget(
  retrievedMessages: Array<{ role: string; content: string }>
) {
  return trimMemoryToTokenBudget(retrievedMessages, {
    maxContextTokens: MAX_MEMORY_TOKENS,
  });
}

The most impactful change is reducing messageRange. A messageRange: 2 setting means each semantic hit pulls in 2 messages before and 2 messages after the matched message — so you get 5 messages per hit. With topK: 10, that's 50 messages of injected context. Drop to topK: 4, messageRange: 1 and you get 12 messages — a 76% reduction in injected context tokens with minimal loss of recall quality for most conversation patterns.

Semantic recall vs. recent history: Mastra's lastMessages and semanticRecall are additive — both are injected into the same system prompt. If you set both high, your agent pays for recent history and semantic retrieval on every turn. For conversational agents where continuity matters most, set lastMessages to a reasonable recent window (8–12 messages) and reduce topK significantly. For task agents where semantic recall is the primary memory access pattern, do the reverse — small lastMessages (3–5), moderate topK (4–6).

Failure Mode 4: Parallel Step Fan-Out Without Concurrency Cap

Mastra Workflows support concurrent step execution by branching the .then() chain into multiple parallel paths. Multiple steps added after a single parent step run concurrently by default. This is the right design for workflows where independent subtasks can be parallelized — it reduces wall-clock latency and makes full use of LLM API concurrency limits.

The cost problem arises when the set of parallel branches is determined at runtime by a prior LLM step. A planner step that returns "research these 12 sources simultaneously" spawns 12 concurrent branches, each running an agent with its own tool-call loop. The planner had no budget constraint on how many parallel sources it could suggest, and the workflow executor has no cap on how many branches it can spawn.

TypeScript — dynamic fan-out without concurrency cap

import { Workflow, Step } from "@mastra/core/workflow";
import { z } from "zod";

const plannerStep = new Step({
  id: "planner",
  description: "Identify research sources",
  outputSchema: z.object({
    sources: z.array(z.string()),
  }),
  execute: async ({ context }) => {
    const agent = mastra.getAgent("plannerAgent");
    // LLM freely decides how many sources to research
    // On a broad question, might return 12-15 sources
    const result = await agent.generate(
      `List all relevant sources to research: ${context.triggerData.topic}`,
      { maxSteps: 3 }
    );
    return { sources: JSON.parse(result.text).sources };
  },
});

// One step per source — all execute concurrently
// If planner returns 12 sources, 12 concurrent agent loops fire simultaneously
function buildResearchWorkflow(sources: string[]) {
  let wf = new Workflow({ name: "research" }).step(plannerStep);
  for (const source of sources) {
    wf = wf.then(
      new Step({
        id: `research-${source}`,
        execute: async () => {
          const agent = mastra.getAgent("researchAgent");
          return await agent.generate(`Research: ${source}`, { maxSteps: 8 });
        },
      })
    );
  }
  return wf.commit();
}

The fix is a two-layer guard: cap the planner's output count, and implement a concurrency semaphore on the execution layer so that even if the planner over-generates, only N branches run simultaneously:

TypeScript — fan-out with concurrency cap and planner output guard

import { z } from "zod";

// Layer 1: cap the planner's output in its output schema + system prompt
const guardedPlannerStep = new Step({
  id: "planner",
  description: "Identify at most 5 research sources",
  outputSchema: z.object({
    sources: z.array(z.string()).max(5), // schema-level cap
  }),
  execute: async ({ context }) => {
    const agent = mastra.getAgent("plannerAgent");
    const result = await agent.generate(
      // Prompt-level instruction to constrain LLM output
      `List the 3-5 most relevant sources to research (max 5):
       ${context.triggerData.topic}`,
      { maxSteps: 3 }
    );
    const parsed = JSON.parse(result.text);
    // Runtime enforcement: slice even if model exceeded the cap
    return { sources: parsed.sources.slice(0, 5) };
  },
});

// Layer 2: semaphore for concurrent execution
class Semaphore {
  private available: number;
  private queue: Array<() => void> = [];

  constructor(concurrency: number) {
    this.available = concurrency;
  }

  async acquire(): Promise {
    if (this.available > 0) {
      this.available--;
      return;
    }
    return new Promise((resolve) => this.queue.push(resolve));
  }

  release(): void {
    const next = this.queue.shift();
    if (next) {
      next();
    } else {
      this.available++;
    }
  }
}

async function cappedParallelResearch(
  sources: string[],
  maxConcurrency: number = 3
): Promise> {
  const sem = new Semaphore(maxConcurrency);
  const agent = mastra.getAgent("researchAgent");

  const tasks = sources.map(async (source) => {
    await sem.acquire();
    try {
      const result = await agent.generate(`Research: ${source}`, {
        maxSteps: 8,
      });
      return { source, result: result.text };
    } finally {
      sem.release();
    }
  });

  return Promise.all(tasks);
}

// Integration: use in a workflow step instead of dynamic step creation
const researchStep = new Step({
  id: "research",
  execute: async ({ context }) => {
    const { sources } = context.plannerStep;
    // Hard cap: even if planner somehow returns 12, we run at most 3 at a time
    return await cappedParallelResearch(sources.slice(0, 5), 3);
  },
});

With a 5-source cap and 3-concurrent semaphore, the worst-case cost is 5 branches × 8 steps per branch = 40 tool calls — regardless of what the planner model suggests. Without the cap, the same workflow could fan out to 15 branches running 8 steps concurrently, producing 120 tool calls in a single workflow run. The semaphore also prevents rate-limit errors from triggering retry storms (Failure Mode 2) by keeping concurrent API load within your tier's limits.

Composite Mastra Cost Policy

The four guards combine into a single policy object that can be instantiated once and applied across your Mastra agents and workflows:

TypeScript

interface MastraCostPolicy {
  // Agent loop limits
  maxSteps: number;
  alertAtStep: number;

  // Retry limits
  maxLLMRetries: number;

  // Memory limits
  memoryTopK: number;
  memoryMessageRange: number;
  memoryLastMessages: number;
  maxMemoryContextTokens: number;

  // Fan-out limits
  maxParallelBranches: number;
  maxConcurrency: number;

  // Cross-cutting budget
  onBudgetAlert?: (context: { step: number; maxSteps: number }) => void;
}

const defaultPolicy: MastraCostPolicy = {
  maxSteps: 12,
  alertAtStep: 8,
  maxLLMRetries: 1,
  memoryTopK: 4,
  memoryMessageRange: 1,
  memoryLastMessages: 10,
  maxMemoryContextTokens: 4_000,
  maxParallelBranches: 5,
  maxConcurrency: 3,
};

function applyMastraPolicy(
  agentConfig: Record,
  policy: MastraCostPolicy = defaultPolicy
): Record {
  return {
    ...agentConfig,
    defaultGenerateOptions: {
      maxSteps: policy.maxSteps,
      onStepFinish: ({ stepType, usage }: any) => {
        // hook for telemetry
      },
    },
    defaultStreamOptions: {
      maxSteps: policy.maxSteps,
    },
    memory: agentConfig.memory
      ? {
          ...agentConfig.memory,
          options: {
            lastMessages: policy.memoryLastMessages,
            semanticRecall: {
              topK: policy.memoryTopK,
              messageRange: policy.memoryMessageRange,
            },
          },
        }
      : undefined,
  };
}

// Usage: wrap every agent definition with the policy
const researchAgentConfig = applyMastraPolicy(
  {
    name: "ResearchAgent",
    instructions: "You are a research assistant. Be thorough but concise.",
    model: openai("gpt-4o"),
    tools: { searchTool, scrapePageTool, summarizeTool },
    memory: new Memory({ provider: pgVector, embedder }),
  },
  {
    ...defaultPolicy,
    maxSteps: 10,          // tighter cap for research agent
    memoryTopK: 3,         // less semantic recall for task-focused agent
  }
);

Cost Impact Summary

Failure mode Unguarded cost (per run) Guarded cost (per run) Reduction
Agent tool loop (40 steps) ~$0.28 (40 × tool call) ~$0.084 (12 steps cap) 70%
Workflow retry storm (4 steps × 3 retries) 4× expected cost on failure day ~1.5× expected (1 retry, discriminated) 62%
Memory context stuffing (200 messages) ~14,000 injected tokens/turn ~3,400 injected tokens/turn (topK 4 + range 1) 76%
Parallel fan-out (12 branches × 10 steps) 120 tool calls/run 40 tool calls/run (5 branches × 8 steps) 67%

The highest-leverage guard for most Mastra deployments is the Memory context stuffing fix — it applies to every single LLM call in a long-running agent session, not just to failure cases. The memory overhead tax is invisible during development (short sessions, few stored messages) but grows steadily in production as conversation histories accumulate. Addressing it proactively is cheaper than discovering it in your LLM bill at month-end.

Why TypeScript Agents Have a Distinct Cost Risk Profile

Python ML frameworks typically surface cost as a configuration concern at library initialization time — you set max_iterations when constructing a ReActAgent, or configure a budget in the framework's config object. The failure mode is clear: you forgot to set the parameter.

TypeScript agent frameworks like Mastra surface cost risks at runtime API boundaries. The agent loop is just an await agent.generate() call — it looks like any other async function call in your codebase. The retry logic is just a TypeScript config object, easy to copy from documentation without understanding its LLM-call multiplier effect. The memory retrieval happens transparently before each LLM call, invisible in the function signature. These failure modes are structurally hidden in ways that Python frameworks with explicit config objects are not.

This is why TypeScript agent cost control requires wrapping at the call site — function-level guards rather than framework-level config — and why the guards in this post take the form of wrappers and factories rather than configuration overrides. You're instrumenting async call sites, not configuring a class constructor.

Mastra observability integration: Mastra has built-in OpenTelemetry integration via @mastra/core/telemetry. If you're already collecting traces, add span attributes for agent.steps_taken, agent.total_tokens, and memory.injected_token_estimate at your call-site wrappers — this gives you per-run cost visibility in your existing trace dashboard without adding a separate monitoring layer.

FAQ

Does setting maxSteps too low degrade agent quality?

Yes, on complex tasks. The right maxSteps is task-dependent: a simple Q&A agent might need 3–5 steps, a research agent might need 10–15, a coding agent might need 20+ (read file, analyze, write, test, debug, iterate). The approach that works in practice is to set a conservative default (8–12 steps), log the finish reason in production, and increase the limit only for task types where you observe frequent early termination from hitting the cap. Agents that reach maxSteps frequently on a specific task type are telling you that task genuinely needs more steps — raise the limit specifically for that agent config, not globally.

How does Mastra Memory's semantic recall compare to LangChain's memory implementations?

They're architecturally similar — both embed messages and retrieve by semantic similarity — but with different integration points. LangChain memory integrates at the chain level (you pass a memory object to the chain constructor, which calls it before each LLM invocation). Mastra memory integrates at the agent level (the Memory instance is part of the agent configuration). The practical cost difference is that Mastra's messageRange parameter pulls in surrounding messages per match, which amplifies injection size faster than LangChain's typical implementation. The remediation is similar: reduce topK, reduce message range, implement a token budget on injected context.

The Semaphore pattern for fan-out — is there a Mastra-native way to do this?

As of Mastra 0.3, there is no built-in concurrency cap on parallel workflow step execution. The Semaphore wrapper is the recommended pattern for production deployments. There is an open issue in the Mastra repository requesting a concurrency option on parallel step groups — if that ships, it will simplify the implementation. Until then, the JavaScript Semaphore class (or a library like p-limit, which provides the same semantics with a cleaner API) is the right tool. p-limit is a 200-byte dependency with no transitive deps that does exactly this: import pLimit from "p-limit"; const limit = pLimit(3); const results = await Promise.all(sources.map(s => limit(() => agentCall(s))));

Should I use Mastra's retryConfig at all for LLM-backed steps?

Yes, for rate-limit errors specifically — and only for those. The Mastra-level retryConfig is simpler to configure than a custom wrapper, and it handles the basic case of a single transient rate-limit hit correctly. The risk is that it doesn't discriminate error types: it retries on context-length errors, content policy violations, and auth failures just as aggressively as it retries on rate limits. If you use retryConfig, keep attempts at 2 (1 retry) and add custom error logging in the step body to distinguish error types. If you need full error discrimination, replace retryConfig with the custom makeLLMStep wrapper shown in this post.

What's the fastest single fix to apply to an existing Mastra application?

Add maxSteps: 12 (or whatever is appropriate for your task type) to every agent.generate() and agent.stream() call that doesn't already have one. This is a one-line change per call site that immediately eliminates the worst-case cost scenario for unbounded tool loops. Next, reduce semanticRecall.topK from its default to 4 and messageRange to 1 if you're using Mastra Memory — this reduces per-turn context injection without any visible behavioral change in most agent interactions. Together, these two changes take under 30 minutes and deliver 50–70% cost reduction on the most common failure cases.

Automatic Mastra cost guards

RunGuard wraps your Mastra agents and workflows with production-grade circuit breakers — maxSteps enforcement, retry discrimination, memory context budgets, and parallel fan-out caps. TypeScript SDK, one install, no Mastra fork required.

See pricing

Also in this series