Microsoft Copilot Studio Cost Control: Loop Detection and Budget Enforcement in Production

Microsoft Copilot Studio (formerly Power Virtual Agents) ships with four distinct execution layers: topic-based dialog flows, Power Automate action calls, generative AI knowledge grounding, and the autonomous agent AI orchestrator introduced in 2025. Each layer adds real capability — and a distinct failure mode that can spiral into unexpected costs. What none of these layers includes is a runtime circuit breaker. There is no mechanism that detects a topic redirect cycle while it's happening, caps the number of times Power Automate retries a failing flow mid-session, limits how many knowledge search calls the generative AI node fires, or stops an autonomous agent's tool loop from running past a safe threshold.

The cost picture is also more nuanced than most other agent platforms. Copilot Studio bills on Power Platform capacity: a Message capacity model charges per message across all billed sessions, and generative AI usage burns Azure OpenAI tokens through the Microsoft tenant. A runaway autonomous agent session — one that fires 40 tool calls before hitting an unhandled error — can consume the per-message budget of dozens of normal sessions, and the Azure OpenAI bill attached to those calls arrives separately through the Pay-As-You-Go AI Builder credits. Enterprise teams running Copilot Studio at scale have seen single test sessions consume what should be a week's conversation budget.

This post covers four failure modes specific to Copilot Studio's architecture, with complete implementations of guards for each one. The examples mix Power Fx (Copilot Studio's formula language for topic variables), TypeScript (for custom connector middleware), and Power Automate expression syntax — because Copilot Studio is a no-code-first platform where the guard logic must live in whichever layer you can actually intercept. If you'd rather not maintain guard code across multiple execution layers, the final section shows the RunGuard integration pattern that works from a single custom action wired into any topic or autonomous agent.

How Microsoft Copilot Studio's execution model works

Copilot Studio has two distinct execution modes that coexist in the same bot:

Classic dialog flow mode — the bot matches user input to a Topic via trigger phrases or intent recognition. Each Topic is a graph of nodes: Message, Question, Condition, Set Variable, Redirect (go to another topic), and Action (call a Power Automate flow or HTTP connector). The bot walks nodes until it reaches an End Conversation or Redirect node, then follows the redirect. There is no built-in cycle detection in topic redirects.

Autonomous agent mode (GA 2025) — an AI orchestrator (GPT-4o via Azure OpenAI) receives the user's goal, decides which registered Actions (connectors, flows, external APIs) to call, executes them in a ReAct loop, and synthesizes a final answer. This is functionally equivalent to the tool-call loops in LangGraph or Pydantic AI, with the same infinite-loop failure modes. The orchestrator calls up to a platform-imposed iteration cap, but that cap is generous by default and does not enforce semantic loop detection.

Both modes can call Power Automate flows as actions. Power Automate flows have their own retry policies and failure modes that are independent of the Copilot Studio bot's execution state — which creates the compound retry failure described below.

The gap: Copilot Studio's topic engine has no redirect-cycle detection. The autonomous orchestrator has an iteration cap but no semantic loop detection. Power Automate action retries are configured per-flow, not per-session. None of these mechanisms communicate with each other — a session-level budget that spans all three layers doesn't exist on the platform.

Failure mode 1: Topic redirect cycle

Topic redirect cycles are the most common Copilot Studio failure for teams that have built large bot configurations with dozens of topics. A redirect cycle forms when Topic A has a Redirect node pointing to Topic B, and Topic B has a Redirect node — explicitly or through its error-handling path — that eventually points back to Topic A. The bot executes the cycle until it hits the platform's session turn limit, consuming one billed message per loop iteration.

The cycle is rarely intentional. It typically emerges when:

  • The Fallback topic (triggered when no topic matches) redirects to a "Help Menu" topic for guidance, and the Help Menu topic's "I need more help" branch redirects back to Fallback.
  • An Escalate topic fails to reach a live agent (e.g., outside business hours) and redirects to a "Sorry, try again" topic, which redirects to the intent that triggers Escalate again.
  • A disambiguation topic (triggered when multiple topics match) redirects to one of the matched topics, which under certain conditions redirects back to disambiguation.

The guard requires tracking redirect depth inside the conversation session. Copilot Studio topics have access to session-scoped variables via Power Fx. A redirect counter variable incremented at the start of each topic and checked before any Redirect node fires will catch cycles before they run to the session turn limit.

In Copilot Studio's topic designer, add a Set Variable node at the very start of each topic that participates in redirects:

// Power Fx formula in "Set Variable" node at topic entry
// Variable name: Global.TopicRedirectDepth (Global scope persists across topic redirects)

Set(
  Global.TopicRedirectDepth,
  If(
    IsBlank(Global.TopicRedirectDepth),
    0,
    Global.TopicRedirectDepth + 1
  )
)

// Then immediately add a Condition node:
// IF Global.TopicRedirectDepth >= 5
//   → Send message: "I've run into a loop. Let me reset and start over."
//   → Set Variable: Global.TopicRedirectDepth = 0
//   → End Conversation (or redirect to a designated recovery topic)
// ELSE
//   → Continue normal topic flow

This pattern needs two supporting steps. First, reset Global.TopicRedirectDepth to 0 at the very start of any "entry-point" topic — one that the user directly triggers rather than one that is redirected to. Second, reset the counter at every End Conversation node so that subsequent turns in the same session begin fresh. Without both resets, a legitimate multi-topic conversation (user navigates help → billing → account) will falsely trip the guard after five hops.

For complex bots with 50+ topics, maintaining the counter manually across every topic is error-prone. The more robust approach is to implement the counter inside a dedicated Topic Guard topic that every other topic redirects through as its first action:

// Topic: "_GuardEntry" (system topic, not user-reachable)
// Called via Redirect from every other topic's first node

// Input variable: Topic.CallerTopicName (string, passed by caller)
// Output variable: Topic.GuardPassed (boolean)

// Node 1: Set Variable
Set(
  Global.TopicRedirectDepth,
  If(IsBlank(Global.TopicRedirectDepth), 0, Global.TopicRedirectDepth + 1)
)
Set(
  Global.TopicCallChain,
  Concatenate(
    If(IsBlank(Global.TopicCallChain), "", Concatenate(Global.TopicCallChain, " → ")),
    Topic.CallerTopicName
  )
)

// Node 2: Condition
// IF Global.TopicRedirectDepth >= 5 OR Len(Global.TopicCallChain) > 200
//   Set Topic.GuardPassed = false
//   [Send admin alert via Power Automate flow with Global.TopicCallChain]
// ELSE
//   Set Topic.GuardPassed = true

// Calling topic checks Topic.GuardPassed before continuing its main flow

The Global.TopicCallChain string captures the redirect path — when the guard trips, this string is invaluable for debugging: you can see exactly which topics formed the cycle without needing to reconstruct it from turn-by-turn logs.

Failure mode 2: Power Automate flow retry storm

When a Copilot Studio action node calls a Power Automate flow, the platform's retry behavior operates at two independent layers: the Copilot Studio action retry and the Power Automate flow's own retry policy. When both are active, a single failing action call can generate many more HTTP requests to the downstream service than you expect.

The compound failure works like this: a Copilot Studio Action node is configured with "Retry on failure: 2 times." The Power Automate flow it calls has an HTTP action with the default retry policy (4 retries on 429/503). The downstream API — an internal knowledge system, say — returns 503 due to a temporary outage. The HTTP action retries 4 times (5 calls total). All 5 fail. The Power Automate flow fails. Copilot Studio retries the action. The same 5 calls happen again. Copilot Studio retries once more. 5 calls again. You've made 15 calls to a system that was down, burned 3× the token budget for the full session, and the user still got an error.

If the downstream call triggers expensive operations — like spawning an AI summarization job or querying a paid external API — the cost is 15× what a single failed call would cost. In an autonomous agent scenario where the orchestrator tries the same action after a transient error, the problem compounds further: the orchestrator may call the action 3–5 times, each time triggering the compound retry stack.

The guard lives in the Power Automate flow. Add a session-scoped attempt counter using a Compose action and a Do Until loop configuration that enforces a hard cap before the HTTP retry policy fires:

// Power Automate flow: Guard wrapper around external API calls
// Add these actions BEFORE the HTTP action that calls your external system

// Action 1: Initialize variable (add at flow start, run once)
// Name: var_attempt_count
// Type: Integer
// Value: 0

// Action 2: Initialize variable
// Name: var_max_attempts
// Type: Integer
// Value: 3  ← total attempts across all retries, both layers

// Action 3: Do Until loop
// Condition: @{variables('var_attempt_count')} >= @{variables('var_max_attempts')}
// Limit: Count = 3, Timeout = PT30S

  // Inside loop — Action 3a: Increment counter
  // Set variable var_attempt_count = @{add(variables('var_attempt_count'), 1)}

  // Inside loop — Action 3b: HTTP action (your actual API call)
  // Method: POST
  // URI: @{triggerBody()?['api_endpoint']}
  // Retry Policy: Type = None  ← IMPORTANT: disable Power Automate's built-in retry
  //                               so the Do Until loop is the single retry controller
  // Run After: configure to run on Success, Failure, Skipped, TimedOut

  // Inside loop — Action 3c: Condition
  // IF HTTP action status code is 2xx
  //   → Set variable var_result = HTTP body
  //   → Terminate (Succeeded)
  // ELSE IF @{variables('var_attempt_count')} >= @{variables('var_max_attempts')}
  //   → Compose error payload: {"error": "max_retries_exceeded", "attempts": @{variables('var_attempt_count')}, "last_status": @{outputs('HTTP_action')?['statusCode']}}
  //   → Terminate (Failed) with error message from Compose
  // ELSE
  //   → Delay: 2^(var_attempt_count) seconds (exponential backoff)
  //   → Continue loop

The critical configuration is setting the Power Automate HTTP action's retry policy to None and letting the Do Until loop be the sole retry controller. This collapses the two-layer retry system into one, making the total attempt count predictable. The var_max_attempts variable can be parameterized via the flow's trigger body, allowing the calling Copilot Studio topic to pass a stricter limit for actions that are known to be expensive.

At the Copilot Studio side, disable the Action node's "Retry on failure" setting entirely once the flow has this guard in place — the flow now owns retries, and double-retrying creates the compound problem you're trying to eliminate.

Failure mode 3: Generative AI knowledge search spiral

Copilot Studio's Knowledge feature (formerly Generative Answers) grounds responses by searching configured data sources — SharePoint sites, Dataverse tables, external URLs via Bing, or custom knowledge bases via the custom data connector. In the autonomous agent mode, the AI orchestrator decides when to call the knowledge search action and with what query. The failure mode emerges when the search returns results that partially satisfy the goal but leave semantic gaps: the orchestrator calls knowledge search again with a refined query, gets more partial results, refines again, and so on.

This spiral is more expensive than it looks. Each knowledge search call in autonomous agent mode goes through Azure OpenAI (to generate the search query from the goal context), executes the actual data source search (which may be a Bing API call or a Dataverse query), and then goes back through Azure OpenAI to determine whether the results are sufficient. Three layers of API calls per knowledge search step. A spiral of 8–10 knowledge searches consumes the equivalent of 24–30 GPT-4o completions, plus the Bing API calls if web search is enabled.

The guard requires custom connector middleware. Copilot Studio's Knowledge feature doesn't expose a hook for pre-execution logic on individual knowledge search calls, but autonomous agents that use a custom data connector can intercept the search through the connector implementation. Build the spiral guard into your TypeScript connector:

// TypeScript: Custom Knowledge Connector with spiral guard
// Deploy as Azure Function or any HTTPS endpoint registered in Power Platform

import { Request, Response } from 'express';

interface SearchCall {
  queryTokens: Set<string>;
  timestamp: number;
}

const sessionSearchHistory = new Map<string, SearchCall[]>();
const WINDOW_SIZE = 5;
const SIMILARITY_THRESHOLD = 0.68;
const SESSION_TTL_MS = 30 * 60 * 1000; // 30 minutes

function tokenize(query: string): Set<string> {
  const stopWords = new Set(['the','a','an','of','in','for','to','and','or','is','are','what','how','tell','me','find','show','list','get']);
  return new Set(
    query.toLowerCase()
      .replace(/[^\w\s]/g, ' ')
      .split(/\s+/)
      .filter(t => t.length > 2 && !stopWords.has(t))
  );
}

function jaccardSimilarity(a: Set<string>, b: Set<string>): number {
  const union = new Set([...a, ...b]);
  if (union.size === 0) return 1;
  const intersection = [...a].filter(t => b.has(t));
  return intersection.length / union.size;
}

function checkKnowledgeSpiral(sessionId: string, query: string): void {
  const now = Date.now();
  const history = sessionSearchHistory.get(sessionId) ?? [];

  // Evict old sessions to prevent memory growth
  for (const [sid, calls] of sessionSearchHistory.entries()) {
    if (calls.length > 0 && now - calls[calls.length - 1].timestamp > SESSION_TTL_MS) {
      sessionSearchHistory.delete(sid);
    }
  }

  const queryTokens = tokenize(query);
  const recentCalls = history.slice(-WINDOW_SIZE);

  const similarCount = recentCalls.filter(call =>
    jaccardSimilarity(queryTokens, call.queryTokens) >= SIMILARITY_THRESHOLD
  ).length;

  if (similarCount >= 3) {
    throw new Error(
      `[KnowledgeSpiralGuard] Search spiral detected in session ${sessionId}. ` +
      `${similarCount} of last ${recentCalls.length} queries are semantically similar ` +
      `(threshold: ${SIMILARITY_THRESHOLD}). Query: "${query}"`
    );
  }

  history.push({ queryTokens, timestamp: now });
  if (history.length > 20) history.shift(); // cap per-session history
  sessionSearchHistory.set(sessionId, history);
}

// Express route handler — this is what Power Platform calls
export async function knowledgeSearch(req: Request, res: Response): Promise<void> {
  const { session_id, query, data_source, max_results = 5 } = req.body as {
    session_id: string;
    query: string;
    data_source: string;
    max_results?: number;
  };

  if (!session_id || !query) {
    res.status(400).json({ error: 'session_id and query are required' });
    return;
  }

  try {
    // Check for spiral before executing the search
    checkKnowledgeSpiral(session_id, query);

    // Execute the actual knowledge search
    const results = await executeDataSourceSearch(data_source, query, max_results);
    res.json({ results, session_id, query_processed: true });

  } catch (err: unknown) {
    const message = err instanceof Error ? err.message : 'Unknown guard error';
    if (message.startsWith('[KnowledgeSpiralGuard]')) {
      // Return a structured response the Copilot Studio orchestrator can act on
      res.status(409).json({
        error: 'knowledge_spiral_detected',
        message,
        suggestion: 'Synthesize an answer from results already retrieved rather than searching further.'
      });
    } else {
      res.status(500).json({ error: 'search_failed', message });
    }
  }
}

// You also need a per-session token budget tracker
interface SessionBudget {
  searchCount: number;
  estimatedTokensConsumed: number;
}

const sessionBudgets = new Map<string, SessionBudget>();
const MAX_SEARCHES_PER_SESSION = 8;
const MAX_TOKENS_PER_SESSION = 12000;

function checkBudget(sessionId: string, estimatedResultTokens: number): void {
  const budget = sessionBudgets.get(sessionId) ?? { searchCount: 0, estimatedTokensConsumed: 0 };
  budget.searchCount++;
  budget.estimatedTokensConsumed += estimatedResultTokens;
  sessionBudgets.set(sessionId, budget);

  if (budget.searchCount > MAX_SEARCHES_PER_SESSION) {
    throw new Error(
      `[KnowledgeBudgetGuard] Session ${sessionId} has called knowledge search ` +
      `${budget.searchCount} times (max: ${MAX_SEARCHES_PER_SESSION}).`
    );
  }
  if (budget.estimatedTokensConsumed > MAX_TOKENS_PER_SESSION) {
    throw new Error(
      `[KnowledgeBudgetGuard] Session ${sessionId} has consumed ~${budget.estimatedTokensConsumed} ` +
      `tokens in knowledge retrieval (max: ${MAX_TOKENS_PER_SESSION}).`
    );
  }
}

Register this connector in Power Platform's custom connector registry and point your Copilot Studio autonomous agent's knowledge action at it. The orchestrator calls this endpoint for every knowledge search — the spiral guard runs before the actual data source query and returns an HTTP 409 when a spiral is detected. Copilot Studio's autonomous agent orchestrator treats a 409 from an action as a handled failure and attempts to synthesize a final answer from its current context rather than retrying indefinitely.

Failure mode 4: Autonomous agent tool call spiral

The autonomous agent AI orchestrator in Copilot Studio (powered by Azure OpenAI's function-calling capability) runs a ReAct loop across any registered Actions: Power Automate flows, HTTP connectors, and custom connectors. The loop is architecturally identical to the tool-call loops in LangGraph or OpenAI Agents SDK — the orchestrator generates a plan, picks a tool, runs it, observes the result, and iterates. The spiral failure mode is the same: when a tool returns partial data, an error, or an empty result, the orchestrator tries the same or similar tool with varied parameters rather than concluding that the goal cannot be satisfied with available tools.

In a Copilot Studio autonomous agent scenario, a common version of this is a CRM lookup spiral: the orchestrator tries to look up a customer record by name, gets multiple partial matches, calls the lookup tool again with the account ID extracted from the first match, gets a nested record structure, calls a related-records tool to expand one of the linked entities, and so on — each call making the context larger, each call burning another round of Azure OpenAI completions, none of the calls reaching a clean terminal state because the goal ("summarize everything about this customer") is underspecified.

The guard belongs in a shared middleware layer that wraps all custom connector actions. Implement it as a TypeScript class that all your custom connectors instantiate:

// TypeScript: Autonomous agent tool spiral guard for Copilot Studio connectors
// Import and use this in every custom connector action handler

export interface ToolCall {
  toolName: string;
  argsFingerprint: string; // hash of the call arguments
  timestamp: number;
}

export class AgentToolGuard {
  private readonly maxCallsPerTool: number;
  private readonly maxTotalCalls: number;
  private readonly windowMs: number;
  private sessions = new Map<string, ToolCall[]>();

  constructor(options: {
    maxCallsPerTool?: number;
    maxTotalCalls?: number;
    windowMs?: number;
  } = {}) {
    this.maxCallsPerTool = options.maxCallsPerTool ?? 5;
    this.maxTotalCalls = options.maxTotalCalls ?? 20;
    this.windowMs = options.windowMs ?? 10 * 60 * 1000; // 10 minutes
  }

  private fingerprint(args: Record<string, unknown>): string {
    // Stable fingerprint: sort keys, stringify, first 64 chars
    const sorted = Object.fromEntries(
      Object.entries(args).sort(([a], [b]) => a.localeCompare(b))
    );
    return JSON.stringify(sorted).substring(0, 64);
  }

  check(sessionId: string, toolName: string, args: Record<string, unknown>): void {
    const now = Date.now();
    const history = (this.sessions.get(sessionId) ?? [])
      .filter(c => now - c.timestamp < this.windowMs);

    const totalCalls = history.length;
    if (totalCalls >= this.maxTotalCalls) {
      throw new ToolGuardError(
        `[ToolGuard] Session ${sessionId} has made ${totalCalls} total tool calls ` +
        `(max: ${this.maxTotalCalls}). Possible autonomous agent runaway.`,
        'total_calls_exceeded',
        { totalCalls, maxTotalCalls: this.maxTotalCalls }
      );
    }

    const toolCalls = history.filter(c => c.toolName === toolName);
    if (toolCalls.length >= this.maxCallsPerTool) {
      throw new ToolGuardError(
        `[ToolGuard] Tool '${toolName}' called ${toolCalls.length} times in session ${sessionId} ` +
        `(max: ${this.maxCallsPerTool}).`,
        'tool_calls_exceeded',
        { toolName, toolCallCount: toolCalls.length, maxCallsPerTool: this.maxCallsPerTool }
      );
    }

    // Detect exact argument repetition (same tool, same args, called again)
    const fp = this.fingerprint(args);
    const exactRepeat = toolCalls.filter(c => c.argsFingerprint === fp).length;
    if (exactRepeat >= 2) {
      throw new ToolGuardError(
        `[ToolGuard] Tool '${toolName}' called with identical arguments ${exactRepeat + 1} times ` +
        `in session ${sessionId}. Likely an orchestrator loop.`,
        'exact_repeat_detected',
        { toolName, exactRepeat, argsFingerprint: fp }
      );
    }

    history.push({ toolName, argsFingerprint: fp, timestamp: now });
    this.sessions.set(sessionId, history);
  }

  recordResult(sessionId: string, toolName: string, success: boolean): void {
    // Optional: track failure rates for cascade detection
    // Extend here to add per-tool failure rate tracking
    void sessionId; void toolName; void success;
  }

  clearSession(sessionId: string): void {
    this.sessions.delete(sessionId);
  }
}

export class ToolGuardError extends Error {
  constructor(
    message: string,
    public readonly code: string,
    public readonly detail: Record<string, unknown>
  ) {
    super(message);
    this.name = 'ToolGuardError';
  }
}

// Singleton to share across all connector handlers in the same deployment
export const toolGuard = new AgentToolGuard({
  maxCallsPerTool: 5,
  maxTotalCalls: 20,
  windowMs: 10 * 60 * 1000
});

Use the guard in every custom connector action handler:

// In your connector's Express route handler:
import { toolGuard, ToolGuardError } from './agent-tool-guard';

export async function lookupCustomer(req: Request, res: Response): Promise<void> {
  const { session_id, customer_name, account_id } = req.body;

  try {
    // Guard check before any expensive operation
    toolGuard.check(session_id, 'lookup_customer', { customer_name, account_id });

    const record = await crmClient.findCustomer({ name: customer_name, id: account_id });
    toolGuard.recordResult(session_id, 'lookup_customer', true);

    res.json({ record, found: true });

  } catch (err) {
    if (err instanceof ToolGuardError) {
      toolGuard.recordResult(session_id, 'lookup_customer', false);
      // HTTP 409 tells Copilot Studio's orchestrator to stop and synthesize
      res.status(409).json({
        error: err.code,
        message: err.message,
        detail: err.detail
      });
    } else {
      res.status(500).json({ error: 'lookup_failed' });
    }
  }
}

The exactRepeat check catches the most common autonomous agent failure: the orchestrator calling the same action with the same parameters after getting a result it couldn't parse, effectively retrying rather than changing strategy. Two identical calls with the same arguments is almost always a sign that the orchestrator is confused rather than that the tool is legitimately needed twice with identical inputs.

Guard configuration reference

Guard Parameter Default When to adjust
Topic redirect depth max_redirect_depth 5 Lower to 3 for simple bots where more than 3 topic hops indicates a cycle. Raise to 8 for complex enterprise bots with legitimate multi-topic workflows (e.g., an account management bot that moves through identity verification → account lookup → permissions check → action confirmation — four hops without any cycle).
Power Automate retry max_attempts 3 Lower to 2 for calls to paid external APIs where cost-per-call is high. Raise to 5 only for calls to known-flaky internal services that recover within 2–3 retries. Keep Power Automate's HTTP action built-in retry set to None whenever this guard is active.
Knowledge spiral SIMILARITY_THRESHOLD 0.68 Raise to 0.78 for bots that legitimately refine queries through multiple near-similar searches (research or due-diligence agents). Lower to 0.60 for bots with a narrow, well-defined knowledge domain where any query variation is a spiral signal.
Knowledge spiral MAX_SEARCHES_PER_SESSION 8 Lower to 4 for focused Q&A bots where the expected search count per session is 1–2. Raise to 15 only for document-processing agents where a single user goal may genuinely require searching 10+ knowledge sources in sequence.
Tool call spiral maxCallsPerTool 5 Lower to 3 for actions that are expensive or have per-call external costs (Bing search, paid CRM APIs). Raise to 8 for list-browsing actions where the orchestrator legitimately paginate through results before synthesizing.
Tool call spiral maxTotalCalls 20 Tune to 3–4× the expected normal-case tool call count for a completed session. If a normal Copilot Studio autonomous session completes in 4–6 tool calls, set the limit to 16–20. A session approaching the limit has already done far more work than expected — the guard fires to prevent the long tail of runaway sessions.

RunGuard integration for Copilot Studio

Maintaining guard logic across Power Fx topic variables, Power Automate flow configurations, and TypeScript connector middleware is operationally expensive — each layer has its own deployment cycle and there's no unified view of which sessions tripped which guard. RunGuard provides all four checks as a single managed HTTP endpoint that any Copilot Studio component can call.

The most practical integration point is a dedicated RunGuard Check custom connector that your autonomous agent or topic registers as its first action before any other tool call in a session:

// TypeScript: RunGuard custom connector for Copilot Studio
// Register this as a custom connector in Power Platform

import axios from 'axios';

const RUNGUARD_API_KEY = process.env.RUNGUARD_API_KEY!;
const RUNGUARD_URL = 'https://api.runguard.dev/v1/check';

export interface RunGuardCheckRequest {
  app_id: string;
  session_id: string;
  tool_name: string;
  tool_args: Record<string, unknown>;
  depth?: number;
}

export async function runguardCheck(req: RunGuardCheckRequest): Promise<void> {
  const resp = await axios.post(RUNGUARD_URL, req, {
    headers: { 'X-RunGuard-Key': RUNGUARD_API_KEY },
    timeout: 2000,
    validateStatus: () => true // handle status manually
  });

  if (resp.status === 409) {
    const { reason, detail } = resp.data;
    throw new Error(`[RunGuard] ${reason}: ${detail}`);
  }
  if (resp.status >= 500) {
    // RunGuard is down — fail open (don't block the agent)
    console.warn('[RunGuard] Service unavailable, failing open:', resp.status);
  }
}

export async function runguardRecordResult(
  sessionId: string,
  toolName: string,
  result: string,
  success: boolean
): Promise<string> {
  try {
    const resp = await axios.post(
      RUNGUARD_URL.replace('/check', '/record'),
      { app_id: 'copilot-studio-prod', session_id: sessionId, tool_name: toolName, result, success },
      { headers: { 'X-RunGuard-Key': RUNGUARD_API_KEY }, timeout: 2000 }
    );
    return resp.data?.result ?? result;
  } catch {
    return result; // fail open
  }
}

// Express route handler for the custom connector endpoint
// Register route: POST /runguard/check
export async function checkHandler(req: Request, res: Response): Promise<void> {
  const { session_id, tool_name, tool_args, depth = 0 } = req.body;
  try {
    await runguardCheck({
      app_id: 'copilot-studio-prod',
      session_id,
      tool_name,
      tool_args,
      depth
    });
    res.json({ passed: true });
  } catch (err: unknown) {
    const message = err instanceof Error ? err.message : 'Guard tripped';
    res.status(409).json({ passed: false, reason: message });
  }
}

In the Copilot Studio autonomous agent designer, add the RunGuard Check action as the first action in your agent's action sequence. Pass System.Activity.Id as the session_id parameter — this is Copilot Studio's built-in conversation activity identifier, unique per session. When RunGuard returns a 409, Copilot Studio's orchestrator receives a failed action result and stops the tool-call loop, allowing it to compose a final answer from whatever context it has accumulated.

RunGuard aggregates trip events into a 30-day dashboard showing trip frequency by guard type, session ID, tool name, and the complete call history that led to the trip. Slack alerts fire on every trip; PagerDuty integration is available on the Team plan. This unified view replaces the fragmented visibility you'd get from Power Automate run history, Copilot Studio conversation transcripts, and Azure Application Insights logs spread across three separate portals.

FAQ

We use Copilot Studio's classic dialog flow mode, not the autonomous agent. Do we still need loop guards?

Yes, but only the topic redirect cycle guard applies directly to classic dialog flow. The Power Automate retry guard also applies if your topics call flows that hit external APIs. The knowledge spiral and tool call spiral guards are specific to the autonomous AI orchestration mode and don't apply to handcrafted topic flows where every redirect path is designer-defined. For classic mode bots, start with the redirect depth counter — it's the guard with the highest ROI for classic Copilot Studio deployments because topic cycles are genuinely easy to create accidentally and hard to detect in transcripts after the fact.

Copilot Studio already has a "maximum turns" conversation limit. Why isn't that sufficient?

The maximum turns limit (configurable per bot, defaulting to a high value) counts conversation turns — each round-trip between the user and the bot. A topic redirect cycle doesn't necessarily consume user turns; topic-to-topic redirects that happen within a single turn's processing count as internal execution steps, not new conversation turns. The platform's turn limit is also a global session cap, not a per-topic or per-action guard — it fires at the end of a long session, not at the point where the cycle starts. By the time the turn limit fires, you've already consumed all the billed messages for the cycle. A redirect depth counter fires immediately when the cycle starts, at step 5, not after 50 turns.

Does the JavaScript/TypeScript custom connector approach work with Copilot Studio's generative answers (Knowledge) feature when the data source is SharePoint, not a custom connector?

Not directly. When the knowledge source is a built-in data source (SharePoint, Dataverse, public websites), Copilot Studio's generative AI node calls it through the platform's internal search pipeline — there's no interception point for custom middleware on the search itself. For built-in sources, your guard options are: (1) use the topic redirect depth guard to catch runaway orchestration at the conversation level; (2) implement a session call counter in a Power Automate flow that the generative AI node calls before each search via an action; or (3) use RunGuard's Copilot Studio action integration as a pre-check before the Generative Answers node fires. The custom connector approach described in failure mode 3 applies when you've built your own knowledge connector that replaces the built-in search — which is increasingly common for enterprise bots with proprietary data sources that can't go through SharePoint.

The depth guard uses Global.TopicRedirectDepth. Does this variable persist across user sessions, or does it reset when a new conversation starts?

Global-scoped variables in Copilot Studio persist for the lifetime of a conversation session — they reset when the conversation ends (or when the Copilot Studio session timeout triggers). They do not persist across separate user sessions. This is the correct behavior for the redirect depth guard: each new conversation starts with a fresh counter. The risk to watch for is bots configured with very long session timeouts (or persistent sessions with Microsoft 365 Copilot integrations) where a single session covers multiple distinct user tasks. In that case, add an explicit reset node at the start of any "entry" topic — the first topic a user reaches from a greeting or a new task request — so the counter resets at the start of each new task rather than accumulating across the full session duration.

Microsoft Copilot Studio charges per message for the conversation capacity model. How do we calculate the actual cost of a topic redirect loop?

Under the conversation capacity model, a "message" is typically defined as a single bot response sent to the user — not an internal execution step. However, Power Platform's billing does count certain action calls toward session consumption, and the per-session capacity burn depends on your organization's license tier. For the autonomous agent mode, the dominant cost is not the Copilot Studio message capacity but the Azure OpenAI token consumption billed through AI Builder credits. Each orchestrator iteration (tool selection, result synthesis) makes at least one GPT-4o completion call; a 20-iteration runaway session can consume 40–60 GPT-4o completions plus the tool call outputs fed back into context. At GPT-4o pricing (~$0.0025–$0.01 per 1K output tokens depending on your Azure agreement), a single runaway session with 50,000 tokens of context accumulation can cost $0.50–$2.00 in AI Builder credits before hitting any platform cap — which adds up quickly when a bot is handling hundreds of daily sessions.

Stop runaway Copilot Studio agents before the bill lands

RunGuard covers all four Copilot Studio failure modes — topic redirect cycle detection, Power Automate retry control, knowledge search spiral detection, and autonomous agent tool call limits — as a single managed HTTP endpoint. One custom connector replaces guard code spread across Power Fx, Power Automate, and TypeScript, with a unified 30-day trip dashboard and Slack alerts on every event.

Start free 14-day trial →