Budget alerts and loop detection for Microsoft Semantic Kernel agents
Semantic Kernel is Microsoft’s enterprise AI SDK for orchestrating LLM calls, plugins, memory, and planners in .NET, Python, and Java. Its auto function-calling loop — where SK invokes your plugins in sequence until the model declares the task complete — is powerful and, without guardrails, dangerous. A badly-formed planner, a plugin that always returns a partial result, or a model that never reaches a stopping condition can burn through your Azure OpenAI credits in a single run. This page explains exactly where SK’s cost risk lives and how to add a per-run budget cap and loop detector without touching your SK plugins.
Where Semantic Kernel’s cost risk lives
SK’s primary agentic loop is the auto function-calling (AFC) pattern. You register plugins, pass a FunctionChoiceBehavior.Auto() to your kernel invocation, and SK iterates: call the model, extract tool invocations from the response, execute the plugins, append the results, call the model again. This continues until the model returns a response with no tool calls.
Four patterns produce runaway loops in this architecture:
- A plugin that always provides partial results. If your
SearchPluginreturns “found some results, may want to refine query” on every call, the model reads this as a signal to keep searching. The AFC loop never terminates because the stopping condition — “task is done” — never appears in the model’s context. Each iteration is a model call plus a plugin call; a 50-iteration run on GPT-4.1 can cost $15-25. - A planner that creates a circular plan. SK’s Handlebars planner or Stepwise planner can produce a plan where step N feeds back into step N-2. The agent faithfully executes the plan. Without a loop detector watching the sequence of plugin calls, the agent will execute the cycle indefinitely.
- A retrieval plugin returning the same context. A RAG-backed plugin returns the same top-K chunks regardless of query reformulation. The model receives the same information, fails to make progress, reformulates the query, and gets the same results again. Pattern-based loop detection catches this — the plugin signature (name + canonical args + output hash) repeats.
- Retry middleware + upstream failure. SK supports Polly-based retry middleware. If an upstream API is down and returns 503, SK’s retry policy executes the plugin call repeatedly. Each retry is a billed model round-trip (SK calls the model to confirm the failure and plan the next step). The Polly retry limit applies to the plugin call, not to the SK planning loop — so the model can propose the same failing tool call many more times than your retry count.
What SK’s built-in limits miss
SK exposes a few knobs for controlling run length: you can set a maximum number of function invocations via FunctionChoiceBehavior options, and you can configure Polly retry limits on the HTTP client. These are count-based limits on individual operations, not pattern-based loop detectors.
Count-based limits have two failure modes:
- Too high — You set
MaximumFunctionCallsPerRequestto 25 to handle complex tasks. A looping agent runs to 25 function calls before stopping. On GPT-4.1 Turbo, 25 planning rounds for a mid-complexity task costs ~$8. That’s per run, per bug, per improperly-formed query your users can send. - Too low — You set it to 5 to cap costs. A legitimately complex orchestration task that needs 8 plugin calls hits the ceiling and produces an incomplete result. Your agent is now unreliable on valid inputs.
Pattern-based detection solves this: let the agent run as many steps as it needs for legitimate work, and trip only when the same signature repeats — which is the definition of being stuck.
Wrapping SK plugins with RunGuard in Python
In Python, SK plugins are classes with methods decorated with @kernel_function. You wrap the underlying Python function with guard() before the SK decorator sees it:
from semantic_kernel.functions import kernel_function
from runguard import guard, BudgetTracker, LoopDetectedError
tracker = BudgetTracker(max_usd=3.0)
class SearchPlugin:
@kernel_function(name="SearchWeb", description="Search the web for information")
@guard(budget=tracker, loop_window=20, loop_threshold=3)
async def search_web(self, query: str) -> str:
# RunGuard checks before this executes
# SK auto function-calling sees the result as usual
return await self._search_api.query(query)
# Register plugin and run normally
from semantic_kernel import Kernel
kernel = Kernel()
kernel.add_plugin(SearchPlugin(), plugin_name="Search")
# Handle the trip at the invocation level
try:
result = await kernel.invoke(
plugin_name="Search",
function_name="SearchWeb",
query="AI regulations 2026"
)
except LoopDetectedError as e:
print(f"SK loop detected on plugin {e.tool_name}: {e.count} repeats")
# Return a structured error to the planning loop
raise
The @guard decorator runs before @kernel_function in the MRO, so RunGuard sees every invocation that SK’s AFC loop generates, regardless of which planner created the plan.
Wrapping SK plugins with RunGuard in TypeScript
// TypeScript SK — plugins are classes with @kernelFunction decorator
import { KernelFunction, kernelFunction } from "@microsoft/semantic-kernel";
import { guard, BudgetTracker, LoopDetectedError } from "runguard";
const tracker = new BudgetTracker({ maxUsd: 3.0 });
class SearchPlugin {
private readonly guardedSearch = guard(
async (query: string): Promise<string> => {
return await this.searchApi.query(query);
},
{ budget: tracker, loopWindow: 20, loopThreshold: 3 }
);
@kernelFunction({
description: "Search the web for information",
parameters: [{ name: "query", description: "The search query" }],
})
async searchWeb(query: string): Promise<string> {
// RunGuard's guard is called inside the SK kernel function
return await this.guardedSearch(query);
}
}
Budget alert: what happens when the ceiling is hit
When BudgetTracker.max_usd is reached mid-run, RunGuard raises BudgetExceededError before the next plugin call executes. The exception bubbles up through SK’s AFC loop. If you catch it at the kernel.invoke() level, you can:
- Return a graceful partial result to the caller with a “budget exceeded, partial result” note
- Send a Slack/PagerDuty alert with the accumulated cost and the task that triggered it
- Store the partial state so the run can be resumed with a higher ceiling if appropriate
- Log the event to your observability backend (Traceloop, W&B Weave, Arize Phoenix) for post-mortem analysis
The key property: the alert fires before the next call, not after the run completes. You get a real-time notification while you can still act on it — before a bad planner plan or a production edge case turns into a five-figure bill.
Calibrating the budget for SK agents
For Semantic Kernel agents, a practical calibration heuristic:
- Measure your P99 cost on valid runs from your development or staging environment. Run the agent against 50-100 representative inputs and record the total cost per run.
- Set
max_usdat 2-3× your P99. This allows headroom for legitimate complex cases while capping runaway scenarios that typically cost 10-50× the median run. - Start with a tighter ceiling in staging (1.5× P99) to catch edge cases before production. Loosen to 2-3× once you’ve validated the distribution.
- Log every trip with the task that triggered it. Trips in staging inform prompt and planner improvements; trips in production surface real user inputs your planner cannot handle gracefully.
Add budget guardrails to your Semantic Kernel agents
RunGuard’s Python and TypeScript SDKs wrap any callable. For Semantic Kernel, add @guard to your plugin methods and BudgetTracker to your kernel setup. A five-minute integration replaces a bespoke cost-counting middleware you would otherwise have to build and maintain yourself.
Get started with RunGuard — or see how the same pattern applies to AutoGen, smolagents, and CrewAI.