Spring AI Cost Control: Loop Detection and Budget Enforcement in Production

Spring AI is the official Spring Project for integrating AI capabilities into Java applications. Built on Spring Boot's autoconfiguration model, it provides a unified abstraction layer over a wide range of LLM providers (OpenAI, Anthropic, Vertex AI, Azure OpenAI, Mistral, and others), a portable VectorStore API for RAG, a composable Advisor system for cross-cutting concerns, and a ChatClient fluent builder that makes it possible to build sophisticated agentic workflows with idiomatic Spring code.

That design brings Java's 10+ million developer ecosystem into the AI agent space — and with it, production workloads running in enterprise Spring Boot applications where reliability and cost predictability are non-negotiable. Spring AI ships with a handful of built-in protections: maxToolCallsPerRequest limits the number of tool calls in a single chat interaction, and TokenCountBudgetAdvisor (added in 1.0.0) truncates messages when a token budget is approached. What neither of these provides is pattern detection — the ability to recognize that an agent is repeating itself, burning budget without making progress, and trip early before the full allowance is consumed.

This post covers four Spring AI-specific failure modes and shows how to build a SpringAgentBreaker circuit breaker in Java using the framework's native CallAroundAdvisor API.

Spring AI architecture in brief

Spring AI agents are typically constructed around the ChatClient fluent API, which composes requests through a chain of Advisor objects before sending them to the underlying ChatModel. The advisor chain handles cross-cutting concerns: MessageChatMemoryAdvisor injects and updates conversation history, QuestionAnswerAdvisor performs RAG retrieval and augments the prompt with vector store results, TokenCountBudgetAdvisor truncates messages when token limits approach, and custom advisors implement domain-specific logic.

Tool calling follows the standard agentic loop: the LLM decides to call a function, Spring AI resolves the corresponding FunctionCallback or @Tool-annotated bean, executes it, and returns the result as a ToolResponseMessage for the next model inference. This loop continues until the model produces a final response without a tool call, or until built-in limits are reached.

The key integration points for a circuit breaker are:

CallAroundAdvisor — wraps every ChatClient call; receives the full AdvisedRequest before it reaches the model and the AdvisedResponse after. The right place to inspect tool call sequences, measure token growth, and trip the breaker.
AdvisedRequest.context() — a mutable Map<String, Object> that flows through the advisor chain per conversation turn. Used to store per-session breaker state without thread-local or external storage.
ChatResponse.getResults() → AssistantMessage.getToolCalls() — exposes the tool call list from each model response; the input for tool call pattern detection.
Usage (via ChatResponse.getMetadata().getUsage()) — prompt and completion token counts per request; the input for token drift detection.

Multi-agent patterns in Spring AI are typically implemented by registering an inner ChatClient invocation as a FunctionCallback — making one agent a callable tool from another agent's perspective. This is clean and composable, but creates a specific loop failure mode covered below.

Why maxToolCallsPerRequest is not a circuit breaker

Spring AI's maxToolCallsPerRequest — set via ToolCallingChatOptions.builder().maxToolCallsPerRequest(N).build() — caps the total number of tool call rounds within a single ChatClient.call() invocation. When the limit is reached, Spring AI stops the tool-calling loop and returns whatever partial result the model has accumulated.

A circuit breaker detects a pattern and trips before the limit is consumed. maxToolCallsPerRequest answers "have we called enough tools?" A pattern detector answers "are these tool calls making progress?" The gap is significant in production:

An agent performing 15 distinct function calls across different data sources should never be halted. An agent calling the same search function 15 times with semantically near-identical arguments should trip after the 3rd or 4th repetition.
With maxToolCallsPerRequest=15, the spiraling agent consumes all 15 rounds and returns a degraded result. A circuit breaker trips at round 3–4, preserves 11–12 rounds for a retry with different parameters, and avoids the full bill.
TokenCountBudgetAdvisor truncates messages when a token threshold is approached — it does not detect the growth rate that signals runaway context inflation. An agent steadily inflating its context by 40% per round will hit the budget advisor's wall at the worst possible moment: after most of the budget is already spent.

The four failure modes below each route around Spring AI's built-in limits while accumulating cost turn by turn.

Failure mode 1: Function callback invocation spiral

The function callback invocation spiral is Spring AI's most common production cost failure. It occurs when the model repeatedly calls the same @Tool-annotated method or FunctionCallback bean with near-identical arguments because each invocation returns a valid result that is informative but not conclusive — the model can't satisfy its current reasoning goal from the returned data, refines the query slightly, and tries again.

A concrete example: a Spring AI research agent uses a @Tool-annotated webSearch(String query) method. The model queries "Spring Boot 3.x benchmark results 2026", receives summaries, determines the detail level is insufficient, and calls webSearch("Spring Boot 3 throughput benchmarks 2026"). Gets similar results. Tries webSearch("Spring Boot 3.3 performance numbers throughput latency"). The function works correctly. The model is not in an infinite loop by any strict definition — each call has a slightly different argument string. But each call incurs a full LLM inference and a function execution while returning decreasing marginal information value.

Spring AI's maxToolCallsPerRequest applies uniformly across all tool types and all calls — it doesn't detect that a single function is being hammered. Detection requires tracking the argument sequence per function name:

import java.util.*;
import java.util.stream.Collectors;

public class ToolCallTracker {

    // Normalize a tool argument string for similarity comparison.
    // Lowercases, strips punctuation, tokenizes, sorts, deduplicates.
    public static String normalize(String args) {
        return Arrays.stream(args.toLowerCase()
                .replaceAll("[^a-z0-9\\s]", " ")
                .trim()
                .split("\\s+"))
            .filter(t -> !t.isEmpty())
            .sorted()
            .distinct()
            .collect(Collectors.joining(" "));
    }

    // Jaccard similarity on token sets.
    public static double jaccard(String a, String b) {
        Set<String> setA = new HashSet<>(Arrays.asList(a.split(" ")));
        Set<String> setB = new HashSet<>(Arrays.asList(b.split(" ")));
        Set<String> intersection = new HashSet<>(setA);
        intersection.retainAll(setB);
        Set<String> union = new HashSet<>(setA);
        union.addAll(setB);
        return union.isEmpty() ? 0.0 : (double) intersection.size() / union.size();
    }

    // Returns true if the last windowSize entries in the call history
    // for toolName all have pairwise Jaccard >= threshold.
    public static boolean isSpiralDetected(
            Map<String, Deque<String>> callHistory,
            String toolName,
            int windowSize,
            double threshold) {
        Deque<String> history = callHistory.getOrDefault(toolName, new ArrayDeque<>());
        if (history.size() < windowSize) return false;
        List<String> window = new ArrayList<>(history).subList(
            history.size() - windowSize, history.size());
        for (int i = 1; i < window.size(); i++) {
            if (jaccard(window.get(i - 1), window.get(i)) < threshold) return false;
        }
        return true;
    }
}

The 0.80 Jaccard threshold is a reliable production starting point for function argument comparison. It catches paraphrase repetition — "search Spring Boot benchmark 2026" vs "Spring Boot 2026 benchmark search" — while allowing functions that legitimately need similar sequential calls (pagination, faceted search, incremental refinement). The window size of 5 avoids false positives on legitimate re-queries while catching tight spiral patterns quickly enough to preserve most of the remaining budget.

Failure mode 2: MessageChatMemoryAdvisor token inflation

MessageChatMemoryAdvisor is Spring AI's standard mechanism for injecting conversation history into each request. By default it uses InMemoryChatMemory, which stores every message in a ConcurrentHashMap keyed by conversation ID. Every time the advisor runs, it fetches the full history for the conversation ID and prepends it to the request's message list. Every tool call response is appended to that history.

For short, bounded conversations this is correct and efficient. For agentic workflows — where a single logical task spans many tool call rounds, each producing a ToolResponseMessage that may contain hundreds of tokens of structured data — InMemoryChatMemory grows monotonically. The 8th round of a research agent sends 7 prior tool responses to the model as part of the conversation context. The 15th round sends 14. The cost curve is the sum of an arithmetic series, not a flat per-call constant.

Spring AI 1.0 added TokenCountBudgetAdvisor, which truncates messages when token usage approaches a configured ceiling. But truncation is reactive — it activates when the budget is nearly exhausted, not when growth is first detected. Token drift detection is proactive: it measures the growth rate across consecutive requests and trips if compounding growth is observed early, while most of the budget is still available.

import org.springframework.ai.chat.metadata.Usage;

public class TokenDriftDetector {

    // Returns true if three consecutive prompt token counts
    // each grew by at least growthFactor over the prior one.
    public static boolean isDrifting(
            List<Long> promptTokenHistory,
            double growthFactor) {
        int n = promptTokenHistory.size();
        if (n < 3) return false;
        long t1 = promptTokenHistory.get(n - 3);
        long t2 = promptTokenHistory.get(n - 2);
        long t3 = promptTokenHistory.get(n - 1);
        if (t1 == 0) return false;
        return ((double) t2 / t1 >= growthFactor)
            && ((double) t3 / t2 >= growthFactor);
    }

    // Extracts prompt token count from a ChatResponse.
    // Returns 0 if usage data is unavailable.
    public static long promptTokens(AdvisedResponse response) {
        if (response == null) return 0;
        Usage usage = response.response().getMetadata().getUsage();
        return usage != null ? usage.getPromptTokens() : 0;
    }
}

A growthFactor of 1.35 means each of three consecutive requests must show 35%+ prompt token growth over the prior one before the drift detector fires. This is intentionally tight: it catches the geometric growth curve of an agent building a large RAG or tool-response context, while tolerating the natural variation in prompt sizes from turn to turn. The right response on a drift trip is not to abort — it's to flush the memory (chatMemory.clear(conversationId)), inject a compressed summary of what was learned, and reset the token baseline.

For JdbcChatMemory-backed deployments (where conversation history is persisted in a relational database), the flush-and-summarize pattern is equally applicable — clear the records for the conversation ID and insert a single summary message.

Failure mode 3: VectorStore RAG query fixation

Spring AI's QuestionAnswerAdvisor automatically augments each request with results from a VectorStore retrieval. The advisor receives the user's message (or a reformulated version of it), calls vectorStore.similaritySearch(), and injects the top-K retrieved documents into the system prompt or user message context. When the retrieved documents don't contain the answer the model needs, the model may reformulate the question — but if the vector store doesn't have better content to return, semantically similar queries will retrieve semantically similar documents, and the agent remains stuck.

This failure mode is particularly common when:

The vector store was indexed against a knowledge base that doesn't cover the specific query domain the agent encounters at runtime.
The agent's task involves finding information that genuinely does not exist in the indexed corpus — but the model isn't confident enough to declare failure and escalate.
The QuestionAnswerAdvisor is combined with a memory advisor that caches prior retrieval attempts, causing the model to "re-research" questions it already attempted without the additional context it needs to resolve them differently.

Detection requires intercepting the query string passed to similaritySearch(). The QuestionAnswerAdvisor stores the search query in the AdvisedRequest.context() map under the key QuestionAnswerAdvisor.RETRIEVED_DOCUMENTS after retrieval — but the query itself can be captured by wrapping the VectorStore:

import org.springframework.ai.vectorstore.VectorStore;
import org.springframework.ai.vectorstore.SearchRequest;
import org.springframework.ai.document.Document;

public class ObservableVectorStore implements VectorStore {

    private final VectorStore delegate;
    private final Consumer<String> queryObserver;

    public ObservableVectorStore(VectorStore delegate, Consumer<String> queryObserver) {
        this.delegate = delegate;
        this.queryObserver = queryObserver;
    }

    @Override
    public List<Document> similaritySearch(SearchRequest request) {
        queryObserver.accept(request.getQuery());
        return delegate.similaritySearch(request);
    }

    // Delegate all other VectorStore methods to the wrapped instance.
    @Override
    public void add(List<Document> documents) { delegate.add(documents); }

    @Override
    public Optional<Boolean> delete(List<String> idList) { return delegate.delete(idList); }
}

The observer callback feeds the query string into the breaker's per-session query history. When three or more consecutive queries for the same session have pairwise Jaccard similarity above 0.75, the RAG fixation detector fires. The right response is to trip the breaker and surface the failure as a specific exception type — RagFixationException — rather than a generic circuit breaker trip, so the calling code can choose to expand the search scope, fall back to a different knowledge source, or escalate to a human.

Failure mode 4: Multi-agent task delegation loop

Spring AI multi-agent patterns are typically implemented by exposing an inner ChatClient invocation as a FunctionCallback. The coordinator agent calls the sub-agent as a tool, receives a result, and proceeds. When the sub-agent cannot complete its assigned task — it lacks the necessary tools, its knowledge base doesn't cover the domain, or the task specification is underspecified — it returns a response that looks like progress to the coordinator (it's a well-structured, coherent text response) but is effectively a request for more information or a description of what it couldn't find.

The coordinator, receiving an informative but inconclusive response, reasons that providing more context will help and calls the sub-agent again with an augmented prompt. The sub-agent receives more context but still cannot resolve the underlying ambiguity. The cycle runs until the coordinator's maxToolCallsPerRequest limit is hit or the total session cost becomes visible in the billing dashboard.

The cost multiplication is significant. Each round of the delegation loop incurs: one coordinator LLM inference to decide to re-delegate, one full sub-agent invocation (with its own internal tool calls), and one coordinator LLM inference to process the sub-agent's response. A delegation cycle that runs for 5 rounds with a sub-agent that internally makes 3 tool calls per invocation costs 5 × (1 + 3 + 1) = 25 LLM calls from the coordinator's perspective — all within maxToolCallsPerRequest, because each round counts as a single tool call (the sub-agent function call), not 5.

Detection at the coordinator level requires tracking per-sub-agent call frequency and the semantic similarity of consecutive responses from each sub-agent function:

public class SubAgentCallTracker {

    record SubAgentCall(int callIndex, String normalizedResponse) {}

    private final Map<String, List<SubAgentCall>> perAgentHistory = new HashMap<>();
    private int totalCallCount = 0;

    public void record(String subAgentFunctionName, String responseText) {
        String normalized = ToolCallTracker.normalize(responseText);
        perAgentHistory
            .computeIfAbsent(subAgentFunctionName, k -> new ArrayList<>())
            .add(new SubAgentCall(totalCallCount++, normalized));
    }

    public boolean isDelegationLoopDetected(
            String subAgentFunctionName,
            int maxInvocations,
            double responseSimilarityThreshold) {
        List<SubAgentCall> history = perAgentHistory.getOrDefault(
            subAgentFunctionName, Collections.emptyList());
        if (history.size() < 2) return false;
        // Trip if max invocations exceeded for this sub-agent.
        if (history.size() >= maxInvocations) return true;
        // Trip if last two responses are semantically near-identical.
        SubAgentCall last = history.get(history.size() - 1);
        SubAgentCall prev = history.get(history.size() - 2);
        return ToolCallTracker.jaccard(prev.normalizedResponse(), last.normalizedResponse())
            >= responseSimilarityThreshold;
    }
}

A responseSimilarityThreshold of 0.65 is appropriate for sub-agent response comparison — responses are longer and more paraphrased than function arguments, so a lower threshold prevents false positives on responses that say the same thing with different phrasing. The maxInvocations cap (default 3) provides a hard backstop for sub-agents whose responses vary enough to stay below the similarity threshold but are still not converging.

Full SpringAgentBreaker implementation

The four detectors above compose into a single CallAroundAdvisor that plugs into any ChatClient with one line of configuration:

import org.springframework.ai.chat.client.advisor.api.*;
import org.springframework.ai.chat.messages.AssistantMessage.ToolCall;
import org.springframework.ai.chat.model.ChatResponse;
import reactor.core.publisher.Flux;

import java.util.*;
import java.util.concurrent.ConcurrentHashMap;

public class SpringAgentBreaker implements CallAroundAdvisor {

    public record BreakerConfig(
        int toolSpiralWindowSize,       // default 5
        double toolSpiralJaccard,       // default 0.80
        int ragQueryWindowSize,         // default 3
        double ragQueryJaccard,         // default 0.75
        int tokenDriftWindowSize,       // default 3
        double tokenDriftFactor,        // default 1.35
        long maxSessionTokens,          // default 100_000
        int maxSubAgentInvocations,     // default 3
        double subAgentResponseJaccard  // default 0.65
    ) {
        public static BreakerConfig defaults() {
            return new BreakerConfig(5, 0.80, 3, 0.75, 3, 1.35, 100_000L, 3, 0.65);
        }
    }

    public enum BreakerState { CLOSED, OPEN, HALF_OPEN }

    // Session-scoped state stored in AdvisedRequest.context()
    private static final String CTX_KEY = "spring_agent_breaker_state";

    private final BreakerConfig config;
    private final ObservableVectorStore observableVectorStore; // may be null

    public SpringAgentBreaker(BreakerConfig config, ObservableVectorStore vs) {
        this.config = config;
        this.observableVectorStore = vs;
    }

    @Override
    public String getName() { return "SpringAgentBreaker"; }

    @Override
    public int getOrder() { return Ordered.HIGHEST_PRECEDENCE; }

    @Override
    public AdvisedResponse aroundCall(AdvisedRequest request, CallAroundAdvisorChain chain) {
        SessionState state = getOrCreateState(request);

        // Check if already OPEN — trip immediately
        if (state.breakerState == BreakerState.OPEN) {
            throw new AgentCircuitBreakerException(state.tripReason);
        }

        // Register RAG query observer for this request
        if (observableVectorStore != null) {
            observableVectorStore.setQueryObserver(query -> {
                String norm = ToolCallTracker.normalize(query);
                state.ragQueryHistory.add(norm);
                if (state.ragQueryHistory.size() >= config.ragQueryWindowSize()) {
                    checkRagFixation(state);
                }
            });
        }

        // Execute the request through the rest of the advisor chain
        AdvisedResponse response = chain.nextAroundCall(request);

        // Post-response: inspect tool calls
        ChatResponse chatResponse = response.response();
        if (chatResponse != null && !chatResponse.getResults().isEmpty()) {
            AssistantMessage assistantMsg = chatResponse.getResult().getOutput();
            if (assistantMsg.hasToolCalls()) {
                for (ToolCall tc : assistantMsg.getToolCalls()) {
                    String toolName = tc.name();
                    String normalizedArgs = ToolCallTracker.normalize(tc.arguments());
                    state.toolCallHistory
                        .computeIfAbsent(toolName, k -> new ArrayDeque<>())
                        .add(normalizedArgs);
                    checkToolSpiral(state, toolName);
                }
            }

            // Token drift detection
            Usage usage = chatResponse.getMetadata().getUsage();
            if (usage != null) {
                state.promptTokenHistory.add(usage.getPromptTokens());
                state.totalTokens += usage.getPromptTokens() + usage.getGenerationTokens();
                if (state.totalTokens >= config.maxSessionTokens()) {
                    tripBreaker(state, "Session token budget exceeded: "
                        + state.totalTokens + " tokens");
                }
                if (state.promptTokenHistory.size() >= config.tokenDriftWindowSize()) {
                    checkTokenDrift(state);
                }
            }
        }

        return response;
    }

    private void checkToolSpiral(SessionState state, String toolName) {
        Deque<String> history = state.toolCallHistory.get(toolName);
        if (history.size() < config.toolSpiralWindowSize()) return;
        List<String> window = new ArrayList<>(history).subList(
            history.size() - config.toolSpiralWindowSize(), history.size());
        boolean spiral = true;
        for (int i = 1; i < window.size(); i++) {
            if (ToolCallTracker.jaccard(window.get(i - 1), window.get(i))
                    < config.toolSpiralJaccard()) {
                spiral = false;
                break;
            }
        }
        if (spiral) tripBreaker(state,
            "Function callback invocation spiral detected on tool: " + toolName);
    }

    private void checkRagFixation(SessionState state) {
        List<String> window = new ArrayList<>(state.ragQueryHistory).subList(
            state.ragQueryHistory.size() - config.ragQueryWindowSize(),
            state.ragQueryHistory.size());
        boolean fixated = true;
        for (int i = 1; i < window.size(); i++) {
            if (ToolCallTracker.jaccard(window.get(i - 1), window.get(i))
                    < config.ragQueryJaccard()) {
                fixated = false;
                break;
            }
        }
        if (fixated) tripBreaker(state, "VectorStore RAG query fixation detected");
    }

    private void checkTokenDrift(SessionState state) {
        List<Long> h = state.promptTokenHistory;
        int n = h.size();
        long t1 = h.get(n - config.tokenDriftWindowSize());
        long t2 = h.get(n - config.tokenDriftWindowSize() + 1);
        long t3 = h.get(n - 1);
        if (t1 > 0
                && (double) t2 / t1 >= config.tokenDriftFactor()
                && (double) t3 / t2 >= config.tokenDriftFactor()) {
            tripBreaker(state, "MessageChatMemory token drift detected: "
                + t1 + " → " + t2 + " → " + t3 + " prompt tokens");
        }
    }

    private void tripBreaker(SessionState state, String reason) {
        state.breakerState = BreakerState.OPEN;
        state.tripReason = reason;
        throw new AgentCircuitBreakerException(reason);
    }

    private SessionState getOrCreateState(AdvisedRequest request) {
        return (SessionState) request.context()
            .computeIfAbsent(CTX_KEY, k -> new SessionState());
    }

    // Per-session mutable state
    static class SessionState {
        BreakerState breakerState = BreakerState.CLOSED;
        String tripReason;
        Map<String, Deque<String>> toolCallHistory = new HashMap<>();
        List<String> ragQueryHistory = new ArrayList<>();
        List<Long> promptTokenHistory = new ArrayList<>();
        long totalTokens = 0;
    }

    public static class AgentCircuitBreakerException extends RuntimeException {
        public AgentCircuitBreakerException(String msg) { super(msg); }
    }
}

Wiring SpringAgentBreaker into a Spring Boot application

The advisor plugs into any ChatClient through the builder's defaultAdvisors() method. Register it as a Spring Bean with the highest precedence so it wraps the full advisor chain:

@Configuration
public class AgentBreakerConfig {

    @Bean
    public ObservableVectorStore observableVectorStore(VectorStore delegate) {
        return new ObservableVectorStore(delegate, query -> { /* observer set per request */ });
    }

    @Bean
    public SpringAgentBreaker springAgentBreaker(ObservableVectorStore observableVectorStore) {
        return new SpringAgentBreaker(BreakerConfig.defaults(), observableVectorStore);
    }

    @Bean
    public ChatClient agentChatClient(
            ChatClient.Builder builder,
            SpringAgentBreaker breaker,
            ObservableVectorStore vectorStore,
            ChatMemory chatMemory) {
        return builder
            .defaultAdvisors(
                breaker,   // must be first (highest precedence)
                new MessageChatMemoryAdvisor(chatMemory),
                new QuestionAnswerAdvisor(vectorStore)
            )
            .build();
    }
}

Handling a circuit breaker trip in your agent loop:

@Service
public class ResearchAgentService {

    private final ChatClient chatClient;
    private final ChatMemory chatMemory;

    public String research(String conversationId, String userQuery) {
        try {
            return chatClient.prompt()
                .user(userQuery)
                .advisors(a -> a.param(
                    MessageChatMemoryAdvisor.CHAT_MEMORY_CONVERSATION_ID_KEY,
                    conversationId))
                .call()
                .content();
        } catch (SpringAgentBreaker.AgentCircuitBreakerException e) {
            // Flush memory and re-inject a compressed summary
            chatMemory.clear(conversationId);
            chatMemory.add(conversationId, List.of(
                new SystemMessage("Prior context summary: " + summarize(e.getMessage()))
            ));
            // Optionally: retry once from the reset baseline
            log.warn("Agent breaker tripped: {}", e.getMessage());
            return "Agent reached a reasoning limit. Retrying from a fresh context.";
        }
    }
}

BreakerConfig tuning reference

Parameter	Default	What it controls	When to adjust
`toolSpiralWindowSize`	5	Number of consecutive calls to the same tool that must all be similar before spiral trips.	Lower to 3 for tight budget environments; raise to 7 for tools with legitimately repetitive access patterns (pagination, streaming).
`toolSpiralJaccard`	0.80	Argument similarity threshold for spiral detection.	Lower to 0.65 for tools whose inputs tend to be verbose (long prompts). Raise to 0.90 for tools with very short, keyword-style inputs.
`ragQueryWindowSize`	3	Number of consecutive VectorStore queries that must be similar before fixation trips.	Keep at 3 for most use cases. Raise to 5 if your agent legitimately re-queries the same concept from multiple angles.
`ragQueryJaccard`	0.75	Query similarity threshold for RAG fixation detection.	Lower to 0.60 for agents working in highly technical domains where queries share many domain-specific terms without being identical in intent.
`tokenDriftFactor`	1.35	Required per-turn prompt token growth multiplier for three consecutive turns to trip drift detection.	Lower to 1.25 for cost-sensitive workloads. Raise to 1.50 for agents that legitimately process large document payloads per turn.
`maxSessionTokens`	100,000	Hard session token budget cap. Trips immediately when cumulative prompt + completion tokens exceed this value.	Set to your actual LLM cost budget per session. At $15/M tokens (GPT-4o), 100K tokens = $1.50 per session hard cap.
`maxSubAgentInvocations`	3	Maximum times the same sub-agent function tool may be called in one session before delegation loop trips.	Raise to 5 for sub-agents performing legitimate iterative refinement. Keep at 3 for sub-agents designed for single-shot tasks.
`subAgentResponseJaccard`	0.65	Response similarity threshold for back-delegation detection.	Lower to 0.50 for sub-agents that produce verbose, domain-specific responses that vary in phrasing but not in informational content.

Spring Boot Actuator and Micrometer integration

Spring AI's built-in observability emits spans and metrics via Micrometer to any compatible backend (Prometheus, Zipkin, OpenTelemetry). The SpringAgentBreaker should augment this instrumentation rather than replace it. Add a MeterRegistry dependency and emit a counter on each trip:

@Autowired
private MeterRegistry meterRegistry;

private void tripBreaker(SessionState state, String reason) {
    state.breakerState = BreakerState.OPEN;
    state.tripReason = reason;
    // Determine trip category from reason string
    String category = reason.contains("spiral") ? "tool_spiral"
        : reason.contains("RAG") ? "rag_fixation"
        : reason.contains("drift") ? "token_drift"
        : reason.contains("delegation") ? "delegation_loop"
        : "budget_exceeded";
    meterRegistry.counter("spring.agent.breaker.trips",
        "category", category).increment();
    throw new AgentCircuitBreakerException(reason);
}

This produces a spring.agent.breaker.trips counter that Prometheus can scrape and Grafana can alert on. Tag-based grouping lets you see which failure mode dominates across your fleet of Spring Boot agent instances. Combine with Spring AI's built-in spring.ai.chat.client.observations spans to correlate breaker trips with the specific chat requests that preceded them.

For Spring Boot Actuator health endpoints, expose a custom HealthIndicator that returns DOWN when a session breaker has tripped and the agent is in a known degraded state. This feeds into your orchestration layer's readiness probe without requiring a separate monitoring integration.

Memory flush and recovery pattern

When the breaker trips due to token drift or a tool spiral, the right recovery path is to flush conversation memory rather than abort the entire agent task. The flush-and-summarize pattern works as follows:

Catch AgentCircuitBreakerException at the agent service layer.
Call chatMemory.clear(conversationId) to remove the inflated history.
Generate a compressed summary of what the agent accomplished before the trip. This can be done with a separate, cheap summarization call using a smaller model: ChatClient.builder(chatModel).build().prompt().user("Summarize: " + collectedContext).call().content().
Re-inject the summary as a single SystemMessage or UserMessage. This resets the context window to a stable baseline while preserving the useful progress made before the trip.
Retry the original task from the reset baseline. Because the memory is now compact, the token drift detector starts from a fresh baseline and the agent has a full budget to work with again.

This pattern is especially effective for RAG-heavy agents where the QuestionAnswerAdvisor injects large retrieved documents per turn. After a RAG fixation trip, the flush clears the document-heavy context, the retry can widen the SearchRequest.withTopK() or change the filterExpression, and the agent approaches the same problem with a different vector store strategy.

Frequently asked questions

Does SpringAgentBreaker work with Spring AI's streaming API (stream().content())?

CallAroundAdvisor only intercepts non-streaming call() invocations. For streaming, Spring AI provides CallAroundAdvisor's streaming counterpart StreamAroundAdvisor with an aroundStream() method. Implement both interfaces in SpringAgentBreaker and delegate to the same detection logic. The streaming path receives Flux<AdvisedResponse>, so buffer tool calls via a collectList() operator before running the pattern checks. The token drift detector must aggregate partial usage metadata from the stream's final chunk, which Spring AI emits as the last item in the flux.

How does the breaker interact with Spring AI's built-in TokenCountBudgetAdvisor?

The two advisors are complementary. TokenCountBudgetAdvisor truncates messages before sending them to the model to keep prompt size within a static ceiling. SpringAgentBreaker detects growth rate across consecutive requests and trips after observing a compounding inflation pattern. Register the breaker with a higher precedence (lower getOrder() value) so it wraps TokenCountBudgetAdvisor in the chain. If the budget advisor's ceiling is already close to your provider's context window, lower the breaker's tokenDriftFactor to 1.25 so the drift detector trips earlier, giving the budget advisor less work to do on subsequent turns.

The spiral detector fires false positives on our paginated search tool that legitimately issues similar queries across pages. How do we tune it out?

Two options. First, include the page number or cursor token in the function argument JSON — normalized arguments will differ by the page offset even if the query text is identical, keeping Jaccard below the threshold. Second, annotate the function bean with a custom marker and skip it in the spiral check: check !skipSpiralTools.contains(toolName) before running checkToolSpiral(). Inject skipSpiralTools as a Set<String> configuration property. This exempts known pagination tools while leaving all other tool calls monitored.

We run Spring Boot agent services in a horizontally scaled pod environment. Is per-request context storage safe for concurrent requests?

Yes. AdvisedRequest.context() is created fresh per ChatClient.call() invocation and is not shared across concurrent requests. Each call gets its own SessionState instance stored in the context map. The SpringAgentBreaker bean itself holds no per-request mutable state — only the config record and the optional ObservableVectorStore reference. As long as the ObservableVectorStore's setQueryObserver method is called within a single call's execution (not shared across calls), concurrent safety is maintained. For truly concurrent multi-call agents in a single JVM, use a ThreadLocal-backed observer or pass the conversation ID as a lookup key into a ConcurrentHashMap of per-conversation query histories.

Does the breaker catch loops that happen inside a single tool call round (e.g., a tool that itself calls an LLM)?

No. SpringAgentBreaker operates at the ChatClient advisor layer — it sees the tool calls emitted by the outer agent's LLM response, not the internals of each tool execution. If a tool internally invokes another LLM (for summarization, classification, or sub-task reasoning), that inner invocation is invisible to the outer breaker unless the inner ChatClient also has a breaker advisor registered. For multi-agent patterns built with Spring AI's function-as-sub-agent idiom, register a SpringAgentBreaker on the inner agent's ChatClient as well. The outer breaker's delegation loop detector then provides a second layer of protection at the coordinator level.

RunGuard: circuit breakers for production AI agents

RunGuard is a runtime SDK that trips a circuit breaker the moment your AI agent's tool-call pattern shows a loop, context-window inflation, or budget blow-through — before the bill lands. One-line install for TypeScript and Python. Works alongside any framework, including Spring AI via the REST API or the JVM SDK wrapper.

Start free trial — no card required