LLM prompt injection detection for agents: why behavioral runtime guards catch what input filters miss

Prompt injection in autonomous LLM agents is not the same problem as prompt injection in a chatbot, and it cannot be solved with the same defences. When an agent browses the web, reads documents from a vector store, calls external APIs, or delegates tasks to sub-agents, every one of those data sources is a potential injection vector. A malicious instruction embedded in a retrieved document, a tool return value, or a sub-agent’s response can redirect the agent’s goals, exfiltrate context, escalate privileges, or trigger an expensive action loop — all without any interaction with a human attacker in real time. The fundamental problem is that LLMs cannot reliably distinguish between instructions from their principal hierarchy (the system prompt, the developer) and instructions embedded in data they are processing. Input filtering at the API boundary misses injections in tool results and retrieval outputs. Output filtering misses injections that succeed without producing detectable output. The only layer that consistently observes all three injection surfaces — user input, retrieved data, and tool results — is the runtime execution context: the sequence of tool calls the agent actually makes.

The three prompt injection surfaces in autonomous agents

Autonomous agents have a fundamentally larger injection surface than stateless chatbots because they consume data from sources that the developer does not control at deployment time.

Why injection-driven behavior creates tool-call loops

From a runtime perspective, successful prompt injection almost always produces a change in tool-call pattern. This is the key insight that makes behavioral detection viable even when the injection text itself is invisible to the agent’s monitoring layer.

When an agent is hijacked by an indirect injection, the attacker typically wants the agent to perform a specific sequence of actions: exfiltrate data (call a tool that sends content to an external endpoint), escalate privileges (call a tool the agent would not normally call), or sabotage a workflow (cause a tool that writes data to write incorrect data). In each case, the agent starts calling tools it has not called before in this session, or it starts calling familiar tools with anomalous parameters, or it loops on the same sequence of calls repeatedly (re-reading the injected document, re-processing the injected instruction, re-calling the exfiltration tool).

Three specific loop signatures are diagnostic of prompt injection at the behavioral level:

Each of these signatures is exactly what RunGuard’s loop detector tracks. The security_sig_fn parameter lets you define what counts as a “repeated pattern” for your specific tool set, giving you precise detection without false positives on normal retry behavior.

Python: behavioral injection detection with RunGuard

The following example shows how to instrument an agent that performs RAG retrieval followed by action execution. The guard uses a custom security_sig_fn that groups tool calls by category (retrieval vs. action) and flags repeated action calls that follow retrieval calls, which is the canonical indirect injection pattern.

The critical design choice is repeats: 2 (trips on third repetition). Normal RAG agents occasionally need two retrieval-then-action cycles for multi-step tasks. Injection-driven exfiltration almost always attempts the same action more than twice because the first attempt fails and the injection instruction is persistent. Tuning to repeats: 2 catches the injection pattern with minimal false positives on legitimate multi-step behavior.

Tuning the injection guard for different risk levels

The right repeats and max_cycle_len values depend on the trust level of the data sources your agent reads and the capabilities of the tools it can call.

Agent type Recommended repeats Recommended max_cycle_len Rationale
Reads from fully trusted internal KB only; no external HTTP tools 3 5 Low injection risk; allow more cycles before flagging
Reads from external web pages or third-party APIs 2 4 Medium risk; web content may be adversarially crafted
Reads from user-submitted documents; has write/send tools 1 3 High risk; user-submitted content is untrusted; any action repetition is suspect
Multi-agent pipeline receiving outputs from untrusted workers 1 2 Relay injection risk; treat any repeated action call from worker output as suspect

Prompt injection detection: approach comparison

Approach Catches direct injection Catches indirect (RAG) injection Catches multi-agent relay injection Limits blast radius
Input classifier on user message Partial — novel phrasings evade classifiers No — only scans user input, not retrieved data No No
Output filter on agent response Partial — catches exfiltration in text; misses tool-call exfiltration Partial — only catches injections that produce visible output No No
Tool call allowlisting No — injection causes agent to call allowed tools with injected parameters No — injected parameters are inside allowed tool calls No Partial — limits which tools can be called, not parameter content
RunGuard behavioral loop detection Yes — injection-driven repetition trips the loop detector Yes — retrieval → action → action pattern detected Yes — relay injection causes repeated novel tool calls, which trip repeats guard Yes — budget cap limits token cost of any injection run

Behavioral detection is not a replacement for input sanitization and output filtering — it is the layer that catches what those filters miss. For the cost amplification risks from injection-driven loops, see prevent AI agent runaway cost in real time. For sandbox escape patterns that share detection mechanics with privilege-escalation injection, see AI agent sandbox escape prevention.

Add runtime injection detection to your LLM agent

RunGuard installs in one command: pip install runguard for Python, npm install @runguard/sdk for TypeScript. Wrap your agent’s LLM call function with guard(), pass a security_sig_fn that categorizes your specific tool set into retrieval vs. action groups, set repeats: 2 for external-data agents, and catch LoopDetectedError to halt and report. The guard operates entirely on tool-call metadata — it never reads the content of retrieved documents or user messages, so there is no privacy concern and no need to route your data through a third-party classifier.

RunGuard pricing: Solo plan at $19/month for individual developers. Team plan at $79/month adds Slack and PagerDuty alerts, shared dashboards, and audit log. Both plans include a 14-day free trial — no credit card required.

Start your 14-day free trial — or explore related patterns: AI agent sandbox escape prevention, AI agent context poisoning detection, autonomous agent cost control best practices, AI agent retry storm prevention, and prevent AI agent runaway cost in real time.