Agent tool call timeout handling: the three failure modes and how to prevent each one

When an AI agent’s tool call hangs, stalls, or returns slowly, you face three distinct failure modes: the agent blocks indefinitely waiting for a result, the agent retries the same tool call repeatedly (a retry storm), or the agent receives an error and enters a loop trying to recover. All three are expensive. The first costs wallclock time and memory. The second costs API tokens (every retry is an LLM call that processes the retry result). The third costs both. This page explains how to configure per-tool timeouts, how retry storms form, and how RunGuard’s loop detector catches them before they drain your budget.

Failure mode 1: the hanging tool call

Failure mode 2: the retry storm

Failure mode 3: the error-recovery loop

Tool timeout patterns comparison

PatternProblemSolution
No timeout on toolAgent hangs indefinitely on slow networkAdd per-tool timeout (10–30s depending on tool)
Tool returns error string on timeoutTriggers retry storm — model sees error as dataRaise exception instead of returning error string
Tool raises exception on timeoutModel may still retry, but interprets as hard failureRunGuard loop detector catches 3+ repeats
Global max_iter limitCounts iterations not cost — still expensiveRunGuard budget cap fires on dollar spend, not iteration count
Exponential backoff in toolAdds real delays — still costs tokens on each retryCombine backoff with RunGuard retry-count detection

Per-tool timeout recommendations by tool type