Helicone tracks what each call cost. RunGuard stops the next call before it fires.

Helicone is a proxy-based LLM observability platform. You swap your baseURL for https://oai.helicone.ai/v1, add one header, and every subsequent OpenAI (or Anthropic, or Google) call flows through Helicone’s infrastructure before reaching the provider. Helicone logs the request and response, computes cost from token counts and model price tables, surfaces a per-call and per-user cost figure in its dashboard, and lets you set rate-limit policies on those headers. All of that is real and valuable. What Helicone is not is an in-process circuit breaker: it sits between your code and the provider API, not inside your agent’s run loop; it receives each call after your code has already decided to make it; and when a rate limit fires, the response arrives at your code as an HTTP 429 — the call went out, hit the proxy, got rejected, and returned. RunGuard operates in the opposite direction: it runs inside your process, accumulates per-run state, and throws a typed exception before the LLM call is ever constructed. The gap between those two designs is where agent loops and weekend bill explosions live.

How Helicone actually works

Helicone’s integration is a baseURL swap and a header injection. In TypeScript with the OpenAI SDK:

import OpenAI from 'openai';

const client = new OpenAI({
  apiKey: process.env.OPENAI_API_KEY,
  baseURL: 'https://oai.helicone.ai/v1',
  defaultHeaders: {
    'Helicone-Auth': `Bearer ${process.env.HELICONE_API_KEY}`,
    'Helicone-User-Id': userId,
  },
});

From this point, every client.chat.completions.create() call is an HTTP request that goes to oai.helicone.ai, not api.openai.com. Helicone’s proxy receives the request, forwards it to OpenAI, receives the response, logs both ends (request body, response body, latency, status code, token counts, computed cost), and returns the OpenAI response to your code. The round-trip latency overhead is typically 20–80 ms depending on Helicone’s region proximity to you and to OpenAI’s endpoint.

The important consequence of this architecture: Helicone only knows about a call as it happens. Your agent’s internal state — the list of tool calls it has made, the pattern those calls form, the accumulated dollar sum of the current run — lives in your process’s memory, not in a variable Helicone can read. Helicone receives individual HTTP requests; it has no model of “agent run as a unit.” A run that makes ten tool calls generates ten separate proxy transactions. Helicone logs each one. If calls 8, 9, and 10 are identical to call 7 (a tool-call loop), Helicone logs four identical entries. It does not interrupt call 9.

Helicone rate limits: what they protect and what they don’t

Helicone ships a rate-limiting feature configured via request headers or the dashboard. You can set limits on requests per minute per user, tokens per minute per user, or cost per day per user (available on paid plans). When a limit is crossed, Helicone returns HTTP 429 to your code. Your code sees a network error or an OpenAI.APIError with status 429 and needs to handle it.

These limits are useful for multi-tenant SaaS products where you need to enforce fair-use guardrails across paying customers. They are not useful for the single-agent-run budget problem for two reasons:

They are per-user-per-window, not per-run. A “cost per day per user” limit means your agent can burn 80% of the daily allowance in one run, and the limit does not fire until the full daily cap is exhausted — which may be many runs later, including runs from other processes using the same user ID. There is no Helicone concept of “this specific agent.run() call must not exceed $5 total.”
A 429 is an HTTP error, not a typed circuit-breaker exception. When Helicone returns 429, your code receives an APIStatusError. Unless you have written explicit handling that catches this specific status, the error propagates up as an uncaught exception or is retried by your agent’s retry logic — which may generate another call that also returns 429, creating a retry loop against the rate-limit wall. This is the same pattern RunGuard is designed to break: a repeated tool call that hits a consistent error and loops until something external kills it.

Helicone’s rate-limit feature answers the question “how do I prevent one user from monopolising my LLM quota?” It does not answer “how do I stop one agent run from looping past its cost ceiling?”

The proxy-side blind spot: loop detection

Detecting a tool-call loop requires understanding the sequence of calls within a single run. The signature of a loop is not “one expensive call” but “the same call or pattern of calls repeating without progress.” To detect this, you need to:

Track every tool call made during this run (not just each individual HTTP request to the LLM provider).
Fingerprint each call by a hash of its name + arguments (or a subset of its arguments if they contain dynamic timestamps).
Check whether that fingerprint has appeared within a sliding window of recent calls.
Throw before the next call is even constructed if the window has hit the loop threshold.

Helicone operates at step 0 of this list — it receives completed HTTP requests. It does not have access to steps 1–4 because those require in-process state between calls. The proxy sees each call individually, logs its content, and forwards it. A loop fingerprinter must sit inside the agent’s execution context, not between the agent and the provider API.

This architectural gap is not a gap Helicone is trying to close — proxy-based observability and in-process runtime guardrails are genuinely different instruments with different design constraints. Helicone is built to be language-agnostic and framework-agnostic (any code that can change a baseURL and add headers works), which requires placing it entirely outside the agent’s process. That universality is why it cannot do in-process fingerprinting.

What RunGuard does instead

RunGuard ships as a TypeScript SDK (Python in public beta). The install is:

npm install @runguard/sdk

The integration wraps your existing agent function at the call site — you do not change the agent’s internal code, you wrap the outermost function that runs the agent loop:

import { guard } from '@runguard/sdk';

const result = await guard(
  () => runMyAgentLoop(input),
  {
    maxLoopReps: 3,      // trip if same tool-call signature repeats 3×
    budgetUsd: 5.00,     // trip if run exceeds $5 cumulative spend
    maxContextTokens: 100_000,  // trip if context window approaches truncation
    onTrip: (event) => {
      notifySlack(`RunGuard tripped: ${event.reason} on run ${event.runId}`);
    },
  }
);

Inside guard(), RunGuard instruments each tool invocation via the SDK’s interceptor layer. Before the fifth call in a pattern that looks like calls 2, 3, and 4 (maxLoopReps: 3), RunGuard throws a RunGuardTripped exception with reason: 'loop', the fingerprint that triggered it, and the trip ID. Your code catches RunGuardTripped at the await guard() site — or lets it propagate to a top-level error boundary — and the LLM call that would have been call 5 is never constructed, never sent, and never billed.

The budget cap works on the same principle: RunGuard maintains a runBudget accumulator that is updated after each successful LLM response from its usage fields. Before constructing the next request, RunGuard checks whether the accumulator has crossed budgetUsd. If it has, it throws RunGuardTripped with reason: 'budget_exhausted'. The call does not go out. This is the behaviour Langfuse also lacks: a synchronous, in-process cost accumulator that gates the next generation before it fires.

Side-by-side: what each tool does

Capability	Helicone	RunGuard
Log cost per LLM call	Yes — proxy logs token + cost in dashboard	Yes — accumulator tracks spend per run
Cost cap that stops the next call	No — 429 arrives after call goes out	Yes — throws `RunGuardTripped` before call is sent
Per-run budget (not per-user-per-day)	No	Yes — `budgetUsd` per `guard()` invocation
Tool-call loop detection	No	Yes — fingerprint + sliding window, `maxLoopReps`
Context-window truncation alert	No	Yes — `maxContextTokens` threshold
Slack / PagerDuty webhook on trip	No (alert emails on daily budget threshold)	Yes — `onTrip` callback + Team plan webhooks
Integration method	`baseURL` swap + request header	Wrap the agent function with `guard()`
Language coverage	Any language that sends HTTP	TypeScript (Python beta)
Works with custom LLM providers	Needs supported provider proxy	Yes — budget tracked from response `usage` field
Framework adapters	Provider-level (OpenAI, Anthropic, Gemini)	LangChain.js, LangGraph, browser-use, agentkit

When to use both

Helicone and RunGuard are not competing for the same slot in your stack. Helicone is your observability layer: historical cost by user, by model, by prompt version; latency percentiles; eval scoring pipelines; prompt management. If you have multiple users hitting your agent product and you need to understand spend at the org and user level, Helicone is the right tool for that view.

RunGuard is your runtime safety net: the in-process guard that catches the loop or the budget blowout while the run is still happening. It does not replace Helicone’s dashboards. It adds the thing Helicone cannot add: a typed exception that fires before the billable call goes out.

The two integrate cleanly. You keep your Helicone baseURL swap; you add guard() at the top of your agent invocation. Helicone logs every call that successfully exits guard(). RunGuard trips before calls that would generate a fifth repetition or cross the per-run dollar cap. The Helicone dashboard shows you the healthy runs and any calls that preceded a trip; RunGuard’s trip log shows you the prevention events. Each instrument does one thing well.

LangChain and LangGraph users

If you are using LangChain.js or LangGraph, RunGuard ships typed adapters that integrate at the framework layer rather than wrapping the top-level function. For LangChain.js, guard(toolFn) wraps at the DynamicTool level, below the CallbackManager. For LangGraph, breaker.wrap(nodeFunction) wraps at the node level so the graph’s own retry and interruption semantics still function above the guard. See the LangChain circuit breaker guide and the LangGraph infinite loop guard guide for integration walkthroughs.

Helicone’s proxy approach works with LangChain and LangGraph because it operates at the HTTP layer — you swap the openai client’s baseURL and LangChain sends that instrumented client through its callback chain as usual. Both tools coexist; neither interferes with the other.

The canary incident

RunGuard was dogfooded on its own launch flow. The tool that posts the launch thread to X returns HTTP 402 CreditsDepleted when the shared account’s credit balance is exhausted. The same 402 — same endpoint, same status code, same error title — repeated across six consecutive agent sessions. The agent had no guard around the poster; it retried on every session regardless of the upstream blocker.

After session 16, a LoopDetector was added with repeats: 3 on the endpoint + status + error-title signature. The detector would have tripped on the fourth attempt. It now trips the preflight on every session before the HTTP call is made — the call count has been pinned at three attempts since the detector was added, and the six entries in deploy/sdk-trip-state.json are byte-identical session over session. The Helicone dashboard for that account, had we been using it, would have shown six identical 402 entries. It would not have stopped the sixth.

Frequently asked questions

Can I use RunGuard if I’m already using Helicone?: Yes — they operate at different layers. Keep the baseURL swap for Helicone and add guard() around your agent invocation. Helicone logs every call that makes it past guard(); RunGuard prevents the calls that would start a loop or blow the budget. The two are additive.
Does RunGuard work without Helicone?: Yes. RunGuard has no Helicone dependency. It works with any LLM provider or framework that executes tool calls. You get per-run cost accumulation, loop detection, and context-window alerts regardless of whether you use Helicone, Langfuse, LangSmith, or no observability layer at all.
Helicone’s cost-per-day limit fired and my agent is retrying the 429. How do I fix that?: Wrap the tool call (or the whole agent function) with guard() and set budgetUsd to a value below the Helicone daily cap. RunGuard will throw before you hit the cap. If you have already hit the 429 wall, add a pattern like { endpoint: 'oai.helicone.ai', status: 429, errorTitle: 'RateLimitError' } to the loop detector — RunGuard will break the retry loop on the third repetition. See how to detect an LLM tool-call loop in production for the full fingerprinting walkthrough.
Does RunGuard replace Helicone for multi-tenant cost attribution?: No. RunGuard is scoped to a single guard() invocation and does not have per-user-per-day aggregations, a historical cost dashboard, or a prompt registry. If you need to attribute spend across thousands of users and show them their own usage in your product, Helicone or a similar observability platform is the right tool. RunGuard handles the per-run safety layer inside a single agent execution.

Also from the factory

keeptier.com — Keep your tier. Lose the Apple tax.
mcpreplay.com — Record. Replay. Catch every MCP regression.
alivemcp.com — Is your MCP server alive? We ping it every 60 seconds so you know before your users do.
keybrake.com — Put the brakes on your agent’s keys.
skillaudit.dev — The trust layer for Claude skills and MCP servers.