A runtime cost cap for browser-use agents

Browser-use gives every Agent a max_actions_per_step, a max_failures, and an outer max_steps on Agent.run(). They are the right primitives for keeping the loop bounded. They do not put a dollar cap on the run. A planner LLM call plus 4–6 browser actions per step, multiplied across 100 steps with a frontier model and a paid proxy pool, is the kind of bill that arrives the morning after a stuck button traps the agent on a checkout page. This page is the runtime budget breaker we ship and how it slots into a browser-use step in eight lines of Python.

Where the dollars actually accumulate inside a browser-use run

The planner LLM call on every step. Browser-use builds an interactive elements DOM snapshot on every step, ships the screenshot plus the element index plus the conversation history into the planner, and asks the model what action to take next. On a frontier model, that prompt sits in the 5–15K input-token range per step before the assistant token even fires. Multiply by 100 steps and you have a six-figure-token planner bill on a single run.
The browser action itself. Each click_element, input_text, scroll, or open_tab is a Playwright call against a real browser. If you’re running on a paid proxy pool, every page navigation costs request-fees. If you’re running on a hosted browser-as-a-service, every minute of session time costs session-fees. Both bill independently of the model.
The retry loop on a flaky locator. A button that’s technically present but covered by a cookie banner that rejects clicks. The agent observes “clicked element 12, page did not change”, replans, clicks element 12 again, replans, clicks element 12 again. max_failures defaults to 3 and counts hard exceptions — not no-op clicks. The planner runs every step regardless.
The infinite-scroll trap. Pagination on a feed where each scroll renders new elements. The planner sees a longer DOM, asks for another scroll, sees a longer DOM, asks for another scroll. Each step is a fresh planner call against a growing screenshot — cost per step increases as the loop runs.

Why `max_steps`, `max_actions_per_step`, and `max_failures` don’t cap the bill

The three knobs browser-use exposes are correct in shape and wrong in unit. max_steps on Agent.run() is a count, not a dollar cap; a step on a 30K-token screenshot costs ten times what a step on a 3K-token screenshot does, and the cap doesn’t know the difference. max_actions_per_step bounds actions inside a step — useful, but the planner LLM call is per-step, not per-action, so the dominant per-step cost is unbounded by it. max_failures only counts hard exceptions; a no-op click that the planner replans on is not a failure, it’s a regression. None of the three look at cumulative dollars spent so far. A run that legitimately needs 40 steps to fill a form and a run that fires the same broken click 40 times at $0.20 each both look identical to the executor — they just produce different invoices.

What a runtime cost cap actually has to do

Track real dollars, not request counts. A step that hit a 50K-token screenshot costs ~10× a step on a 5K-token screenshot. A request count over-credits the small ones and under-bills the big ones. The tracker has to take a USD number from the host after each call — pulled from the model SDK’s cost-usage field or computed from input/output tokens times the published per-token price.
Trip before the next call goes out, not after. The check is in-process, on a numeric accumulator. It runs in microseconds. When the cap is crossed, the next paid action raises a typed error and the agent halts — the next planner LLM call never fires, the next Playwright navigation never opens.
Support a rolling window, not just cumulative. “Don’t spend more than $5 on this run” is one rule. “Don’t spend more than $5 in any rolling minute” is a different rule, and the right one for long-running agents that legitimately spend dollars over an hour but should never spend dollars in a minute. The tracker has to evict old entries on every query so the second rule is enforceable in microseconds, same as the first.
Be a primitive, not a framework opinion. The same wrap should compose with browser-use, with raw Playwright, with the OpenAI Python client, with whatever framework lands next quarter. A breaker that ships as a browser-use callback registry or an Agent mixin is brittle; a breaker that wraps any callable is portable.

Wrapping a browser-use step with `runguard`

# browser-use + runguard. The Agent stays an Agent; we wrap the planner
# step so the budget tracker sees every paid call and trips before the next.
from browser_use import Agent
from langchain_openai import ChatOpenAI
from runguard import guard, BudgetExceededError, LoopDetectedError

llm = ChatOpenAI(model="gpt-4o")

async def _plan_and_act(payload):
    # Wraps the planner+action round trip for one step.
    result = await payload["agent"].step()
    usage = result.usage  # tokens + usd from the model client
    return {"action": result.action, "usd": usage.total_cost_usd}

guarded_step = guard(
    _plan_and_act,
    signature=lambda i: f"step:{i['agent'].state.url}:{i['agent'].state.last_action}",
    budget={"max_usd": 5, "window_ms": 60_000},
    loop={"repeats": 3, "max_cycle_len": 8},
    cost=lambda _i, o: o["usd"],
    on_trip=lambda e: print("[runguard]", e["reason"], e.get("spent"), "of", e.get("cap"))),
)

agent = Agent(task="Book the cheapest one-way flight to Lisbon for next Friday", llm=llm)

while not agent.state.done:
    try:
        await guarded_step({"agent": agent})
    except (BudgetExceededError, LoopDetectedError) as e:
        print("halted:", e); break

The budget primitive is the BudgetTracker shipped at product/sdk/src/budget.ts: maxUsd for the cap, optional windowMs for rolling-window throttles, an add(usd) the host calls post-call, and an exceeded() the wrap reads pre-call — a hard cap with rolling-window option, no daemon, no telemetry. Defaults are honest: $5 per run is enough for a non-trivial form-fill on a frontier model, low enough that a stuck button doesn’t become a six-figure incident. The same wrap watches for loops on the same step signature; the fingerprint-and-window approach is documented at how to detect LLM tool-call loops in production; the LangChain-tool wrap is here; the multi-agent CrewAI wrap is here.

How the breaker behaves inside `Agent.run()`

Costs accumulate after each step. The wrap reads output.usd on success and pushes it into the BudgetTracker. Successful steps under the cap pass through transparently — the agent sees its observation and continues planning. Zero-cost steps (a cached step, a free local model) never trip the budget; the tracker explicitly skips zero entries.
The first step over the cap throws before its planner LLM call goes out. BudgetExceededError is constructed with the cumulative spend, the cap, and a reason field. It propagates out of the guarded step into the caller’s while loop. The browser session can be cleanly torn down by the host — the tracker doesn’t close pages or terminate processes, that’s for the host to do.
Your on_trip hook fires before the throw. Page Slack with the spend curve, write a row to a trip log keyed on the failing URL, screenshot the last DOM — whatever you wire. Sync hooks run inline; async hooks are awaited. An on_trip exception propagates instead of the trip error, by design (the host explicitly opted in to side-effecting on trip).
Reset is explicit. When a fresh run starts, call guarded_step.reset() to clear the spend ledger. The tracker is per-guarded-fn, not per-process — you can keep one budget per agent and run several agents independently, or share one budget across a multi-agent fleet for a fleet-wide cap. The same reset() wipes the loop window on the same wrap.

Tuning for browser-use cost shapes

Browser-use defaults to max_steps=100 on Agent.run(). On gpt-4o at typical web-element-extraction prompt sizes, a step lands around $0.05–$0.15 of input tokens before assistant output. The default max_usd: 5 on the budget tracker corresponds to roughly 35–100 steps on the small end, 18–50 on the heavy end — an honest one-shot form-fill or research run, generous enough that legitimate workflows finish, tight enough that a stuck button trips the breaker before the bill triples. For long-running agents (an hourly scraper, a continuous monitoring agent), set window_ms: 60_000 with the same max_usd: 5: the cap rolls; old spend evicts; the cumulative invoice over an hour is unbounded but the per-minute spike is. For high-stakes work where an over-spend is worse than an under-spend (production checkout flow, paid lead capture), drop to max_usd: 1 — a tighter cap costs you one re-run on legitimate workflows; a looser cap costs you one Friday-night incident. Stack the budget guard with the loop detector on the same wrap: a stuck-click loop usually trips the loop guard first (signature repeats fast), but a slow-burn loop on slightly-different-each-time DOM hashes trips the budget instead — both stop the agent, both leave a typed error, both are cheap to retry.

Loop detection on the same wrap

Signature is the step fingerprint. step:<url>:<last_action> covers most browser-use loops — same URL plus same proposed action means the agent is asking the planner to re-do something it already did. The detector pushes the signature into a 32-entry sliding window and looks for any cycle of length 1–8 repeating 3+ times. Length 1 catches the stuck-click loop (same click on the same URL three times). Length 2 catches the click-then-back-then-click ping-pong. Higher lengths cover the multi-step retry shapes a planner falls into when a form rejects an input and asks for a retype.
Trip event tells you which fired. reason: "loop" for a cycle hit; reason: "budget" for a cost cap; reason: "context" if you also pass a context-window guard for screenshot-driven prompts that grow unbounded. The typed error is one of LoopDetectedError, BudgetExceededError, ContextLimitError — the calling code dispatches on the type, not on string parsing.
Per-fn or shared. One guarded_step per agent gives you per-agent isolation. One shared guard() across multiple agents in a fleet gives you cross-agent loop detection — useful when two agents are scraping the same retry-loop–heavy upstream and you want one breaker to halt both.
Zero outbound calls. The whole check is pure data flow inside your Python process. No telemetry, no daemon, no SaaS. The wrap is the only thing in your process that knows the agent is loop-stuck or over-budget — the loop counter and the dollar accumulator both live in-memory, scoped to the guarded function.

The first loop our SDK caught was ours

It wasn’t a browser-use run — it was our own launch script firing a six-tweet thread against a paid X API. The first attempt came back with HTTP 402 CreditsDepleted. Six consecutive sessions later, six identical signatures — post_tweet:402:CreditsDepleted — were sitting in a flat JSON file on disk. The seventh session loaded the six-row history into the detector at startup and exited at signature three with a RunGuardTripped preflight before a single HTTP request went out. It has held the breaker open every session since. Read the dogfood story on the 30-day log; the same pattern slots into a browser-use run when a planner replans the same stuck click against the same URL three times.

What this is not

Not a browser-use plugin. RunGuard does not subclass Agent, ship a Controller mixin, or hook into the action registry. It wraps the underlying callable. That is the design — the same wrap composes with raw Playwright, with the OpenAI Python client, with LangChain tools, with whatever framework lands next quarter. The SDK at product/sdk/src/budget.ts is 84 lines; the loop detector at product/sdk/src/loop-detector.ts is 111 lines; both are in-process primitives.
Not a usage analytics platform. A spend dashboard answers “how much did the agent cost yesterday?”. A runtime cost cap answers “should the next paid step go out?”. The two are complementary — one for finance, one for prevention. Run both.
Not a server. No outbound network, no telemetry, no cookies, no daemon, no SaaS. The budget check is pure data flow inside your Python process. The same in-process discipline shows up in the embed-preview widget; the policy is one repo away in llms.txt.

The minimum browser-use integration

One pip install runguard, one guard() wrap around the per-step round trip, one cost function that pulls a dollar number from the model SDK’s usage field, and one on_trip that pages the channel you actually read. Eight lines of wrap, no callback to register, no controller subclass, no agent mixin. The breaker trips on the dollar cap or the third repeat of any step signature, halts the run, and leaves a structured event and a typed error behind for the post-mortem. RunGuard ships it as runguard on PyPI and @runguard/sdk on npm — same primitive, both runtimes, in-process, zero deps.