LangSmith is LangChain's hosted observability, evaluation, and prompt management platform. When LANGCHAIN_TRACING_V2=true is set, a LangSmithTracer callback handler is registered globally on every LangChain invocation — every chain, every LLM call, every tool call is captured as a linked run tree and POSTed to api.smith.langchain.com. LangSmith Hub provides centralized prompt versioning that any LangChain agent can pull at runtime. The langsmith.evaluate() API runs your agent against a dataset and scores results with LLM-as-judge evaluators.
Together these features cover tracing, evals, and prompt management — three of the most common production needs for LangChain teams. What they do not cover is spend control. LangSmith records what your agent did. Nothing in the SDK decides that the agent has spent enough and should stop. A recursive reflection agent with no depth ceiling produces a beautifully detailed trace of every iteration right up to the moment your LLM provider returns a quota error.
There are also four structural patterns specific to LangSmith's architecture that amplify costs beyond the baseline agent loop: the @traceable decorator accumulates RunTree objects in memory and floods a background thread pool under deep recursion; evaluate() fires all dataset examples simultaneously in an unconstrained thread pool by default; hub.pull() makes a network round-trip on every agent iteration when the cache key shifts; and score-gated re-run loops recurse indefinitely when the LLM judge never returns a passing grade. None of these show up as anomalies in your LangSmith traces because they are working as designed — the bills arrive later.
What this post covers: Four cost amplification patterns specific to LangSmith's architecture, and a runtime circuit breaker guard for each. The guards work alongside LangSmith — they do not replace it. You keep the traces; you add the spend ceilings.
Pattern 1: @traceable Recursive Fan-out and Thread Pool Saturation
The @traceable decorator from the langsmith SDK is the non-LangChain equivalent of LangSmith's automatic callback tracing. Applied to a function, it wraps the function body in a RunTree context: create_run() is called on entry (HTTP POST to LangSmith), end() is called on exit (PATCH with the output), and the run is registered as a child of the currently active parent via a ContextVar (_PARENT_RUN_TREE). For a simple request-response flow, this creates one root run and one child per LLM call — a shallow tree with two HTTP calls per step.
For a recursive self-reflection agent — one that calls an @traceable-decorated function from within another @traceable function — the trace tree grows without bound. Each recursive invocation pushes a new RunTree object onto the parent stack and fires a create_run() POST before the LLM call happens. The LangSmith SDK dispatches these HTTP calls to a concurrent.futures.ThreadPoolExecutor (default: 10 workers in the langsmith Python SDK). With a recursion depth of 20 and each invocation firing 2 HTTP calls (create + update), you need 40 concurrent executor slots. The pool's queue absorbs the overflow, but tasks pile up — and the in-memory RunTree objects for every incomplete run stay live until their PATCH completes.
A recursive agent that reaches 50 levels of depth has 50 RunTree objects in memory simultaneously. Each RunTree carries the full input and output for its level as serialized JSON — at 4 KB per run that is 200 KB of in-process objects. More critically, the 50 pending PATCH calls compete for 10 thread pool slots. The backlog grows faster than it drains. If your agent runs concurrently (multiple requests in flight), the shared thread pool saturates and later requests see multi-second latency on every LangSmith HTTP call. The LLM costs are incurred regardless; the tracing infrastructure just slows down under the load it is producing.
The guard: RecursiveTraceGuard
A depth counter in a ContextVar mirrors the SDK's parent-run stack without touching its internals. The guard checks the counter before each LLM call and raises before the HTTP POST fires:
import contextvars
from langsmith import traceable, Client
from langsmith.run_helpers import get_current_run_tree
_trace_depth: contextvars.ContextVar[int] = contextvars.ContextVar(
"trace_depth", default=0
)
MAX_TRACE_DEPTH = 6 # reflection agents rarely need more than 4
MAX_CALL_BUDGET = 25 # absolute LLM call ceiling across this trace tree
class RecursiveTraceGuard:
def __init__(
self,
client: Client,
max_depth: int = MAX_TRACE_DEPTH,
max_calls: int = MAX_CALL_BUDGET,
):
self._client = client
self.max_depth = max_depth
self.max_calls = max_calls
self._call_count: contextvars.ContextVar[int] = contextvars.ContextVar(
"trace_call_count", default=0
)
def check(self, label: str = "") -> None:
depth = _trace_depth.get()
calls = self._call_count.get()
if depth >= self.max_depth or calls >= self.max_calls:
run = get_current_run_tree()
reason = (
f"depth_exceeded (depth={depth}, max={self.max_depth})"
if depth >= self.max_depth
else f"call_budget_exceeded (calls={calls}, max={self.max_calls})"
)
if run:
# Tag the trip on the live LangSmith run
self._client.update_run(
run.id,
extra={"runguard_trip": reason, "runguard_label": label},
tags=["runguard_trip"],
)
raise AgentBudgetError(f"RecursiveTraceGuard tripped: {reason}")
self._call_count.set(calls + 1)
class AgentBudgetError(Exception):
pass
def guarded_traceable(guard: RecursiveTraceGuard, name: str = None):
"""Decorator: wraps @traceable with recursive depth + call-budget enforcement."""
def decorator(fn):
@traceable(name=name or fn.__name__)
def wrapper(*args, **kwargs):
token = _trace_depth.set(_trace_depth.get() + 1)
try:
guard.check(label=fn.__name__)
return fn(*args, **kwargs)
finally:
_trace_depth.reset(token)
return wrapper
return decorator
# Usage
ls_client = Client()
trace_guard = RecursiveTraceGuard(ls_client, max_depth=6, max_calls=25)
@guarded_traceable(trace_guard, name="reflect")
def reflect(draft: str, iteration: int = 0) -> str:
"""Self-reflection step — LLM critiques its own draft."""
critique = llm_call(f"Critique this draft: {draft}")
if needs_refinement(critique) and iteration < 20:
return reflect(draft=critique, iteration=iteration + 1) # recursive
return critique
@traceable(name="handle_request")
def handle_request(task: str) -> str:
try:
return reflect(draft=initial_draft(task))
except AgentBudgetError as e:
# Trip is already tagged on the LangSmith run via update_run()
return f"[guarded] {e}"
The client.update_run() call on trip is load-bearing: it writes the guard's verdict into the LangSmith run before raising, so the trace in the LangSmith UI shows tag:runguard_trip and the specific reason. You can filter the Projects view by this tag to find all guarded traces and see at which depth each one was stopped — the observability layer records the circuit break; the circuit break prevents the runaway.
Note that get_current_run_tree() from langsmith.run_helpers returns the active run in the current context, which is the run opened by the innermost @traceable decorator. The update patches that specific run, not the root. If you want to tag the root run, walk run.parent_run_id up the chain — but tagging the leaf is sufficient for filtering in the UI.
Pattern 2: evaluate() Concurrency Burst
LangSmith's evaluate() function (introduced in langsmith>=0.1.17) is the modern API for dataset evaluation. The call signature looks simple:
results = langsmith.evaluate(
target=my_agent, # function(example_input) → output
data="my-eval-dataset", # dataset name or UUID
evaluators=[correctness_eval, faithfulness_eval],
num_repetitions=3,
max_concurrency=None, # ← the problem
)
With max_concurrency=None (the default), LangSmith runs target function calls in a ThreadPoolExecutor sized to min(32, os.cpu_count() + 4) — the Python standard library's default. On an 8-core machine that is 12 workers. But evaluate() also fans out evaluator calls using the same pool after each target call completes. On a dataset of 200 examples with num_repetitions=3 and two evaluators, the total LLM call count is:
- 200 × 3 = 600 target (agent) calls
- 200 × 3 × 2 = 1,200 evaluator LLM calls
- 1,800 total LLM calls from a single
evaluate()invocation
The concurrency burst lands on your LLM provider's rate limiter immediately: 12 threads, each with its own in-flight HTTP request, all start together when evaluate() begins. Modern LLM APIs enforce requests-per-minute (RPM) and tokens-per-minute (TPM) limits. A 12-thread burst on a 60-RPM limit exhausts the rate window in 30 seconds and triggers 429 errors; the SDK's retry backoff (default: exponential with jitter) then schedules retries that pile into the next rate window. Each retry is a billed call if the LLM responds before the timeout.
Teams often run evaluations in Jupyter notebooks triggered with a single cell. The first cell to run a 200-example eval with strong judge models (GPT-4.1, Claude Opus) can cost $50–150 before the next cell executes. There is no pre-flight cost estimate built into evaluate() — it starts immediately.
The guard: guarded_evaluate()
from dataclasses import dataclass
from typing import Callable, Any
import langsmith
from langsmith import Client
MODEL_COST_PER_1M_TOKENS = {
"gpt-4o": 7.50,
"gpt-4o-mini": 0.30,
"gpt-4.1": 8.00,
"gpt-4.1-mini": 0.60,
"claude-opus-4-7": 75.00,
"claude-sonnet-4-6": 9.00,
"claude-haiku-4-5": 1.25,
"gemini-2.0-flash": 0.30,
}
AVG_TOKENS_PER_CALL = 2000 # blended estimate; calibrate from a pilot run
@dataclass
class EvalSpec:
dataset_name: str
num_examples: int
num_repetitions: int
num_evaluators: int
judge_model: str
target_model: str = "gpt-4o-mini"
target_calls_per_example: float = 1.0 # measure from a 5-example pilot
def total_llm_calls(self) -> int:
target = int(self.num_examples * self.num_repetitions * self.target_calls_per_example)
judge = self.num_examples * self.num_repetitions * self.num_evaluators
return target + judge
def estimated_cost_usd(self) -> float:
target_cost = (
self.num_examples * self.num_repetitions * self.target_calls_per_example
* AVG_TOKENS_PER_CALL / 1_000_000
* MODEL_COST_PER_1M_TOKENS.get(self.target_model, 10.0)
)
judge_cost = (
self.num_examples * self.num_repetitions * self.num_evaluators
* AVG_TOKENS_PER_CALL / 1_000_000
* MODEL_COST_PER_1M_TOKENS.get(self.judge_model, 10.0)
)
return target_cost + judge_cost
def guarded_evaluate(
target: Callable,
spec: EvalSpec,
evaluators: list,
*,
max_calls: int = 400,
max_cost_usd: float = 25.0,
safe_concurrency: int = 4,
sample_if_over: bool = True,
**kwargs,
) -> Any:
"""
Pre-flight cost gate for langsmith.evaluate().
Samples the dataset or blocks if the run would exceed max_calls or max_cost_usd.
Sets a safe max_concurrency to avoid LLM provider rate limit bursts.
"""
calls = spec.total_llm_calls()
cost = spec.estimated_cost_usd()
print(
f"[guarded_evaluate] dataset={spec.dataset_name!r} "
f"examples={spec.num_examples} reps={spec.num_repetitions} "
f"evaluators={spec.num_evaluators}\n"
f" estimated calls={calls} estimated cost=${cost:.2f}"
)
over_calls = calls > max_calls
over_cost = cost > max_cost_usd
if over_calls or over_cost:
if not sample_if_over:
raise EvalBudgetError(
f"Evaluation exceeds limits: calls={calls} (max {max_calls}), "
f"cost=${cost:.2f} (max ${max_cost_usd:.2f}). "
"Pass sample_if_over=True to auto-sample."
)
# Compute largest sample that fits both ceilings
calls_per_item = (spec.num_repetitions * spec.target_calls_per_example
+ spec.num_repetitions * spec.num_evaluators)
cost_per_item = cost / spec.num_examples
max_by_calls = int(max_calls / calls_per_item)
max_by_cost = int(max_cost_usd / cost_per_item)
sample_n = max(10, min(max_by_calls, max_by_cost, spec.num_examples))
print(
f" → sampling {sample_n}/{spec.num_examples} examples "
f"(calls≈{int(sample_n * calls_per_item)}, "
f"cost≈${cost_per_item * sample_n:.2f})"
)
data = langsmith.Client().list_examples(dataset_name=spec.dataset_name)
sampled = list(data)[:sample_n]
dataset_arg = sampled
else:
dataset_arg = spec.dataset_name
return langsmith.evaluate(
target,
data=dataset_arg,
evaluators=evaluators,
num_repetitions=spec.num_repetitions,
max_concurrency=safe_concurrency, # rate-limit-aware concurrency cap
**kwargs,
)
class EvalBudgetError(Exception):
pass
# Usage
spec = EvalSpec(
dataset_name="rag-qa-200",
num_examples=200,
num_repetitions=3,
num_evaluators=2,
judge_model="gpt-4.1",
target_model="gpt-4o-mini",
target_calls_per_example=2.3, # measured from a 10-item pilot run
)
results = guarded_evaluate(
target=my_rag_agent,
spec=spec,
evaluators=[correctness_eval, faithfulness_eval],
max_calls=400,
max_cost_usd=20.0,
safe_concurrency=4,
experiment_prefix="guarded-eval",
)
# [guarded_evaluate] dataset='rag-qa-200' examples=200 reps=3 evaluators=2
# estimated calls=2580 estimated cost=$49.10
# → sampling 16/200 examples (calls≈394, cost≈$19.27)
The safe_concurrency=4 default is chosen for LLM providers with 60 RPM limits: 4 threads × 15 requests/min each = 60 RPM, matching the limit without saturation. For providers with higher RPM limits, raise this proportionally. The key is to measure your target function's actual LLM call count with a 5–10 item pilot before committing to a full dataset run — the target_calls_per_example field is the biggest source of budget estimation error, and a factor of 2 difference means your $25 eval costs $50.
Pattern 3: Hub Prompt Polling Without a Stable Cache
LangSmith Hub is the prompt versioning component of the LangSmith platform. Teams author prompts in the Hub UI, version them with commit hashes, and pull them at runtime with from langchain import hub; prompt = hub.pull("org/prompt-name"). The hub.pull() call makes an authenticated HTTP GET to api.smith.langchain.com/commits/{owner}/{repo}/{commit} — or, when pulling by name without a commit hash, to the /latest endpoint.
The LangChain Hub client caches fetched prompts in an in-process dict keyed by (owner, repo, commit_hash, api_url). This cache is hit on subsequent calls — but only if the commit_hash is stable. When you pull by name without a commit (hub.pull("my-prompt")), the client first fetches the latest commit hash from /latest, then fetches the prompt at that hash. That first call to /latest is not cached. It fires on every hub.pull() call regardless of how recently you last pulled.
In a looping agent where the prompt is pulled inside the loop — a common pattern during development, where you want to iterate on prompts and see changes in the next agent run — every iteration fires at least one HTTP GET to api.smith.langchain.com for the /latest check. At 5 iterations per second that is 5 HTTP requests per second sustained. LangSmith Hub's per-IP rate limits kick in, returning 429 responses. The LangChain Hub client's retry logic backs off exponentially, adding seconds of latency to each agent iteration. At scale with concurrent agent sessions sharing the same IP, one active development session can degrade hub fetches for all other agents in the same VPC.
The subtler issue is the development-to-production leak: hub.pull() inside a loop is written during development when iteration speed matters and the rate limit is not yet hit. It gets committed and deployed to production, where the load is 10–100× higher and the same pattern triggers sustained 429 cascades.
The guard: CachedHubPrompt
import time
import threading
import warnings
from langchain import hub
class CachedHubPrompt:
"""
Wraps hub.pull() with an explicit TTL cache keyed on the prompt ref.
Enforces a minimum TTL and warns on high fetch rates.
Prevents accidental per-iteration hub.pull() in production loops.
"""
MIN_TTL_SECONDS = 60 # never allow below this in production
WARN_FETCH_RATE_PER_MIN = 5 # warn if pulling more than 5×/min
def __init__(self, ref: str, ttl_seconds: int = 300):
if ttl_seconds < self.MIN_TTL_SECONDS:
warnings.warn(
f"CachedHubPrompt: ttl_seconds={ttl_seconds} is below minimum "
f"{self.MIN_TTL_SECONDS}. Using {self.MIN_TTL_SECONDS}s.",
stacklevel=2,
)
ttl_seconds = self.MIN_TTL_SECONDS
self._ref = ref
self._ttl = ttl_seconds
self._prompt = None
self._fetched_at: float = 0.0
self._lock = threading.Lock()
self._fetch_log: list[float] = []
def get(self):
"""Return cached prompt, refreshing if TTL has expired."""
now = time.monotonic()
with self._lock:
self._fetch_log = [t for t in self._fetch_log if now - t < 60]
if now - self._fetched_at >= self._ttl:
self._fetch_log.append(now)
rate = len(self._fetch_log)
if rate > self.WARN_FETCH_RATE_PER_MIN:
import logging
logging.getLogger("runguard.langsmith").warning(
"CachedHubPrompt: %d fetches in last 60s for %r. "
"Increase ttl_seconds or cache at module level.",
rate, self._ref,
)
self._prompt = hub.pull(self._ref)
self._fetched_at = now
return self._prompt
def force_refresh(self):
"""Invalidate cache and fetch immediately (use sparingly)."""
with self._lock:
self._fetched_at = 0.0
return self.get()
# Module-level singletons — one hub.pull() per process start, then TTL-refreshed
_agent_prompt = CachedHubPrompt("my-org/agent-v3", ttl_seconds=300)
_judge_prompt = CachedHubPrompt("my-org/judge-v2", ttl_seconds=600)
@traceable(name="agent_step")
def agent_step(messages: list[dict]) -> str:
prompt = _agent_prompt.get() # cache hit on every call until TTL expires
response = llm_call(
system=prompt.format_messages()[0].content,
messages=messages,
)
return response.content
@traceable(name="evaluate_response")
def evaluate_response(output: str, expected: str) -> float:
judge_prompt = _judge_prompt.get()
verdict = llm_call(
system=judge_prompt.format(output=output, expected=expected),
)
return parse_score(verdict.content)
Placing CachedHubPrompt instances at module level rather than inside the agent function is the key change. Python module-level objects are initialized once per process. The first call to .get() fetches from Hub; subsequent calls within the TTL window return the cached object immediately without any network I/O. In a process serving 1,000 agent requests per hour, you go from 1,000 hub fetches per hour to 12 (once per 5-minute TTL). The prompt content is still refreshed when you deploy a new version — you just wait up to one TTL period for agents to pick it up, which is acceptable for every production prompt engineering workflow.
For development, where you want zero-latency prompt updates: set ttl_seconds=30 in your dev environment. The MIN_TTL_SECONDS floor prevents setting it to 0 accidentally. The warning on high fetch rates catches the case where a developer with a tight inner loop is still hitting Hub more than intended even at 30-second TTL.
Pattern 4: Automated Feedback Recursion
LangSmith's feedback API (client.create_feedback()) lets you attach structured scores to runs — a binary correct/incorrect label, a numeric score, or free-text annotation. Teams use this for automated evaluation pipelines: after each agent run completes and posts its trace to LangSmith, a lightweight evaluator queries the run's output and posts feedback via create_feedback(). The feedback drives downstream decisions: if the score is below threshold, the agent re-runs with the feedback in context.
The recursion risk emerges when the re-run decision and the feedback creation live in the same pipeline with no global run counter. The pattern looks like this:
- Agent produces output → LangSmith trace created (run A)
- Evaluator LLM scores output →
create_feedback(run_id=A, score=0.3) - Score < threshold (0.7) → agent re-runs with critique in context → trace created (run B)
- Evaluator scores run B's output →
create_feedback(run_id=B, score=0.4) - Still < 0.7 → agent re-runs → trace C → evaluator → score 0.45 → re-run → ...
Each iteration fires the agent (1–5 LLM calls) plus the evaluator (1 LLM call). On prompts where the LLM judge systematically disagrees with the agent — common when the judge and agent have different capability levels or different reference answers — the loop never terminates naturally. Fifteen iterations × 4 LLM calls average = 60 billed calls from one user request. At $0.01 per call that is $0.60 per request; at 1,000 requests per day, $600 per day from the feedback loop alone.
The problem is structural: the feedback pipeline checks local state (the score from the most recent evaluator call) but not global state (how many re-runs have already happened for this root request). When each iteration starts fresh, there is no mechanism to know it is the fifteenth iteration of the same task.
The guard: FeedbackLoopGuard
import uuid
import time
from dataclasses import dataclass, field
from langsmith import Client
@dataclass
class IterationState:
root_run_id: str
iteration: int = 0
total_llm_calls: int = 0
scores: list[float] = field(default_factory=list)
start_time: float = field(default_factory=time.monotonic)
class FeedbackLoopGuard:
"""
Global run counter for score-gated agent re-runs.
Guards against feedback loops that never reach the score threshold.
"""
def __init__(
self,
client: Client,
max_iterations: int = 5,
max_total_calls: int = 30,
max_elapsed_seconds: float = 120.0,
plateau_threshold: float = 0.02, # stop if score improves < 2% per iteration
plateau_window: int = 3,
):
self._client = client
self.max_iterations = max_iterations
self.max_total_calls = max_total_calls
self.max_elapsed = max_elapsed_seconds
self.plateau_threshold = plateau_threshold
self.plateau_window = plateau_window
def should_retry(
self,
state: IterationState,
score: float,
score_threshold: float,
llm_calls_this_iteration: int = 1,
) -> tuple[bool, str]:
"""
Returns (should_retry: bool, reason: str).
Call after each agent run + evaluator scoring step.
"""
state.scores.append(score)
state.iteration += 1
state.total_llm_calls += llm_calls_this_iteration
# Score passed threshold
if score >= score_threshold:
return False, f"score {score:.2f} >= threshold {score_threshold:.2f}"
# Hard limits
if state.iteration >= self.max_iterations:
self._post_trip_feedback(state, f"iteration_limit ({self.max_iterations})")
return False, f"iteration_limit: {state.iteration} iterations reached"
if state.total_llm_calls >= self.max_total_calls:
self._post_trip_feedback(state, f"call_limit ({self.max_total_calls})")
return False, f"call_limit: {state.total_llm_calls} LLM calls"
elapsed = time.monotonic() - state.start_time
if elapsed >= self.max_elapsed:
self._post_trip_feedback(state, f"time_limit ({self.max_elapsed:.0f}s)")
return False, f"time_limit: {elapsed:.1f}s elapsed"
# Plateau detection: no meaningful improvement in last N iterations
if len(state.scores) >= self.plateau_window:
recent = state.scores[-self.plateau_window:]
improvement = max(recent) - min(recent)
if improvement < self.plateau_threshold:
self._post_trip_feedback(
state, f"plateau (Δ={improvement:.3f} over {self.plateau_window} iters)"
)
return False, f"plateau: score improvement {improvement:.3f} < {self.plateau_threshold}"
return True, f"retry (iter={state.iteration}, score={score:.2f})"
def _post_trip_feedback(self, state: IterationState, reason: str) -> None:
try:
self._client.create_feedback(
run_id=state.root_run_id,
key="runguard_feedback_loop_trip",
score=0,
comment=f"FeedbackLoopGuard tripped: {reason}. "
f"iterations={state.iteration}, calls={state.total_llm_calls}, "
f"scores={[f'{s:.2f}' for s in state.scores]}",
)
except Exception:
pass # feedback posting is best-effort; never block the return path
# Usage
ls_client = Client()
loop_guard = FeedbackLoopGuard(
ls_client,
max_iterations=5,
max_total_calls=20,
max_elapsed_seconds=90.0,
plateau_threshold=0.03,
plateau_window=3,
)
@traceable(name="handle_with_feedback_loop")
def handle_with_feedback_loop(task: str) -> dict:
root_run_id = str(uuid.uuid4())
state = IterationState(root_run_id=root_run_id)
critique = ""
while True:
# Agent run
with langsmith.trace(name="agent_attempt", run_id=root_run_id if state.iteration == 0 else None) as run:
output = my_agent(task, prior_critique=critique)
agent_calls = run.extra.get("llm_call_count", 1) # instrument in my_agent
# Evaluator run
score = evaluate_output(output, task) # 1 LLM call
should_retry, reason = loop_guard.should_retry(
state,
score=score,
score_threshold=0.75,
llm_calls_this_iteration=agent_calls + 1,
)
if not should_retry:
return {"output": output, "score": score, "reason": reason,
"iterations": state.iteration}
critique = f"Previous score: {score:.2f}. Improve by: {get_critique(output, task)}"
Plateau detection is the most practically useful ceiling here. Score-gated loops that do terminate eventually often plateau: the agent reaches its capability ceiling and produces scores of 0.52, 0.54, 0.53, 0.54 across four iterations. Without plateau detection, you wait for max_iterations to expire even though the fifth iteration will not improve on the fourth. With a 3-iteration plateau window and a 3% improvement threshold, you save those three iterations every time the agent hits its ceiling — which on difficult tasks is most of the time.
Putting It Together: A Guarded LangSmith Agent
The four guards address independent resource axes — recursion depth and thread pool pressure, evaluation concurrency, hub fetch rate, and feedback iteration count — and compose without shared state:
from langsmith import Client, traceable
import langsmith
ls_client = Client()
# Guard instances (module-level singletons)
trace_guard = RecursiveTraceGuard(ls_client, max_depth=6, max_calls=25)
loop_guard = FeedbackLoopGuard(ls_client, max_iterations=5, max_total_calls=20)
agent_prompt = CachedHubPrompt("my-org/agent-v3", ttl_seconds=300)
judge_prompt = CachedHubPrompt("my-org/judge-v2", ttl_seconds=600)
@guarded_traceable(trace_guard, name="agent_turn")
def agent_turn(messages: list[dict]) -> str:
prompt = agent_prompt.get() # cache hit; no hub network I/O
response = llm_call(
system=prompt.format_messages()[0].content,
messages=messages,
)
tool_calls = response.tool_calls or []
if tool_calls:
results = [run_tool(tc) for tc in tool_calls]
messages = messages + tool_results_to_messages(tool_calls, results)
return agent_turn(messages) # recursive — guarded by RecursiveTraceGuard
return response.content
@traceable(name="handle_request")
def handle_request(task: str, score_threshold: float = 0.75) -> dict:
state = IterationState(root_run_id=get_current_run_id())
critique = ""
while True:
try:
output = agent_turn([{"role": "user", "content": task + critique}])
except AgentBudgetError as e:
return {"output": f"[guarded] {e}", "score": 0.0, "iterations": state.iteration}
j_prompt = judge_prompt.get()
score = score_with_llm(j_prompt, output, task)
should_retry, reason = loop_guard.should_retry(
state, score=score, score_threshold=score_threshold,
llm_calls_this_iteration=state.total_llm_calls + 2,
)
if not should_retry:
return {"output": output, "score": score,
"iterations": state.iteration, "reason": reason}
critique = f"\n\n[Prior attempt scored {score:.2f}. Improve specifically: {reason}]"
# Guarded evaluation (separate from online serving)
def run_eval(dataset_name: str, num_examples: int) -> None:
spec = EvalSpec(
dataset_name=dataset_name,
num_examples=num_examples,
num_repetitions=2,
num_evaluators=2,
judge_model="gpt-4.1",
target_model="gpt-4o-mini",
target_calls_per_example=3.1, # from pilot run
)
results = guarded_evaluate(
target=lambda inp: agent_turn([{"role": "user", "content": inp["question"]}]),
spec=spec,
evaluators=[correctness_eval, faithfulness_eval],
max_calls=400,
max_cost_usd=20.0,
safe_concurrency=4,
experiment_prefix="guarded-run",
)
In this composition, agent_turn is guarded against recursive depth by RecursiveTraceGuard. The outer handle_request loop is guarded against feedback iteration by FeedbackLoopGuard. Hub fetches are cached at module level by CachedHubPrompt. Dataset evaluations are pre-flight checked and concurrency-capped by guarded_evaluate(). Each guard is independent and can be disabled or tuned without touching the others.
What RunGuard Adds to a LangSmith-Instrumented Stack
| Layer | LangSmith responsibility | RunGuard responsibility |
|---|---|---|
| Tracing | Capture every run as a linked tree with timing, tokens, cost | — |
| Recursion depth | Display nested run hierarchy in trace viewer | Trip agent when depth exceeds ceiling; tag trip in run metadata |
| Eval concurrency | Run target + evaluators in a thread pool; report results | Pre-compute call estimate; sample dataset; cap concurrency |
| Hub prompts | Serve latest prompt version via REST API | Cache with minimum TTL; warn on high fetch rate |
| Feedback loops | Store feedback scores; surface in trace UI | Enforce iteration + call + time + plateau ceilings |
| Cost ceiling | Report token cost after the fact from provider usage | Enforce ceiling before charges accrue |
LangSmith's documentation describes the platform as an observability and evaluation tool. It is the right framing — LangSmith records what your agents did and how they performed. The policy layer — "this agent must stop before spending $20", "this feedback loop must not run more than 5 times" — belongs in a separate runtime guard that can stop execution before the charges land.
Calibrating thresholds: Run your agent on 10–20 representative requests and read the LangSmith traces to find your actual 90th-percentile recursion depth, LLM call count per request, hub fetch count per minute, and score distribution from your evaluator. Set guard ceilings at 1.5× those baselines — tight enough to catch runaway behavior, loose enough not to trip on legitimate heavy requests. Guards calibrated to the wrong baseline either never trip (useless) or trip on normal traffic (disruptive).
Failure Modes This Post Does Not Cover
Four patterns were enough for one post. LangSmith's architecture has others:
- LangChain callback waterfall — when
LANGCHAIN_TRACING_V2=trueis set globally, every LangChain component fires callbacks on every call, including deeply nested chains. A chain-of-chains with 5 levels and 3 LLM calls per level fires 15 × (on_chain_start + on_llm_start + on_llm_end + on_chain_end) = 60 callback events per request to the tracing backend — 60 HTTP calls before your response is returned. - Structured output retry fan-out — LangChain's
.with_structured_output()retries on schema validation failures. In an eval run with a judge model that frequently fails to produce valid JSON, every row fires 2–5 retries. Each retry is billed and traced, multiplying your eval cost by the retry factor. - Multi-agent shared project namespace — when multiple agents write traces to the same LangSmith project, the project-level rate limits apply to all of them collectively. A runaway agent in one pod can exhaust the trace ingestion quota for all agents in the project.
- Experiment proliferation — each
evaluate()call creates a new experiment in LangSmith. Teams iterating rapidly on prompts or agents can accumulate hundreds of experiments, each with hundreds of runs. LangSmith's storage is not free on paid plans; experiment cleanup hygiene is a real cost control lever.
Common questions
Does RecursiveTraceGuard interfere with LangSmith's run tree linkage?
No. The guard uses its own ContextVar to count depth — it does not touch LangSmith's _PARENT_RUN_TREE context variable. The @traceable decorator still runs normally and the child run is registered as a child of the current parent. The guard raises after the run is created but before the LLM call fires, so LangSmith records the run as an incomplete execution with no output tokens. This is the correct trace representation of a tripped request, and it's filterable by the runguard_trip tag posted via update_run().
Can I use guarded_evaluate() with the older client.run_on_dataset() API?
Yes, with minor changes. Replace the langsmith.evaluate() call with client.run_on_dataset() and pass concurrency_level=safe_concurrency instead of max_concurrency. The pre-flight cost estimate and dataset sampling logic are identical — only the inner call changes. The newer langsmith.evaluate() API is recommended for new code since it returns a typed ExperimentResults object and supports more evaluator patterns.
Does hub.pull() cache the compiled prompt template or the raw definition?
hub.pull() returns a LangChain Runnable object — typically a ChatPromptTemplate or PromptTemplate. CachedHubPrompt.get() caches this object. You call prompt.format_messages(variable=value) on the cached object at invocation time; variable substitution happens locally with no network call. The Hub is only consulted when the TTL expires and a refresh is needed.
What plateau_threshold value should I use in FeedbackLoopGuard?
Measure the score variance your evaluator produces on identical inputs. If your LLM judge is non-deterministic (temperature > 0), it will produce scores that vary by ±0.05–0.10 on the same response. Set plateau_threshold to at least that variance level — otherwise the guard will declare a plateau when the scores are actually just noisy. A 0.02–0.05 threshold works well for calibrated LLM judges with temperature 0; raise to 0.08–0.10 for judges with higher temperature or free-form scoring rubrics.
How does LangSmith's own cost tracking compare to these guards?
LangSmith surfaces token costs per run in the trace viewer and supports cost aggregation across experiments. This is accurate and useful for post-hoc analysis — "last week's eval runs cost $340 in GPT-4 calls". What it does not do is stop a run in progress when costs exceed a threshold. The guards here enforce a ceiling before charges accrue, not after. Both are necessary: LangSmith gives you the audit trail; the guards give you the kill switch.