Hugging Face Inference Endpoints AI Agent Cost Control: Cold Start Cascade, Model Loading Context Overhead, Batch Inference Fan-Out, and Endpoint Autoscaling Overshoot

Hugging Face Dedicated Inference Endpoints let teams self-host any model from the Hub on managed GPU infrastructure — an Instruct-tuned LLM for text generation, a BAAI/bge embedding model for RAG pipelines, a BERT variant for classification, a Whisper checkpoint for transcription — without managing the underlying instances. The billing model is straightforward: you pay per second the endpoint is running, whether or not requests are in flight. A t4-medium GPU endpoint at $0.60/hour costs $0.60 whether it handles 10,000 requests or 0.

AI agents interact with Inference Endpoints differently from interactive applications. A human user opens a UI, issues a handful of requests, and closes the tab. An agent processing a research corpus, generating embeddings for a RAG index, running a classification pass over a document set, or using an LLM-as-judge pattern for eval loops may issue hundreds or thousands of requests in rapid succession — and then go quiet for hours between sessions. This stop-and-start pattern, combined with the per-second billing model, creates four distinct cost amplification patterns that a human-in-the-loop workflow would rarely produce.

The four patterns compound each other. An agent that triggers a cold start (Pattern 1) generates health-check polling responses injected into its own context window (Pattern 2) while waiting for the endpoint to become ready — and then, once the endpoint is warm, sends unbatched embedding requests one at a time (Pattern 3) in a burst that triggers autoscaling (Pattern 4), spinning up replica-hours the agent will not use. Each pattern has a circuit breaker guard that enforces budget discipline at the agent layer without modifying the HF Endpoint configuration.

What this post covers: Four cost amplification patterns specific to AI agents using Hugging Face Dedicated Inference Endpoints: cold start cascade, model loading context overhead, batch inference fan-out, and endpoint autoscaling overshoot — and a runtime circuit breaker guard for each. The guards operate at the agent layer around HF API calls, giving you observable cost ceilings without modifying your endpoint configuration or billing tier.

Pattern 1: Cold Start Cascade

HF Dedicated Endpoints support a scale_to_zero option that shuts down all replicas after a configurable idle period — typically 15 minutes — and restarts them on the next incoming request. For teams running intermittent agent workloads, this feature pays for itself instantly: an endpoint that would otherwise idle at $0.60/hour between agent sessions costs $0 during the gaps. The catch is the cold start: bringing a Mistral-7B Instruct model from zero to ready on a T4 GPU takes 90–150 seconds; a LLaMA-3.1-70B on an A100 takes 180–270 seconds. During this window, the endpoint returns a 503 Loading response to every request.

The cascade occurs when an agent's retry logic and the endpoint's warmup window overlap in the wrong way. An agent that implements naive exponential backoff — retry after 5s, 10s, 20s — and sends its first retry at 5 seconds is still inside the 90–150 second warmup window. Every retry during the warmup period generates a second, third, or fourth 503 Loading response, each of which the agent may record as a failure in its own state, log into its context, or use to update its retry counter. The endpoint is not failing; it is warming up. But the agent's retry logic treats the warmup as a recoverable error and fires additional requests before the endpoint is ready.

The worst case is concurrent agent warmup amplification: a team running multiple agent threads (parallel document processing, a fan-out RAG pipeline, concurrent eval workers) where each thread independently detects the 503 and begins its own retry loop. Five concurrent agent threads, each retrying at 5-second intervals over a 120-second cold start, generate 5 × 24 retries = 120 warmup-period requests — all of which receive 503 Loading responses, all of which are logged, and some of which may race to trigger a second endpoint replica before the first is ready, starting a second cold start cycle in parallel.

Single-thread cold start overhead:
Model: Mistral-7B Instruct on T4 GPU, cold start = 120 seconds
Naive retry (5s, 10s, 20s, 40s intervals): 4 warmup-period requests before ready
Context overhead per 503 response: ~180 tokens JSON error × 4 retries = 720 tokens wasted

5-thread concurrent cascade:
5 threads × 24 retries at 5s intervals over 120s = 120 warmup-period requests
Cold start instance-minutes billed: 120s warmup = $0.00 (startup not billed separately)
BUT: second replica triggered by concurrent load at t=60s = +1 full billing hour
T4 GPU replica-hour cost: $0.60 for 58 minutes of idle after the warmup resolves
5 cold-start cascades/day across 5 threads: $3.00/day in phantom replica-hours

TypeScript — HFColdStartGuard

interface ColdStartEvent {
  triggeredAt: Date;
  warmupDurationMs: number;
  retriesDuringWarmup: number;
  resolvedAt: Date | null;
}

interface ColdStartState {
  sessionColdStarts: number;
  currentlyWarming: boolean;
  warmingStartedAt: Date | null;
  retriesDuringCurrentWarmup: number;
  coldStartHistory: ColdStartEvent[];
  totalWarmupTokensInjected: number;
}

class HFColdStartGuard {
  private state: ColdStartState = {
    sessionColdStarts: 0,
    currentlyWarming: false,
    warmingStartedAt: null,
    retriesDuringCurrentWarmup: 0,
    coldStartHistory: [],
    totalWarmupTokensInjected: 0,
  };

  constructor(
    private readonly maxSessionColdStarts: number = 3,
    private readonly maxRetriesPerWarmup: number = 4,
    private readonly warmupTimeoutMs: number = 300_000,  // 5 min hard stop
    private readonly estimatedTokensPerPoll: number = 180,
  ) {}

  onEndpointResponse(statusCode: number, body: string): void {
    if (statusCode === 503 && body.includes('loading')) {
      if (!this.state.currentlyWarming) {
        // First 503 in a new warmup cycle
        this.state.sessionColdStarts++;
        if (this.state.sessionColdStarts > this.maxSessionColdStarts) {
          throw new Error(
            `HFColdStartGuard: ${this.state.sessionColdStarts} cold starts in this ` +
            `session exceeds ceiling ${this.maxSessionColdStarts}. ` +
            `Endpoint is scaling to zero between agent iterations — set a longer ` +
            `idle timeout on the endpoint or pre-warm before the agent loop begins.`,
          );
        }
        this.state.currentlyWarming = true;
        this.state.warmingStartedAt = new Date();
        this.state.retriesDuringCurrentWarmup = 0;
      }

      this.state.retriesDuringCurrentWarmup++;
      this.state.totalWarmupTokensInjected += this.estimatedTokensPerPoll;

      if (this.state.retriesDuringCurrentWarmup > this.maxRetriesPerWarmup) {
        throw new Error(
          `HFColdStartGuard: ${this.state.retriesDuringCurrentWarmup} retries during ` +
          `current cold start exceeds ceiling ${this.maxRetriesPerWarmup}. ` +
          `Switch to a longer polling interval (≥30s) or poll via endpoint status API ` +
          `instead of inference requests to avoid injecting duplicate 503 bodies into context.`,
        );
      }

      const elapsed = Date.now() - this.state.warmingStartedAt!.getTime();
      if (elapsed > this.warmupTimeoutMs) {
        throw new Error(
          `HFColdStartGuard: cold start warmup exceeded ${this.warmupTimeoutMs / 1000}s ` +
          `without the endpoint becoming ready. Check HF endpoint health in the console — ` +
          `the model may have failed to load due to VRAM limits or a corrupt checkpoint.`,
        );
      }
    } else if (statusCode === 200) {
      if (this.state.currentlyWarming) {
        this.state.coldStartHistory.push({
          triggeredAt: this.state.warmingStartedAt!,
          warmupDurationMs: Date.now() - this.state.warmingStartedAt!.getTime(),
          retriesDuringWarmup: this.state.retriesDuringCurrentWarmup,
          resolvedAt: new Date(),
        });
        this.state.currentlyWarming = false;
        this.state.warmingStartedAt = null;
        this.state.retriesDuringCurrentWarmup = 0;
      }
    }
  }

  getWarmupTokensInjected(): number {
    return this.state.totalWarmupTokensInjected;
  }

  state_summary(): ColdStartState {
    return { ...this.state };
  }
}

Pattern 2: Model Loading Context Overhead

During a cold start, a well-designed agent does not hammer the endpoint with inference requests — it polls the endpoint's status API and waits for the running state. The HF Endpoint status API returns a JSON response including the endpoint name, status, replica count, model repository, framework details, and current loading progress. A typical status response is 280–400 tokens of JSON. An agent polling every 10 seconds over a 150-second cold start generates 15 status poll responses — 15 × 350 tokens = 5,250 tokens of metadata injected into its context window before the first inference call executes.

The compounding factor is session-level accumulation across multiple cold starts. An agent session that encounters three cold starts (endpoint scales to zero between each document processing batch because the agent's idle period between batches exceeds the 15-minute scale-to-zero threshold) injects 3 × 5,250 = 15,750 tokens of loading status JSON into its context. For a session processing a 200-document corpus, this overhead represents 8–12% of the context budget consumed by metadata about the inference infrastructure, not by the documents being processed.

The subtler problem is duplicate status response injection. If the agent's polling loop uses the same conversation message slot to report waiting status, each poll overwrites the previous one and context usage stays flat. If the agent appends each status update as a new message or logs each poll response as a tool call result, the context accumulates all prior status responses even after the endpoint is ready and the information is no longer relevant. Models used as eval judges or research agents that track their own tool call history are particularly prone to this pattern — their system design requires full tool call histories for audit trails, which means every status poll result is permanent context.

Single cold start — polling overhead:
Status poll interval: 10 seconds; Cold start duration: 150 seconds
Status polls during warmup: 15 polls
Tokens per poll response (JSON with model metadata): ~350 tokens
Total context injected during one cold start: 5,250 tokens
At Sonnet 4.6 input pricing ($3/M tokens): $0.016 per cold start overhead

Agent session with 3 cold starts + appended poll history:
15 + 15 + 15 = 45 status responses permanently in context
Total overhead: 45 × 350 = 15,750 tokens (~$0.047 at Sonnet pricing)
Over 200 agent sessions/month: $9.40/month in warmup-metadata tokens

TypeScript — HFContextOverheadGuard

interface PollEvent {
  polledAt: Date;
  status: string;
  responseTokens: number;
  injectedIntoContext: boolean;
}

interface ContextOverheadState {
  sessionPollCount: number;
  sessionPollTokens: number;
  lastSeenStatus: string | null;
  pollHistory: PollEvent[];
  suppressedPollCount: number;
}

class HFContextOverheadGuard {
  private state: ContextOverheadState = {
    sessionPollCount: 0,
    sessionPollTokens: 0,
    lastSeenStatus: null,
    pollHistory: [],
    suppressedPollCount: 0,
  };

  constructor(
    private readonly maxContextPollTokens: number = 3_000,
    private readonly maxPollsPerColdStart: number = 8,
    private readonly suppressDuplicateStatus: boolean = true,
  ) {}

  shouldInjectPollResponse(
    status: string,
    estimatedTokens: number,
  ): { inject: boolean; reason: string } {
    this.state.sessionPollCount++;
    this.state.sessionPollTokens += estimatedTokens;

    if (this.state.sessionPollTokens > this.maxContextPollTokens) {
      this.state.suppressedPollCount++;
      return {
        inject: false,
        reason: `Poll token budget exhausted (${this.state.sessionPollTokens} / ` +
                `${this.maxContextPollTokens} tokens). Endpoint status updates are ` +
                `suppressed — only inject when status changes to avoid context overflow.`,
      };
    }

    if (this.suppressDuplicateStatus && status === this.state.lastSeenStatus) {
      this.state.suppressedPollCount++;
      return {
        inject: false,
        reason: `Endpoint still reporting '${status}' — no change since last poll. ` +
                `Suppressing duplicate status injection to avoid context accumulation.`,
      };
    }

    this.state.lastSeenStatus = status;
    this.state.pollHistory.push({
      polledAt: new Date(),
      status,
      responseTokens: estimatedTokens,
      injectedIntoContext: true,
    });

    return { inject: true, reason: 'status changed or first poll — inject summary' };
  }

  onColdStartResolved(): void {
    // Reset per-cold-start counters; session totals accumulate
    this.state.lastSeenStatus = null;
  }

  getSuppressedCount(): number {
    return this.state.suppressedPollCount;
  }
}

Pattern 3: Batch Inference Fan-Out

HF Dedicated Endpoints support batched inference for embedding models and many text classification models: a single API call can submit an array of inputs and receive an array of outputs. The GPU processes them in parallel within the batch, with near-identical latency to a single-item request for small-to-medium batch sizes. A T4 GPU can embed 32 text chunks in the same ~50ms it takes to embed 1 chunk, making the throughput ratio 32× for the same instance-time.

AI agents building RAG indexes, generating embeddings for a document corpus, or running classification over a record set typically process items in a loop: fetch document, call embedding endpoint, store result, fetch next document. This one-at-a-time pattern is natural for synchronous agent code and completely correct in terms of output — but it wastes 96.9% of the available GPU throughput per call. An agent that embeds 10,000 document chunks individually at 50ms per call keeps a dedicated T4 endpoint running for 500 seconds (8.3 minutes) of actual GPU work spread across 500 seconds of wall time, because each individual call consumes 50ms of GPU time plus 40ms of network round-trip and API overhead. The same 10,000 chunks in batches of 32 take 313 calls × 90ms each = 28 seconds of wall time and ~16 seconds of GPU work.

The billing impact depends on whether you are using the Serverless API or a Dedicated Endpoint. For the Serverless API (pay-per-token), each individual call pays the per-call minimum plus token fees; batching reduces calls proportionally. For Dedicated Endpoints (pay-per-instance-second), the difference is purely in how long the endpoint must stay running. An agent that sends 10,000 individual embedding calls over 8.3 minutes keeps a $0.60/hour T4 endpoint running for 9 minutes (including the tail end of the last batch resolving) — $0.09. The same task in batches of 32 completes in under 1 minute — $0.01. The 8× cost difference is entirely attributable to unbatched call dispatch, not to the inference work itself.

Embedding 10,000 chunks — individual calls vs. batched:
Individual: 10,000 calls × 90ms (50ms GPU + 40ms network) = 900 seconds wall time
Batched (size=32): 313 calls × 90ms = 28 seconds wall time
Throughput ratio: 32× faster with batching

Dedicated T4 endpoint ($0.60/hour = $0.000167/second):
Individual: 900s × $0.000167 = $0.15
Batched: 28s × $0.000167 = $0.0047 — 97% reduction

Agent session: 50 document sets × 200 chunks each = 10,000 chunks/session
Individual calls, 5 sessions/day: 5 × $0.15 = $0.75/day = $22.50/month
Batched, 5 sessions/day: 5 × $0.0047 = $0.024/day = $0.71/month

TypeScript — HFBatchInferenceGuard

interface BatchInferenceState {
  sessionCallCount: number;
  sessionItemCount: number;
  unbatchedCallCount: number;
  totalBatchSizes: number[];
  estimatedWastedInstanceSeconds: number;
}

class HFBatchInferenceGuard {
  private state: BatchInferenceState = {
    sessionCallCount: 0,
    sessionItemCount: 0,
    unbatchedCallCount: 0,
    totalBatchSizes: [],
    estimatedWastedInstanceSeconds: 0,
  };

  constructor(
    private readonly minBatchSize: number = 8,
    private readonly recommendedBatchSize: number = 32,
    private readonly maxUnbatchedCalls: number = 10,
    private readonly estimatedMsPerCall: number = 90,
    private readonly instanceCostPerSecond: number = 0.000167,  // T4 default
  ) {}

  onInferenceCall(batchSize: number): void {
    this.state.sessionCallCount++;
    this.state.sessionItemCount += batchSize;
    this.state.totalBatchSizes.push(batchSize);

    if (batchSize < this.minBatchSize) {
      this.state.unbatchedCallCount++;

      // Estimate wasted time: this call processes `batchSize` items in ~estimatedMsPerCall ms
      // A full batch of recommendedBatchSize items would take the same time — wasted capacity:
      const capacityUtilization = batchSize / this.recommendedBatchSize;
      const wastedMs = this.estimatedMsPerCall * (1 - capacityUtilization);
      this.state.estimatedWastedInstanceSeconds += wastedMs / 1000;

      if (this.state.unbatchedCallCount > this.maxUnbatchedCalls) {
        const wastedCost = (
          this.state.estimatedWastedInstanceSeconds * this.instanceCostPerSecond
        ).toFixed(4);
        throw new Error(
          `HFBatchInferenceGuard: ${this.state.unbatchedCallCount} unbatched calls ` +
          `(batch_size < ${this.minBatchSize}) in this session. Estimated wasted ` +
          `instance-seconds: ${this.state.estimatedWastedInstanceSeconds.toFixed(1)}s ` +
          `($${wastedCost} at T4 pricing). Collect items into batches of ` +
          `${this.recommendedBatchSize} before calling the endpoint. ` +
          `Current item queue: ${this.state.sessionItemCount} items processed; ` +
          `rewrite the agent loop to use Array.chunk(items, ${this.recommendedBatchSize}) ` +
          `and call the endpoint once per chunk.`,
        );
      }
    }
  }

  getAverageBatchSize(): number {
    if (this.state.totalBatchSizes.length === 0) return 0;
    return this.state.sessionItemCount / this.state.sessionCallCount;
  }

  getEfficiencyRatio(): number {
    return this.getAverageBatchSize() / this.recommendedBatchSize;
  }
}

Pattern 4: Endpoint Autoscaling Overshoot

HF Dedicated Endpoints support autoscaling: you configure a minimum replica count (commonly 0 for scale-to-zero or 1 for always-on) and a maximum replica count. When incoming request volume exceeds the capacity of current replicas, the endpoint automatically provisions additional replicas. Each new replica starts a fresh instance with the full model loaded into GPU VRAM — another cold start, another 90–270 second warmup period, and crucially, another full billing unit that starts at the moment the instance is provisioned regardless of when the first request is processed.

AI agents create burst traffic patterns that trigger autoscaling disproportionate to their actual workload duration. An agent that begins a new RAG indexing task by embedding 500 documents simultaneously (using concurrent async requests, a worker pool, or a parallel fan-out architecture) can saturate a single replica and trigger the provisioning of 4 additional replicas in the first 30 seconds of the task. If the task completes in 3 minutes — because the burst was the entire task — all 5 replicas then sit idle for the remainder of their billing period. HF bills in one-second increments with a minimum billing window per replica determined by the instance type's underlying cloud provider (typically 1–10 minutes for GPU instances on AWS and Azure infrastructure).

The scaling decision is made by HF's infrastructure based on queue depth, not by the agent. The agent has no direct signal that a new replica is being provisioned. By the time the agent finishes its burst task, the extra replicas are already running and billing. The maximum overshoot scenario is a team running 10 agents in parallel (a batch eval pipeline, a nightly data enrichment job, a research agent fleet) each sending a burst of requests to the same endpoint: 10 × 500 concurrent requests trigger the maximum replica count, and when all 10 agents complete simultaneously, 10× the maximum replica count sits idle for the minimum billing window.

Single agent burst task triggering autoscaling:
Agent sends 500 concurrent embedding requests in first 30 seconds
Single T4 replica throughput: ~100 requests/minute → queue depth triggers scale-up
Replicas provisioned: 1 (initial) + 4 (autoscaled) = 5 active replicas
Task completion time: 3 minutes (burst fully processed)
Idle time per additional replica: 57 minutes (up to 1-hour billing boundary)
T4 cost per replica: $0.60/hour
Overshoot cost: 4 replicas × 57 min idle = 228 replica-minutes = $2.28
Useful work cost: 5 replicas × 3 min active = $0.15
Overhead ratio: 15.2× cost multiplier from autoscaling overshoot

Nightly eval pipeline (10 parallel agents, A100 endpoint at $3.20/hour):
10 agents each triggering 4 extra replicas = 40 extra replica-hours billed
Per nightly run: 40 × $3.20 × (57/60) = $121.60 in idle replica-hours

TypeScript — HFAutoscalingGuard

interface ReplicaEvent {
  eventType: 'scale_up' | 'scale_down' | 'request_burst';
  replicaCount: number;
  timestamp: Date;
  estimatedIdleMinutes?: number;
}

interface AutoscalingState {
  currentEstimatedReplicas: number;
  peakReplicasThisSession: number;
  concurrentRequestsInFlight: number;
  maxConcurrentObserved: number;
  replicaEvents: ReplicaEvent[];
  estimatedCommittedCostUsd: number;
}

class HFAutoscalingGuard {
  private state: AutoscalingState = {
    currentEstimatedReplicas: 1,
    peakReplicasThisSession: 1,
    concurrentRequestsInFlight: 0,
    maxConcurrentObserved: 0,
    replicaEvents: [],
    estimatedCommittedCostUsd: 0,
  };

  constructor(
    private readonly maxConcurrentRequests: number = 20,
    private readonly requestsPerReplica: number = 25,
    private readonly instanceCostPerHour: number = 0.60,  // T4 default
    private readonly minBillingMinutes: number = 60,
    private readonly maxCommittedCostUsd: number = 5.00,
  ) {}

  onRequestStart(): void {
    this.state.concurrentRequestsInFlight++;
    this.state.maxConcurrentObserved = Math.max(
      this.state.maxConcurrentObserved,
      this.state.concurrentRequestsInFlight,
    );

    const estimatedReplicas = Math.max(
      1,
      Math.ceil(this.state.concurrentRequestsInFlight / this.requestsPerReplica),
    );

    if (estimatedReplicas > this.state.currentEstimatedReplicas) {
      // New replica would be triggered — estimate the committed cost
      const newReplicas = estimatedReplicas - this.state.currentEstimatedReplicas;
      const committedIdleCost = newReplicas *
        (this.minBillingMinutes / 60) * this.instanceCostPerHour;

      this.state.estimatedCommittedCostUsd += committedIdleCost;

      if (this.state.estimatedCommittedCostUsd > this.maxCommittedCostUsd) {
        throw new Error(
          `HFAutoscalingGuard: estimated committed instance cost ` +
          `$${this.state.estimatedCommittedCostUsd.toFixed(2)} exceeds ceiling ` +
          `$${this.maxCommittedCostUsd}. Current concurrent requests: ` +
          `${this.state.concurrentRequestsInFlight}, estimated replicas: ` +
          `${estimatedReplicas} (each committed for ${this.minBillingMinutes}min ` +
          `at $${this.instanceCostPerHour}/hour). Reduce request concurrency ` +
          `to ≤${this.maxConcurrentRequests} or batch requests to increase ` +
          `per-replica utilization before scaling.`,
        );
      }

      this.state.currentEstimatedReplicas = estimatedReplicas;
      this.state.peakReplicasThisSession = Math.max(
        this.state.peakReplicasThisSession,
        estimatedReplicas,
      );
      this.state.replicaEvents.push({
        eventType: 'scale_up',
        replicaCount: estimatedReplicas,
        timestamp: new Date(),
        estimatedIdleMinutes: this.minBillingMinutes,
      });
    }

    if (this.state.concurrentRequestsInFlight > this.maxConcurrentRequests) {
      throw new Error(
        `HFAutoscalingGuard: ${this.state.concurrentRequestsInFlight} concurrent ` +
        `requests in flight exceeds ceiling ${this.maxConcurrentRequests}. ` +
        `This burst volume will trigger autoscaling to ${estimatedReplicas} replicas, ` +
        `committing ${this.minBillingMinutes}-minute billing windows for each. ` +
        `Use a request queue with concurrency=limit to flatten the burst.`,
      );
    }
  }

  onRequestComplete(): void {
    this.state.concurrentRequestsInFlight = Math.max(
      0,
      this.state.concurrentRequestsInFlight - 1,
    );
  }

  getCommittedCostEstimate(): number {
    return this.state.estimatedCommittedCostUsd;
  }
}

// Usage: wrap all HF endpoint calls with the autoscaling guard
async function runEmbeddingBatch(
  chunks: string[],
  endpoint: HFEndpoint,
  autoscalingGuard: HFAutoscalingGuard,
  batchGuard: HFBatchInferenceGuard,
): Promise<number[][]> {
  const BATCH_SIZE = 32;
  const CONCURRENCY = 4;  // Well below maxConcurrentRequests=20
  const results: number[][] = [];

  for (let i = 0; i < chunks.length; i += BATCH_SIZE * CONCURRENCY) {
    const batchGroup = [];
    for (let j = 0; j < CONCURRENCY && i + j * BATCH_SIZE < chunks.length; j++) {
      const batch = chunks.slice(i + j * BATCH_SIZE, i + (j + 1) * BATCH_SIZE);
      batchGuard.onInferenceCall(batch.length);
      autoscalingGuard.onRequestStart();
      batchGroup.push(
        endpoint.embed(batch).finally(() => autoscalingGuard.onRequestComplete()),
      );
    }
    const batchResults = await Promise.all(batchGroup);
    results.push(...batchResults.flat());
  }

  return results;
}

Composing All Four Guards

The four patterns are not mutually exclusive — they occur in sequence. A typical agent session hits a cold start (Pattern 1), polls the status API while waiting (Pattern 2), then once the endpoint is warm, begins processing its corpus with unbatched calls (Pattern 3) in a burst that triggers autoscaling (Pattern 4). Composing the guards into a single runAgentInferenceSession() wrapper gives you a single call site where all four ceilings are enforced before each endpoint interaction.

TypeScript — composed inference session

interface HFEndpointSession {
  coldStartGuard: HFColdStartGuard;
  contextOverheadGuard: HFContextOverheadGuard;
  batchGuard: HFBatchInferenceGuard;
  autoscalingGuard: HFAutoscalingGuard;
}

async function callHFEndpointWithGuards(
  inputs: string[],
  endpoint: HFEndpoint,
  session: HFEndpointSession,
): Promise<number[][]> {
  // Pattern 3: enforce batching before dispatching
  session.batchGuard.onInferenceCall(inputs.length);

  // Pattern 4: check autoscaling impact before starting request
  session.autoscalingGuard.onRequestStart();

  try {
    let response = await endpoint.infer(inputs);

    // Pattern 1: detect cold start (503 Loading)
    while (response.status === 503 && response.body.includes('loading')) {
      session.coldStartGuard.onEndpointResponse(503, response.body);

      // Pattern 2: decide whether to inject the poll response into agent context
      const { inject, reason } = session.contextOverheadGuard.shouldInjectPollResponse(
        'loading',
        estimateTokens(response.body),
      );
      if (inject) {
        // Let the agent log this status update
        console.log(`[HF Endpoint] Still loading: ${reason}`);
      }

      await sleep(30_000);  // 30s minimum poll interval to avoid cascade
      response = await endpoint.infer(inputs);
    }

    session.coldStartGuard.onEndpointResponse(200, response.body);
    session.contextOverheadGuard.onColdStartResolved();
    return response.embeddings;
  } finally {
    session.autoscalingGuard.onRequestComplete();
  }
}

function buildDefaultSession(): HFEndpointSession {
  return {
    coldStartGuard: new HFColdStartGuard(
      3,        // maxSessionColdStarts
      4,        // maxRetriesPerWarmup
      300_000,  // warmupTimeoutMs (5 min)
    ),
    contextOverheadGuard: new HFContextOverheadGuard(
      3_000,  // maxContextPollTokens
      8,      // maxPollsPerColdStart
      true,   // suppressDuplicateStatus
    ),
    batchGuard: new HFBatchInferenceGuard(
      8,     // minBatchSize
      32,    // recommendedBatchSize
      10,    // maxUnbatchedCalls before trip
    ),
    autoscalingGuard: new HFAutoscalingGuard(
      20,    // maxConcurrentRequests
      25,    // requestsPerReplica (for scaling estimate)
      0.60,  // instanceCostPerHour (T4)
      60,    // minBillingMinutes
      5.00,  // maxCommittedCostUsd
    ),
  };
}

Frequently Asked Questions

Does the cold start cascade guard work with the HF Serverless Inference API, or only Dedicated Endpoints?

Only Dedicated Endpoints have the scale_to_zero cold start pattern that HFColdStartGuard is designed for. The Serverless Inference API uses shared, always-hot infrastructure — you may see occasional 503 Model Loading responses when HF's shared pool is provisioning a new instance for an unpopular model, but these are rare and short (<30 seconds). For Serverless, the more relevant cost controls are rate limit detection (HTTP 429), per-token budget enforcement, and daily request count limits tracked through the HF API headers. The guard pattern is the same; the trigger condition is different.

What batch size should I use for embedding models on HF Dedicated Endpoints?

For most embedding models (BAAI/bge-large-en-v1.5, sentence-transformers variants, E5 models) on T4 GPU instances, a batch size of 32 hits the GPU saturation point — processing 32 items takes approximately the same time as processing 1 item. Beyond 32, you may see diminishing returns or OOM errors depending on the model's sequence length and the average token count in your documents. For A100 or A10G instances, you can often push to batch sizes of 64–128. Test with your specific model and instance type: run the same 1,000 chunks with batch sizes of 1, 8, 16, 32, 64, and measure wall-clock time — the throughput graph will plateau at your model's optimal batch size.

Can I disable autoscaling to prevent overshoot costs?

Yes — set max_replica: 1 in your endpoint configuration to fix the replica count. This prevents autoscaling overshoot entirely but means your endpoint can only handle the throughput of a single replica. For most agent workloads (a single agent or small agent fleet), one replica is sufficient and the fixed cost predictability is worth the throughput ceiling. The autoscaling overshoot pattern typically only manifests with parallel agent fleets sending burst traffic simultaneously. If you need autoscaling, the HFAutoscalingGuard approach — capping concurrent requests at the level that a single replica can handle smoothly — achieves the same result without changing the endpoint config.

How does the context overhead guard decide what counts as a "duplicate" status response?

The guard compares the status field from the endpoint status API response — specifically the state field in the HF Endpoint status JSON, which progresses from pending to initializing to running. Two consecutive polls returning initializing are duplicates — the endpoint has not changed state, and the second poll result carries no new information for the agent. The guard suppresses the second injection. A transition from initializing to running is a state change — that gets injected. You can disable suppressDuplicateStatus if your agent explicitly needs full polling history for audit trails, at the cost of higher context accumulation.

Does RunGuard integrate directly with the HF Inference Endpoints API or do I need to build the guards manually?

The guards in this post are TypeScript class implementations you can drop into your agent code — they are not RunGuard SDK internals. The RunGuard SDK provides the guard() wrapper and LoopDetector, BudgetTracker, and ContextGuard primitives that handle the circuit breaker trip-and-halt behavior. The HF-specific guards above are application-layer implementations that sit on top of those primitives and enforce the four HF-specific cost patterns. You can build these yourself using the open-source guard pattern shown here, or use the RunGuard SDK to get the breaker-open exception, persistent trip state, and FORCE_RETRY bypass semantics out of the box.

Stop paying for agent loops before they bill you

RunGuard is a runtime circuit breaker SDK for TypeScript and Python agents. One guard() call wraps your agent's tool invocations and trips a breaker on the first sign of a loop, context overflow, or budget breach — before the fourth bill-of-the-month lands. Free 14-day trial, no card required.

Start free trial