LLM batch API cost reduction: 50% off when your workload can wait
Anthropic, OpenAI, and Google Vertex AI all offer the same deal on their batch APIs: cut the per-token price in half in exchange for a completion window of up to 24 hours. For any LLM workload where the user is not waiting for a response in real time — nightly document processing, bulk classification, embedding generation, test suite evaluation, content moderation queues — batch pricing is the single highest-leverage cost reduction available, requiring zero changes to your prompts or models. The challenge is not the API itself (which is straightforward) but the operational question: how do you systematically identify which requests are batch-eligible, route them correctly, and reconcile batch-job costs against your daily budget? This page works through the math, the code, and the RunGuard integration that makes batch routing automatic.
The batch pricing model across all three major providers
- Anthropic Messages Batch API. The Anthropic Messages Batch API accepts up to 10,000 message requests in a single JSONL file. Every model on the platform — Claude 3.5 Sonnet, Claude 3 Haiku, Claude 3 Opus — is available at 50% of the standard per-token price. Input and output tokens are both discounted. Typical completion time is 1–4 hours; the SLA maximum is 24 hours. Batches are submitted via
POST /v1/messages/batches; results are retrieved via polling or webhook. Batch jobs that fail partway through return per-request success/error status in the results JSONL, so individual request failures do not require resubmitting the entire batch. - OpenAI Batch API. OpenAI’s Batch API covers GPT-4o, GPT-4o mini, GPT-4 Turbo, and the text embedding models. The pricing discount is 50% across all eligible models. Batches are submitted as JSONL files uploaded via the Files API; each line contains a
custom_idfield for result matching plus a standard Chat Completions request body. Completion time: up to 24 hours. OpenAI also applies batch pricing to embedding generation — useful for the bounded LRU embedding-cache pattern described in LLM agent resource cleanup cost patterns when you have a large initial corpus to embed offline. - Google Vertex AI Batch. Google’s Vertex AI Batch prediction endpoint covers Gemini 1.5 Pro, Gemini 1.5 Flash, and Gemini 1.0 Pro. The discount is 50% off the standard online-prediction price. Batch jobs are submitted via BigQuery input tables or Cloud Storage JSONL; results write back to BigQuery or Cloud Storage. Completion time is similar to the other providers: typically 2–6 hours for large jobs, with a 24-hour maximum. For teams already running on Google Cloud with BigQuery for analytics, the Vertex batch pipeline integrates directly with existing data infrastructure.
- The universal principle: does the user need a response in under 30 seconds? If the answer is yes, the request must go real-time. If the answer is no — or if there is no end-user waiting at all — the request is a candidate for batch. This single question separates the two pricing tiers. The sections below formalize the use-case matrix and show how to encode batch eligibility as metadata on agent requests so RunGuard can route them automatically.
Real-time vs batch: the complete use-case matrix
- Workloads that always require real-time. User-facing chat interfaces have an obvious latency requirement: the user typed a message and is watching the cursor. Interactive agent workflows — where a human-in-the-loop step follows the LLM response — also require real-time because a 4-hour delay breaks the interaction model. Time-sensitive alerts and anomaly detection similarly require sub-second response: a fraud-detection pipeline that takes 4 hours to flag a transaction provides no value. Finally, any pipeline where downstream steps are blocked waiting for the LLM result is real-time by definition (a batch job that finishes in 4 hours but blocks a real-time user for those 4 hours has just introduced a 4-hour user-facing latency).
- Workloads where batch is appropriate. The list is longer than most teams expect:
- Nightly document processing pipelines (contract analysis, PDF summarization, report generation on the previous day’s data)
- Bulk classification and tagging (categorizing a product catalog, intent-labeling a support ticket archive, tagging historical conversation logs)
- Embedding generation for a corpus (initial embedding of a document library before launching an RAG system, re-embedding after a model upgrade)
- Test suite evaluation and LLM-as-judge scoring (running evals on a test set of 5,000 prompts nightly to detect regression)
- Content moderation for uploaded assets (user-uploaded images, documents, or audio transcripts that are queued rather than blocking the upload response)
- Weekly report generation (summarizing user activity, generating personalized digests, creating analytics narratives)
- Fine-tuning data preparation (generating synthetic training examples, reformatting existing datasets)
- Retrospective analytics (running sentiment analysis or topic modeling on last month’s customer feedback)
- The grey zone: background agent tasks. Many production agent architectures have a mix of real-time and background tasks. A user-facing agent may trigger background subtasks — enriching a CRM record, generating a draft report, fetching and summarizing reference material — that do not need to complete before the agent responds to the user. These background subtasks are batch-eligible. The key architectural requirement is that the agent can return a response to the user without waiting for the subtask result. RunGuard’s
BatchRouter(described below) uses deadline metadata on each request to make this determination automatically.
Calculating your batch savings: real numbers
- Example 1: nightly intent classification of support tickets. A customer support platform receives 10,000 tickets per day. Each ticket requires an intent classification call to GPT-4o: 500 input tokens (system prompt + ticket text) + ~50 output tokens (the classification label). Real-time pricing for GPT-4o: $2.50/MTok input, $10.00/MTok output. Real-time daily cost: (500 × 10,000 / 1,000,000 × $2.50) + (50 × 10,000 / 1,000,000 × $10.00) = $12.50 + $5.00 = $17.50/day. Batch pricing (50% off): $8.75/day. Annual savings: $8.75 × 365 = $3,194/year on a single nightly job. The batch job runs overnight and results are ready before the support team arrives in the morning — zero operational impact.
- Example 2: nightly document summarization. A legal tech platform summarizes 10,000 contracts per night. Each summarization call uses 2,000 input tokens + 500 output tokens. Real-time: (2,000 × 10,000 / 1,000,000 × $2.50) + (500 × 10,000 / 1,000,000 × $10.00) = $50.00 + $50.00 = $100.00/night. Batch: $50.00/night. Annual savings: $50.00 × 365 = $18,250/year. At these volumes the savings from batch pricing alone pay for years of RunGuard Team plan ($79/month × 12 = $948/year).
- Example 3: weekly embedding refresh for a 500,000-document corpus. An enterprise search platform re-embeds its full document corpus weekly after content updates. Using OpenAI text-embedding-3-small at $0.02/MTok real-time: average 300 tokens per chunk × 500,000 chunks = 150,000,000 tokens = $3.00/week. Batch (50% off): $1.50/week. Annual savings: $1.50 × 52 = $78/year — modest in isolation, but for larger models (text-embedding-3-large at $0.13/MTok) and larger corpora the savings scale linearly. A corpus 10× larger with text-embedding-3-large saves $780/year on embeddings alone.
- Example 4: LLM-as-judge eval suite, nightly regression testing. A team runs 5,000 prompt-response pairs through Claude 3.5 Sonnet for quality scoring each night. Each eval call: 1,500 input tokens (rubric + prompt + response) + 200 output tokens (score + reasoning). Claude 3.5 Sonnet real-time: $3.00/MTok input, $15.00/MTok output. Real-time: (1,500 × 5,000 / 1,000,000 × $3.00) + (200 × 5,000 / 1,000,000 × $15.00) = $22.50 + $15.00 = $37.50/night. Batch: $18.75/night. Annual savings: $6,844/year. For a team running eval suites as part of CI, this represents a substantial fraction of the total LLM spend. For related patterns on managing per-session agent cost see AI agent cost per user session.
Anthropic Messages Batch API: complete Python implementation
- Submitting a batch and handling per-request errors. The Anthropic batch API accepts a list of requests, each with a
custom_idfor result matching and a standardparamsdict matching the Messages API signature. The response includes aprocessing_statusfield ("in_progress","ended") and aresults_urlthat becomes available when processing ends. Results are a JSONL file where each line maps acustom_idto either amessageresult or anerrorobject. Individual request errors do not fail the batch:import anthropic import json import time import runguard client = anthropic.Anthropic() rg = runguard.init(api_key="rg_live_...") def submit_classification_batch( tickets: list[dict], system_prompt: str, ) -> str: """Submit a batch and return the batch ID.""" requests = [ { "custom_id": f"ticket-{t['id']}", "params": { "model": "claude-3-5-sonnet-20241022", "max_tokens": 100, "system": system_prompt, "messages": [ {"role": "user", "content": t["text"]} ] } } for t in tickets ] batch = client.messages.batches.create(requests=requests) rg.metric("batch_submitted", value=len(requests), tags={"batch_id": batch.id}) print(f"Submitted batch {batch.id} with {len(requests)} requests") return batch.id def wait_for_batch(batch_id: str, poll_interval_s: int = 60) -> anthropic.MessageBatch: """Poll until the batch is complete. Returns the completed batch object.""" while True: batch = client.messages.batches.retrieve(batch_id) if batch.processing_status == "ended": return batch print(f"Batch {batch_id} still processing... sleeping {poll_interval_s}s") time.sleep(poll_interval_s) def process_batch_results(batch_id: str) -> dict[str, str]: """ Stream results from the completed batch. Returns a dict mapping custom_id to classification label. Logs per-request errors to RunGuard for alerting. """ results: dict[str, str] = {} errors: list[dict] = [] for result in client.messages.batches.results(batch_id): if result.result.type == "succeeded": # Extract the text content from the message label = result.result.message.content[0].text.strip() results[result.custom_id] = label elif result.result.type == "errored": errors.append({ "custom_id": result.custom_id, "error_type": result.result.error.type, "error_message": result.result.error.message, }) if errors: rg.alert( "batch_request_errors", detail=f"{len(errors)} requests failed in batch {batch_id}", severity="warning", metadata={"sample_errors": errors[:5]}, ) rg.metric( "batch_completion", value=len(results), tags={ "batch_id": batch_id, "success_count": len(results), "error_count": len(errors), } ) return results # Putting it together for a nightly classification job def nightly_classification_job(tickets: list[dict]) -> None: SYSTEM = ( "Classify the support ticket into exactly one category: " "billing, technical, feature_request, other. " "Reply with only the category name." ) with rg.budget(daily_usd=15.00, alert_at=0.80): batch_id = submit_classification_batch(tickets, SYSTEM) batch = wait_for_batch(batch_id) labels = process_batch_results(batch_id) print(f"Classified {len(labels)} tickets. " f"Estimated cost: ${batch.request_counts.succeeded * 500 / 1_000_000 * 1.50:.4f}") - Webhook-based completion instead of polling. For production pipelines, polling every 60 seconds is acceptable but a webhook is cleaner. Anthropic sends a POST to your configured webhook URL when the batch reaches
"ended"status. The webhook payload includes thebatch_id; your handler callsclient.messages.batches.results(batch_id)to stream results. Register the webhook URL in your Anthropic account settings or viaclient.messages.batches.create(..., webhook_url="https://yourapp.com/batch-webhook").
OpenAI Batch API: TypeScript implementation
- Uploading the JSONL file and submitting the batch. OpenAI’s batch workflow has an extra step: the JSONL request file must be uploaded via the Files API first, then referenced by ID when creating the batch. Each line of the JSONL must include
custom_id,method,url, andbody:import OpenAI from "openai"; import * as fs from "fs"; import * as readline from "readline"; import RunGuard from "@runguard/sdk"; const openai = new OpenAI({ apiKey: process.env.OPENAI_API_KEY }); const rg = new RunGuard({ apiKey: process.env.RUNGUARD_API_KEY! }); interface TicketRecord { id: string; text: string; } interface BatchResult { ticketId: string; label: string; } async function submitOpenAIBatch( tickets: TicketRecord[], systemPrompt: string ): Promise<string> { // Build the JSONL content in memory const lines = tickets.map((t) => JSON.stringify({ custom_id: `ticket-${t.id}`, method: "POST", url: "/v1/chat/completions", body: { model: "gpt-4o", max_tokens: 100, messages: [ { role: "system", content: systemPrompt }, { role: "user", content: t.text }, ], }, }) ); const jsonlContent = lines.join("\n"); // Upload the JSONL file const file = await openai.files.create({ file: new File([jsonlContent], "batch_input.jsonl", { type: "application/jsonl", }), purpose: "batch", }); // Submit the batch const batch = await openai.batches.create({ input_file_id: file.id, endpoint: "/v1/chat/completions", completion_window: "24h", }); await rg.metric("openai_batch_submitted", { value: tickets.length, tags: { batchId: batch.id }, }); return batch.id; } async function waitForOpenAIBatch( batchId: string, pollIntervalMs = 60_000 ): Promise<OpenAI.Batches.Batch> { while (true) { const batch = await openai.batches.retrieve(batchId); if (batch.status === "completed" || batch.status === "failed" || batch.status === "expired" || batch.status === "cancelled") { return batch; } console.log(`Batch ${batchId} status: ${batch.status}. Waiting...`); await new Promise((resolve) => setTimeout(resolve, pollIntervalMs)); } } async function processOpenAIBatchResults( batch: OpenAI.Batches.Batch ): Promise<BatchResult[]> { if (batch.status !== "completed" || !batch.output_file_id) { throw new Error(`Batch ${batch.id} did not complete successfully: ${batch.status}`); } const fileContent = await openai.files.content(batch.output_file_id); const text = await fileContent.text(); const results: BatchResult[] = []; const errors: string[] = []; for (const line of text.split("\n").filter(Boolean)) { const record = JSON.parse(line); if (record.error) { errors.push(`${record.custom_id}: ${record.error.message}`); continue; } const label = record.response?.body?.choices?.[0]?.message?.content?.trim() ?? "unknown"; results.push({ ticketId: record.custom_id.replace("ticket-", ""), label, }); } if (errors.length) { await rg.alert("openai_batch_errors", { detail: `${errors.length} requests failed in batch ${batch.id}`, severity: "warning", }); } await rg.metric("openai_batch_completed", { value: results.length, tags: { batchId: batch.id, errorCount: errors.length }, }); return results; } // Example nightly job async function nightlyClassificationJob(tickets: TicketRecord[]): Promise<void> { const SYSTEM = "Classify the support ticket into exactly one category: " + "billing, technical, feature_request, other. Reply with only the category name."; // RunGuard budget guard prevents the batch submission itself if daily budget // would be exceeded by the estimated cost of this batch await rg.withBudget({ dailyUsd: 15.0, alertAt: 0.8 }, async () => { const batchId = await submitOpenAIBatch(tickets, SYSTEM); const batch = await waitForOpenAIBatch(batchId); const results = await processOpenAIBatchResults(batch); console.log(`Classified ${results.length} tickets via batch API`); }); } - Cleaning up the uploaded files after batch completion. Each JSONL file uploaded to the Files API counts against your OpenAI storage quota. Delete both the input file and the output file after processing results. Add
await openai.files.del(batch.input_file_id)andawait openai.files.del(batch.output_file_id)afterprocessOpenAIBatchResultsreturns. This is a form of resource cleanup with the same importance as the connection and cache patterns described in LLM agent resource cleanup cost patterns — storage accumulates silently and generates charges long after the batch is done.
RunGuard BatchRouter: automatic real-time vs batch routing
- The problem BatchRouter solves. In a production agent codebase, LLM calls are scattered across many tool functions, pipeline stages, and background workers. Manually auditing each call site to determine batch eligibility is error-prone and drifts over time as the codebase grows. RunGuard’s
BatchRoutermiddleware intercepts every outgoing LLM call and applies a routing policy that encodes the eligibility rules in one place. The policy evaluates three signals: (a) abatch_eligibletag on the request, set by the calling code; (b) adeadlinetimestamp, where any request with a deadline more than 30 minutes in the future is routed to batch; and (c) the current hour of day, where requests triggered during off-peak hours (e.g., 22:00–06:00 local time) are automatically batched if they have no explicit real-time requirement. The router accumulates batch-eligible requests in a queue and flushes them as batches at a configurable interval (e.g., every 15 minutes):import runguard from runguard.batch import BatchRouter, BatchPolicy from datetime import datetime, timedelta rg = runguard.init(api_key="rg_live_...") # Configure routing policy policy = BatchPolicy( # Route to batch if the request is tagged batch_eligible tag_rules={"batch_eligible": True}, # Route to batch if deadline is >30 minutes away deadline_threshold_minutes=30, # Route to batch if current UTC hour is in the off-peak window off_peak_hours=range(22, 6), # 22:00–05:59 UTC # Flush accumulated batch requests every 15 minutes flush_interval_seconds=900, # Use Anthropic batch endpoint (or "openai" for OpenAI) provider="anthropic", ) router = BatchRouter(rg, policy) # Wrap your LLM client with the router @router.wrap async def call_llm( messages: list[dict], model: str = "claude-3-5-sonnet-20241022", batch_eligible: bool = False, deadline: datetime | None = None, **kwargs, ) -> str: """ This function is intercepted by BatchRouter. If the request is batch-eligible, it is queued and a placeholder Future is returned. The Future resolves when the batch completes. If the request is real-time, it is forwarded immediately. """ import anthropic client = anthropic.AsyncAnthropic() response = await client.messages.create( model=model, messages=messages, max_tokens=kwargs.get("max_tokens", 1024) ) return response.content[0].text # Usage: real-time (user-facing) async def handle_user_message(user_input: str) -> str: return await call_llm( messages=[{"role": "user", "content": user_input}], batch_eligible=False, # explicit: user is waiting ) # Usage: background enrichment (batch-eligible) async def enrich_crm_record(record_id: str, notes: str) -> str: return await call_llm( messages=[ {"role": "user", "content": f"Summarize these CRM notes: {notes}"} ], batch_eligible=True, deadline=datetime.utcnow() + timedelta(hours=4), ) # BatchRouter queues this request; result available when batch completes - Handling batch Future resolution in async pipelines. When
BatchRouterqueues a request, it returns aBatchFuturethat behaves like a standardasyncio.Future. Background worker code canawaitthe future; it will resolve when the batch completes. For fire-and-forget background tasks (CRM enrichment, report generation), the future can be stored in a task registry and checked later. For pipeline stages where the next step genuinely depends on the batch result,await batch_futurewill block the pipeline stage until the result arrives — which is exactly the behaviour you want for a non-latency-sensitive step that feeds a downstream non-latency-sensitive step. - Off-peak automatic batching for scheduled jobs. Many agent pipelines include a mix of triggered (real-time) and scheduled (off-peak) work. RunGuard’s
off_peak_hourssetting automatically promotes all calls made during the off-peak window to batch, even if the calling code did not explicitly setbatch_eligible=True. This provides a safety net for jobs scheduled with cron that were written without batch awareness: they pick up the 50% discount automatically after theBatchRouteris deployed.
Monitoring batch job costs with RunGuard
- Batch jobs need different monitoring than real-time calls. Real-time LLM monitoring focuses on per-call latency, token-per-second throughput, and per-request cost. Batch monitoring is different: you care about cost-per-batch-job, the ratio of succeeded to errored requests within the batch, and whether batch costs are tracking against the expected daily budget. A batch job that has a 5% per-request error rate is worth alerting on; in real-time mode you’d see individual request failures immediately, but in batch mode all failures arrive together when the job completes.
- RunGuard batch cost reconciliation. Configure RunGuard with a daily batch budget separate from the real-time budget. The
rg.budgetcontext manager supports achannel="batch"parameter that tracks batch spend independently:import runguard from runguard.batch import BatchCostReconciler rg = runguard.init(api_key="rg_live_...") async def run_nightly_jobs() -> None: async with rg.budget( daily_usd=50.00, channel="batch", alert_at=0.80, # alert at $40 spend alert_channels=["slack", "pagerduty"], ) as batch_budget: # Job 1: ticket classification tickets_cost = await run_ticket_classification_batch() batch_budget.record_cost(tickets_cost, job="ticket_classification") # Job 2: document summarization docs_cost = await run_document_summarization_batch() batch_budget.record_cost(docs_cost, job="document_summarization") # Job 3: embedding refresh embeddings_cost = await run_embedding_refresh_batch() batch_budget.record_cost(embeddings_cost, job="embedding_refresh") # RunGuard writes a daily batch cost report to the dashboard # with per-job breakdown and comparison against the budget class BatchCostReconciler: """ Utility class that parses Anthropic/OpenAI batch result files and computes actual cost from token usage fields. """ def compute_anthropic_cost( self, results_file_path: str, input_price_per_mtoken: float, output_price_per_mtoken: float, ) -> float: total_cost = 0.0 with open(results_file_path) as f: for line in f: record = json.loads(line) if record.get("result", {}).get("type") == "succeeded": usage = record["result"]["message"].get("usage", {}) input_tokens = usage.get("input_tokens", 0) output_tokens = usage.get("output_tokens", 0) total_cost += ( input_tokens / 1_000_000 * input_price_per_mtoken + output_tokens / 1_000_000 * output_price_per_mtoken ) return total_cost - Alerting on high per-request error rates in batches. A batch with more than 2% per-request errors usually indicates a systematic problem — a prompt formatting error, a model content filter rejection, or a context-window overflow. RunGuard’s batch result processor emits an alert when the error rate exceeds a configurable threshold:
The alert fires immediately when RunGuard processes the results file, before your downstream pipeline tries to use the (partially absent) results. This gives you a window to resubmit the failed requests before the downstream job reads results. See multi-agent orchestration cost control for how batch-job error handling fits into larger multi-stage agent pipelines.rg.configure_batch_alert( error_rate_threshold=0.02, # alert if >2% of requests fail alert_message_template=( "Batch {batch_id} has {error_rate:.1%} error rate " "({error_count}/{total_count} requests failed). " "Check batch results at {results_url}." ), channels=["slack"], )
Batch cost savings by workload size and request profile
| Workload | Requests / night | Tokens per request | Real-time cost / night | Batch cost / night | Annual savings | Completion time | Batch viable? |
|---|---|---|---|---|---|---|---|
| Intent classification (GPT-4o) | 10,000 | 500 in + 50 out | $17.50 | $8.75 | $3,194 | 1–3 h | Yes — results ready before business hours |
| Document summarization (GPT-4o) | 10,000 | 2,000 in + 500 out | $100.00 | $50.00 | $18,250 | 2–4 h | Yes — nightly pipeline, no user waiting |
| LLM eval suite (Claude 3.5 Sonnet) | 5,000 | 1,500 in + 200 out | $37.50 | $18.75 | $6,844 | 1–2 h | Yes — CI eval run, no SLA on completion |
| Embedding generation (text-embedding-3-small) | 500,000 chunks | 300 tokens each | $3.00 | $1.50 | $78 | 2–6 h | Yes — weekly corpus refresh |
| User-facing chat (GPT-4o) | N/A (real-time) | 1,000 in + 300 out | Real-time required | Not applicable | $0 | < 2 s | No — user is waiting for response |
| Fraud detection alert (GPT-4o) | N/A (real-time) | 500 in + 100 out | Real-time required | Not applicable | $0 | < 1 s | No — time-sensitive; delay defeats purpose |
| CRM enrichment (background, Claude Haiku) | 20,000 | 800 in + 150 out | $4.60 | $2.30 | $840 | 1–3 h | Yes — background task, no user dependency |
| Content moderation (uploaded docs) | 5,000 | 1,200 in + 50 out | $15.25 | $7.63 | $2,782 | 2–4 h | Yes — upload queue, async review acceptable |
Related: Anthropic Claude API cost optimization for prompt-caching strategies that stack with batch pricing — prompt caching (90% input token discount on repeated prefixes) and batch pricing (50% off all tokens) are independent discounts that can be combined. See also OpenAI Assistants API budget control for managing cost across the Assistants (real-time) and Batch APIs on the same project.
Route your batch-eligible workloads and cut your LLM bill in half
The 50% batch discount from Anthropic, OpenAI, and Google is the single easiest cost reduction available for non-latency-sensitive agent workloads. The savings compound across every nightly job you run: $18,250/year on document summarization, $6,844/year on eval suites, $3,194/year on ticket classification. The implementation patterns on this page — JSONL batch submission, per-request error handling, webhook-based completion — cover everything you need to migrate your first batch job. RunGuard’s BatchRouter makes the routing decision automatic, so new callsites pick up batch pricing without requiring individual code changes. Combined with the budget controls in autonomous agent cost control best practices, batch routing gives you both the lower rate and the guardrails to catch runaway spend if a batch job misfires.
RunGuard pricing: Solo plan at $19/month for individual developers. Team plan at $79/month adds Slack and PagerDuty webhook alerts, shared dashboards, and audit log. Both plans include a 14-day free trial — no credit card required.
Start your 14-day free trial — or explore related: LLM caching cost savings calculation and agent task decomposition cost efficiency.