
LLM token cost optimization is the practice of cutting your inference bill without cutting product quality, using five compounding levers: output limits, model-tier routing, prompt caching, batch APIs, and tighter context. Combine them and a $5,000/month spend usually drops to $800 or less. The work is mechanical, not magical, and it pays back inside a billing cycle.
Teams ship a feature on the strongest available model, prove it works, and then never come back to the cost. That is rational on day one and expensive by month three.
By 2026, three things have become true at once. Model tiers got wider (Haiku 4.5 and GPT-4o-mini are genuinely capable for routine work). Provider-side discounts got real (Anthropic offers 90% off prompt-cache hits, OpenAI offers 50% off batch jobs). And output tokens still cost 4 to 5 times more than input tokens, so verbose responses dominate the bill.
The teams who fix this do not run a single optimization. They run a ladder of five, in order, and stop when the bill is low enough.
| Lever | Effort | Typical savings | When to skip |
|---|---|---|---|
| 1. Output token limits + stop sequences | Hours | 15-25% | You already enforce schemas |
| 2. Model tier routing (Haiku / Sonnet / Opus) | Days | 40-60% | All traffic genuinely needs the top tier |
| 3. Prompt caching (Anthropic / OpenAI) | Days | 30-70% on long-context work | Prompts are short and unique |
| 4. Batch API for async work | A day | 50% on the routed share | No async-tolerable workload |
| 5. Context engineering + structured outputs | Weeks | 20-40% | Already running tight context |
Run them in this order. Each lever changes the inputs to the next, so reordering wastes work.
Output tokens are the single most expensive thing on your bill. GPT-4o output is $10 per million tokens against $2.50 per million input. Sonnet 4.6 output is $15 against $3 input. Cutting output length 60% saves more dollars than cutting input by the same percentage.
Two changes, both 20-line PRs:
max_tokens on every call. Pick a value that fits the longest legitimate response, plus 10%. If your support reply is 120 tokens at p95, set max_tokens=150, not 1024.stop=["\n\nUser:", "</answer>", "\n##"] kills hallucinated continuation. For function-calling endpoints, stop sequences also prevent the model from explaining its tool calls in prose, which it loves doing.Streaming the first token to the user, while you keep the full call short, is a UX upgrade that costs nothing. Time-to-first-token (TTFT) drops below 400ms even on Sonnet, and users perceive a 200-token response as instant.
Model tier routing is the single biggest cost lever in 2026. The shape is the same across providers:
The routing rule that works in practice: classify the request with the cheap tier, then dispatch. A two-line prompt to Haiku 4.5 ("classify this request as simple | reasoning | hard, reply with one word") costs about $0.0001 and saves an Opus call about half the time. If you want a deeper take on which provider to start with, our breakdown of OpenAI, Anthropic, and Google models walks through the trade-offs.
A real number from a customer-support workload we benchmarked: 80% of tickets routed to Haiku 4.5, 18% to Sonnet 4.6, 2% escalated to Opus 4.7. Quality dropped 1.2 points on a 100-point internal eval. Cost dropped 78%.
Prompt caching is the change that catches every team off guard. Anthropic charges 25% extra on cache writes and only 10% of the normal rate on cache hits, a 90% discount on the cached portion. OpenAI's automatic prompt caching gives 50% off cache hits with no API change required.
The math gets dramatic when system prompts are long. A 4,000-token system prompt sent 10,000 times a day on Sonnet 4.6 is $480 per day uncached. With Anthropic prompt caching and a realistic 70% hit rate, the same workload runs about $156 per day. That is $9,720 per month back in your pocket from one feature flag.
Three rules that get you most of the savings:
For agentic workflows specifically, where the system prompt plus tool definitions can hit 10,000 tokens, prompt caching is the difference between a profitable feature and a money pit. Engineers who live in this stack ship cache-aware code by default; teams new to it usually ship the expensive version twice before they notice.
If a workload can wait 24 hours, OpenAI's Batch API and Anthropic's Message Batches API both give 50% off. The API surface is almost identical to the synchronous one: submit a JSONL file, poll for results.
Workloads that fit:
A pricing trick most teams miss: you can route the cheap tier through Batch too. Haiku 4.5 at the batch discount is roughly $0.125 per million input tokens. For a 100-million-token monthly classification job, you pay $12.50 instead of $25.
The implementation cost is genuinely small. One queue, one cron, one results-merger. A mid engineer ships it inside two days.
The last lever is the one that sounds boring and usually finds the most slack.
Structured outputs. Both OpenAI and Anthropic now ship guaranteed-JSON modes. Use them. Free-form JSON parsing fails 1-3% of the time even on top-tier models, and the retry cost (full token bill again, plus latency) eats your margin. Structured outputs lock the schema server-side, eliminate retries, and as a bonus tend to be shorter than free-form JSON because the model stops adding explanatory keys. If you are doing this in OpenAI, our function calling guide covers the 2026 changes.
Context engineering. Most prompts have 2-3x more context than they need. The patterns that find the slack:
Smaller context is faster, cheaper, and usually more accurate. The model has fewer distractions and the same answer to give.
Here is the same workload through the ladder. Assume a 10-million-token-per-day user-facing app on Sonnet 4.6, with a 4,000-token system prompt and average 600-token output.
| Stage | Monthly cost | Cumulative savings |
|---|---|---|
| Baseline (no optimization) | $5,000 | 0% |
| + Output cap from 1024 to 200 | $3,800 | 24% |
| + Tier routing (70% Haiku, 25% Sonnet, 5% Opus) | $1,700 | 66% |
| + Anthropic prompt caching (70% hit rate) | $1,050 | 79% |
| + Batch the 30% async share | $880 | 82% |
| + Tighter context (40% fewer input tokens) | $620 | 88% |
The numbers are not perfectly additive because each lever changes the inputs to the next, but the shape holds. Teams who run the full ladder routinely take a Series A LLM bill from "uncomfortable" to "rounding error."
The honest answer: an engineer who has shipped LLM features before. Token cost optimization is not framework-level work, and there is no plug-in service that does it correctly without context. You need someone who can read your code, run the eval before and after each lever, and resist the impulse to over-optimize.
That is true whether you hire full-time or book contract. If you don't have an in-house option, every engineer on Cadence is AI-native by default, vetted on Cursor and Claude Code fluency, structured-output discipline, and prompt-as-spec rigor before they unlock bookings. Our pool of 12,800 engineers self-selects into tiers (Junior $500/week, Mid $1,000/week, Senior $1,500/week, Lead $2,000/week), and the median time to first commit is 27 hours. For a one-week cost-optimization sprint, a Mid is usually the right call. If you want help deciding whether to build it in-house, ship it with a contractor, or buy a tool like Helicone or Langfuse, our build-vs-buy decision tool walks the trade-off.
If your monthly LLM bill is over $1,500 and you have not run the ladder, do these three things in order:
max_tokens and stop sequences to every call. Ship today.Those three changes alone usually deliver 60-70% of the total savings. The remaining levers are worth running, but they have a higher engineering tax.
Want a sanity check before you start? Run the Build/Buy/Book decision tool. Two minutes in, you'll know whether to ship this with an in-house engineer, a contractor, or a managed observability tool. The output is honest, not a pitch.
Most production apps that have not been optimized save 70-85% by running the full five-lever ladder. A $5,000/month spend dropping to $800 is typical. The bigger your bill, the larger the absolute savings; very small bills (under $200/month) are rarely worth the engineering time.
Yes, but the implementation differs. Anthropic exposes explicit cache breakpoints with a 90% discount on hits and a 25% premium on writes. OpenAI does it automatically with a 50% discount on hits and no API changes. Google's Gemini context caching is closer to Anthropic's model. Treat the patterns as similar, the API as different.
Self-hosting Llama 3.1 70B or Qwen 2.5 72B beats hosted APIs only above roughly 1 million queries per month, and only if you already have GPU operations skill. Below that volume, the time you spend running inference is more expensive than the API. Run the five-lever ladder first; consider self-host only if your bill is still uncomfortable after that. Our Sonnet vs GPT-4o coding comparison covers when the hosted top-tier still earns its price.
Build an eval set of 50-200 representative requests with golden outputs. Run it before and after each routing change. If quality drops more than 2 points on your scoring rubric, the routing rule is wrong, not the model. Expect to revisit edge cases for two weeks after any tier-routing change.
Cap output tokens. It is one config change, applies to every call, and saves 15-25% on day one with zero risk. Once that ships, move to model-tier routing. Most teams hit a 60% reduction before they ever touch caching.