I am a...
Learn more
How it worksPricingFAQ
Account
May 7, 2026 · 9 min read · Cadence Editorial

Token cost optimization for LLM apps

llm token cost optimization — Token cost optimization for LLM apps
Photo by [Brett Sayles](https://www.pexels.com/@brett-sayles) on [Pexels](https://www.pexels.com/photo/server-racks-on-data-center-5480781/)

Token cost optimization for LLM apps

LLM token cost optimization is the practice of cutting your inference bill without cutting product quality, using five compounding levers: output limits, model-tier routing, prompt caching, batch APIs, and tighter context. Combine them and a $5,000/month spend usually drops to $800 or less. The work is mechanical, not magical, and it pays back inside a billing cycle.

Why most LLM bills are 5x larger than they need to be

Teams ship a feature on the strongest available model, prove it works, and then never come back to the cost. That is rational on day one and expensive by month three.

By 2026, three things have become true at once. Model tiers got wider (Haiku 4.5 and GPT-4o-mini are genuinely capable for routine work). Provider-side discounts got real (Anthropic offers 90% off prompt-cache hits, OpenAI offers 50% off batch jobs). And output tokens still cost 4 to 5 times more than input tokens, so verbose responses dominate the bill.

The teams who fix this do not run a single optimization. They run a ladder of five, in order, and stop when the bill is low enough.

The 5-lever cost ladder (priority order)

LeverEffortTypical savingsWhen to skip
1. Output token limits + stop sequencesHours15-25%You already enforce schemas
2. Model tier routing (Haiku / Sonnet / Opus)Days40-60%All traffic genuinely needs the top tier
3. Prompt caching (Anthropic / OpenAI)Days30-70% on long-context workPrompts are short and unique
4. Batch API for async workA day50% on the routed shareNo async-tolerable workload
5. Context engineering + structured outputsWeeks20-40%Already running tight context

Run them in this order. Each lever changes the inputs to the next, so reordering wastes work.

Lever 1: Cap output tokens and add stop sequences

Output tokens are the single most expensive thing on your bill. GPT-4o output is $10 per million tokens against $2.50 per million input. Sonnet 4.6 output is $15 against $3 input. Cutting output length 60% saves more dollars than cutting input by the same percentage.

Two changes, both 20-line PRs:

  1. Set max_tokens on every call. Pick a value that fits the longest legitimate response, plus 10%. If your support reply is 120 tokens at p95, set max_tokens=150, not 1024.
  2. Add stop sequences for chatty endings. stop=["\n\nUser:", "</answer>", "\n##"] kills hallucinated continuation. For function-calling endpoints, stop sequences also prevent the model from explaining its tool calls in prose, which it loves doing.

Streaming the first token to the user, while you keep the full call short, is a UX upgrade that costs nothing. Time-to-first-token (TTFT) drops below 400ms even on Sonnet, and users perceive a 200-token response as instant.

Lever 2: Route to the cheapest model that passes the eval

Model tier routing is the single biggest cost lever in 2026. The shape is the same across providers:

  • Cheap tier (Haiku 4.5, GPT-4o-mini, Gemini Flash): classification, extraction, routing, summarization of short text, JSON normalization. Roughly $0.25 per million input tokens. Good enough for 60-70% of typical app traffic.
  • Default tier (Sonnet 4.6, GPT-4o, Gemini Pro): anything user-facing that requires reasoning, multi-step tool use, or careful tone. Roughly $3 per million input tokens.
  • Hard tier (Opus 4.7, o-series, Gemini Ultra): architectural decisions, multi-file refactors, hard math, ambiguous specs that need a careful first pass. Roughly $15 per million input tokens.

The routing rule that works in practice: classify the request with the cheap tier, then dispatch. A two-line prompt to Haiku 4.5 ("classify this request as simple | reasoning | hard, reply with one word") costs about $0.0001 and saves an Opus call about half the time. If you want a deeper take on which provider to start with, our breakdown of OpenAI, Anthropic, and Google models walks through the trade-offs.

A real number from a customer-support workload we benchmarked: 80% of tickets routed to Haiku 4.5, 18% to Sonnet 4.6, 2% escalated to Opus 4.7. Quality dropped 1.2 points on a 100-point internal eval. Cost dropped 78%.

Lever 3: Prompt caching is the biggest free lunch in 2026

Prompt caching is the change that catches every team off guard. Anthropic charges 25% extra on cache writes and only 10% of the normal rate on cache hits, a 90% discount on the cached portion. OpenAI's automatic prompt caching gives 50% off cache hits with no API change required.

The math gets dramatic when system prompts are long. A 4,000-token system prompt sent 10,000 times a day on Sonnet 4.6 is $480 per day uncached. With Anthropic prompt caching and a realistic 70% hit rate, the same workload runs about $156 per day. That is $9,720 per month back in your pocket from one feature flag.

Three rules that get you most of the savings:

  1. Put your system prompt and any few-shot examples in the cached prefix. Put the user query last.
  2. Keep cache breakpoints stable. If you regenerate the system prompt every request (timestamps, random IDs), nothing caches.
  3. Watch your cache TTL. Anthropic's standard cache lives 5 minutes; the extended option lives an hour at higher cost. Pick the one that matches your traffic shape.

For agentic workflows specifically, where the system prompt plus tool definitions can hit 10,000 tokens, prompt caching is the difference between a profitable feature and a money pit. Engineers who live in this stack ship cache-aware code by default; teams new to it usually ship the expensive version twice before they notice.

Lever 4: Batch API for anything async

If a workload can wait 24 hours, OpenAI's Batch API and Anthropic's Message Batches API both give 50% off. The API surface is almost identical to the synchronous one: submit a JSONL file, poll for results.

Workloads that fit:

  • Nightly summarization of yesterday's events
  • Backfilling embeddings or reclassifying old data
  • Generating thumbnails, alt text, or marketing copy variants in bulk
  • Running offline evals against new prompts

A pricing trick most teams miss: you can route the cheap tier through Batch too. Haiku 4.5 at the batch discount is roughly $0.125 per million input tokens. For a 100-million-token monthly classification job, you pay $12.50 instead of $25.

The implementation cost is genuinely small. One queue, one cron, one results-merger. A mid engineer ships it inside two days.

Lever 5: Structured outputs and context engineering

The last lever is the one that sounds boring and usually finds the most slack.

Structured outputs. Both OpenAI and Anthropic now ship guaranteed-JSON modes. Use them. Free-form JSON parsing fails 1-3% of the time even on top-tier models, and the retry cost (full token bill again, plus latency) eats your margin. Structured outputs lock the schema server-side, eliminate retries, and as a bonus tend to be shorter than free-form JSON because the model stops adding explanatory keys. If you are doing this in OpenAI, our function calling guide covers the 2026 changes.

Context engineering. Most prompts have 2-3x more context than they need. The patterns that find the slack:

  • Truncate chat history beyond the last 10 turns; summarize the rest into a 200-token recap.
  • For RAG, retrieve 8 chunks, rerank, and pass only the top 3. Chunk size of 400 tokens beats 1,000 in nearly every eval. (More on this in our production RAG architecture writeup.)
  • Strip whitespace, comments, and dead code from any code-context payload. Cursor and Claude Code both do this client-side now, but bespoke pipelines often skip it.
  • Move static reference material (style guide, ontology, schema) into the cached prefix. Don't re-tokenize it on every call.

Smaller context is faster, cheaper, and usually more accurate. The model has fewer distractions and the same answer to give.

Walking the math: $5,000 to $800

Here is the same workload through the ladder. Assume a 10-million-token-per-day user-facing app on Sonnet 4.6, with a 4,000-token system prompt and average 600-token output.

StageMonthly costCumulative savings
Baseline (no optimization)$5,0000%
+ Output cap from 1024 to 200$3,80024%
+ Tier routing (70% Haiku, 25% Sonnet, 5% Opus)$1,70066%
+ Anthropic prompt caching (70% hit rate)$1,05079%
+ Batch the 30% async share$88082%
+ Tighter context (40% fewer input tokens)$62088%

The numbers are not perfectly additive because each lever changes the inputs to the next, but the shape holds. Teams who run the full ladder routinely take a Series A LLM bill from "uncomfortable" to "rounding error."

Who actually does this work

The honest answer: an engineer who has shipped LLM features before. Token cost optimization is not framework-level work, and there is no plug-in service that does it correctly without context. You need someone who can read your code, run the eval before and after each lever, and resist the impulse to over-optimize.

That is true whether you hire full-time or book contract. If you don't have an in-house option, every engineer on Cadence is AI-native by default, vetted on Cursor and Claude Code fluency, structured-output discipline, and prompt-as-spec rigor before they unlock bookings. Our pool of 12,800 engineers self-selects into tiers (Junior $500/week, Mid $1,000/week, Senior $1,500/week, Lead $2,000/week), and the median time to first commit is 27 hours. For a one-week cost-optimization sprint, a Mid is usually the right call. If you want help deciding whether to build it in-house, ship it with a contractor, or buy a tool like Helicone or Langfuse, our build-vs-buy decision tool walks the trade-off.

What to do this week

If your monthly LLM bill is over $1,500 and you have not run the ladder, do these three things in order:

  1. Add max_tokens and stop sequences to every call. Ship today.
  2. Pick one classifier endpoint and reroute it to Haiku 4.5 or GPT-4o-mini. Compare the eval before and after. Ship this week.
  3. Add Anthropic prompt caching to your longest system prompt. Watch the cache-hit metric for a day. Ship next week.

Those three changes alone usually deliver 60-70% of the total savings. The remaining levers are worth running, but they have a higher engineering tax.

Want a sanity check before you start? Run the Build/Buy/Book decision tool. Two minutes in, you'll know whether to ship this with an in-house engineer, a contractor, or a managed observability tool. The output is honest, not a pitch.

FAQ

How much can I realistically save on my LLM bill?

Most production apps that have not been optimized save 70-85% by running the full five-lever ladder. A $5,000/month spend dropping to $800 is typical. The bigger your bill, the larger the absolute savings; very small bills (under $200/month) are rarely worth the engineering time.

Does prompt caching work across providers?

Yes, but the implementation differs. Anthropic exposes explicit cache breakpoints with a 90% discount on hits and a 25% premium on writes. OpenAI does it automatically with a 50% discount on hits and no API changes. Google's Gemini context caching is closer to Anthropic's model. Treat the patterns as similar, the API as different.

Should I self-host an open-source model instead?

Self-hosting Llama 3.1 70B or Qwen 2.5 72B beats hosted APIs only above roughly 1 million queries per month, and only if you already have GPU operations skill. Below that volume, the time you spend running inference is more expensive than the API. Run the five-lever ladder first; consider self-host only if your bill is still uncomfortable after that. Our Sonnet vs GPT-4o coding comparison covers when the hosted top-tier still earns its price.

How do I avoid quality regressions when I route to cheaper models?

Build an eval set of 50-200 representative requests with golden outputs. Run it before and after each routing change. If quality drops more than 2 points on your scoring rubric, the routing rule is wrong, not the model. Expect to revisit edge cases for two weeks after any tier-routing change.

What's the cheapest way to get started?

Cap output tokens. It is one config change, applies to every call, and saves 15-25% on day one with zero risk. Once that ships, move to model-tier routing. Most teams hit a 60% reduction before they ever touch caching.

All posts