
Use Claude Sonnet 4.6 as your default production model in 2026. At $3 input and $15 output per million tokens, it hits 79.6% on SWE-bench Verified and 72.5% on OSWorld-Verified, which means it is good enough for roughly 80% of real shipping work: code-gen, agent loops, RAG queries, summarization, and tool use. Escalate to Opus 4.7 for the hardest reasoning. Drop to Haiku 4.5 for classification and routing.
That single sentence is the entire model-selection policy you need on day one. The rest of this post is how to actually wire Sonnet 4.6 into a production system: prompt caching, structured outputs, the Batch API, agent loops with computer use and MCP, and the context engineering moves that keep your bill predictable.
Sonnet 4.6 is the first Claude release where mid-tier pricing buys you near-flagship coding performance. The headline numbers explain why teams are standardizing on it:
Two of those numbers change how you architect a real system. The OSWorld jump means computer-use agents finally clear the bar for internal-ops automation. The injection-resistance jump means you can put a Sonnet-backed agent in front of customer input without a panic attack, provided you keep the standard tool allowlists and approval flows in place.
For coding agents specifically, Anthropic reported that Claude Code users preferred Sonnet 4.6 over Sonnet 4.5 about 70% of the time, and over Opus 4.5 about 59% of the time. That second number is the interesting one. A model two tiers cheaper is winning head-to-head on the work most engineers actually do.
The migration from Sonnet 4.5 is a drop-in replacement with one caveat: prefilling is removed in 4.6. If your prompts depended on starting the assistant turn with a partial response, you need to refactor before you bump the version string.
The optimal 2026 strategy is not "always Sonnet." It is a three-tier routing policy that you write down once and enforce in code.
| Model | Best for | Cost (in/out per M tokens) | Latency | When to escalate or drop |
|---|---|---|---|---|
| Haiku 4.5 | Classification, routing, structured extraction, light summarization | $1 / $5 | Fastest | If output quality wobbles on multi-step reasoning, move to Sonnet |
| Sonnet 4.6 | Code-gen, agent loops, RAG, tool use, computer use, most production work | $3 / $15 | Fast | Default. Escalate to Opus only when Sonnet fails the eval suite twice |
| Opus 4.7 | Novel architecture, hardest reasoning, ambiguous specs, frontier problems | ~$15 / $75 | Slowest | Use sparingly. Cache aggressively. Often a one-shot for the hard problem, then back to Sonnet |
The senior pattern is simple. Sonnet handles the request. If you have a router or planner, Haiku handles that step (it is fast and 3x cheaper). If the request is in a known-hard bucket (think a multi-system refactor, a novel data-model design, or anything where Sonnet has plateaued in your evals), the planner escalates that single call to Opus, then routes the rest of the workflow back to Sonnet.
Engineers who have internalized this routing reach for the right tier without thinking about it. We cover the broader breakdown in our Claude Opus, Sonnet, Haiku tier guide, and the head-to-head against OpenAI in our Sonnet vs GPT-4o coding comparison.
The single biggest cost lever on Sonnet 4.6 is prompt caching, and most production apps under-use it. The mechanics:
cache_control on the stable parts of your prompt: system message, tool definitions, long persona blocks, retrieved documents that you re-use across requests.The math: a 30,000-token system prompt costs $0.09 per request uncached. With a cache hit, it costs $0.009 on the prefix. That is a 90% reduction on the largest token line in most production apps, and real-world cache hit rates run from 30% to 98% depending on traffic patterns.
A practical layout for a chat or agent app:
On Sonnet 4.6 specifically, prior thinking blocks are kept by default and remain part of the cached prefix. That makes multi-turn agent work meaningfully cheaper than it was on 4.5. For deeper coverage of the cost side, see our token cost optimization playbook.
If you are still parsing free-text responses with regex in 2026, stop. Sonnet 4.6 supports structured outputs through the tool-use API: you define a JSON schema, the model returns a tool_use block with a typed payload, and you validate it on the receiving end with Pydantic, Zod, or whatever your runtime prefers.
This is prompt-as-spec discipline at the API boundary. The schema is the contract between your code and the model. When you change the schema, the prompt changes with it, and your tests catch the drift.
A few production rules:
enum constraints for any field with a closed value set (status codes, intent labels, severity).If your structured output ever needs to express uncertainty, add a confidence field and route low-confidence outputs to a human review queue. We cover the broader pattern in handling hallucinations in production LLM apps.
Anything in your system that does not have a latency SLO belongs on the Batch API. You get a 50% discount on both input and output ($1.50 / $7.50 on Sonnet 4.6) and you can stack it with prompt caching for compounded savings.
Workloads that almost always belong in batch:
Two things to know. Batches can take longer than 5 minutes to complete, so use the 1-hour cache TTL when you are batching with shared context. And the cache pre-warming pattern (a request with max_tokens: 0) is not supported inside a batch, since the ephemeral cache entry would expire before the follow-up.
If you are spending more than $500 a month on Anthropic and have not split your workload into "real-time" and "batch" lanes, that is the next change to make. The savings usually pay for the engineering time inside a week.
Sonnet 4.6 is the first model where computer-use agents are credible inside a regulated company. OSWorld-Verified at 72.5% means the model can drive a browser, fill out forms, and complete multi-step UI workflows with a meaningful first-pass success rate. Combined with the prompt-injection improvement, it is reasonable to put one in front of internal-ops work that previously needed a human.
The shape of a production agent loop on Sonnet 4.6:
MCP is the part that turns this from a chatbot into a real coworker. It gives the agent typed access to your actual systems through a protocol that handles auth, schemas, and discovery. Most teams start by exposing read-only tools (search documents, query a metrics database, fetch a customer record), then graduate to write tools once the eval suite is solid.
For the higher-level architecture of agentic systems with retrieval, our production RAG architecture guide covers the retrieval side end to end.
Sonnet 4.6 ships with a 200K standard context window and a 1M-token beta. The 1M is not an invitation to dump everything into the prompt. Above 200K input tokens, you pay a long-context premium of $6 input and $22.50 output per million, and quality on long-context recall is still uneven across token positions.
The senior move is to put only what the model needs in the window:
Context engineering is the difference between a $50/month app and a $5,000/month app at the same usage level. Get this right before you negotiate with sales for an enterprise discount.
Every engineer on Cadence is AI-native by baseline. The platform's voice interview specifically scores Cursor / Claude Code / Copilot fluency, prompt-as-spec discipline, verification habits, and multi-step prompt-ladder thinking. There is no non-AI-native option on the platform. That matters here because Sonnet 4.6 only pays off at scale when the engineer wiring it knows what to cache, what to escalate, and what to never delegate.
A typical week looks like this. A founder books a mid-tier engineer at $1,000/week to wire the LLM stack (Anthropic SDK, prompt-caching layout, structured outputs, batch lanes for the nightly job). They book a senior at $1,500/week one week later to own the escalation policy, eval suite, and cost guardrails. The lead tier ($2,000/week) shows up only when the system needs to scale to multi-region traffic or you are designing a novel agentic architecture from scratch. Junior at $500/week handles cleanup, doc generation, and integrations against well-documented APIs. The 48-hour free trial means you find out in two days whether the engineer actually has the AI-native habits the platform vetted them on.
If you want a quick gut check on which next feature to build with Sonnet 4.6 versus buy off the shelf, the Build/Buy/Book decider tool walks you through the call in about two minutes.
If you are running on Sonnet 4.5 today, three concrete moves:
claude-sonnet-4-6 in a feature branch. Run your eval suite. If anything depends on prefilling, refactor before merging.cache_control to the system prompt and tool definitions. Watch the cost drop.If you are starting from scratch this week and want a senior engineer who already knows this stack, Cadence shortlists 4 vetted AI-native engineers in 2 minutes with a 48-hour free trial. No recruiter loop, no notice period, weekly billing.
Yes. SWE-bench Verified at 79.6%, OSWorld-Verified at 72.5%, and a prompt-injection success rate of 0.51% with safeguards (down from 49.36% on Sonnet 4.5). It is the recommended default for most production teams in 2026.
Default to Sonnet for roughly 80% of work. Escalate to Opus 4.7 only for novel architecture, ambiguous specs, or evaluation runs where Sonnet has visibly plateaued. Opus runs about 5x the cost of Sonnet and is noticeably slower, so most teams use it as a one-shot for the hard problem and route the rest back to Sonnet.
Cache reads cost 10% of the base input rate ($0.30 per million) and writes cost a 25% premium ($3.75 per million). A 30,000-token system prompt that costs $0.09 per request uncached costs $0.009 per cache hit. Real apps see hit rates from 30% to 98% depending on traffic patterns.
Yes, for anything without a latency SLO. Backfills, daily reports, eval runs, embeddings, bulk extraction. You get a 50% discount on both input and output, and you can stack it with the 1-hour cache TTL. Splitting your workload into a real-time lane and a batch lane usually pays for the engineering time inside a week.
Yes, with the standard agent guardrails. OSWorld-Verified jumped to 72.5% in 4.6, which makes computer-use agents credible for internal-ops automation against legacy systems without APIs. Wrap any production deployment in approval workflows for irreversible actions, action allowlists, and sandboxing.
It is a drop-in replacement with one caveat: prefilling is removed in 4.6. If you have prompts that depend on starting the assistant turn with a partial response, refactor those before you bump the model string. Otherwise, change the version, run your eval suite, and ship.