How to use Claude Sonnet 4.6 for production work

Use Claude Sonnet 4.6 as your default production model in 2026. At $3 input and $15 output per million tokens, it hits 79.6% on SWE-bench Verified and 72.5% on OSWorld-Verified, which means it is good enough for roughly 80% of real shipping work: code-gen, agent loops, RAG queries, summarization, and tool use. Escalate to Opus 4.7 for the hardest reasoning. Drop to Haiku 4.5 for classification and routing.

That single sentence is the entire model-selection policy you need on day one. The rest of this post is how to actually wire Sonnet 4.6 into a production system: prompt caching, structured outputs, the Batch API, agent loops with computer use and MCP, and the context engineering moves that keep your bill predictable.

When Sonnet 4.6 is the right default

Sonnet 4.6 is the first Claude release where mid-tier pricing buys you near-flagship coding performance. The headline numbers explain why teams are standardizing on it:

79.6% on SWE-bench Verified (up from 77.2% on Sonnet 4.5)
59.1% on Terminal-Bench 2.0 (up from 51.0%)
72.5% on OSWorld-Verified (up from 61.4%)
0.51% prompt-injection success with safeguards (down from 49.36% on 4.5)
Hard-benign refusal rate of 0.18% (down from 8.5%, a 47x improvement)

Two of those numbers change how you architect a real system. The OSWorld jump means computer-use agents finally clear the bar for internal-ops automation. The injection-resistance jump means you can put a Sonnet-backed agent in front of customer input without a panic attack, provided you keep the standard tool allowlists and approval flows in place.

For coding agents specifically, Anthropic reported that Claude Code users preferred Sonnet 4.6 over Sonnet 4.5 about 70% of the time, and over Opus 4.5 about 59% of the time. That second number is the interesting one. A model two tiers cheaper is winning head-to-head on the work most engineers actually do.

The migration from Sonnet 4.5 is a drop-in replacement with one caveat: prefilling is removed in 4.6. If your prompts depended on starting the assistant turn with a partial response, you need to refactor before you bump the version string.

The three-tier escalation rule

The optimal 2026 strategy is not "always Sonnet." It is a three-tier routing policy that you write down once and enforce in code.

Model	Best for	Cost (in/out per M tokens)	Latency	When to escalate or drop
Haiku 4.5	Classification, routing, structured extraction, light summarization	$1 / $5	Fastest	If output quality wobbles on multi-step reasoning, move to Sonnet
Sonnet 4.6	Code-gen, agent loops, RAG, tool use, computer use, most production work	$3 / $15	Fast	Default. Escalate to Opus only when Sonnet fails the eval suite twice
Opus 4.7	Novel architecture, hardest reasoning, ambiguous specs, frontier problems	~$15 / $75	Slowest	Use sparingly. Cache aggressively. Often a one-shot for the hard problem, then back to Sonnet

The senior pattern is simple. Sonnet handles the request. If you have a router or planner, Haiku handles that step (it is fast and 3x cheaper). If the request is in a known-hard bucket (think a multi-system refactor, a novel data-model design, or anything where Sonnet has plateaued in your evals), the planner escalates that single call to Opus, then routes the rest of the workflow back to Sonnet.

Engineers who have internalized this routing reach for the right tier without thinking about it. We cover the broader breakdown in our Claude Opus, Sonnet, Haiku tier guide, and the head-to-head against OpenAI in our Sonnet vs GPT-4o coding comparison.

Prompt caching: the 90% lever you stop ignoring

The single biggest cost lever on Sonnet 4.6 is prompt caching, and most production apps under-use it. The mechanics:

Mark cache_control on the stable parts of your prompt: system message, tool definitions, long persona blocks, retrieved documents that you re-use across requests.
Cache reads cost 10% of the standard input rate ($0.30 per million on Sonnet 4.6).
Cache writes cost a 25% premium ($3.75 per million).
Default TTL is 5 minutes. A 1-hour TTL is available for batch and slow-burn workloads.
Minimum cacheable block is 2,048 tokens.

The math: a 30,000-token system prompt costs $0.09 per request uncached. With a cache hit, it costs $0.009 on the prefix. That is a 90% reduction on the largest token line in most production apps, and real-world cache hit rates run from 30% to 98% depending on traffic patterns.

A practical layout for a chat or agent app:

Cache the system prompt + tool definitions once at the top of the conversation.
Cache the conversation history up to (but not including) the latest user message.
Pay full price only on the new turn and the model's response.

On Sonnet 4.6 specifically, prior thinking blocks are kept by default and remain part of the cached prefix. That makes multi-turn agent work meaningfully cheaper than it was on 4.5. For deeper coverage of the cost side, see our token cost optimization playbook.

Structured outputs with JSON schema

If you are still parsing free-text responses with regex in 2026, stop. Sonnet 4.6 supports structured outputs through the tool-use API: you define a JSON schema, the model returns a tool_use block with a typed payload, and you validate it on the receiving end with Pydantic, Zod, or whatever your runtime prefers.

This is prompt-as-spec discipline at the API boundary. The schema is the contract between your code and the model. When you change the schema, the prompt changes with it, and your tests catch the drift.

A few production rules:

Always validate. The model is good, not perfect. A schema-conforming response that contains a hallucinated field name is still a real failure mode.
Use enum constraints for any field with a closed value set (status codes, intent labels, severity).
Keep schemas shallow. Deeply nested object trees push the model toward generation errors.
For batch extraction tasks, pair structured outputs with the Batch API and prompt caching for double the savings.

If your structured output ever needs to express uncertainty, add a confidence field and route low-confidence outputs to a human review queue. We cover the broader pattern in handling hallucinations in production LLM apps.

Batch API for non-urgent work

Anything in your system that does not have a latency SLO belongs on the Batch API. You get a 50% discount on both input and output ($1.50 / $7.50 on Sonnet 4.6) and you can stack it with prompt caching for compounded savings.

Workloads that almost always belong in batch:

Backfills (re-processing a year of historical records)
Daily or weekly reports
Eval runs (you do not need a sub-second response on a 500-prompt regression suite)
Embedding refreshes paired with a summarization step
Bulk data extraction from documents

Two things to know. Batches can take longer than 5 minutes to complete, so use the 1-hour cache TTL when you are batching with shared context. And the cache pre-warming pattern (a request with max_tokens: 0) is not supported inside a batch, since the ephemeral cache entry would expire before the follow-up.

If you are spending more than $500 a month on Anthropic and have not split your workload into "real-time" and "batch" lanes, that is the next change to make. The savings usually pay for the engineering time inside a week.

Agent loops, computer use, and MCP

Sonnet 4.6 is the first model where computer-use agents are credible inside a regulated company. OSWorld-Verified at 72.5% means the model can drive a browser, fill out forms, and complete multi-step UI workflows with a meaningful first-pass success rate. Combined with the prompt-injection improvement, it is reasonable to put one in front of internal-ops work that previously needed a human.

The shape of a production agent loop on Sonnet 4.6:

System prompt with role, allowed tools, and refusal policy. Cached.
Tool definitions, including any MCP connectors you have wired in (databases, internal APIs, search, S3). Cached.
The user's request, with any retrieved context.
The model decides whether to think, call a tool, or respond.
Your runtime executes the tool, returns the result, and the loop continues.
Approval gates on irreversible actions (sending email, mutating production data, anything destructive).

MCP is the part that turns this from a chatbot into a real coworker. It gives the agent typed access to your actual systems through a protocol that handles auth, schemas, and discovery. Most teams start by exposing read-only tools (search documents, query a metrics database, fetch a customer record), then graduate to write tools once the eval suite is solid.

For the higher-level architecture of agentic systems with retrieval, our production RAG architecture guide covers the retrieval side end to end.

Context engineering beats parameter tuning

Sonnet 4.6 ships with a 200K standard context window and a 1M-token beta. The 1M is not an invitation to dump everything into the prompt. Above 200K input tokens, you pay a long-context premium of $6 input and $22.50 output per million, and quality on long-context recall is still uneven across token positions.

The senior move is to put only what the model needs in the window:

Use retrieval (BM25 plus dense vectors plus a reranker) to fetch the right 5,000 tokens of context.
Use the new context-compaction beta to auto-summarize older conversation turns once they are no longer load-bearing.
Keep tool results small. If a tool returns 50,000 tokens, summarize before passing back into the loop.
Reserve the 1M context for genuinely long-document tasks (legal review, codebase audits, multi-document synthesis) and pay the premium consciously.

Context engineering is the difference between a $50/month app and a $5,000/month app at the same usage level. Get this right before you negotiate with sales for an enterprise discount.

What this looks like on a real Cadence project

Every engineer on Cadence is AI-native by baseline. The platform's voice interview specifically scores Cursor / Claude Code / Copilot fluency, prompt-as-spec discipline, verification habits, and multi-step prompt-ladder thinking. There is no non-AI-native option on the platform. That matters here because Sonnet 4.6 only pays off at scale when the engineer wiring it knows what to cache, what to escalate, and what to never delegate.

A typical week looks like this. A founder books a mid-tier engineer at $1,000/week to wire the LLM stack (Anthropic SDK, prompt-caching layout, structured outputs, batch lanes for the nightly job). They book a senior at $1,500/week one week later to own the escalation policy, eval suite, and cost guardrails. The lead tier ($2,000/week) shows up only when the system needs to scale to multi-region traffic or you are designing a novel agentic architecture from scratch. Junior at $500/week handles cleanup, doc generation, and integrations against well-documented APIs. The 48-hour free trial means you find out in two days whether the engineer actually has the AI-native habits the platform vetted them on.

If you want a quick gut check on which next feature to build with Sonnet 4.6 versus buy off the shelf, the Build/Buy/Book decider tool walks you through the call in about two minutes.

What to do this week

If you are running on Sonnet 4.5 today, three concrete moves:

Bump the model string to claude-sonnet-4-6 in a feature branch. Run your eval suite. If anything depends on prefilling, refactor before merging.
Audit your largest prompts for cacheable prefixes. Add cache_control to the system prompt and tool definitions. Watch the cost drop.
Identify one workload that does not need a real-time response (a nightly report, a backfill, an eval run) and move it to the Batch API. Stack it with the 1-hour cache TTL.

If you are starting from scratch this week and want a senior engineer who already knows this stack, Cadence shortlists 4 vetted AI-native engineers in 2 minutes with a 48-hour free trial. No recruiter loop, no notice period, weekly billing.

FAQ

Is Claude Sonnet 4.6 ready for production today?

Yes. SWE-bench Verified at 79.6%, OSWorld-Verified at 72.5%, and a prompt-injection success rate of 0.51% with safeguards (down from 49.36% on Sonnet 4.5). It is the recommended default for most production teams in 2026.

When should I use Claude Sonnet 4.6 vs Opus 4.7?

Default to Sonnet for roughly 80% of work. Escalate to Opus 4.7 only for novel architecture, ambiguous specs, or evaluation runs where Sonnet has visibly plateaued. Opus runs about 5x the cost of Sonnet and is noticeably slower, so most teams use it as a one-shot for the hard problem and route the rest back to Sonnet.

How much can prompt caching save with Sonnet 4.6?

Cache reads cost 10% of the base input rate ($0.30 per million) and writes cost a 25% premium ($3.75 per million). A 30,000-token system prompt that costs $0.09 per request uncached costs $0.009 per cache hit. Real apps see hit rates from 30% to 98% depending on traffic patterns.

Should I use the Batch API for production?

Yes, for anything without a latency SLO. Backfills, daily reports, eval runs, embeddings, bulk extraction. You get a 50% discount on both input and output, and you can stack it with the 1-hour cache TTL. Splitting your workload into a real-time lane and a batch lane usually pays for the engineering time inside a week.

Does Sonnet 4.6 support computer use in production?

Yes, with the standard agent guardrails. OSWorld-Verified jumped to 72.5% in 4.6, which makes computer-use agents credible for internal-ops automation against legacy systems without APIs. Wrap any production deployment in approval workflows for irreversible actions, action allowlists, and sandboxing.

How do I migrate from Sonnet 4.5 to Sonnet 4.6?

It is a drop-in replacement with one caveat: prefilling is removed in 4.6. If you have prompts that depend on starting the assistant turn with a partial response, refactor those before you bump the model string. Otherwise, change the version, run your eval suite, and ship.

All posts