
Prompt caching with the Anthropic API lets you mark a stable prefix of your request so Claude reads it from cache on the next call instead of re-encoding it. Cached input tokens cost 10% of base input pricing (a 90% discount), with a 25% premium on the first write. Default TTL is 5 minutes; 1-hour TTL is available at 2x write cost. The mechanism is literal byte-match on the prefix, not semantic similarity.
If you ship anything that calls Claude in a loop (an agent, a chat product, a RAG pipeline), prompt caching is the single biggest cost lever in the API. Done right, it turns a $3.50-per-million-token agent loop into a $0.30 one. Done wrong, you pay the 25% write premium on every request and never see a hit.
This guide is the working playbook: how the four cache breakpoints behave, the real per-million-token math, working TypeScript and Python, the three patterns that compound, the cache hygiene rules that stop you from invalidating yourself by accident, and a literal install-to-measure sequence at the end.
You opt into caching by adding a cache_control: { "type": "ephemeral" } marker on a content block. That marker is a breakpoint. Everything in the request from the start up to and including that block becomes the cache key. On the next call, if the same prefix arrives byte-for-byte, Claude skips encoding it and reads from cache.
There are four places you can drop a breakpoint:
tools array, useful for agents with stable tool definitions.system array, the most common cache target.messages.content, including assistant turns and tool results.The order of the prefix is fixed: tools -> system -> messages. Whatever changes invalidates that segment and everything downstream of it. Swap a tool definition and you blow away the system and message caches too. Add a new image partway through messages and only the message segment dies.
You get up to 4 breakpoints per request. The system also walks backward up to 20 blocks per breakpoint hunting for a prior cache entry, which matters in long conversations: if your messages array grows past 20 blocks since the last write, you lose the hit unless you set a second breakpoint further back.
Minimum cacheable length depends on the model. Claude Sonnet 4.6 caches anything 1,024 tokens or longer. Claude Opus 4.7 and Haiku 4.5 require 4,096 tokens. Below the threshold the request runs normally but nothing gets cached, and cache_creation_input_tokens returns 0. Always check that field on your first request; silent no-cache is the most common reason teams think caching "isn't working."
Caching has two prices: a write premium the first time you create the cache entry, and a read discount every time you hit it. Here are the live numbers on the three current production models.
| Model | Base input | 5-min write | 1-hour write | Cache read | Output |
|---|---|---|---|---|---|
| Claude Opus 4.7 | $5.00 | $6.25 | $10.00 | $0.50 | $25.00 |
| Claude Sonnet 4.6 | $3.00 | $3.75 | $6.00 | $0.30 | $15.00 |
| Claude Haiku 4.5 | $1.00 | $1.25 | $2.00 | $0.10 | $5.00 |
All prices per million tokens. The pattern is constant across models: 5-minute writes cost 1.25x base, 1-hour writes cost 2x base, and reads cost 0.1x base.
The break-even is: cache writes pay back after roughly 2 reads at the 5-minute tier and 11 reads at the 1-hour tier. Anything beyond that is pure savings.
Real agent loop math on Sonnet 4.6: a typical agent has a 50,000-token system prompt, 8,000 tokens of tool definitions, and roughly 500 fresh user tokens per turn. Uncached, that's 58,500 input tokens at $3/M, or about $0.000176 per turn ($0.176 per thousand turns). For background guidance on picking the right model for the loop in the first place, see our breakdown of when to use Opus vs Sonnet vs Haiku.
Cached, the first call costs $0.000219 (a one-time write premium). Every following turn within 5 minutes costs $0.000019 ($0.019 per thousand turns). That's 11x cheaper per cache hit. Run a chat product with 100 turns per session and you've cut input cost from $0.018 per session to $0.0019. At a million sessions per month that's $16,000 saved on input alone.
For teams running multi-step LLM pipelines, caching also compounds with the disciplines covered in prompt engineering for senior software engineers: tighter prompts mean smaller stable prefixes, which hit cache thresholds faster.
Caching is not free. The 25% write premium is a real loss if you don't reuse the prefix. Use this rule:
Caching pays when:
Caching doesn't pay when:
The honest negative case: a single-shot Haiku classification call with a 2,000-token system prompt that runs once per user signup will pay the write premium and never read. Don't cache it. Move that workload to the Batch API or just eat the input cost.
The minimum viable pattern is one breakpoint on the system prompt. Here's TypeScript:
import Anthropic from "@anthropic-ai/sdk";
const client = new Anthropic();
const SYSTEM_PROMPT = [
{
type: "text" as const,
text: longStableSystemPrompt, // 5,000+ tokens of guidance, examples, schema
cache_control: { type: "ephemeral" as const },
},
];
const response = await client.messages.create({
model: "claude-sonnet-4-6",
max_tokens: 1024,
system: SYSTEM_PROMPT,
messages: [{ role: "user", content: userMessage }],
});
console.log({
read: response.usage.cache_read_input_tokens,
write: response.usage.cache_creation_input_tokens,
fresh: response.usage.input_tokens,
});
Run it twice with the same longStableSystemPrompt and the second call's cache_read_input_tokens will equal the system prompt size; cache_creation_input_tokens will be 0.
For Python with multiple breakpoints (tools + system + RAG context):
import anthropic
client = anthropic.Anthropic()
response = client.messages.create(
model="claude-sonnet-4-6",
max_tokens=1024,
tools=TOOLS, # tools array, becomes the first cache segment
system=[
{
"type": "text",
"text": SYSTEM_INSTRUCTIONS,
"cache_control": {"type": "ephemeral", "ttl": "1h"},
},
{
"type": "text",
"text": COMPANY_RAG_CONTEXT, # rarely-changing knowledge base
"cache_control": {"type": "ephemeral", "ttl": "1h"},
},
],
messages=[
{
"role": "user",
"content": [
{"type": "text", "text": conversation_so_far,
"cache_control": {"type": "ephemeral"}}, # 5-min default
{"type": "text", "text": new_user_message},
],
}
],
)
print(response.usage.model_dump_json(indent=2))
A few details that bite people. Order matters when mixing TTLs: 1-hour breakpoints must appear before 5-minute breakpoints in the prefix. Tool definition order matters too: if your serializer reorders JSON keys non-deterministically (some Python dict-to-JSON paths do), you'll never hit cache. For broader guidance on building reliable agents around this loop, see building your first AI agent with tool calling.
Most production gains come from three patterns layered together.
Pattern 1: Cache the system prompt. This is the 80/20. Move every stable instruction, schema, persona, and rule into a single system block with cache_control. Push per-user dynamic data down into the user message. A well-cached system prompt of 10,000 tokens at Sonnet 4.6 costs $0.003 to read versus $0.030 fresh. If you're using Sonnet 4.6 in production for the first time, our Claude Sonnet 4.6 production guide covers the surrounding setup.
Pattern 2: Cache the tools array. Tools sit at the top of the prefix, so caching them protects everything below. If you have 15 tool definitions averaging 400 tokens each (6,000 total), that's $0.018 saved per call on Sonnet 4.6 for every agent turn after the first. For agents that loop 5-20 times per task, this is real money.
Pattern 3: Cache static RAG context with a 1-hour TTL. When users grill the same PDF, video transcript, or codebase snapshot for many questions, paste the full text once with ttl: "1h". The first request pays 2x base. Every subsequent question in that hour pays 0.1x. Sonnet 4.6 with a 100,000-token doc: $0.60 first read, $0.03 every reread. Ten reads in an hour and you've spent $0.87 instead of $3.00.
These compound multiplicatively. An agent with cached tools, cached system, and cached RAG context can run hundreds of turns at near-zero input cost.
Every Cadence engineer is AI-native by default, vetted on Cursor and Claude Code fluency, prompt-as-spec discipline, and verification habits before they unlock bookings. Cache hygiene is part of that baseline. The rules that show up in code review:
Current time: ... line in the system prompt rotates the cache key every second and you pay the write premium forever. Push variable context into the user message.max_tokens: 0 request with the cached prefix when the server starts. The first real user request hits a warm cache.cache_read_input_tokens / (cache_read_input_tokens + cache_creation_input_tokens + input_tokens) per request and chart it. If it drops below 70% on a stable workload, something just changed your prefix.These show up in PRs as one-line diffs that quietly cost a team thousands of dollars. The engineers who catch them before merge are the ones who treat the API bill like infrastructure cost, not a vendor fee.
If you're hiring for an AI product team and want this kind of API discipline as a baseline rather than a coaching project, book a senior engineer on Cadence for the 48-hour free trial. Senior tier is $1,500/week and most senior engineers on the platform have shipped production prompt-caching at least once.
A literal install-to-optimize sequence for adding prompt caching to an existing Anthropic SDK app.
npm install @anthropic-ai/sdk (TypeScript) or pip install anthropic (Python). Pin to a recent version that supports the cache_control block: SDK >= 0.40 for both languages.messages.create call. Find the longest content that doesn't change per request: usually the system prompt, the tools array, or a static RAG document. Confirm it's above the model minimum (1,024 tokens for Sonnet 4.6, 4,096 for Opus 4.7 and Haiku 4.5).system argument from a string to an array of content blocks, then add cache_control: { type: "ephemeral" } on the last stable block. Do not parameterize the cached blocks.response.usage.cache_read_input_tokens, cache_creation_input_tokens, and input_tokens on every call. Send the first request to seed the cache, then send a second identical-prefix request and confirm cache_read_input_tokens matches the prefix size and cache_creation_input_tokens is 0.max_tokens: 0 request with the cached prefix. This eliminates first-request latency for users.model: "claude-sonnet-4-6" to a version-pinned identifier in production. A model auto-bump silently invalidates every cache entry across your fleet.Trying to decide whether to ship a caching pass yourself, hand it to an engineer, or buy a managed agent platform? Run it through Build, Buy, or Book for a one-screen recommendation grounded in your stack and timeline.
Yes for normal calls. The one exception is the pre-warming pattern with max_tokens: 0, which the API rejects when stream: true. Run pre-warming as a non-streaming call and stream your real responses normally.
5 minutes by default, refreshed every time you read the entry. 1 hour is available by passing ttl: "1h" on the cache_control block, billed at 2x base input on the write. Bedrock supports 5-minute only; the Claude API, Vertex AI, and Microsoft Foundry support both.
Yes. Tools are one of the four cacheable types and sit at the top of the prefix hierarchy. Caching them protects every system and message cache below them. If your tools change between turns (which is rare), you'll invalidate the entire cache.
Inspect response.usage. After your first request, cache_creation_input_tokens should equal the size of your cached prefix and cache_read_input_tokens should be 0. After your second request, those flip: cache_read_input_tokens should equal the prefix size, cache_creation_input_tokens should be 0. If both are 0 on every call, your prefix is below the model minimum or it's changing between requests.
No. The response is identical to a non-cached call, token-for-token. Caching only changes encoding cost and time-to-first-token. This matters for evals: you do not need to re-baseline quality after enabling caching.
Sonnet 4.6 is the sweet spot for most production agent loops because the cached read price ($0.30/M) makes long stable system prompts almost free. Use Opus 4.7 when reasoning quality is the bottleneck. Use Haiku 4.5 for high-volume classification where the system prompt is short.