Prompt caching with Anthropic API: complete guide

Q: How do I know if my prompt was actually cached?

Inspect response.usage. After your first request, cache_creation_input_tokens should equal the size of your cached prefix and cache_read_input_tokens should be 0. After your second request, those flip: cache_read_input_tokens should equal the prefix size, cache_creation_input_tokens should be 0. If both are 0 on every call, your prefix is below the model minimum or it's changing between requests.

Prompt caching with the Anthropic API lets you mark a stable prefix of your request so Claude reads it from cache on the next call instead of re-encoding it. Cached input tokens cost 10% of base input pricing (a 90% discount), with a 25% premium on the first write. Default TTL is 5 minutes; 1-hour TTL is available at 2x write cost. The mechanism is literal byte-match on the prefix, not semantic similarity.

If you ship anything that calls Claude in a loop (an agent, a chat product, a RAG pipeline), prompt caching is the single biggest cost lever in the API. Done right, it turns a $3.50-per-million-token agent loop into a $0.30 one. Done wrong, you pay the 25% write premium on every request and never see a hit.

This guide is the working playbook: how the four cache breakpoints behave, the real per-million-token math, working TypeScript and Python, the three patterns that compound, the cache hygiene rules that stop you from invalidating yourself by accident, and a literal install-to-measure sequence at the end.

How prompt caching with the Anthropic API actually works

You opt into caching by adding a cache_control: { "type": "ephemeral" } marker on a content block. That marker is a breakpoint. Everything in the request from the start up to and including that block becomes the cache key. On the next call, if the same prefix arrives byte-for-byte, Claude skips encoding it and reads from cache.

There are four places you can drop a breakpoint:

Tools: the entire tools array, useful for agents with stable tool definitions.
System: any block in the system array, the most common cache target.
Messages: text blocks inside messages.content, including assistant turns and tool results.
Raw text content blocks: long pasted documents, RAG context, few-shot examples.

The order of the prefix is fixed: tools -> system -> messages. Whatever changes invalidates that segment and everything downstream of it. Swap a tool definition and you blow away the system and message caches too. Add a new image partway through messages and only the message segment dies.

You get up to 4 breakpoints per request. The system also walks backward up to 20 blocks per breakpoint hunting for a prior cache entry, which matters in long conversations: if your messages array grows past 20 blocks since the last write, you lose the hit unless you set a second breakpoint further back.

Minimum cacheable length depends on the model. Claude Sonnet 4.6 caches anything 1,024 tokens or longer. Claude Opus 4.7 and Haiku 4.5 require 4,096 tokens. Below the threshold the request runs normally but nothing gets cached, and cache_creation_input_tokens returns 0. Always check that field on your first request; silent no-cache is the most common reason teams think caching "isn't working."

The economics: 90% read discount vs 25% write premium

Caching has two prices: a write premium the first time you create the cache entry, and a read discount every time you hit it. Here are the live numbers on the three current production models.

Model	Base input	5-min write	1-hour write	Cache read	Output
Claude Opus 4.7	$5.00	$6.25	$10.00	$0.50	$25.00
Claude Sonnet 4.6	$3.00	$3.75	$6.00	$0.30	$15.00
Claude Haiku 4.5	$1.00	$1.25	$2.00	$0.10	$5.00

All prices per million tokens. The pattern is constant across models: 5-minute writes cost 1.25x base, 1-hour writes cost 2x base, and reads cost 0.1x base.

The break-even is: cache writes pay back after roughly 2 reads at the 5-minute tier and 11 reads at the 1-hour tier. Anything beyond that is pure savings.

Real agent loop math on Sonnet 4.6: a typical agent has a 50,000-token system prompt, 8,000 tokens of tool definitions, and roughly 500 fresh user tokens per turn. Uncached, that's 58,500 input tokens at $3/M, or about $0.000176 per turn ($0.176 per thousand turns). For background guidance on picking the right model for the loop in the first place, see our breakdown of when to use Opus vs Sonnet vs Haiku.

Cached, the first call costs $0.000219 (a one-time write premium). Every following turn within 5 minutes costs $0.000019 ($0.019 per thousand turns). That's 11x cheaper per cache hit. Run a chat product with 100 turns per session and you've cut input cost from $0.018 per session to $0.0019. At a million sessions per month that's $16,000 saved on input alone.

For teams running multi-step LLM pipelines, caching also compounds with the disciplines covered in prompt engineering for senior software engineers: tighter prompts mean smaller stable prefixes, which hit cache thresholds faster.

When prompt caching pays off (and when it doesn't)

Caching is not free. The 25% write premium is a real loss if you don't reuse the prefix. Use this rule:

Caching pays when:

Your system prompt is stable across requests (not parameterized per user).
You ship few-shot examples with every call (20+ examples beats RAG for some tasks).
You define tools that don't change between turns (most agents).
You inject static RAG context that's hot for many queries (a single document users grill for 10+ questions).
You run multi-turn conversations where message history grows.

Caching doesn't pay when:

Inputs are highly varied per call (one-shot extraction, classification with no shared system).
Your loop fires once per minute or less and the 5-minute window expires before you read.
The cacheable prefix is below the model minimum (1,024 tokens for Sonnet, 4,096 for Opus and Haiku).
You parameterize the system prompt with per-user data (drop user data into messages instead).

The honest negative case: a single-shot Haiku classification call with a 2,000-token system prompt that runs once per user signup will pay the write premium and never read. Don't cache it. Move that workload to the Batch API or just eat the input cost.

Working code: TypeScript and Python

The minimum viable pattern is one breakpoint on the system prompt. Here's TypeScript:

import Anthropic from "@anthropic-ai/sdk";

const client = new Anthropic();

const SYSTEM_PROMPT = [
  {
    type: "text" as const,
    text: longStableSystemPrompt, // 5,000+ tokens of guidance, examples, schema
    cache_control: { type: "ephemeral" as const },
  },
];

const response = await client.messages.create({
  model: "claude-sonnet-4-6",
  max_tokens: 1024,
  system: SYSTEM_PROMPT,
  messages: [{ role: "user", content: userMessage }],
});

console.log({
  read: response.usage.cache_read_input_tokens,
  write: response.usage.cache_creation_input_tokens,
  fresh: response.usage.input_tokens,
});

Run it twice with the same longStableSystemPrompt and the second call's cache_read_input_tokens will equal the system prompt size; cache_creation_input_tokens will be 0.

For Python with multiple breakpoints (tools + system + RAG context):

import anthropic

client = anthropic.Anthropic()

response = client.messages.create(
    model="claude-sonnet-4-6",
    max_tokens=1024,
    tools=TOOLS,  # tools array, becomes the first cache segment
    system=[
        {
            "type": "text",
            "text": SYSTEM_INSTRUCTIONS,
            "cache_control": {"type": "ephemeral", "ttl": "1h"},
        },
        {
            "type": "text",
            "text": COMPANY_RAG_CONTEXT,  # rarely-changing knowledge base
            "cache_control": {"type": "ephemeral", "ttl": "1h"},
        },
    ],
    messages=[
        {
            "role": "user",
            "content": [
                {"type": "text", "text": conversation_so_far,
                 "cache_control": {"type": "ephemeral"}},  # 5-min default
                {"type": "text", "text": new_user_message},
            ],
        }
    ],
)

print(response.usage.model_dump_json(indent=2))

A few details that bite people. Order matters when mixing TTLs: 1-hour breakpoints must appear before 5-minute breakpoints in the prefix. Tool definition order matters too: if your serializer reorders JSON keys non-deterministically (some Python dict-to-JSON paths do), you'll never hit cache. For broader guidance on building reliable agents around this loop, see building your first AI agent with tool calling.

Three implementation patterns that compound

Most production gains come from three patterns layered together.

Pattern 1: Cache the system prompt. This is the 80/20. Move every stable instruction, schema, persona, and rule into a single system block with cache_control. Push per-user dynamic data down into the user message. A well-cached system prompt of 10,000 tokens at Sonnet 4.6 costs $0.003 to read versus $0.030 fresh. If you're using Sonnet 4.6 in production for the first time, our Claude Sonnet 4.6 production guide covers the surrounding setup.

Pattern 2: Cache the tools array. Tools sit at the top of the prefix, so caching them protects everything below. If you have 15 tool definitions averaging 400 tokens each (6,000 total), that's $0.018 saved per call on Sonnet 4.6 for every agent turn after the first. For agents that loop 5-20 times per task, this is real money.

Pattern 3: Cache static RAG context with a 1-hour TTL. When users grill the same PDF, video transcript, or codebase snapshot for many questions, paste the full text once with ttl: "1h". The first request pays 2x base. Every subsequent question in that hour pays 0.1x. Sonnet 4.6 with a 100,000-token doc: $0.60 first read, $0.03 every reread. Ten reads in an hour and you've spent $0.87 instead of $3.00.

These compound multiplicatively. An agent with cached tools, cached system, and cached RAG context can run hundreds of turns at near-zero input cost.

Cache hygiene: the rules AI-native engineers internalize

Every Cadence engineer is AI-native by default, vetted on Cursor and Claude Code fluency, prompt-as-spec discipline, and verification habits before they unlock bookings. Cache hygiene is part of that baseline. The rules that show up in code review:

No timestamps, request IDs, or per-user data in the cached prefix. A Current time: ... line in the system prompt rotates the cache key every second and you pay the write premium forever. Push variable context into the user message.
Stable JSON key order in tools. If your tool schema serializer is non-deterministic, fix the serializer or hand-write the tool array.
Pre-warm at app boot. Send one max_tokens: 0 request with the cached prefix when the server starts. The first real user request hits a warm cache.
Treat cache hit-rate as a tier-1 metric. Log cache_read_input_tokens / (cache_read_input_tokens + cache_creation_input_tokens + input_tokens) per request and chart it. If it drops below 70% on a stable workload, something just changed your prefix.
Re-warm before the TTL expires. A 5-minute cache for a low-traffic endpoint is dead by the time the next user shows up. Either schedule a warm ping every 4 minutes or jump to 1-hour TTL.
Watch for silent invalidators. Toggling tool_choice, swapping model versions, adding images, or changing thinking parameters all rotate cache keys partially. Pin model versions in production code.

These show up in PRs as one-line diffs that quietly cost a team thousands of dollars. The engineers who catch them before merge are the ones who treat the API bill like infrastructure cost, not a vendor fee.

If you're hiring for an AI product team and want this kind of API discipline as a baseline rather than a coaching project, book a senior engineer on Cadence for the 48-hour free trial. Senior tier is $1,500/week and most senior engineers on the platform have shipped production prompt-caching at least once.

Steps

A literal install-to-optimize sequence for adding prompt caching to an existing Anthropic SDK app.

Install the SDK. Run npm install @anthropic-ai/sdk (TypeScript) or pip install anthropic (Python). Pin to a recent version that supports the cache_control block: SDK >= 0.40 for both languages.
Identify your stable prefix. Open your existing messages.create call. Find the longest content that doesn't change per request: usually the system prompt, the tools array, or a static RAG document. Confirm it's above the model minimum (1,024 tokens for Sonnet 4.6, 4,096 for Opus 4.7 and Haiku 4.5).
Add the cache_control marker. Convert your system argument from a string to an array of content blocks, then add cache_control: { type: "ephemeral" } on the last stable block. Do not parameterize the cached blocks.
Measure the hit rate. Log response.usage.cache_read_input_tokens, cache_creation_input_tokens, and input_tokens on every call. Send the first request to seed the cache, then send a second identical-prefix request and confirm cache_read_input_tokens matches the prefix size and cache_creation_input_tokens is 0.
Pre-warm at boot. Add a startup hook that sends one max_tokens: 0 request with the cached prefix. This eliminates first-request latency for users.
Optimize breakpoint placement. If you're caching multiple sections that change at different speeds, add up to 4 breakpoints. Put 1-hour TTL blocks (RAG context) first, 5-minute TTL blocks (conversation history) last. Confirm hit rate stays above 70% in production.
Pin model versions. Switch model: "claude-sonnet-4-6" to a version-pinned identifier in production. A model auto-bump silently invalidates every cache entry across your fleet.

Trying to decide whether to ship a caching pass yourself, hand it to an engineer, or buy a managed agent platform? Run it through Build, Buy, or Book for a one-screen recommendation grounded in your stack and timeline.

FAQ

Does prompt caching work with streaming responses?

Yes for normal calls. The one exception is the pre-warming pattern with max_tokens: 0, which the API rejects when stream: true. Run pre-warming as a non-streaming call and stream your real responses normally.

How long does the cache last?

5 minutes by default, refreshed every time you read the entry. 1 hour is available by passing ttl: "1h" on the cache_control block, billed at 2x base input on the write. Bedrock supports 5-minute only; the Claude API, Vertex AI, and Microsoft Foundry support both.

Can I cache tool definitions?

Yes. Tools are one of the four cacheable types and sit at the top of the prefix hierarchy. Caching them protects every system and message cache below them. If your tools change between turns (which is rare), you'll invalidate the entire cache.

How do I know if my prompt was actually cached?

Inspect response.usage. After your first request, cache_creation_input_tokens should equal the size of your cached prefix and cache_read_input_tokens should be 0. After your second request, those flip: cache_read_input_tokens should equal the prefix size, cache_creation_input_tokens should be 0. If both are 0 on every call, your prefix is below the model minimum or it's changing between requests.

Does prompt caching change the model's output?

No. The response is identical to a non-cached call, token-for-token. Caching only changes encoding cost and time-to-first-token. This matters for evals: you do not need to re-baseline quality after enabling caching.

What's the right model to pair with prompt caching?

Sonnet 4.6 is the sweet spot for most production agent loops because the cached read price ($0.30/M) makes long stable system prompts almost free. Use Opus 4.7 when reasoning quality is the bottleneck. Use Haiku 4.5 for high-volume classification where the system prompt is short.

All posts