cheat sheet

Claude API

Anthropic prompt caching — `cache_control`, 5-minute TTL, 1-hour beta, cache breakpoints, cost math, multi-turn caching patterns, and recipes for system prompts, tools, and documents.

updated 05-25-2026

Claude API — Prompt Caching

What it is

Prompt caching lets Claude reuse a previously seen prefix of your request (system prompt, tool definitions, large documents) at ~10% of the standard input cost and a fraction of the latency. You mark the end of a cacheable prefix with cache_control: { type: "ephemeral" }; the first request writes the cache (at +25% input cost), every subsequent request that begins with the same bytes within the TTL reads the cache (at 10% input cost). Reach for it whenever the same large context is repeated across multiple turns or users — chat agents over a fixed knowledge base, RAG pipelines, agent loops with stable tool definitions, document-grounded Q&A.

Cost model

Caching changes input pricing only — output tokens are unchanged.

Operation	Input cost (relative to base)
Cache write (5-min TTL)	1.25×
Cache write (1-hour TTL, beta)	2.00×
Cache read	0.10×
Non-cached input	1.00×

Net win: a prefix that is reused once within the TTL already pays for itself (1.25 + 0.10 = 1.35 vs 2× without caching). The more reuse, the bigger the savings. A 10× reuse drops effective input cost to 0.225× — a ~78% reduction.

TTL options

TTL	Header / param	Use case
5 minutes (default)	`cache_control: { type: "ephemeral" }`	Short conversation, single user session
1 hour (beta)	`cache_control: { type: "ephemeral", ttl: "1h" }` + beta header	Multi-user fanout, longer agent runs

The 1-hour TTL is in beta — pass anthropic-beta: extended-cache-ttl-2025-04-11 on the request. Pricing for the longer TTL is 2× write (vs 1.25× for the 5-minute default), so it only pays off above ~6× reuse.

Where you can place cache breakpoints

You may put cache_control on any of these content blocks. A request may have up to 4 breakpoints; each breakpoint defines the end of a cacheable prefix.

Location	Supported
`system` (when an array of blocks)	Yes
`tools[*]` (last tool's definition)	Yes
`messages[].content[]` (text, image, document)	Yes
Top-level string `system`	No — must be array form
Inside `tool_result.content`	Yes (treated as a message content block)

Minimum cacheable size

Caches only kick in above a minimum prefix length — short prompts cannot be cached.

Model	Minimum tokens to cache
`claude-opus-4-7`	1024
`claude-sonnet-4-6`	1024
`claude-haiku-4-5`	2048

If your would-be cached prefix is shorter than this, the API silently bills it as a regular write — no error, but no later read discount either.

Cache scope

Cache hits require an exact byte-for-byte match of the prefix up to the breakpoint. This includes system prompt, every tool definition (in order), and every message before the breakpoint. Changing even a single character before the breakpoint busts the cache.

What's part of the key	Notes
Model	Different model = different cache
Tools	Definitions in order, schema bytes
System prompt	Full text content
Messages before the breakpoint	Including roles and types
Image bytes	Hashed identically
Cache breakpoint position	Two breakpoints at different positions = two caches

Caches are scoped to your organization. They are not shared across orgs or workspaces; another team writing the same prefix does not benefit your reads.

Basic — system prompt

The simplest pattern: a long system prompt or knowledge base in the system field, marked cacheable.

python

import anthropic

client = anthropic.Anthropic()

with open("knowledge_base.md") as f:
    kb = f.read()    # ~30,000 tokens of stable docs

response = client.messages.create(
    model="claude-opus-4-7",
    max_tokens=1024,
    system=[
        {"type": "text", "text": "You are a docs assistant."},
        {
            "type": "text",
            "text": kb,
            "cache_control": {"type": "ephemeral"},
        },
    ],
    messages=[{"role": "user", "content": "What is the retry policy?"}],
)
print(response.usage)

Output (first call — writes cache):

text

Usage(input_tokens=24, output_tokens=98, cache_creation_input_tokens=30000, cache_read_input_tokens=0)

Output (second call within 5 min, different user question):

text

Usage(input_tokens=24, output_tokens=104, cache_creation_input_tokens=0, cache_read_input_tokens=30000)

TypeScript

Same shape — system is an array of blocks, the last cacheable block carries cache_control.

typescript

import Anthropic from "@anthropic-ai/sdk";
import fs from "node:fs/promises";

const client = new Anthropic();
const kb = await fs.readFile("knowledge_base.md", "utf8");

const response = await client.messages.create({
  model: "claude-opus-4-7",
  max_tokens: 1024,
  system: [
    { type: "text", text: "You are a docs assistant." },
    { type: "text", text: kb, cache_control: { type: "ephemeral" } },
  ],
  messages: [{ role: "user", content: "What is the retry policy?" }],
});

console.log(response.usage);

Output:

text

{
  input_tokens: 24,
  output_tokens: 98,
  cache_creation_input_tokens: 30000,
  cache_read_input_tokens: 0
}

Caching tool definitions

When you call the API with the same tool list every turn, mark the last tool's definition with cache_control. The prefix that includes all tools (in their declared order) becomes cacheable.

python

tools = [
    {
        "name": "search_docs",
        "description": "Search documentation by keyword.",
        "input_schema": {
            "type": "object",
            "properties": {"query": {"type": "string"}},
            "required": ["query"],
        },
    },
    {
        "name": "open_file",
        "description": "Open a file at a path.",
        "input_schema": {
            "type": "object",
            "properties": {"path": {"type": "string"}},
            "required": ["path"],
        },
        "cache_control": {"type": "ephemeral"},   # last tool — cache point
    },
]

The prefix being cached: model + all tools above (in this order) + everything before any later breakpoint.

Caching documents

For document-grounded Q&A, place each large document as a cached prefix and let the user question be the volatile suffix.

python

import base64

with open("annual_report.pdf", "rb") as f:
    pdf = base64.standard_b64encode(f.read()).decode("ascii")

response = client.messages.create(
    model="claude-opus-4-7",
    max_tokens=2048,
    messages=[{
        "role": "user",
        "content": [
            {
                "type": "document",
                "source": {"type": "base64", "media_type": "application/pdf", "data": pdf},
                "cache_control": {"type": "ephemeral"},
            },
            {"type": "text", "text": "What was the gross margin in Q3?"},
        ],
    }],
)

The next user question about the same PDF (within 5 minutes) reads the cached PDF tokens at 10% cost.

Multiple breakpoints — system + tools + doc

Up to 4 breakpoints can layer caches with different update frequencies. The classic three-layer setup:

Most stable — system prompt (changes rarely)
Semi-stable — tools (changes per deployment)
Per-session — user-uploaded document

When the document changes, layers 1 and 2 still cache-hit; only the document portion is re-written.

python

response = client.messages.create(
    model="claude-opus-4-7",
    max_tokens=2048,
    system=[
        {"type": "text", "text": "You are an analyst that uses tools."},
        {
            "type": "text",
            "text": LONG_PERSONA_AND_FORMATTING_RULES,
            "cache_control": {"type": "ephemeral"},   # breakpoint 1
        },
    ],
    tools=[
        *TOOLS_PREFIX,
        {**LAST_TOOL, "cache_control": {"type": "ephemeral"}},   # breakpoint 2
    ],
    messages=[{
        "role": "user",
        "content": [
            {
                "type": "document",
                "source": {"type": "base64", "media_type": "application/pdf", "data": pdf},
                "cache_control": {"type": "ephemeral"},                # breakpoint 3
            },
            {"type": "text", "text": user_question},                   # volatile suffix
        ],
    }],
)

Multi-turn conversation caching

In a chat, place the cache breakpoint on the assistant's last response in the history. The whole conversation up to and including that turn becomes cacheable; only the new user message is incremental.

python

messages = [
    {"role": "user", "content": "Hi, can you help me debug this Python?"},
    {"role": "assistant", "content": [
        {"type": "text", "text": "Sure — paste the code and the error you're seeing."}
    ]},
    {"role": "user", "content": "Here is the code: …\nError: …"},
    {"role": "assistant", "content": [
        {
            "type": "text",
            "text": "I think the issue is the off-by-one in the loop. Try …",
            "cache_control": {"type": "ephemeral"},     # cache up to here
        },
    ]},
    {"role": "user", "content": "That fixed it — now how do I refactor for clarity?"},
]

Subsequent calls within 5 minutes pay for the new user turn only at full price; everything else hits the cache.

Reading cache metrics

response.usage always reports both write and read counts. Use them to compute hit ratio and the effective cost saving.

python

def hit_ratio(usage) -> float:
    total = usage.cache_creation_input_tokens + usage.cache_read_input_tokens + usage.input_tokens
    if total == 0:
        return 0.0
    return usage.cache_read_input_tokens / total

def effective_input_cost(usage, base_in_per_mtok: float) -> float:
    total_dollars = (
        usage.input_tokens * 1.0
        + usage.cache_creation_input_tokens * 1.25
        + usage.cache_read_input_tokens * 0.10
    ) * base_in_per_mtok / 1_000_000
    return total_dollars

print(f"hit ratio: {hit_ratio(response.usage):.1%}")
print(f"input cost: ${effective_input_cost(response.usage, 15.0):.6f}")

Output:

text

hit ratio: 99.9%
input cost: $0.000810

1-hour TTL beta

Pass the beta header to switch the TTL from 5 minutes to 1 hour. The write cost rises from 1.25× to 2.00×, so this only pays off above ~6× reuse — but it eliminates re-warming for low-traffic agents.

python

response = client.messages.create(
    model="claude-opus-4-7",
    max_tokens=1024,
    extra_headers={"anthropic-beta": "extended-cache-ttl-2025-04-11"},
    system=[
        {"type": "text", "text": "..."},
        {
            "type": "text",
            "text": LARGE_PROMPT,
            "cache_control": {"type": "ephemeral", "ttl": "1h"},
        },
    ],
    messages=[{"role": "user", "content": "..."}],
)

typescript

const response = await client.messages.create(
  {
    model: "claude-opus-4-7",
    max_tokens: 1024,
    system: [
      { type: "text", text: "..." },
      { type: "text", text: LARGE_PROMPT, cache_control: { type: "ephemeral", ttl: "1h" } },
    ],
    messages: [{ role: "user", content: "..." }],
  },
  { headers: { "anthropic-beta": "extended-cache-ttl-2025-04-11" } },
);

Refreshing the TTL

Every cache read resets the TTL countdown to the original value (5 min or 1 h). So a chat that gets at least one message every few minutes keeps its cache hot indefinitely without re-writing. The cache is only ever evicted when the full TTL elapses without a single read.

Cache hit ratio in practice

python

hits = 0
total = 0

for question in QUESTIONS:
    resp = client.messages.create(
        model="claude-opus-4-7",
        max_tokens=512,
        system=[
            {"type": "text", "text": "You are a docs assistant."},
            {"type": "text", "text": KB, "cache_control": {"type": "ephemeral"}},
        ],
        messages=[{"role": "user", "content": question}],
    )
    u = resp.usage
    if u.cache_read_input_tokens > 0:
        hits += 1
    total += 1

print(f"cache hit ratio: {hits / total:.1%}")

Output:

text

cache hit ratio: 96.0%

Batch API + caching

Cache discounts apply inside the Batch API too, and stack with the 50% batch discount. The cache is warmed by the first request in the batch that processes it; the rest read.

python

batch = client.messages.batches.create(
    requests=[
        {
            "custom_id": f"q-{i}",
            "params": {
                "model": "claude-opus-4-7",
                "max_tokens": 256,
                "system": [
                    {"type": "text", "text": "You answer questions about the policy doc."},
                    {"type": "text", "text": POLICY_DOC, "cache_control": {"type": "ephemeral"}},
                ],
                "messages": [{"role": "user", "content": q}],
            },
        }
        for i, q in enumerate(QUESTIONS)
    ],
)

The first request in the batch pays the cache-write price; the remaining ~9,999 read at 10% × 50% = effectively 5% of base input cost.

Beta vs GA

Prompt caching itself is GA (generally available) and stable. What is still in beta:

Feature	Status	How to enable
5-minute ephemeral cache	GA	`cache_control: { type: "ephemeral" }`
1-hour ephemeral cache	Beta	`extra_headers={"anthropic-beta": "extended-cache-ttl-2025-04-11"}` + `ttl: "1h"`
Caching with Files API documents	GA	Default once a file is referenced
Caching encrypted thinking blocks	Beta	Beta header on extended-thinking

Common pitfalls

Pitfall	Symptom	Fix
Prefix < 1024 tokens (Sonnet/Opus)	Always `cache_creation` 0, no read either	Pad with stable content or skip caching
Modifying anything before the breakpoint	Cache miss every call	Keep stable content before breakpoints; volatile after
Cache control on every block	More breakpoints than allowed	Max 4 breakpoints per request — put one per layer
Using `system` as a string	Cannot attach cache_control	Convert to `[{"type": "text", "text": "..."}]` array
Reordering tools randomly	Constant cache misses	Always pass tools in the same order; sort them
Forgetting cache writes cost extra	Bill surprise on first call	Expect 1.25× input on warmup; amortise over reuse
1-hour TTL without beta header	API silently downgrades to 5 min	Always send `anthropic-beta: extended-cache-ttl-2025-04-11`
Two breakpoints at same position	One is wasted	Only place breakpoints where the prefix actually diverges later
Per-user content before breakpoint	Per-user caches, low hit ratio	Move per-user content after the last breakpoint

Common recipes

Conditional caching

Only enable caching when the prefix is large enough to justify the 1.25× write premium.

python

MIN_CACHE_TOKENS = 1024

def maybe_cache(text: str) -> dict:
    block = {"type": "text", "text": text}
    # Rough estimate: ~4 chars per token
    if len(text) >= MIN_CACHE_TOKENS * 4:
        block["cache_control"] = {"type": "ephemeral"}
    return block

Cache warmer

Some agents call once on startup just to seed the cache, then defer real user calls.

python

def warm_cache(system_blocks: list[dict]) -> None:
    client.messages.create(
        model="claude-opus-4-7",
        max_tokens=1,
        system=system_blocks,
        messages=[{"role": "user", "content": "warm"}],
    )

warm_cache(SYSTEM_WITH_CACHE_CONTROL)
print("cache warm — ready for traffic")

Output:

text

cache warm — ready for traffic

Hit-ratio dashboard

python

from collections import Counter

def log_usage(usage, counter: Counter) -> None:
    counter["input"]          += usage.input_tokens
    counter["cache_creation"] += usage.cache_creation_input_tokens
    counter["cache_read"]     += usage.cache_read_input_tokens
    counter["output"]         += usage.output_tokens

def summary(counter: Counter, in_price_per_mtok: float, out_price_per_mtok: float) -> dict:
    input_cost = (
        counter["input"] * 1.0
        + counter["cache_creation"] * 1.25
        + counter["cache_read"] * 0.10
    ) * in_price_per_mtok / 1_000_000
    output_cost = counter["output"] * out_price_per_mtok / 1_000_000
    return {
        "hit_ratio": counter["cache_read"] / max(counter["cache_read"] + counter["input"], 1),
        "input_cost": input_cost,
        "output_cost": output_cost,
        "total_cost": input_cost + output_cost,
    }

Two-layer cache (persona + doc)

python

def make_request(persona: str, doc: str, question: str):
    return client.messages.create(
        model="claude-opus-4-7",
        max_tokens=1024,
        system=[
            {"type": "text", "text": persona, "cache_control": {"type": "ephemeral"}},
        ],
        messages=[{
            "role": "user",
            "content": [
                {"type": "text", "text": f"Document:\n{doc}", "cache_control": {"type": "ephemeral"}},
                {"type": "text", "text": f"Question: {question}"},
            ],
        }],
    )

The persona caches across all documents; each document caches across all questions about that document. A new persona busts only layer 1; a new document busts only layer 2.

Claude API — Prompt Caching

What it is

Cost model

TTL options

Where you can place cache breakpoints

Minimum cacheable size

Cache scope

Basic — system prompt

TypeScript

Caching tool definitions

Caching documents

Multiple breakpoints — system + tools + doc

Multi-turn conversation caching

Reading cache metrics

1-hour TTL beta

Refreshing the TTL

Cache hit ratio in practice

Batch API + caching

Beta vs GA

Common pitfalls

Common recipes

Conditional caching

Cache warmer

Hit-ratio dashboard

Two-layer cache (persona + doc)

See also