cheat sheet

Claude API

Anthropic prompt caching — `cache_control`, 5-minute TTL, 1-hour beta, cache breakpoints, cost math, multi-turn caching patterns, and recipes for system prompts, tools, and documents.

Claude API — Prompt Caching

What it is

Prompt caching lets Claude reuse a previously seen prefix of your request (system prompt, tool definitions, large documents) at ~10% of the standard input cost and a fraction of the latency. You mark the end of a cacheable prefix with cache_control: { type: "ephemeral" }; the first request writes the cache (at +25% input cost), every subsequent request that begins with the same bytes within the TTL reads the cache (at 10% input cost). Reach for it whenever the same large context is repeated across multiple turns or users — chat agents over a fixed knowledge base, RAG pipelines, agent loops with stable tool definitions, document-grounded Q&A.

Cost model

Caching changes input pricing only — output tokens are unchanged.

OperationInput cost (relative to base)
Cache write (5-min TTL)1.25×
Cache write (1-hour TTL, beta)2.00×
Cache read0.10×
Non-cached input1.00×

Net win: a prefix that is reused once within the TTL already pays for itself (1.25 + 0.10 = 1.35 vs 2× without caching). The more reuse, the bigger the savings. A 10× reuse drops effective input cost to 0.225× — a ~78% reduction.

TTL options

TTLHeader / paramUse case
5 minutes (default)cache_control: { type: "ephemeral" }Short conversation, single user session
1 hour (beta)cache_control: { type: "ephemeral", ttl: "1h" } + beta headerMulti-user fanout, longer agent runs

The 1-hour TTL is in beta — pass anthropic-beta: extended-cache-ttl-2025-04-11 on the request. Pricing for the longer TTL is 2× write (vs 1.25× for the 5-minute default), so it only pays off above ~6× reuse.

Where you can place cache breakpoints

You may put cache_control on any of these content blocks. A request may have up to 4 breakpoints; each breakpoint defines the end of a cacheable prefix.

LocationSupported
system (when an array of blocks)Yes
tools[*] (last tool's definition)Yes
messages[*].content[*] (text, image, document)Yes
Top-level string systemNo — must be array form
Inside tool_result.contentYes (treated as a message content block)

Minimum cacheable size

Caches only kick in above a minimum prefix length — short prompts cannot be cached.

ModelMinimum tokens to cache
claude-opus-4-71024
claude-sonnet-4-61024
claude-haiku-4-52048

If your would-be cached prefix is shorter than this, the API silently bills it as a regular write — no error, but no later read discount either.

Cache scope

Cache hits require an exact byte-for-byte match of the prefix up to the breakpoint. This includes system prompt, every tool definition (in order), and every message before the breakpoint. Changing even a single character before the breakpoint busts the cache.

What's part of the keyNotes
ModelDifferent model = different cache
ToolsDefinitions in order, schema bytes
System promptFull text content
Messages before the breakpointIncluding roles and types
Image bytesHashed identically
Cache breakpoint positionTwo breakpoints at different positions = two caches

Caches are scoped to your organization. They are not shared across orgs or workspaces; another team writing the same prefix does not benefit your reads.

Basic — system prompt

The simplest pattern: a long system prompt or knowledge base in the system field, marked cacheable.

python
import anthropic

client = anthropic.Anthropic()

with open("knowledge_base.md") as f:
    kb = f.read()    # ~30,000 tokens of stable docs

response = client.messages.create(
    model="claude-opus-4-7",
    max_tokens=1024,
    system=[
        {"type": "text", "text": "You are a docs assistant."},
        {
            "type": "text",
            "text": kb,
            "cache_control": {"type": "ephemeral"},
        },
    ],
    messages=[{"role": "user", "content": "What is the retry policy?"}],
)
print(response.usage)

Output (first call — writes cache):

text
Usage(input_tokens=24, output_tokens=98, cache_creation_input_tokens=30000, cache_read_input_tokens=0)

Output (second call within 5 min, different user question):

text
Usage(input_tokens=24, output_tokens=104, cache_creation_input_tokens=0, cache_read_input_tokens=30000)

TypeScript

Same shape — system is an array of blocks, the last cacheable block carries cache_control.

typescript
import Anthropic from "@anthropic-ai/sdk";
import fs from "node:fs/promises";

const client = new Anthropic();
const kb = await fs.readFile("knowledge_base.md", "utf8");

const response = await client.messages.create({
  model: "claude-opus-4-7",
  max_tokens: 1024,
  system: [
    { type: "text", text: "You are a docs assistant." },
    { type: "text", text: kb, cache_control: { type: "ephemeral" } },
  ],
  messages: [{ role: "user", content: "What is the retry policy?" }],
});

console.log(response.usage);

Output:

text
{
  input_tokens: 24,
  output_tokens: 98,
  cache_creation_input_tokens: 30000,
  cache_read_input_tokens: 0
}

Caching tool definitions

When you call the API with the same tool list every turn, mark the last tool's definition with cache_control. The prefix that includes all tools (in their declared order) becomes cacheable.

python
tools = [
    {
        "name": "search_docs",
        "description": "Search documentation by keyword.",
        "input_schema": {
            "type": "object",
            "properties": {"query": {"type": "string"}},
            "required": ["query"],
        },
    },
    {
        "name": "open_file",
        "description": "Open a file at a path.",
        "input_schema": {
            "type": "object",
            "properties": {"path": {"type": "string"}},
            "required": ["path"],
        },
        "cache_control": {"type": "ephemeral"},   # last tool — cache point
    },
]

The prefix being cached: model + all tools above (in this order) + everything before any later breakpoint.

Caching documents

For document-grounded Q&A, place each large document as a cached prefix and let the user question be the volatile suffix.

python
import base64

with open("annual_report.pdf", "rb") as f:
    pdf = base64.standard_b64encode(f.read()).decode("ascii")

response = client.messages.create(
    model="claude-opus-4-7",
    max_tokens=2048,
    messages=[{
        "role": "user",
        "content": [
            {
                "type": "document",
                "source": {"type": "base64", "media_type": "application/pdf", "data": pdf},
                "cache_control": {"type": "ephemeral"},
            },
            {"type": "text", "text": "What was the gross margin in Q3?"},
        ],
    }],
)

The next user question about the same PDF (within 5 minutes) reads the cached PDF tokens at 10% cost.

Multiple breakpoints — system + tools + doc

Up to 4 breakpoints can layer caches with different update frequencies. The classic three-layer setup:

  1. Most stable — system prompt (changes rarely)
  2. Semi-stable — tools (changes per deployment)
  3. Per-session — user-uploaded document

When the document changes, layers 1 and 2 still cache-hit; only the document portion is re-written.

python
response = client.messages.create(
    model="claude-opus-4-7",
    max_tokens=2048,
    system=[
        {"type": "text", "text": "You are an analyst that uses tools."},
        {
            "type": "text",
            "text": LONG_PERSONA_AND_FORMATTING_RULES,
            "cache_control": {"type": "ephemeral"},   # breakpoint 1
        },
    ],
    tools=[
        *TOOLS_PREFIX,
        {**LAST_TOOL, "cache_control": {"type": "ephemeral"}},   # breakpoint 2
    ],
    messages=[{
        "role": "user",
        "content": [
            {
                "type": "document",
                "source": {"type": "base64", "media_type": "application/pdf", "data": pdf},
                "cache_control": {"type": "ephemeral"},                # breakpoint 3
            },
            {"type": "text", "text": user_question},                   # volatile suffix
        ],
    }],
)

Multi-turn conversation caching

In a chat, place the cache breakpoint on the assistant's last response in the history. The whole conversation up to and including that turn becomes cacheable; only the new user message is incremental.

python
messages = [
    {"role": "user", "content": "Hi, can you help me debug this Python?"},
    {"role": "assistant", "content": [
        {"type": "text", "text": "Sure — paste the code and the error you're seeing."}
    ]},
    {"role": "user", "content": "Here is the code: …\nError: …"},
    {"role": "assistant", "content": [
        {
            "type": "text",
            "text": "I think the issue is the off-by-one in the loop. Try …",
            "cache_control": {"type": "ephemeral"},     # cache up to here
        },
    ]},
    {"role": "user", "content": "That fixed it — now how do I refactor for clarity?"},
]

Subsequent calls within 5 minutes pay for the new user turn only at full price; everything else hits the cache.

Reading cache metrics

response.usage always reports both write and read counts. Use them to compute hit ratio and the effective cost saving.

python
def hit_ratio(usage) -> float:
    total = usage.cache_creation_input_tokens + usage.cache_read_input_tokens + usage.input_tokens
    if total == 0:
        return 0.0
    return usage.cache_read_input_tokens / total

def effective_input_cost(usage, base_in_per_mtok: float) -> float:
    total_dollars = (
        usage.input_tokens * 1.0
        + usage.cache_creation_input_tokens * 1.25
        + usage.cache_read_input_tokens * 0.10
    ) * base_in_per_mtok / 1_000_000
    return total_dollars

print(f"hit ratio: {hit_ratio(response.usage):.1%}")
print(f"input cost: ${effective_input_cost(response.usage, 15.0):.6f}")

Output:

text
hit ratio: 99.9%
input cost: $0.000810

1-hour TTL beta

Pass the beta header to switch the TTL from 5 minutes to 1 hour. The write cost rises from 1.25× to 2.00×, so this only pays off above ~6× reuse — but it eliminates re-warming for low-traffic agents.

python
response = client.messages.create(
    model="claude-opus-4-7",
    max_tokens=1024,
    extra_headers={"anthropic-beta": "extended-cache-ttl-2025-04-11"},
    system=[
        {"type": "text", "text": "..."},
        {
            "type": "text",
            "text": LARGE_PROMPT,
            "cache_control": {"type": "ephemeral", "ttl": "1h"},
        },
    ],
    messages=[{"role": "user", "content": "..."}],
)
typescript
const response = await client.messages.create(
  {
    model: "claude-opus-4-7",
    max_tokens: 1024,
    system: [
      { type: "text", text: "..." },
      { type: "text", text: LARGE_PROMPT, cache_control: { type: "ephemeral", ttl: "1h" } },
    ],
    messages: [{ role: "user", content: "..." }],
  },
  { headers: { "anthropic-beta": "extended-cache-ttl-2025-04-11" } },
);

Refreshing the TTL

Every cache read resets the TTL countdown to the original value (5 min or 1 h). So a chat that gets at least one message every few minutes keeps its cache hot indefinitely without re-writing. The cache is only ever evicted when the full TTL elapses without a single read.

Cache hit ratio in practice

python
hits = 0
total = 0

for question in QUESTIONS:
    resp = client.messages.create(
        model="claude-opus-4-7",
        max_tokens=512,
        system=[
            {"type": "text", "text": "You are a docs assistant."},
            {"type": "text", "text": KB, "cache_control": {"type": "ephemeral"}},
        ],
        messages=[{"role": "user", "content": question}],
    )
    u = resp.usage
    if u.cache_read_input_tokens > 0:
        hits += 1
    total += 1

print(f"cache hit ratio: {hits / total:.1%}")

Output:

text
cache hit ratio: 96.0%

Batch API + caching

Cache discounts apply inside the Batch API too, and stack with the 50% batch discount. The cache is warmed by the first request in the batch that processes it; the rest read.

python
batch = client.messages.batches.create(
    requests=[
        {
            "custom_id": f"q-{i}",
            "params": {
                "model": "claude-opus-4-7",
                "max_tokens": 256,
                "system": [
                    {"type": "text", "text": "You answer questions about the policy doc."},
                    {"type": "text", "text": POLICY_DOC, "cache_control": {"type": "ephemeral"}},
                ],
                "messages": [{"role": "user", "content": q}],
            },
        }
        for i, q in enumerate(QUESTIONS)
    ],
)

The first request in the batch pays the cache-write price; the remaining ~9,999 read at 10% × 50% = effectively 5% of base input cost.

Beta vs GA

Prompt caching itself is GA (generally available) and stable. What is still in beta:

FeatureStatusHow to enable
5-minute ephemeral cacheGAcache_control: { type: "ephemeral" }
1-hour ephemeral cacheBetaextra_headers={"anthropic-beta": "extended-cache-ttl-2025-04-11"} + ttl: "1h"
Caching with Files API documentsGADefault once a file is referenced
Caching encrypted thinking blocksBetaBeta header on extended-thinking

Common pitfalls

PitfallSymptomFix
Prefix < 1024 tokens (Sonnet/Opus)Always cache_creation 0, no read eitherPad with stable content or skip caching
Modifying anything before the breakpointCache miss every callKeep stable content before breakpoints; volatile after
Cache control on every blockMore breakpoints than allowedMax 4 breakpoints per request — put one per layer
Using system as a stringCannot attach cache_controlConvert to [{"type": "text", "text": "..."}] array
Reordering tools randomlyConstant cache missesAlways pass tools in the same order; sort them
Forgetting cache writes cost extraBill surprise on first callExpect 1.25× input on warmup; amortise over reuse
1-hour TTL without beta headerAPI silently downgrades to 5 minAlways send anthropic-beta: extended-cache-ttl-2025-04-11
Two breakpoints at same positionOne is wastedOnly place breakpoints where the prefix actually diverges later
Per-user content before breakpointPer-user caches, low hit ratioMove per-user content after the last breakpoint

Common recipes

Conditional caching

Only enable caching when the prefix is large enough to justify the 1.25× write premium.

python
MIN_CACHE_TOKENS = 1024

def maybe_cache(text: str) -> dict:
    block = {"type": "text", "text": text}
    # Rough estimate: ~4 chars per token
    if len(text) >= MIN_CACHE_TOKENS * 4:
        block["cache_control"] = {"type": "ephemeral"}
    return block

Cache warmer

Some agents call once on startup just to seed the cache, then defer real user calls.

python
def warm_cache(system_blocks: list[dict]) -> None:
    client.messages.create(
        model="claude-opus-4-7",
        max_tokens=1,
        system=system_blocks,
        messages=[{"role": "user", "content": "warm"}],
    )

warm_cache(SYSTEM_WITH_CACHE_CONTROL)
print("cache warm — ready for traffic")

Output:

text
cache warm — ready for traffic

Hit-ratio dashboard

python
from collections import Counter

def log_usage(usage, counter: Counter) -> None:
    counter["input"]          += usage.input_tokens
    counter["cache_creation"] += usage.cache_creation_input_tokens
    counter["cache_read"]     += usage.cache_read_input_tokens
    counter["output"]         += usage.output_tokens

def summary(counter: Counter, in_price_per_mtok: float, out_price_per_mtok: float) -> dict:
    input_cost = (
        counter["input"] * 1.0
        + counter["cache_creation"] * 1.25
        + counter["cache_read"] * 0.10
    ) * in_price_per_mtok / 1_000_000
    output_cost = counter["output"] * out_price_per_mtok / 1_000_000
    return {
        "hit_ratio": counter["cache_read"] / max(counter["cache_read"] + counter["input"], 1),
        "input_cost": input_cost,
        "output_cost": output_cost,
        "total_cost": input_cost + output_cost,
    }

Two-layer cache (persona + doc)

python
def make_request(persona: str, doc: str, question: str):
    return client.messages.create(
        model="claude-opus-4-7",
        max_tokens=1024,
        system=[
            {"type": "text", "text": persona, "cache_control": {"type": "ephemeral"}},
        ],
        messages=[{
            "role": "user",
            "content": [
                {"type": "text", "text": f"Document:\n{doc}", "cache_control": {"type": "ephemeral"}},
                {"type": "text", "text": f"Question: {question}"},
            ],
        }],
    )

The persona caches across all documents; each document caches across all questions about that document. A new persona busts only layer 1; a new document busts only layer 2.

See also