cheat sheet
Claude API
Anthropic prompt caching — `cache_control`, 5-minute TTL, 1-hour beta, cache breakpoints, cost math, multi-turn caching patterns, and recipes for system prompts, tools, and documents.
Claude API — Prompt Caching
What it is
Prompt caching lets Claude reuse a previously seen prefix of your request (system prompt, tool definitions, large documents) at ~10% of the standard input cost and a fraction of the latency. You mark the end of a cacheable prefix with cache_control: { type: "ephemeral" }; the first request writes the cache (at +25% input cost), every subsequent request that begins with the same bytes within the TTL reads the cache (at 10% input cost). Reach for it whenever the same large context is repeated across multiple turns or users — chat agents over a fixed knowledge base, RAG pipelines, agent loops with stable tool definitions, document-grounded Q&A.
Cost model
Caching changes input pricing only — output tokens are unchanged.
| Operation | Input cost (relative to base) |
|---|---|
| Cache write (5-min TTL) | 1.25× |
| Cache write (1-hour TTL, beta) | 2.00× |
| Cache read | 0.10× |
| Non-cached input | 1.00× |
Net win: a prefix that is reused once within the TTL already pays for itself (1.25 + 0.10 = 1.35 vs 2× without caching). The more reuse, the bigger the savings. A 10× reuse drops effective input cost to 0.225× — a ~78% reduction.
TTL options
| TTL | Header / param | Use case |
|---|---|---|
| 5 minutes (default) | cache_control: { type: "ephemeral" } | Short conversation, single user session |
| 1 hour (beta) | cache_control: { type: "ephemeral", ttl: "1h" } + beta header | Multi-user fanout, longer agent runs |
The 1-hour TTL is in beta — pass
anthropic-beta: extended-cache-ttl-2025-04-11on the request. Pricing for the longer TTL is 2× write (vs 1.25× for the 5-minute default), so it only pays off above ~6× reuse.
Where you can place cache breakpoints
You may put cache_control on any of these content blocks. A request may have up to 4 breakpoints; each breakpoint defines the end of a cacheable prefix.
| Location | Supported |
|---|---|
system (when an array of blocks) | Yes |
tools[*] (last tool's definition) | Yes |
messages[*].content[*] (text, image, document) | Yes |
Top-level string system | No — must be array form |
Inside tool_result.content | Yes (treated as a message content block) |
Minimum cacheable size
Caches only kick in above a minimum prefix length — short prompts cannot be cached.
| Model | Minimum tokens to cache |
|---|---|
claude-opus-4-7 | 1024 |
claude-sonnet-4-6 | 1024 |
claude-haiku-4-5 | 2048 |
If your would-be cached prefix is shorter than this, the API silently bills it as a regular write — no error, but no later read discount either.
Cache scope
Cache hits require an exact byte-for-byte match of the prefix up to the breakpoint. This includes system prompt, every tool definition (in order), and every message before the breakpoint. Changing even a single character before the breakpoint busts the cache.
| What's part of the key | Notes |
|---|---|
| Model | Different model = different cache |
| Tools | Definitions in order, schema bytes |
| System prompt | Full text content |
| Messages before the breakpoint | Including roles and types |
| Image bytes | Hashed identically |
| Cache breakpoint position | Two breakpoints at different positions = two caches |
Caches are scoped to your organization. They are not shared across orgs or workspaces; another team writing the same prefix does not benefit your reads.
Basic — system prompt
The simplest pattern: a long system prompt or knowledge base in the system field, marked cacheable.
import anthropic
client = anthropic.Anthropic()
with open("knowledge_base.md") as f:
kb = f.read() # ~30,000 tokens of stable docs
response = client.messages.create(
model="claude-opus-4-7",
max_tokens=1024,
system=[
{"type": "text", "text": "You are a docs assistant."},
{
"type": "text",
"text": kb,
"cache_control": {"type": "ephemeral"},
},
],
messages=[{"role": "user", "content": "What is the retry policy?"}],
)
print(response.usage)
Output (first call — writes cache):
Usage(input_tokens=24, output_tokens=98, cache_creation_input_tokens=30000, cache_read_input_tokens=0)
Output (second call within 5 min, different user question):
Usage(input_tokens=24, output_tokens=104, cache_creation_input_tokens=0, cache_read_input_tokens=30000)
TypeScript
Same shape — system is an array of blocks, the last cacheable block carries cache_control.
import Anthropic from "@anthropic-ai/sdk";
import fs from "node:fs/promises";
const client = new Anthropic();
const kb = await fs.readFile("knowledge_base.md", "utf8");
const response = await client.messages.create({
model: "claude-opus-4-7",
max_tokens: 1024,
system: [
{ type: "text", text: "You are a docs assistant." },
{ type: "text", text: kb, cache_control: { type: "ephemeral" } },
],
messages: [{ role: "user", content: "What is the retry policy?" }],
});
console.log(response.usage);
Output:
{
input_tokens: 24,
output_tokens: 98,
cache_creation_input_tokens: 30000,
cache_read_input_tokens: 0
}
Caching tool definitions
When you call the API with the same tool list every turn, mark the last tool's definition with cache_control. The prefix that includes all tools (in their declared order) becomes cacheable.
tools = [
{
"name": "search_docs",
"description": "Search documentation by keyword.",
"input_schema": {
"type": "object",
"properties": {"query": {"type": "string"}},
"required": ["query"],
},
},
{
"name": "open_file",
"description": "Open a file at a path.",
"input_schema": {
"type": "object",
"properties": {"path": {"type": "string"}},
"required": ["path"],
},
"cache_control": {"type": "ephemeral"}, # last tool — cache point
},
]
The prefix being cached: model + all tools above (in this order) + everything before any later breakpoint.
Caching documents
For document-grounded Q&A, place each large document as a cached prefix and let the user question be the volatile suffix.
import base64
with open("annual_report.pdf", "rb") as f:
pdf = base64.standard_b64encode(f.read()).decode("ascii")
response = client.messages.create(
model="claude-opus-4-7",
max_tokens=2048,
messages=[{
"role": "user",
"content": [
{
"type": "document",
"source": {"type": "base64", "media_type": "application/pdf", "data": pdf},
"cache_control": {"type": "ephemeral"},
},
{"type": "text", "text": "What was the gross margin in Q3?"},
],
}],
)
The next user question about the same PDF (within 5 minutes) reads the cached PDF tokens at 10% cost.
Multiple breakpoints — system + tools + doc
Up to 4 breakpoints can layer caches with different update frequencies. The classic three-layer setup:
- Most stable — system prompt (changes rarely)
- Semi-stable — tools (changes per deployment)
- Per-session — user-uploaded document
When the document changes, layers 1 and 2 still cache-hit; only the document portion is re-written.
response = client.messages.create(
model="claude-opus-4-7",
max_tokens=2048,
system=[
{"type": "text", "text": "You are an analyst that uses tools."},
{
"type": "text",
"text": LONG_PERSONA_AND_FORMATTING_RULES,
"cache_control": {"type": "ephemeral"}, # breakpoint 1
},
],
tools=[
*TOOLS_PREFIX,
{**LAST_TOOL, "cache_control": {"type": "ephemeral"}}, # breakpoint 2
],
messages=[{
"role": "user",
"content": [
{
"type": "document",
"source": {"type": "base64", "media_type": "application/pdf", "data": pdf},
"cache_control": {"type": "ephemeral"}, # breakpoint 3
},
{"type": "text", "text": user_question}, # volatile suffix
],
}],
)
Multi-turn conversation caching
In a chat, place the cache breakpoint on the assistant's last response in the history. The whole conversation up to and including that turn becomes cacheable; only the new user message is incremental.
messages = [
{"role": "user", "content": "Hi, can you help me debug this Python?"},
{"role": "assistant", "content": [
{"type": "text", "text": "Sure — paste the code and the error you're seeing."}
]},
{"role": "user", "content": "Here is the code: …\nError: …"},
{"role": "assistant", "content": [
{
"type": "text",
"text": "I think the issue is the off-by-one in the loop. Try …",
"cache_control": {"type": "ephemeral"}, # cache up to here
},
]},
{"role": "user", "content": "That fixed it — now how do I refactor for clarity?"},
]
Subsequent calls within 5 minutes pay for the new user turn only at full price; everything else hits the cache.
Reading cache metrics
response.usage always reports both write and read counts. Use them to compute hit ratio and the effective cost saving.
def hit_ratio(usage) -> float:
total = usage.cache_creation_input_tokens + usage.cache_read_input_tokens + usage.input_tokens
if total == 0:
return 0.0
return usage.cache_read_input_tokens / total
def effective_input_cost(usage, base_in_per_mtok: float) -> float:
total_dollars = (
usage.input_tokens * 1.0
+ usage.cache_creation_input_tokens * 1.25
+ usage.cache_read_input_tokens * 0.10
) * base_in_per_mtok / 1_000_000
return total_dollars
print(f"hit ratio: {hit_ratio(response.usage):.1%}")
print(f"input cost: ${effective_input_cost(response.usage, 15.0):.6f}")
Output:
hit ratio: 99.9%
input cost: $0.000810
1-hour TTL beta
Pass the beta header to switch the TTL from 5 minutes to 1 hour. The write cost rises from 1.25× to 2.00×, so this only pays off above ~6× reuse — but it eliminates re-warming for low-traffic agents.
response = client.messages.create(
model="claude-opus-4-7",
max_tokens=1024,
extra_headers={"anthropic-beta": "extended-cache-ttl-2025-04-11"},
system=[
{"type": "text", "text": "..."},
{
"type": "text",
"text": LARGE_PROMPT,
"cache_control": {"type": "ephemeral", "ttl": "1h"},
},
],
messages=[{"role": "user", "content": "..."}],
)
const response = await client.messages.create(
{
model: "claude-opus-4-7",
max_tokens: 1024,
system: [
{ type: "text", text: "..." },
{ type: "text", text: LARGE_PROMPT, cache_control: { type: "ephemeral", ttl: "1h" } },
],
messages: [{ role: "user", content: "..." }],
},
{ headers: { "anthropic-beta": "extended-cache-ttl-2025-04-11" } },
);
Refreshing the TTL
Every cache read resets the TTL countdown to the original value (5 min or 1 h). So a chat that gets at least one message every few minutes keeps its cache hot indefinitely without re-writing. The cache is only ever evicted when the full TTL elapses without a single read.
Cache hit ratio in practice
hits = 0
total = 0
for question in QUESTIONS:
resp = client.messages.create(
model="claude-opus-4-7",
max_tokens=512,
system=[
{"type": "text", "text": "You are a docs assistant."},
{"type": "text", "text": KB, "cache_control": {"type": "ephemeral"}},
],
messages=[{"role": "user", "content": question}],
)
u = resp.usage
if u.cache_read_input_tokens > 0:
hits += 1
total += 1
print(f"cache hit ratio: {hits / total:.1%}")
Output:
cache hit ratio: 96.0%
Batch API + caching
Cache discounts apply inside the Batch API too, and stack with the 50% batch discount. The cache is warmed by the first request in the batch that processes it; the rest read.
batch = client.messages.batches.create(
requests=[
{
"custom_id": f"q-{i}",
"params": {
"model": "claude-opus-4-7",
"max_tokens": 256,
"system": [
{"type": "text", "text": "You answer questions about the policy doc."},
{"type": "text", "text": POLICY_DOC, "cache_control": {"type": "ephemeral"}},
],
"messages": [{"role": "user", "content": q}],
},
}
for i, q in enumerate(QUESTIONS)
],
)
The first request in the batch pays the cache-write price; the remaining ~9,999 read at 10% × 50% = effectively 5% of base input cost.
Beta vs GA
Prompt caching itself is GA (generally available) and stable. What is still in beta:
| Feature | Status | How to enable |
|---|---|---|
| 5-minute ephemeral cache | GA | cache_control: { type: "ephemeral" } |
| 1-hour ephemeral cache | Beta | extra_headers={"anthropic-beta": "extended-cache-ttl-2025-04-11"} + ttl: "1h" |
| Caching with Files API documents | GA | Default once a file is referenced |
| Caching encrypted thinking blocks | Beta | Beta header on extended-thinking |
Common pitfalls
| Pitfall | Symptom | Fix |
|---|---|---|
| Prefix < 1024 tokens (Sonnet/Opus) | Always cache_creation 0, no read either | Pad with stable content or skip caching |
| Modifying anything before the breakpoint | Cache miss every call | Keep stable content before breakpoints; volatile after |
| Cache control on every block | More breakpoints than allowed | Max 4 breakpoints per request — put one per layer |
Using system as a string | Cannot attach cache_control | Convert to [{"type": "text", "text": "..."}] array |
| Reordering tools randomly | Constant cache misses | Always pass tools in the same order; sort them |
| Forgetting cache writes cost extra | Bill surprise on first call | Expect 1.25× input on warmup; amortise over reuse |
| 1-hour TTL without beta header | API silently downgrades to 5 min | Always send anthropic-beta: extended-cache-ttl-2025-04-11 |
| Two breakpoints at same position | One is wasted | Only place breakpoints where the prefix actually diverges later |
| Per-user content before breakpoint | Per-user caches, low hit ratio | Move per-user content after the last breakpoint |
Common recipes
Conditional caching
Only enable caching when the prefix is large enough to justify the 1.25× write premium.
MIN_CACHE_TOKENS = 1024
def maybe_cache(text: str) -> dict:
block = {"type": "text", "text": text}
# Rough estimate: ~4 chars per token
if len(text) >= MIN_CACHE_TOKENS * 4:
block["cache_control"] = {"type": "ephemeral"}
return block
Cache warmer
Some agents call once on startup just to seed the cache, then defer real user calls.
def warm_cache(system_blocks: list[dict]) -> None:
client.messages.create(
model="claude-opus-4-7",
max_tokens=1,
system=system_blocks,
messages=[{"role": "user", "content": "warm"}],
)
warm_cache(SYSTEM_WITH_CACHE_CONTROL)
print("cache warm — ready for traffic")
Output:
cache warm — ready for traffic
Hit-ratio dashboard
from collections import Counter
def log_usage(usage, counter: Counter) -> None:
counter["input"] += usage.input_tokens
counter["cache_creation"] += usage.cache_creation_input_tokens
counter["cache_read"] += usage.cache_read_input_tokens
counter["output"] += usage.output_tokens
def summary(counter: Counter, in_price_per_mtok: float, out_price_per_mtok: float) -> dict:
input_cost = (
counter["input"] * 1.0
+ counter["cache_creation"] * 1.25
+ counter["cache_read"] * 0.10
) * in_price_per_mtok / 1_000_000
output_cost = counter["output"] * out_price_per_mtok / 1_000_000
return {
"hit_ratio": counter["cache_read"] / max(counter["cache_read"] + counter["input"], 1),
"input_cost": input_cost,
"output_cost": output_cost,
"total_cost": input_cost + output_cost,
}
Two-layer cache (persona + doc)
def make_request(persona: str, doc: str, question: str):
return client.messages.create(
model="claude-opus-4-7",
max_tokens=1024,
system=[
{"type": "text", "text": persona, "cache_control": {"type": "ephemeral"}},
],
messages=[{
"role": "user",
"content": [
{"type": "text", "text": f"Document:\n{doc}", "cache_control": {"type": "ephemeral"}},
{"type": "text", "text": f"Question: {question}"},
],
}],
)
The persona caches across all documents; each document caches across all questions about that document. A new persona busts only layer 1; a new document busts only layer 2.
See also
- Python SDK — base messages API.
- TypeScript SDK — caching in TS.
- Tool use — caching tool definitions.
- Batch API — stack 50% batch + 90% cache.
- Files API — uploaded files cache automatically.
- Streaming — caching works with streamed responses too.