cheat sheet
Few-Shot Prompting
In-context learning techniques — example selection, format design, count tuning, dynamic retrieval of demonstrations, and pitfalls of few-shot prompting.
Few-Shot Prompting
What it is
Few-shot prompting (also called in-context learning, ICL) is the technique of placing 2–10 worked examples in the prompt so the model can pattern-match its way to the right format, style, or reasoning. Unlike fine-tuning, no weights change — the demonstrations are just text in the context window — yet behavior often shifts dramatically. Reach for few-shot whenever zero-shot instructions alone fail to lock in the format, tone, or edge-case handling you need.
In-context learning fundamentals
The phenomenon: large language models can perform a new task they were never explicitly trained on by inferring the pattern from a handful of (input, output) pairs in the prompt. The examples must share three things — a consistent format (delimiter, field order, capitalization), a consistent label space (closed list of categories or shape of output), and diversity that spans the input distribution the model will see at inference time.
| Shot count | Name | When to use |
|---|---|---|
| 0 | Zero-shot | Task fully described by instructions; format simple |
| 1 | One-shot | Format is unusual but task is intuitive |
| 2–5 | Few-shot (sweet spot) | Format consistency, style transfer, edge-case anchoring |
| 6–20 | Many-shot | Long-tail labels or subtle distinctions; needs longer context |
| 100+ | Many-shot ICL | Replace fine-tuning for low-data domains |
If your zero-shot prompt works on easy inputs but fails on edge cases, don't add more rules — add one example of each failing edge case. Demonstrations consistently outperform additional prose instructions for handling unusual inputs.
Basic structure
The minimum viable few-shot prompt: an instruction sentence, then Input → Output pairs separated by blank lines, ending with the new input and a trailing Output: that the model completes.
Convert each sentence to past tense.
Input: She walks to school.
Output: She walked to school.
Input: They are building a house.
Output: They were building a house.
Input: The server processes 1,000 requests per second.
Output: The server processed 1,000 requests per second.
Input: {user_sentence}
Output:
Format design
The format defines what the model has to learn. Three principles: pick a delimiter the model will never produce by accident (### Input: is safer than Input: because ### is rare in natural text); keep the field order identical across every example; and end with the field the model should produce, with a colon and no value.
TEMPLATE = """### Task
Classify each support ticket into one of: billing, access, performance, bug, feature, other.
Reply with the category only, in lowercase.
### Example 1
Ticket: "I was charged twice for my June subscription."
Category: billing
### Example 2
Ticket: "Page takes 30 seconds to load when I click 'export'."
Category: performance
### Example 3
Ticket: "Can you add dark mode to the dashboard?"
Category: feature
### Example 4
Ticket: "Forgot my password, the reset email never arrives."
Category: access
### New ticket
Ticket: "{user_ticket}"
Category:"""
Inconsistent capitalization, spacing, or label spelling across your examples teaches the model that those things don't matter — and it will produce inconsistent output. Audit demonstrations for whitespace and casing before shipping.
Example selection
Which examples you pick has more impact than the count. Three selection strategies, in order of complexity: hand-curate (best for ≤5 examples), random sample from a labeled pool (cheap baseline), and similarity retrieval at inference time (dynamic few-shot, see below). Within those, three quality heuristics apply universally — diverse coverage of the input distribution, no duplicates, no near-duplicates that lock the model to a narrow pattern.
☐ One example per major category — covers the label space
☐ Mix easy and hard inputs — hard ones anchor edge cases
☐ Include at least one negative-result example if applicable ("if none apply, say 'other'")
☐ Verify each example matches your latest format exactly
☐ Audit for label noise — a single mislabeled example can flip dozens of test cases
☐ Test the prompt with each example REMOVED — if removing it doesn't change anything, drop it
Count tuning
More examples is not always better. Returns diminish quickly past 5–6 examples for most tasks; beyond that you're spending tokens for marginal gains. Empirically sweep: run the eval suite at 0, 2, 4, 6, 8, and 12 shots and graph the score curve. The right count is where the curve flattens, not the maximum.
def sweep_shot_count(dataset, candidate_pool, max_shots=12, step=2):
"""Find the optimal shot count for this task and dataset."""
results = {}
for k in range(0, max_shots + 1, step):
examples = candidate_pool[:k]
prompt = build_prompt(instruction, examples)
score = run_eval(dataset, runner=lambda x: run_prompt(prompt, x))
results[k] = score["mean_score"]
print(f"shots={k:2d} mean_score={score['mean_score']:.3f} tokens={count_tokens(prompt)}")
return results
Output:
shots= 0 mean_score=0.612 tokens=180
shots= 2 mean_score=0.748 tokens=520
shots= 4 mean_score=0.831 tokens=860
shots= 6 mean_score=0.852 tokens=1200
shots= 8 mean_score=0.857 tokens=1540
shots=10 mean_score=0.859 tokens=1880
shots=12 mean_score=0.858 tokens=2220
In the example above the inflection is at 6 shots — going from 6 to 12 doubles token cost for negligible gain. Stop where the slope flattens, not where the score peaks.
Dynamic few-shot (example retrieval)
Instead of using the same examples for every query, retrieve the most similar examples from a labeled pool at inference time. This is RAG applied to demonstrations: embed your example pool once, embed the incoming query, fetch the top-k most similar examples, and inject them. Dynamic selection consistently outperforms static selection on heterogeneous tasks (varied input types) at the cost of one extra embedding call per query.
from sentence_transformers import SentenceTransformer
import numpy as np
class ExampleBank:
def __init__(self, examples: list[dict]):
"""examples: list of {"input": str, "output": str, "tags": list[str]}"""
self.examples = examples
self.embedder = SentenceTransformer("all-MiniLM-L6-v2")
self.embeddings = self.embedder.encode(
[ex["input"] for ex in examples],
normalize_embeddings=True,
)
def select(self, query: str, k: int = 4) -> list[dict]:
q_vec = self.embedder.encode([query], normalize_embeddings=True)[0]
sims = self.embeddings @ q_vec # cosine sim since normalized
top_idx = np.argsort(sims)[::-1][:k]
return [self.examples[i] for i in top_idx]
bank = ExampleBank(labeled_pool)
relevant = bank.select(user_query, k=4)
prompt = build_few_shot_prompt(instruction, relevant, user_query)
The pool needs to be big enough that retrieval finds genuinely relevant examples — start with 30–100 labeled items spread across your input distribution. With fewer than that, hand-curating a static set usually wins.
Diverse retrieval (MMR)
Pure similarity retrieval often returns near-duplicates. Maximal Marginal Relevance (MMR) trades similarity for diversity: at each step pick the example that's similar to the query but dissimilar to already-picked examples. Use MMR when your pool contains many semantic duplicates.
def mmr_select(query_vec, doc_vecs, examples, k=4, lambda_=0.5):
"""Maximal Marginal Relevance: balance similarity and diversity."""
selected = []
candidate_idx = list(range(len(examples)))
sims_to_query = doc_vecs @ query_vec
sims_doc_doc = doc_vecs @ doc_vecs.T
while len(selected) < k and candidate_idx:
scores = []
for idx in candidate_idx:
relevance = sims_to_query[idx]
redundancy = max((sims_doc_doc[idx][j] for j in selected), default=0)
mmr = lambda_ * relevance - (1 - lambda_) * redundancy
scores.append((mmr, idx))
_, best = max(scores)
selected.append(best)
candidate_idx.remove(best)
return [examples[i] for i in selected]
Order effects
The order in which examples appear matters. The model attends more strongly to the last example (recency) and the first example (primacy); examples in the middle have less influence. Two practical rules: put your most prototypical example LAST so it most directly informs the new input, and group similar examples together rather than alternating (the model treats clustered demonstrations as more authoritative).
def order_for_recency(examples: list[dict], query: str, embedder) -> list[dict]:
"""Sort examples so the MOST similar to the query is LAST."""
q_vec = embedder.encode([query], normalize_embeddings=True)[0]
doc_vecs = embedder.encode([ex["input"] for ex in examples], normalize_embeddings=True)
sims = doc_vecs @ q_vec
order = np.argsort(sims) # ascending: most similar last
return [examples[i] for i in order]
Chain-of-thought few-shot
Combine few-shot with chain-of-thought by including the reasoning chain in each example. The model learns both the what (final answer) and the how (decomposition). Few-shot CoT is the most reliable way to elicit careful reasoning on tasks that confuse zero-shot CoT.
Solve the math word problem. Show your work step by step before the final answer.
Q: Alice Dev has 5 apples. She gives 2 to a friend and buys 4 more. How many does she have?
A: She starts with 5. After giving 2 away she has 5 - 2 = 3. After buying 4 more she has 3 + 4 = 7.
Final answer: 7
Q: A train travels 60 mph for 2 hours then 80 mph for 1 hour. Total distance?
A: First leg: 60 * 2 = 120 miles. Second leg: 80 * 1 = 80 miles. Total: 120 + 80 = 200 miles.
Final answer: 200 miles
Q: {new_problem}
A:
See chain-of-thought for deeper coverage.
Negative examples (contrast)
Showing what NOT to do is sometimes more effective than showing what to do. Pair each desired-style example with a contrasting bad-style example, labeled so the model learns to discriminate. Use sparingly — too many bad examples can teach the bad pattern instead.
Rewrite the customer email to be friendly and professional.
# Bad rewrite (what NOT to do — too casual, slang)
Original: "Your refund request has been denied per policy section 4.2."
Rewrite: "Yo, no refund for u — check policy 4.2 lol."
# Good rewrite (what TO do — warm but professional)
Original: "Your refund request has been denied per policy section 4.2."
Rewrite: "Thanks for reaching out. Unfortunately we're unable to issue a refund in this case. The relevant policy (4.2) is linked below for your reference."
# Good rewrite
Original: "Account locked. Reset via portal."
Rewrite: "I've gone ahead and flagged your account for unlock — you'll receive an email to reset your password shortly. Let me know if it doesn't arrive within 10 minutes!"
# Your turn
Original: "{user_email}"
Rewrite:
Structured few-shot with XML
When examples themselves contain free-form text that could collide with your delimiters, wrap each demonstration in XML tags. The model parses XML reliably and the structure scales to multi-field examples without ambiguity.
Classify the support ticket and extract metadata.
<example>
<ticket>I was charged twice for May.</ticket>
<classification>billing</classification>
<urgency>medium</urgency>
<action>refund_one_charge</action>
</example>
<example>
<ticket>Production server is down — all customers affected.</ticket>
<classification>incident</classification>
<urgency>critical</urgency>
<action>page_oncall</action>
</example>
<example>
<ticket>How do I export my data?</ticket>
<classification>question</classification>
<urgency>low</urgency>
<action>send_docs_link</action>
</example>
<new>
<ticket>{user_ticket}</ticket>
<classification>
Label balance
If your few-shot examples are skewed toward one class, the model develops a strong prior toward that class. Balance your examples across the label space — even at the cost of using more examples — when label distribution matters. For classification, aim for at least one example per class; for binary tasks, exactly equal counts.
def balanced_select(pool: list[dict], label_key: str = "label", per_class: int = 1) -> list[dict]:
"""Pick `per_class` examples from each label."""
by_label: dict[str, list[dict]] = {}
for ex in pool:
by_label.setdefault(ex[label_key], []).append(ex)
selected = []
for label, examples in by_label.items():
selected.extend(examples[:per_class])
return selected
Many-shot ICL
Modern models with 200K+ context windows enable many-shot ICL with dozens or hundreds of examples — approaching fine-tuning quality without the operational complexity of training. Used effectively this can replace a fine-tuned model entirely, particularly when your domain has 50–500 labeled examples and a low-traffic deployment makes per-call cost reasonable.
def many_shot_prompt(examples: list[dict], user_input: str, max_examples: int = 100) -> str:
chosen = examples[:max_examples]
blocks = [
f"<ex>\n <in>{ex['input']}</in>\n <out>{ex['output']}</out>\n</ex>"
for ex in chosen
]
return (
"Pattern-match the input to the demonstrations below and produce the output.\n\n"
+ "\n".join(blocks)
+ f"\n\n<ex>\n <in>{user_input}</in>\n <out>"
)
Many-shot prompts benefit massively from prompt caching — the examples are stable across queries. Mark the example block as
cache_control: {"type": "ephemeral"}and a 50K-token prompt becomes nearly free on cache hit.
Caching demonstrations
When your example set is stable, cache it. The first call writes the cache; subsequent calls within the 5-minute TTL read at ~10% the cost and lower latency.
response = client.messages.create(
model="claude-opus-4-7",
max_tokens=512,
system=[
{"type": "text", "text": "You are a strict classifier. Reply with the label only."},
{
"type": "text",
"text": render_examples(static_example_set), # 30K tokens of examples
"cache_control": {"type": "ephemeral"}, # cache up to here
},
],
messages=[{"role": "user", "content": user_input}],
)
print(response.usage)
Output:
Usage(cache_creation_input_tokens=30420, cache_read_input_tokens=0, input_tokens=85, output_tokens=4)
Few-shot vs fine-tuning
Few-shot and fine-tuning solve overlapping problems. Few-shot is faster to iterate, doesn't require labeled scale, and updates instantly when you edit the prompt. Fine-tuning is cheaper per inference at high volume, supports more examples than fit in context, and can shift style or persona in ways few-shot cannot. The decision rule below.
| Question | Lean few-shot if... | Lean fine-tuning if... |
|---|---|---|
| How many labeled examples? | < 100 | > 1,000 |
| How often does the task change? | Often | Rare |
| Inference volume? | Low / spiky | High and steady |
| Per-call budget? | Token cost OK | Need lowest possible cost |
| Need style transfer? | Sometimes works | Reliably works |
| Need new factual knowledge? | Use RAG instead | Use RAG instead |
| Time to first useful output? | Minutes | Days to weeks |
Common pitfalls
The recurring few-shot bugs and the fix that resolves each one.
| Symptom | Root cause | Fix |
|---|---|---|
| Model copies the example output verbatim | Example too similar to query | Add diversity; check pool for near-duplicates |
| Format breaks on the new input | Examples have inconsistent format | Audit whitespace, casing, label spelling |
| Score plateaus past 4 shots | Returns diminish quickly | Stop; you're spending tokens for nothing |
| All outputs match the most frequent label | Label imbalance in examples | Balance examples per class |
| Edge cases still fail | No edge example in the demo set | Add one example per known edge case |
| Quality varies across runs | Random example shuffling | Pin order; put best example last |
| Examples don't help on hard inputs | Pool too narrow | Expand pool, switch to retrieval |
| Long prompts hit rate limits | Too many shots | Prompt-cache examples; reduce count |
| One mislabeled example confuses model | Noise in labels | Re-audit demonstrations periodically |
| Behavior changes when prompt cached | Cache breakpoint inside dynamic content | Move cache_control to end of stable block |
Real-world recipes
Concrete few-shot setups for the highest-volume use cases.
Email-tone rewrite
Rewrite the email below to be warm, professional, and action-oriented.
Keep all facts the same. Match the style of the examples.
Original: "We received your invoice. Will pay net 30."
Rewrite: "Thanks for sending the invoice — we have it on file and will process payment on our standard net-30 terms. Reach out if anything's needed in the meantime!"
Original: "Meeting cancelled. Reschedule."
Rewrite: "Apologies for the late notice — I need to cancel today's meeting. Could we look at finding a new time later this week? I'll send a few options shortly."
Original: "Bug fixed in v2.3."
Rewrite: "Good news — the bug you reported has been fixed and is rolling out in v2.3. Let us know if you continue to see the issue!"
Original: "{user_text}"
Rewrite:
Closed-list intent classification
Classify the user message into ONE of: cancel, change, status, help, other.
Reply with the label only.
User: "I want to cancel my subscription."
Intent: cancel
User: "Can I change my plan from monthly to yearly?"
Intent: change
User: "Where's my order?"
Intent: status
User: "How does the trial work?"
Intent: help
User: "Tell me about the weather."
Intent: other
User: "{user_message}"
Intent:
Structured extraction with worked examples
Extract the fields below as JSON. Output JSON only.
Text: "Alice Dev paid $42.50 at Acme Coffee on 2025-04-12 at 9:14am."
JSON: {"amount": 42.50, "currency": "USD", "merchant": "Acme Coffee", "date": "2025-04-12", "time": "09:14"}
Text: "Refund of 1,250.00 EUR to card ending 4242 on 2025-04-14."
JSON: {"amount": 1250.00, "currency": "EUR", "type": "refund", "card_last4": "4242", "date": "2025-04-14"}
Text: "{user_text}"
JSON:
Dynamic SQL generation
sql_pool = [
{"input": "Show me the top 5 customers by revenue", "output": "SELECT customer_id, SUM(amount) AS revenue FROM orders GROUP BY customer_id ORDER BY revenue DESC LIMIT 5;"},
{"input": "How many orders were placed yesterday?", "output": "SELECT COUNT(*) FROM orders WHERE order_date = CURRENT_DATE - 1;"},
{"input": "List products that have never been ordered", "output": "SELECT p.id, p.name FROM products p LEFT JOIN order_items oi ON oi.product_id = p.id WHERE oi.id IS NULL;"},
# ... 50+ more
]
bank = ExampleBank(sql_pool)
def text_to_sql(question: str) -> str:
relevant = bank.select(question, k=5)
examples_text = "\n".join(
f"Q: {ex['input']}\nSQL: {ex['output']}\n"
for ex in relevant
)
prompt = f"""Generate SQL for the question below. Match the style of the examples.
{examples_text}
Q: {question}
SQL:"""
return call_claude(prompt, max_tokens=300, temperature=0.0)
Few-shot for tool use
Few-shot examples also work inside tool descriptions and inside the conversation history before a tool call. The most reliable pattern is to seed the messages list with one or two complete user → assistant(tool_use) → tool_result → assistant(text) cycles before the real user turn — Claude pattern-matches on the cycle structure as readily as on text demonstrations.
seeded_messages = [
{"role": "user", "content": "What's the weather in Paris?"},
{"role": "assistant", "content": [
{"type": "text", "text": "I'll check the weather."},
{"type": "tool_use", "id": "demo_1", "name": "get_weather",
"input": {"location": "Paris, France", "unit": "celsius"}},
]},
{"role": "user", "content": [
{"type": "tool_result", "tool_use_id": "demo_1",
"content": '{"temp": 14, "condition": "rainy"}'},
]},
{"role": "assistant", "content": "The weather in Paris is 14°C and rainy."},
# Now the real user turn
{"role": "user", "content": user_question},
]
Demonstration tool-use cycles also teach Claude when NOT to call a tool. Include one example where the user asks something off-topic and the assistant answers without a tool call.
Auditing demonstrations
Few-shot demonstrations rot over time — schemas evolve, label definitions shift, edge cases get added then forgotten. Build a quarterly audit step into your prompt maintenance routine. The check below catches the most common rot patterns.
def audit_demonstrations(examples: list[dict], schema: dict) -> list[str]:
issues = []
seen_inputs = set()
for i, ex in enumerate(examples):
# Duplicate detection
normalized = ex["input"].strip().lower()
if normalized in seen_inputs:
issues.append(f"#{i}: duplicate of an earlier example")
seen_inputs.add(normalized)
# Schema conformance
try:
jsonschema.validate(ex["output"], schema)
except jsonschema.ValidationError as e:
issues.append(f"#{i}: output doesn't match current schema — {e.message}")
# Empty fields
if not ex.get("input", "").strip():
issues.append(f"#{i}: empty input")
if not ex.get("output"):
issues.append(f"#{i}: empty output")
return issues
Quick reference
What to reach for, sorted by symptom.
| Goal | Pattern |
|---|---|
| Lock in an unusual format | 2–3 examples with that exact format |
| Handle a recurring edge case | One example of the edge case |
| Improve classifier on rare labels | Balanced set, one per label |
| Reduce drift across runs | Pin example order, temperature 0.0 |
| Scale to many use cases | Dynamic retrieval from a labeled pool |
| Improve reasoning | Few-shot CoT (include the chain in examples) |
| Cut cost on stable example set | Prompt cache the demonstration block |
| Test if extra shots help | Sweep 0/2/4/6/8/10 and graph the curve |
| Replace fine-tuning | Many-shot ICL (50–500 examples + cache) |
| Teach what NOT to do | One labeled negative example, marked clearly |