cheat sheet

Few-Shot Prompting

In-context learning techniques — example selection, format design, count tuning, dynamic retrieval of demonstrations, and pitfalls of few-shot prompting.

Few-Shot Prompting

What it is

Few-shot prompting (also called in-context learning, ICL) is the technique of placing 2–10 worked examples in the prompt so the model can pattern-match its way to the right format, style, or reasoning. Unlike fine-tuning, no weights change — the demonstrations are just text in the context window — yet behavior often shifts dramatically. Reach for few-shot whenever zero-shot instructions alone fail to lock in the format, tone, or edge-case handling you need.

In-context learning fundamentals

The phenomenon: large language models can perform a new task they were never explicitly trained on by inferring the pattern from a handful of (input, output) pairs in the prompt. The examples must share three things — a consistent format (delimiter, field order, capitalization), a consistent label space (closed list of categories or shape of output), and diversity that spans the input distribution the model will see at inference time.

Shot countNameWhen to use
0Zero-shotTask fully described by instructions; format simple
1One-shotFormat is unusual but task is intuitive
2–5Few-shot (sweet spot)Format consistency, style transfer, edge-case anchoring
6–20Many-shotLong-tail labels or subtle distinctions; needs longer context
100+Many-shot ICLReplace fine-tuning for low-data domains

If your zero-shot prompt works on easy inputs but fails on edge cases, don't add more rules — add one example of each failing edge case. Demonstrations consistently outperform additional prose instructions for handling unusual inputs.

Basic structure

The minimum viable few-shot prompt: an instruction sentence, then Input → Output pairs separated by blank lines, ending with the new input and a trailing Output: that the model completes.

text
Convert each sentence to past tense.

Input: She walks to school.
Output: She walked to school.

Input: They are building a house.
Output: They were building a house.

Input: The server processes 1,000 requests per second.
Output: The server processed 1,000 requests per second.

Input: {user_sentence}
Output:

Format design

The format defines what the model has to learn. Three principles: pick a delimiter the model will never produce by accident (### Input: is safer than Input: because ### is rare in natural text); keep the field order identical across every example; and end with the field the model should produce, with a colon and no value.

python
TEMPLATE = """### Task
Classify each support ticket into one of: billing, access, performance, bug, feature, other.
Reply with the category only, in lowercase.

### Example 1
Ticket: "I was charged twice for my June subscription."
Category: billing

### Example 2
Ticket: "Page takes 30 seconds to load when I click 'export'."
Category: performance

### Example 3
Ticket: "Can you add dark mode to the dashboard?"
Category: feature

### Example 4
Ticket: "Forgot my password, the reset email never arrives."
Category: access

### New ticket
Ticket: "{user_ticket}"
Category:"""

Inconsistent capitalization, spacing, or label spelling across your examples teaches the model that those things don't matter — and it will produce inconsistent output. Audit demonstrations for whitespace and casing before shipping.

Example selection

Which examples you pick has more impact than the count. Three selection strategies, in order of complexity: hand-curate (best for ≤5 examples), random sample from a labeled pool (cheap baseline), and similarity retrieval at inference time (dynamic few-shot, see below). Within those, three quality heuristics apply universally — diverse coverage of the input distribution, no duplicates, no near-duplicates that lock the model to a narrow pattern.

sql
One example per major category — covers the label space
☐ Mix easy and hard inputs — hard ones anchor edge cases
☐ Include at least one negative-result example if applicable ("if none apply, say 'other'")
☐ Verify each example matches your latest format exactly
☐ Audit for label noise — a single mislabeled example can flip dozens of test cases
☐ Test the prompt with each example REMOVED — if removing it doesn't change anything, drop it

Count tuning

More examples is not always better. Returns diminish quickly past 5–6 examples for most tasks; beyond that you're spending tokens for marginal gains. Empirically sweep: run the eval suite at 0, 2, 4, 6, 8, and 12 shots and graph the score curve. The right count is where the curve flattens, not the maximum.

python
def sweep_shot_count(dataset, candidate_pool, max_shots=12, step=2):
    """Find the optimal shot count for this task and dataset."""
    results = {}
    for k in range(0, max_shots + 1, step):
        examples = candidate_pool[:k]
        prompt = build_prompt(instruction, examples)
        score = run_eval(dataset, runner=lambda x: run_prompt(prompt, x))
        results[k] = score["mean_score"]
        print(f"shots={k:2d}  mean_score={score['mean_score']:.3f}  tokens={count_tokens(prompt)}")
    return results

Output:

text
shots= 0  mean_score=0.612  tokens=180
shots= 2  mean_score=0.748  tokens=520
shots= 4  mean_score=0.831  tokens=860
shots= 6  mean_score=0.852  tokens=1200
shots= 8  mean_score=0.857  tokens=1540
shots=10  mean_score=0.859  tokens=1880
shots=12  mean_score=0.858  tokens=2220

In the example above the inflection is at 6 shots — going from 6 to 12 doubles token cost for negligible gain. Stop where the slope flattens, not where the score peaks.

Dynamic few-shot (example retrieval)

Instead of using the same examples for every query, retrieve the most similar examples from a labeled pool at inference time. This is RAG applied to demonstrations: embed your example pool once, embed the incoming query, fetch the top-k most similar examples, and inject them. Dynamic selection consistently outperforms static selection on heterogeneous tasks (varied input types) at the cost of one extra embedding call per query.

python
from sentence_transformers import SentenceTransformer
import numpy as np

class ExampleBank:
    def __init__(self, examples: list[dict]):
        """examples: list of {"input": str, "output": str, "tags": list[str]}"""
        self.examples = examples
        self.embedder = SentenceTransformer("all-MiniLM-L6-v2")
        self.embeddings = self.embedder.encode(
            [ex["input"] for ex in examples],
            normalize_embeddings=True,
        )

    def select(self, query: str, k: int = 4) -> list[dict]:
        q_vec = self.embedder.encode([query], normalize_embeddings=True)[0]
        sims = self.embeddings @ q_vec    # cosine sim since normalized
        top_idx = np.argsort(sims)[::-1][:k]
        return [self.examples[i] for i in top_idx]

bank = ExampleBank(labeled_pool)
relevant = bank.select(user_query, k=4)
prompt = build_few_shot_prompt(instruction, relevant, user_query)

The pool needs to be big enough that retrieval finds genuinely relevant examples — start with 30–100 labeled items spread across your input distribution. With fewer than that, hand-curating a static set usually wins.

Diverse retrieval (MMR)

Pure similarity retrieval often returns near-duplicates. Maximal Marginal Relevance (MMR) trades similarity for diversity: at each step pick the example that's similar to the query but dissimilar to already-picked examples. Use MMR when your pool contains many semantic duplicates.

python
def mmr_select(query_vec, doc_vecs, examples, k=4, lambda_=0.5):
    """Maximal Marginal Relevance: balance similarity and diversity."""
    selected = []
    candidate_idx = list(range(len(examples)))

    sims_to_query = doc_vecs @ query_vec
    sims_doc_doc = doc_vecs @ doc_vecs.T

    while len(selected) < k and candidate_idx:
        scores = []
        for idx in candidate_idx:
            relevance = sims_to_query[idx]
            redundancy = max((sims_doc_doc[idx][j] for j in selected), default=0)
            mmr = lambda_ * relevance - (1 - lambda_) * redundancy
            scores.append((mmr, idx))
        _, best = max(scores)
        selected.append(best)
        candidate_idx.remove(best)

    return [examples[i] for i in selected]

Order effects

The order in which examples appear matters. The model attends more strongly to the last example (recency) and the first example (primacy); examples in the middle have less influence. Two practical rules: put your most prototypical example LAST so it most directly informs the new input, and group similar examples together rather than alternating (the model treats clustered demonstrations as more authoritative).

python
def order_for_recency(examples: list[dict], query: str, embedder) -> list[dict]:
    """Sort examples so the MOST similar to the query is LAST."""
    q_vec = embedder.encode([query], normalize_embeddings=True)[0]
    doc_vecs = embedder.encode([ex["input"] for ex in examples], normalize_embeddings=True)
    sims = doc_vecs @ q_vec
    order = np.argsort(sims)          # ascending: most similar last
    return [examples[i] for i in order]

Chain-of-thought few-shot

Combine few-shot with chain-of-thought by including the reasoning chain in each example. The model learns both the what (final answer) and the how (decomposition). Few-shot CoT is the most reliable way to elicit careful reasoning on tasks that confuse zero-shot CoT.

text
Solve the math word problem. Show your work step by step before the final answer.

Q: Alice Dev has 5 apples. She gives 2 to a friend and buys 4 more. How many does she have?
A: She starts with 5. After giving 2 away she has 5 - 2 = 3. After buying 4 more she has 3 + 4 = 7.
Final answer: 7

Q: A train travels 60 mph for 2 hours then 80 mph for 1 hour. Total distance?
A: First leg: 60 * 2 = 120 miles. Second leg: 80 * 1 = 80 miles. Total: 120 + 80 = 200 miles.
Final answer: 200 miles

Q: {new_problem}
A:

See chain-of-thought for deeper coverage.

Negative examples (contrast)

Showing what NOT to do is sometimes more effective than showing what to do. Pair each desired-style example with a contrasting bad-style example, labeled so the model learns to discriminate. Use sparingly — too many bad examples can teach the bad pattern instead.

text
Rewrite the customer email to be friendly and professional.

# Bad rewrite (what NOT to do — too casual, slang)
Original: "Your refund request has been denied per policy section 4.2."
Rewrite: "Yo, no refund for u — check policy 4.2 lol."

# Good rewrite (what TO do — warm but professional)
Original: "Your refund request has been denied per policy section 4.2."
Rewrite: "Thanks for reaching out. Unfortunately we're unable to issue a refund in this case. The relevant policy (4.2) is linked below for your reference."

# Good rewrite
Original: "Account locked. Reset via portal."
Rewrite: "I've gone ahead and flagged your account for unlock — you'll receive an email to reset your password shortly. Let me know if it doesn't arrive within 10 minutes!"

# Your turn
Original: "{user_email}"
Rewrite:

Structured few-shot with XML

When examples themselves contain free-form text that could collide with your delimiters, wrap each demonstration in XML tags. The model parses XML reliably and the structure scales to multi-field examples without ambiguity.

text
Classify the support ticket and extract metadata.

<example>
  <ticket>I was charged twice for May.</ticket>
  <classification>billing</classification>
  <urgency>medium</urgency>
  <action>refund_one_charge</action>
</example>

<example>
  <ticket>Production server is down — all customers affected.</ticket>
  <classification>incident</classification>
  <urgency>critical</urgency>
  <action>page_oncall</action>
</example>

<example>
  <ticket>How do I export my data?</ticket>
  <classification>question</classification>
  <urgency>low</urgency>
  <action>send_docs_link</action>
</example>

<new>
  <ticket>{user_ticket}</ticket>
  <classification>

Label balance

If your few-shot examples are skewed toward one class, the model develops a strong prior toward that class. Balance your examples across the label space — even at the cost of using more examples — when label distribution matters. For classification, aim for at least one example per class; for binary tasks, exactly equal counts.

python
def balanced_select(pool: list[dict], label_key: str = "label", per_class: int = 1) -> list[dict]:
    """Pick `per_class` examples from each label."""
    by_label: dict[str, list[dict]] = {}
    for ex in pool:
        by_label.setdefault(ex[label_key], []).append(ex)
    selected = []
    for label, examples in by_label.items():
        selected.extend(examples[:per_class])
    return selected

Many-shot ICL

Modern models with 200K+ context windows enable many-shot ICL with dozens or hundreds of examples — approaching fine-tuning quality without the operational complexity of training. Used effectively this can replace a fine-tuned model entirely, particularly when your domain has 50–500 labeled examples and a low-traffic deployment makes per-call cost reasonable.

python
def many_shot_prompt(examples: list[dict], user_input: str, max_examples: int = 100) -> str:
    chosen = examples[:max_examples]
    blocks = [
        f"<ex>\n  <in>{ex['input']}</in>\n  <out>{ex['output']}</out>\n</ex>"
        for ex in chosen
    ]
    return (
        "Pattern-match the input to the demonstrations below and produce the output.\n\n"
        + "\n".join(blocks)
        + f"\n\n<ex>\n  <in>{user_input}</in>\n  <out>"
    )

Many-shot prompts benefit massively from prompt caching — the examples are stable across queries. Mark the example block as cache_control: {"type": "ephemeral"} and a 50K-token prompt becomes nearly free on cache hit.

Caching demonstrations

When your example set is stable, cache it. The first call writes the cache; subsequent calls within the 5-minute TTL read at ~10% the cost and lower latency.

python
response = client.messages.create(
    model="claude-opus-4-7",
    max_tokens=512,
    system=[
        {"type": "text", "text": "You are a strict classifier. Reply with the label only."},
        {
            "type": "text",
            "text": render_examples(static_example_set),     # 30K tokens of examples
            "cache_control": {"type": "ephemeral"},          # cache up to here
        },
    ],
    messages=[{"role": "user", "content": user_input}],
)
print(response.usage)

Output:

text
Usage(cache_creation_input_tokens=30420, cache_read_input_tokens=0, input_tokens=85, output_tokens=4)

Few-shot vs fine-tuning

Few-shot and fine-tuning solve overlapping problems. Few-shot is faster to iterate, doesn't require labeled scale, and updates instantly when you edit the prompt. Fine-tuning is cheaper per inference at high volume, supports more examples than fit in context, and can shift style or persona in ways few-shot cannot. The decision rule below.

QuestionLean few-shot if...Lean fine-tuning if...
How many labeled examples?< 100> 1,000
How often does the task change?OftenRare
Inference volume?Low / spikyHigh and steady
Per-call budget?Token cost OKNeed lowest possible cost
Need style transfer?Sometimes worksReliably works
Need new factual knowledge?Use RAG insteadUse RAG instead
Time to first useful output?MinutesDays to weeks

Common pitfalls

The recurring few-shot bugs and the fix that resolves each one.

SymptomRoot causeFix
Model copies the example output verbatimExample too similar to queryAdd diversity; check pool for near-duplicates
Format breaks on the new inputExamples have inconsistent formatAudit whitespace, casing, label spelling
Score plateaus past 4 shotsReturns diminish quicklyStop; you're spending tokens for nothing
All outputs match the most frequent labelLabel imbalance in examplesBalance examples per class
Edge cases still failNo edge example in the demo setAdd one example per known edge case
Quality varies across runsRandom example shufflingPin order; put best example last
Examples don't help on hard inputsPool too narrowExpand pool, switch to retrieval
Long prompts hit rate limitsToo many shotsPrompt-cache examples; reduce count
One mislabeled example confuses modelNoise in labelsRe-audit demonstrations periodically
Behavior changes when prompt cachedCache breakpoint inside dynamic contentMove cache_control to end of stable block

Real-world recipes

Concrete few-shot setups for the highest-volume use cases.

Email-tone rewrite

text
Rewrite the email below to be warm, professional, and action-oriented.
Keep all facts the same. Match the style of the examples.

Original: "We received your invoice. Will pay net 30."
Rewrite: "Thanks for sending the invoice — we have it on file and will process payment on our standard net-30 terms. Reach out if anything's needed in the meantime!"

Original: "Meeting cancelled. Reschedule."
Rewrite: "Apologies for the late notice — I need to cancel today's meeting. Could we look at finding a new time later this week? I'll send a few options shortly."

Original: "Bug fixed in v2.3."
Rewrite: "Good news — the bug you reported has been fixed and is rolling out in v2.3. Let us know if you continue to see the issue!"

Original: "{user_text}"
Rewrite:

Closed-list intent classification

text
Classify the user message into ONE of: cancel, change, status, help, other.
Reply with the label only.

User: "I want to cancel my subscription."
Intent: cancel

User: "Can I change my plan from monthly to yearly?"
Intent: change

User: "Where's my order?"
Intent: status

User: "How does the trial work?"
Intent: help

User: "Tell me about the weather."
Intent: other

User: "{user_message}"
Intent:

Structured extraction with worked examples

text
Extract the fields below as JSON. Output JSON only.

Text: "Alice Dev paid $42.50 at Acme Coffee on 2025-04-12 at 9:14am."
JSON: {"amount": 42.50, "currency": "USD", "merchant": "Acme Coffee", "date": "2025-04-12", "time": "09:14"}

Text: "Refund of 1,250.00 EUR to card ending 4242 on 2025-04-14."
JSON: {"amount": 1250.00, "currency": "EUR", "type": "refund", "card_last4": "4242", "date": "2025-04-14"}

Text: "{user_text}"
JSON:

Dynamic SQL generation

python
sql_pool = [
    {"input": "Show me the top 5 customers by revenue", "output": "SELECT customer_id, SUM(amount) AS revenue FROM orders GROUP BY customer_id ORDER BY revenue DESC LIMIT 5;"},
    {"input": "How many orders were placed yesterday?", "output": "SELECT COUNT(*) FROM orders WHERE order_date = CURRENT_DATE - 1;"},
    {"input": "List products that have never been ordered", "output": "SELECT p.id, p.name FROM products p LEFT JOIN order_items oi ON oi.product_id = p.id WHERE oi.id IS NULL;"},
    # ... 50+ more
]

bank = ExampleBank(sql_pool)

def text_to_sql(question: str) -> str:
    relevant = bank.select(question, k=5)
    examples_text = "\n".join(
        f"Q: {ex['input']}\nSQL: {ex['output']}\n"
        for ex in relevant
    )
    prompt = f"""Generate SQL for the question below. Match the style of the examples.

{examples_text}

Q: {question}
SQL:"""
    return call_claude(prompt, max_tokens=300, temperature=0.0)

Few-shot for tool use

Few-shot examples also work inside tool descriptions and inside the conversation history before a tool call. The most reliable pattern is to seed the messages list with one or two complete user → assistant(tool_use) → tool_result → assistant(text) cycles before the real user turn — Claude pattern-matches on the cycle structure as readily as on text demonstrations.

python
seeded_messages = [
    {"role": "user", "content": "What's the weather in Paris?"},
    {"role": "assistant", "content": [
        {"type": "text", "text": "I'll check the weather."},
        {"type": "tool_use", "id": "demo_1", "name": "get_weather",
         "input": {"location": "Paris, France", "unit": "celsius"}},
    ]},
    {"role": "user", "content": [
        {"type": "tool_result", "tool_use_id": "demo_1",
         "content": '{"temp": 14, "condition": "rainy"}'},
    ]},
    {"role": "assistant", "content": "The weather in Paris is 14°C and rainy."},
    # Now the real user turn
    {"role": "user", "content": user_question},
]

Demonstration tool-use cycles also teach Claude when NOT to call a tool. Include one example where the user asks something off-topic and the assistant answers without a tool call.

Auditing demonstrations

Few-shot demonstrations rot over time — schemas evolve, label definitions shift, edge cases get added then forgotten. Build a quarterly audit step into your prompt maintenance routine. The check below catches the most common rot patterns.

python
def audit_demonstrations(examples: list[dict], schema: dict) -> list[str]:
    issues = []
    seen_inputs = set()
    for i, ex in enumerate(examples):
        # Duplicate detection
        normalized = ex["input"].strip().lower()
        if normalized in seen_inputs:
            issues.append(f"#{i}: duplicate of an earlier example")
        seen_inputs.add(normalized)

        # Schema conformance
        try:
            jsonschema.validate(ex["output"], schema)
        except jsonschema.ValidationError as e:
            issues.append(f"#{i}: output doesn't match current schema — {e.message}")

        # Empty fields
        if not ex.get("input", "").strip():
            issues.append(f"#{i}: empty input")
        if not ex.get("output"):
            issues.append(f"#{i}: empty output")

    return issues

Quick reference

What to reach for, sorted by symptom.

GoalPattern
Lock in an unusual format2–3 examples with that exact format
Handle a recurring edge caseOne example of the edge case
Improve classifier on rare labelsBalanced set, one per label
Reduce drift across runsPin example order, temperature 0.0
Scale to many use casesDynamic retrieval from a labeled pool
Improve reasoningFew-shot CoT (include the chain in examples)
Cut cost on stable example setPrompt cache the demonstration block
Test if extra shots helpSweep 0/2/4/6/8/10 and graph the curve
Replace fine-tuningMany-shot ICL (50–500 examples + cache)
Teach what NOT to doOne labeled negative example, marked clearly