cheat sheet

Chain-of-Thought Prompting

CoT prompting techniques — zero-shot CoT, few-shot CoT, self-consistency, tree of thoughts, and how reasoning models compare with prompted reasoning.

updated 05-25-2026

Chain-of-Thought Prompting

What it is

Chain-of-thought (CoT) prompting elicits step-by-step reasoning from a language model before it commits to a final answer. The technique exploits the fact that transformers produce one token at a time with no "scratchpad" beyond the visible context — by getting the model to write its work, you give it more compute to think with. CoT consistently improves accuracy on multi-step math, logic puzzles, code debugging, and any task where the right answer requires combining several facts. The two foundational variants are zero-shot CoT (an instruction like "think step by step") and few-shot CoT (worked examples that show the desired reasoning style).

Why CoT works

A model that emits "the answer is 42" in one token had exactly one forward pass to figure out 42. A model that emits 200 tokens of reasoning before "42" had 200 forward passes' worth of compute, each conditioned on the previous tokens — effectively building a working memory through its own output. The reasoning tokens are not just for human inspection; they are part of the computation that produces the answer.

Task type	Zero-shot accuracy	With CoT accuracy	Gain
Arithmetic word problems	Low	High	Large
Multi-step logic puzzles	Low-medium	High	Large
Code debugging	Medium	High	Moderate
Factual lookup	High	High	None — CoT doesn't help
Style/tone tasks	High	High	None or slight regression
Short classification	High	High	None — overhead without benefit

Apply CoT where reasoning matters; skip it on tasks where the answer is already a lookup or pattern-match. Padding short tasks with CoT just burns tokens for no quality gain.

Zero-shot CoT

The simplest variant: append "Let's think step by step" to the question. No examples, no scaffolding. This phrase (and small variants) reliably triggers reasoning behavior in instruction-tuned models. Stronger variants ask the model to produce reasoning in a tagged region, then a clean final answer outside the tag.

text

Question: A jug holds 3 liters. It is filled with a 30% salt solution.
4 liters of pure water are added. What is the new salt concentration?

Let's think step by step.

Tagged CoT (cleaner output parsing)

text

{problem}

Think step by step inside <reasoning> tags.
After </reasoning>, give ONLY the final answer with no explanation.

python

import re
import anthropic

client = anthropic.Anthropic()

def solve_with_cot(problem: str) -> tuple[str, str]:
    """Returns (reasoning, answer)."""
    resp = client.messages.create(
        model="claude-opus-4-7",
        max_tokens=2048,
        messages=[{
            "role": "user",
            "content": (
                f"{problem}\n\n"
                "Think step by step inside <reasoning> tags. "
                "After </reasoning>, give ONLY the final answer."
            ),
        }],
    )
    text = resp.content[0].text
    reasoning = re.search(r"<reasoning>(.*?)</reasoning>", text, re.DOTALL)
    answer = re.sub(r"<reasoning>.*?</reasoning>", "", text, flags=re.DOTALL).strip()
    return (reasoning.group(1).strip() if reasoning else "", answer)

reasoning, answer = solve_with_cot(
    "A jug holds 3L of 30% salt solution. Add 4L pure water. New concentration?"
)
print("Answer:", answer)

Output:

text

Answer: ~12.86% (0.9L salt in 7L total)

For production, ask for the answer in a separate tag like <answer>...</answer> to avoid regex cleanup. Prefilling <reasoning> also works to force the structure.

Few-shot CoT

When zero-shot CoT produces reasoning that's wrong or in the wrong style, show the model worked examples. The model copies the format and depth of the reasoning chain along with the answer style. This is the most reliable variant for tasks where the kind of reasoning matters (case analysis vs algebraic vs proof by contradiction).

text

Solve each problem. Show step-by-step reasoning before the final answer.

Q: A train leaves Station A at 9am going 60 mph. Another leaves Station B at 10am
   going 80 mph toward A. Stations are 280 miles apart. When do they meet?
A: Let t = hours after 9am.
   Train 1 position from A: 60 * t
   Train 2 starts at 10am (so t=1 onwards): 80 * (t - 1) from B
   They meet when 60t + 80(t-1) = 280
   60t + 80t - 80 = 280
   140t = 360
   t = 2.57 hours after 9am
   Final answer: ~11:34 am

Q: If 5 workers can build a wall in 8 days, how long for 8 workers (same rate)?
A: Total worker-days = 5 * 8 = 40
   New time = 40 / 8 = 5 days
   Final answer: 5 days

Q: {new_problem}
A:

Demonstrations should match the reasoning depth and style you want — show "think it through carefully" examples, not skipped-step examples. The model imitates whatever level of detail it sees.

Self-consistency

Self-consistency runs the same CoT prompt multiple times at non-zero temperature, collects the final answers, and takes a majority vote. The intuition: many reasoning paths can reach the right answer, and wrong reasoning paths tend to diverge. On hard math problems self-consistency improves accuracy substantially over single-sample CoT at the cost of N times more API calls.

python

import anthropic
from collections import Counter

client = anthropic.Anthropic()

def self_consistency(problem: str, n: int = 5, temperature: float = 0.7) -> dict:
    """Sample n reasoning paths and majority-vote the final answer."""
    answers = []
    for _ in range(n):
        resp = client.messages.create(
            model="claude-opus-4-7",
            max_tokens=1500,
            temperature=temperature,
            messages=[{
                "role": "user",
                "content": (
                    f"{problem}\n\n"
                    "Think step by step. End with: Final answer: <answer>"
                ),
            }],
        )
        text = resp.content[0].text
        if "Final answer:" in text:
            answer = text.split("Final answer:")[-1].strip().split("\n")[0]
            answers.append(answer)

    counts = Counter(answers)
    winner, votes = counts.most_common(1)[0]
    return {
        "answer": winner,
        "confidence": votes / n,
        "all_answers": dict(counts),
        "samples": n,
    }

result = self_consistency("What is 47 * 83?", n=5)
print(result)

Output:

text

{'answer': '3901', 'confidence': 1.0, 'all_answers': {'3901': 5}, 'samples': 5}

Use self-consistency on problems where you can extract a clean final answer (a number, a label, a yes/no). For open-ended outputs majority vote doesn't apply — use it only where answers can be compared for equality.

Cost-aware self-consistency

You don't always need 5 samples. Run 1 sample first; if the answer matches a confidence rubric (the model says "I'm confident"), accept it; otherwise escalate to N samples. This adaptive pattern cuts cost dramatically on easy problems while preserving accuracy on hard ones.

python

def adaptive_consistency(problem: str, easy_threshold: float = 0.9) -> dict:
    # First pass: ask for answer + self-rated confidence
    first = call_claude(
        f"{problem}\n\nThink step by step. End with:\n"
        f"Final answer: <answer>\nConfidence (0-1): <num>"
    )
    if extract_confidence(first) >= easy_threshold:
        return {"answer": extract_answer(first), "samples": 1}
    # Escalate
    return self_consistency(problem, n=5)

Tree of Thoughts (ToT)

Tree of Thoughts generalizes CoT by exploring multiple reasoning branches at each step, evaluating partial solutions, and pruning dead ends. Instead of one linear chain, the model emits several candidate next steps, scores them, and continues from the best. ToT works well for tasks with discrete intermediate states (game-playing, planning, combinatorial puzzles) and is overkill for arithmetic or single-pass reasoning.

python

import anthropic

client = anthropic.Anthropic()

def expand_thoughts(state: str, k: int = 3) -> list[str]:
    """Generate k candidate next steps from the current state."""
    resp = client.messages.create(
        model="claude-opus-4-7",
        max_tokens=600,
        temperature=0.8,
        messages=[{
            "role": "user",
            "content": (
                f"Current state:\n{state}\n\n"
                f"Propose {k} distinct next steps. Number them 1-{k}. "
                f"Each step should be one short paragraph."
            ),
        }],
    )
    return parse_numbered_list(resp.content[0].text, k)

def score_thought(thought: str, goal: str) -> float:
    """Heuristic score from 0-1 for how promising the thought is."""
    resp = client.messages.create(
        model="claude-opus-4-7",
        max_tokens=20,
        messages=[{
            "role": "user",
            "content": (
                f"Goal: {goal}\nProposed step: {thought}\n\n"
                "Rate 0-1 how likely this step is to reach the goal. "
                "Reply with the number only."
            ),
        }],
    )
    try:
        return float(resp.content[0].text.strip())
    except ValueError:
        return 0.0

def tot_search(problem: str, max_depth: int = 4, beam: int = 3) -> str:
    """Beam-search over thought tree."""
    frontier = [problem]
    for depth in range(max_depth):
        candidates = []
        for state in frontier:
            for thought in expand_thoughts(state, k=3):
                score = score_thought(thought, problem)
                candidates.append((score, state + "\n" + thought))
        candidates.sort(reverse=True)
        frontier = [s for _, s in candidates[:beam]]
        if any("FINAL ANSWER" in s for s in frontier):
            return next(s for s in frontier if "FINAL ANSWER" in s)
    return frontier[0]

ToT is expensive — every expansion is an API call. A depth-4 beam-3 search with k=3 candidates per node costs ~36 calls per problem. Reach for it only when CoT and self-consistency have plateaued.

Reasoning models (extended thinking)

Modern reasoning-tuned models (Claude with extended thinking, OpenAI's o-series, DeepSeek-R1) bake CoT into the training and inference loop — they think privately for thousands of tokens before producing a visible answer. From a prompt-engineering perspective they replace the need for explicit "think step by step" wrappers on hard problems, but they have different cost and latency profiles.

Dimension	Prompted CoT	Reasoning model (extended thinking)
Setup	Just a phrase	API parameter (`thinking={"type": "enabled"}`)
Visible reasoning	Yes (in output)	Optional — separate `thinking` block
Cost	Output tokens for reasoning	Billed for thinking tokens separately
Latency	Higher (long output)	Higher (thinking + output)
Best for	Most multi-step tasks	Hard math, code, multi-step planning
Temperature	Free to tune	Must be 1.0
Combines with tools	Yes	Yes (some restrictions)

Extended thinking with Claude

python

import anthropic

client = anthropic.Anthropic()

response = client.messages.create(
    model="claude-opus-4-7",          # Opus required for extended thinking
    max_tokens=16000,
    thinking={
        "type": "enabled",
        "budget_tokens": 10000,        # cap on private reasoning tokens
    },
    messages=[{
        "role": "user",
        "content": (
            "A 3-digit number is divisible by 7 and 11. Its digits sum to 13. "
            "What's the largest such number?"
        ),
    }],
)

for block in response.content:
    if block.type == "thinking":
        print(f"[Thinking: {len(block.thinking)} chars (hidden)]")
    elif block.type == "text":
        print("Answer:", block.text)

Output:

text

[Thinking: 2104 chars (hidden)]
Answer: 847 (847 = 7 * 11 * 11, digits 8+4+7=19... no, recompute) 

Actually the largest 3-digit multiple of 77 with digit sum 13 is 616 (6+1+6=13).

When you have access to extended thinking, prefer it over prompted CoT for hard multi-step problems — the reasoning is more thorough and doesn't bloat the visible output. Use prompted CoT when the model doesn't support extended thinking (Sonnet, Haiku at time of writing) or when you want the reasoning visible for auditing.

Extended thinking requires temperature=1 and budget_tokens >= 1024. Thinking tokens are billed at the same rate as output tokens. Tool use is supported alongside thinking with some restrictions on the order of content blocks.

Step-Back prompting

Step-Back asks the model to first articulate the general principle before solving the specific instance. The first call extracts an abstract version of the problem ("what's the relevant physics principle?"); the second call applies it to the concrete numbers. Works well for physics, chemistry, and applied-math tasks where the right principle determines everything.

text

PROMPT 1 (extract principle):
Question: A 2 kg ball is dropped from 10 m. What is its KE just before hitting the ground?
What general principle is needed to solve this? Reply in one sentence.

PROMPT 2 (apply principle):
Question: A 2 kg ball is dropped from 10 m. What is its KE just before hitting the ground?
Principle: Conservation of energy — PE at the top equals KE at the bottom.
Apply this principle step by step.

Plan-and-Solve

Plan-and-Solve splits CoT into two phases: first the model writes an explicit plan of steps, then it executes the plan. This addresses a common failure mode where models skip steps or lose track mid-derivation. The plan acts as a contract the execution phase must follow.

text

Problem: {problem}

Step 1 — DEVISE A PLAN.
List the sub-steps needed to solve this problem. Number them 1, 2, 3, ...
Do NOT solve yet.

Step 2 — EXECUTE THE PLAN.
Now work through each numbered step. Show calculations.

Step 3 — FINAL ANSWER.
State the final answer on its own line: "Final answer: <answer>"

Self-critique CoT

Add a critique loop after the initial CoT: ask the model to inspect its own reasoning for errors and produce a revised answer. Two rounds (draft + critique + revise) catches a substantial fraction of arithmetic mistakes and logical gaps without paying for full self-consistency.

python

def cot_with_critique(problem: str) -> str:
    draft = call_claude(
        f"{problem}\n\nThink step by step and give your answer.",
        max_tokens=1500,
    )

    revised = call_claude(
        f"Problem: {problem}\n\n"
        f"My draft solution:\n{draft}\n\n"
        "Review this for arithmetic errors, logical gaps, and misread conditions. "
        "List any issues, then provide a corrected final answer if needed. "
        "If the draft is correct, say 'no changes' and repeat the final answer.",
        max_tokens=1500,
    )
    return revised

When NOT to use CoT

CoT is not free. It adds output tokens (cost + latency), can degrade short-task accuracy by introducing irrelevant reasoning, and is a security risk when the reasoning chain might leak sensitive context to end users. Skip CoT when:

vbnet

☐ The answer is a single token (classification, yes/no, label)
☐ Latency budget is tight (chat UX, autocomplete)
☐ Reasoning chain might expose sensitive sources (RAG over private docs)
☐ The model gets it right zero-shot — measure before adding CoT
☐ Output is for a downstream model that doesn't need the reasoning
☐ The task is style/tone rewriting where reasoning doesn't help

Hiding the chain from end users

CoT reasoning is often verbose and exposes the model's uncertainty in ways that hurt UX. Two patterns: emit reasoning in a tagged region and strip it before display, or use the thinking block (extended thinking) which is structurally separate from text blocks.

python

def user_facing_answer(problem: str) -> str:
    """Run CoT, return only the clean final answer for the user."""
    resp = client.messages.create(
        model="claude-opus-4-7",
        max_tokens=1500,
        messages=[{
            "role": "user",
            "content": (
                f"{problem}\n\n"
                "Reason in <thinking> tags, then output the user-facing answer in <answer> tags. "
                "The answer must be self-contained and friendly — do not refer to your reasoning."
            ),
        }],
    )
    text = resp.content[0].text
    match = re.search(r"<answer>(.*?)</answer>", text, re.DOTALL)
    return match.group(1).strip() if match else text

Common pitfalls

Recurring CoT failure modes and their fixes.

Symptom	Root cause	Fix
Reasoning is right, answer is wrong	Final-answer extraction fails	Use `Final answer:` sentinel + regex; or prefill `<answer>`
Model skips reasoning steps	"Step by step" too soft	Switch to few-shot CoT with worked examples
CoT degrades simple-task accuracy	Reasoning introduces hallucinations	Drop CoT for those tasks; measure before applying broadly
Self-consistency gives same wrong answer 5x	Temperature too low	Raise to 0.7+ for sampling diversity
ToT explodes in cost	k or depth too large	Beam search with beam ≤ 3, depth ≤ 5
Different chains, can't pick best	No scoring function	Add explicit verifier; rerun with self-consistency
Long reasoning truncates final answer	`max_tokens` too small	Increase budget; or split into two calls
Reasoning leaks system prompt to user	No output separation	Use `<thinking>` + `<answer>` tags; strip thinking before display
Inconsistent format on reasoning	No demonstration	Add 1–2 few-shot CoT examples showing the desired format
Reasoning model latency too high	Default thinking budget too large	Lower `budget_tokens` for easier problems

Real-world recipes

End-to-end CoT patterns for high-volume use cases.

Math word problem solver

python

def solve_word_problem(problem: str) -> dict:
    resp = client.messages.create(
        model="claude-opus-4-7",
        max_tokens=2000,
        messages=[{
            "role": "user",
            "content": (
                f"Solve this word problem step by step.\n\n"
                f"Problem: {problem}\n\n"
                f"Format:\n"
                f"Step 1: <identify variables and what's being asked>\n"
                f"Step 2: <set up equations or relationships>\n"
                f"Step 3: <solve algebraically>\n"
                f"Step 4: <check the answer is reasonable>\n"
                f"Final answer: <number with units>"
            ),
        }],
    )
    text = resp.content[0].text
    answer = text.split("Final answer:")[-1].strip()
    return {"reasoning": text, "answer": answer}

Multi-step debugger

python

def debug_with_cot(code: str, error: str) -> str:
    return call_claude(f"""You are an expert debugger. Diagnose the failure step by step.

<code>
{code}
</code>

<error>
{error}
</error>

Reason through the failure:
1. What does the error message tell us about the failing line and condition?
2. Trace execution up to that point — what variables hold what values?
3. Identify the root cause (one sentence).
4. Propose a minimal fix as a diff.

End with:
<fix>
<diff>...</diff>
</fix>""", max_tokens=2000)

Decision agent with explicit reasoning

python

def route_ticket(ticket: str) -> dict:
    resp = client.messages.create(
        model="claude-opus-4-7",
        max_tokens=600,
        messages=[{
            "role": "user",
            "content": f"""Route this ticket to a queue. Show your reasoning.

Available queues:
- urgent_billing (P1 billing issues, payment failures)
- general_billing (refunds, invoice questions)
- access (password, login, permissions)
- bug (broken behavior)
- product (feature requests, how-to)

Ticket: {ticket}

Reason in <thinking> tags. Then output:
<queue>queue_name</queue>
<priority>P1|P2|P3|P4</priority>
<reason>one sentence for the routing decision</reason>"""
        }],
    )
    text = resp.content[0].text
    return {
        "queue": re.search(r"<queue>(.*?)</queue>", text).group(1),
        "priority": re.search(r"<priority>(.*?)</priority>", text).group(1),
        "reason": re.search(r"<reason>(.*?)</reason>", text).group(1),
        "raw": text,
    }

Verifier on top of generation

python

def generate_then_verify(problem: str) -> str:
    candidates = [
        call_claude(f"{problem}\n\nThink step by step.", temperature=0.7)
        for _ in range(3)
    ]

    verifier_prompt = (
        f"Problem: {problem}\n\n"
        + "\n\n---\n\n".join(f"Candidate {i+1}:\n{c}" for i, c in enumerate(candidates))
        + "\n\nWhich candidate has the most correct reasoning? Reply with the number and a one-sentence justification, then restate the final answer."
    )
    return call_claude(verifier_prompt, temperature=0.0)

Quick reference

Pattern selection table, indexed by problem shape.

Problem shape	First pattern to try
Single multi-step math problem	Zero-shot CoT
Many similar math problems, varied difficulty	Few-shot CoT with worked examples
Hard problem, can't afford a wrong answer	Self-consistency (N=5)
Combinatorial / planning / game	Tree of Thoughts
Applied physics / chemistry	Step-Back prompting
Multi-step procedural task	Plan-and-Solve
Already have a draft, want to catch errors	Self-critique loop
Have access to extended thinking	Use it instead of prompted CoT
Need to hide reasoning from end users	Tagged CoT + strip before display
Reasoning chain itself is the deliverable	Visible CoT; ask for numbered steps