cheat sheet
Chain-of-Thought Prompting
CoT prompting techniques — zero-shot CoT, few-shot CoT, self-consistency, tree of thoughts, and how reasoning models compare with prompted reasoning.
Chain-of-Thought Prompting
What it is
Chain-of-thought (CoT) prompting elicits step-by-step reasoning from a language model before it commits to a final answer. The technique exploits the fact that transformers produce one token at a time with no "scratchpad" beyond the visible context — by getting the model to write its work, you give it more compute to think with. CoT consistently improves accuracy on multi-step math, logic puzzles, code debugging, and any task where the right answer requires combining several facts. The two foundational variants are zero-shot CoT (an instruction like "think step by step") and few-shot CoT (worked examples that show the desired reasoning style).
Why CoT works
A model that emits "the answer is 42" in one token had exactly one forward pass to figure out 42. A model that emits 200 tokens of reasoning before "42" had 200 forward passes' worth of compute, each conditioned on the previous tokens — effectively building a working memory through its own output. The reasoning tokens are not just for human inspection; they are part of the computation that produces the answer.
| Task type | Zero-shot accuracy | With CoT accuracy | Gain |
|---|---|---|---|
| Arithmetic word problems | Low | High | Large |
| Multi-step logic puzzles | Low-medium | High | Large |
| Code debugging | Medium | High | Moderate |
| Factual lookup | High | High | None — CoT doesn't help |
| Style/tone tasks | High | High | None or slight regression |
| Short classification | High | High | None — overhead without benefit |
Apply CoT where reasoning matters; skip it on tasks where the answer is already a lookup or pattern-match. Padding short tasks with CoT just burns tokens for no quality gain.
Zero-shot CoT
The simplest variant: append "Let's think step by step" to the question. No examples, no scaffolding. This phrase (and small variants) reliably triggers reasoning behavior in instruction-tuned models. Stronger variants ask the model to produce reasoning in a tagged region, then a clean final answer outside the tag.
Question: A jug holds 3 liters. It is filled with a 30% salt solution.
4 liters of pure water are added. What is the new salt concentration?
Let's think step by step.
Tagged CoT (cleaner output parsing)
{problem}
Think step by step inside <reasoning> tags.
After </reasoning>, give ONLY the final answer with no explanation.
import re
import anthropic
client = anthropic.Anthropic()
def solve_with_cot(problem: str) -> tuple[str, str]:
"""Returns (reasoning, answer)."""
resp = client.messages.create(
model="claude-opus-4-7",
max_tokens=2048,
messages=[{
"role": "user",
"content": (
f"{problem}\n\n"
"Think step by step inside <reasoning> tags. "
"After </reasoning>, give ONLY the final answer."
),
}],
)
text = resp.content[0].text
reasoning = re.search(r"<reasoning>(.*?)</reasoning>", text, re.DOTALL)
answer = re.sub(r"<reasoning>.*?</reasoning>", "", text, flags=re.DOTALL).strip()
return (reasoning.group(1).strip() if reasoning else "", answer)
reasoning, answer = solve_with_cot(
"A jug holds 3L of 30% salt solution. Add 4L pure water. New concentration?"
)
print("Answer:", answer)
Output:
Answer: ~12.86% (0.9L salt in 7L total)
For production, ask for the answer in a separate tag like
<answer>...</answer>to avoid regex cleanup. Prefilling<reasoning>also works to force the structure.
Few-shot CoT
When zero-shot CoT produces reasoning that's wrong or in the wrong style, show the model worked examples. The model copies the format and depth of the reasoning chain along with the answer style. This is the most reliable variant for tasks where the kind of reasoning matters (case analysis vs algebraic vs proof by contradiction).
Solve each problem. Show step-by-step reasoning before the final answer.
Q: A train leaves Station A at 9am going 60 mph. Another leaves Station B at 10am
going 80 mph toward A. Stations are 280 miles apart. When do they meet?
A: Let t = hours after 9am.
Train 1 position from A: 60 * t
Train 2 starts at 10am (so t=1 onwards): 80 * (t - 1) from B
They meet when 60t + 80(t-1) = 280
60t + 80t - 80 = 280
140t = 360
t = 2.57 hours after 9am
Final answer: ~11:34 am
Q: If 5 workers can build a wall in 8 days, how long for 8 workers (same rate)?
A: Total worker-days = 5 * 8 = 40
New time = 40 / 8 = 5 days
Final answer: 5 days
Q: {new_problem}
A:
Demonstrations should match the reasoning depth and style you want — show "think it through carefully" examples, not skipped-step examples. The model imitates whatever level of detail it sees.
Self-consistency
Self-consistency runs the same CoT prompt multiple times at non-zero temperature, collects the final answers, and takes a majority vote. The intuition: many reasoning paths can reach the right answer, and wrong reasoning paths tend to diverge. On hard math problems self-consistency improves accuracy substantially over single-sample CoT at the cost of N times more API calls.
import anthropic
from collections import Counter
client = anthropic.Anthropic()
def self_consistency(problem: str, n: int = 5, temperature: float = 0.7) -> dict:
"""Sample n reasoning paths and majority-vote the final answer."""
answers = []
for _ in range(n):
resp = client.messages.create(
model="claude-opus-4-7",
max_tokens=1500,
temperature=temperature,
messages=[{
"role": "user",
"content": (
f"{problem}\n\n"
"Think step by step. End with: Final answer: <answer>"
),
}],
)
text = resp.content[0].text
if "Final answer:" in text:
answer = text.split("Final answer:")[-1].strip().split("\n")[0]
answers.append(answer)
counts = Counter(answers)
winner, votes = counts.most_common(1)[0]
return {
"answer": winner,
"confidence": votes / n,
"all_answers": dict(counts),
"samples": n,
}
result = self_consistency("What is 47 * 83?", n=5)
print(result)
Output:
{'answer': '3901', 'confidence': 1.0, 'all_answers': {'3901': 5}, 'samples': 5}
Use self-consistency on problems where you can extract a clean final answer (a number, a label, a yes/no). For open-ended outputs majority vote doesn't apply — use it only where answers can be compared for equality.
Cost-aware self-consistency
You don't always need 5 samples. Run 1 sample first; if the answer matches a confidence rubric (the model says "I'm confident"), accept it; otherwise escalate to N samples. This adaptive pattern cuts cost dramatically on easy problems while preserving accuracy on hard ones.
def adaptive_consistency(problem: str, easy_threshold: float = 0.9) -> dict:
# First pass: ask for answer + self-rated confidence
first = call_claude(
f"{problem}\n\nThink step by step. End with:\n"
f"Final answer: <answer>\nConfidence (0-1): <num>"
)
if extract_confidence(first) >= easy_threshold:
return {"answer": extract_answer(first), "samples": 1}
# Escalate
return self_consistency(problem, n=5)
Tree of Thoughts (ToT)
Tree of Thoughts generalizes CoT by exploring multiple reasoning branches at each step, evaluating partial solutions, and pruning dead ends. Instead of one linear chain, the model emits several candidate next steps, scores them, and continues from the best. ToT works well for tasks with discrete intermediate states (game-playing, planning, combinatorial puzzles) and is overkill for arithmetic or single-pass reasoning.
import anthropic
client = anthropic.Anthropic()
def expand_thoughts(state: str, k: int = 3) -> list[str]:
"""Generate k candidate next steps from the current state."""
resp = client.messages.create(
model="claude-opus-4-7",
max_tokens=600,
temperature=0.8,
messages=[{
"role": "user",
"content": (
f"Current state:\n{state}\n\n"
f"Propose {k} distinct next steps. Number them 1-{k}. "
f"Each step should be one short paragraph."
),
}],
)
return parse_numbered_list(resp.content[0].text, k)
def score_thought(thought: str, goal: str) -> float:
"""Heuristic score from 0-1 for how promising the thought is."""
resp = client.messages.create(
model="claude-opus-4-7",
max_tokens=20,
messages=[{
"role": "user",
"content": (
f"Goal: {goal}\nProposed step: {thought}\n\n"
"Rate 0-1 how likely this step is to reach the goal. "
"Reply with the number only."
),
}],
)
try:
return float(resp.content[0].text.strip())
except ValueError:
return 0.0
def tot_search(problem: str, max_depth: int = 4, beam: int = 3) -> str:
"""Beam-search over thought tree."""
frontier = [problem]
for depth in range(max_depth):
candidates = []
for state in frontier:
for thought in expand_thoughts(state, k=3):
score = score_thought(thought, problem)
candidates.append((score, state + "\n" + thought))
candidates.sort(reverse=True)
frontier = [s for _, s in candidates[:beam]]
if any("FINAL ANSWER" in s for s in frontier):
return next(s for s in frontier if "FINAL ANSWER" in s)
return frontier[0]
ToT is expensive — every expansion is an API call. A depth-4 beam-3 search with k=3 candidates per node costs ~36 calls per problem. Reach for it only when CoT and self-consistency have plateaued.
Reasoning models (extended thinking)
Modern reasoning-tuned models (Claude with extended thinking, OpenAI's o-series, DeepSeek-R1) bake CoT into the training and inference loop — they think privately for thousands of tokens before producing a visible answer. From a prompt-engineering perspective they replace the need for explicit "think step by step" wrappers on hard problems, but they have different cost and latency profiles.
| Dimension | Prompted CoT | Reasoning model (extended thinking) |
|---|---|---|
| Setup | Just a phrase | API parameter (thinking={"type": "enabled"}) |
| Visible reasoning | Yes (in output) | Optional — separate thinking block |
| Cost | Output tokens for reasoning | Billed for thinking tokens separately |
| Latency | Higher (long output) | Higher (thinking + output) |
| Best for | Most multi-step tasks | Hard math, code, multi-step planning |
| Temperature | Free to tune | Must be 1.0 |
| Combines with tools | Yes | Yes (some restrictions) |
Extended thinking with Claude
import anthropic
client = anthropic.Anthropic()
response = client.messages.create(
model="claude-opus-4-7", # Opus required for extended thinking
max_tokens=16000,
thinking={
"type": "enabled",
"budget_tokens": 10000, # cap on private reasoning tokens
},
messages=[{
"role": "user",
"content": (
"A 3-digit number is divisible by 7 and 11. Its digits sum to 13. "
"What's the largest such number?"
),
}],
)
for block in response.content:
if block.type == "thinking":
print(f"[Thinking: {len(block.thinking)} chars (hidden)]")
elif block.type == "text":
print("Answer:", block.text)
Output:
[Thinking: 2104 chars (hidden)]
Answer: 847 (847 = 7 * 11 * 11, digits 8+4+7=19... no, recompute)
Actually the largest 3-digit multiple of 77 with digit sum 13 is 616 (6+1+6=13).
When you have access to extended thinking, prefer it over prompted CoT for hard multi-step problems — the reasoning is more thorough and doesn't bloat the visible output. Use prompted CoT when the model doesn't support extended thinking (Sonnet, Haiku at time of writing) or when you want the reasoning visible for auditing.
Extended thinking requires
temperature=1andbudget_tokens >= 1024. Thinking tokens are billed at the same rate as output tokens. Tool use is supported alongside thinking with some restrictions on the order of content blocks.
Step-Back prompting
Step-Back asks the model to first articulate the general principle before solving the specific instance. The first call extracts an abstract version of the problem ("what's the relevant physics principle?"); the second call applies it to the concrete numbers. Works well for physics, chemistry, and applied-math tasks where the right principle determines everything.
PROMPT 1 (extract principle):
Question: A 2 kg ball is dropped from 10 m. What is its KE just before hitting the ground?
What general principle is needed to solve this? Reply in one sentence.
PROMPT 2 (apply principle):
Question: A 2 kg ball is dropped from 10 m. What is its KE just before hitting the ground?
Principle: Conservation of energy — PE at the top equals KE at the bottom.
Apply this principle step by step.
Plan-and-Solve
Plan-and-Solve splits CoT into two phases: first the model writes an explicit plan of steps, then it executes the plan. This addresses a common failure mode where models skip steps or lose track mid-derivation. The plan acts as a contract the execution phase must follow.
Problem: {problem}
Step 1 — DEVISE A PLAN.
List the sub-steps needed to solve this problem. Number them 1, 2, 3, ...
Do NOT solve yet.
Step 2 — EXECUTE THE PLAN.
Now work through each numbered step. Show calculations.
Step 3 — FINAL ANSWER.
State the final answer on its own line: "Final answer: <answer>"
Self-critique CoT
Add a critique loop after the initial CoT: ask the model to inspect its own reasoning for errors and produce a revised answer. Two rounds (draft + critique + revise) catches a substantial fraction of arithmetic mistakes and logical gaps without paying for full self-consistency.
def cot_with_critique(problem: str) -> str:
draft = call_claude(
f"{problem}\n\nThink step by step and give your answer.",
max_tokens=1500,
)
revised = call_claude(
f"Problem: {problem}\n\n"
f"My draft solution:\n{draft}\n\n"
"Review this for arithmetic errors, logical gaps, and misread conditions. "
"List any issues, then provide a corrected final answer if needed. "
"If the draft is correct, say 'no changes' and repeat the final answer.",
max_tokens=1500,
)
return revised
When NOT to use CoT
CoT is not free. It adds output tokens (cost + latency), can degrade short-task accuracy by introducing irrelevant reasoning, and is a security risk when the reasoning chain might leak sensitive context to end users. Skip CoT when:
☐ The answer is a single token (classification, yes/no, label)
☐ Latency budget is tight (chat UX, autocomplete)
☐ Reasoning chain might expose sensitive sources (RAG over private docs)
☐ The model gets it right zero-shot — measure before adding CoT
☐ Output is for a downstream model that doesn't need the reasoning
☐ The task is style/tone rewriting where reasoning doesn't help
Hiding the chain from end users
CoT reasoning is often verbose and exposes the model's uncertainty in ways that hurt UX. Two patterns: emit reasoning in a tagged region and strip it before display, or use the thinking block (extended thinking) which is structurally separate from text blocks.
def user_facing_answer(problem: str) -> str:
"""Run CoT, return only the clean final answer for the user."""
resp = client.messages.create(
model="claude-opus-4-7",
max_tokens=1500,
messages=[{
"role": "user",
"content": (
f"{problem}\n\n"
"Reason in <thinking> tags, then output the user-facing answer in <answer> tags. "
"The answer must be self-contained and friendly — do not refer to your reasoning."
),
}],
)
text = resp.content[0].text
match = re.search(r"<answer>(.*?)</answer>", text, re.DOTALL)
return match.group(1).strip() if match else text
Common pitfalls
Recurring CoT failure modes and their fixes.
| Symptom | Root cause | Fix |
|---|---|---|
| Reasoning is right, answer is wrong | Final-answer extraction fails | Use Final answer: sentinel + regex; or prefill <answer> |
| Model skips reasoning steps | "Step by step" too soft | Switch to few-shot CoT with worked examples |
| CoT degrades simple-task accuracy | Reasoning introduces hallucinations | Drop CoT for those tasks; measure before applying broadly |
| Self-consistency gives same wrong answer 5x | Temperature too low | Raise to 0.7+ for sampling diversity |
| ToT explodes in cost | k or depth too large | Beam search with beam ≤ 3, depth ≤ 5 |
| Different chains, can't pick best | No scoring function | Add explicit verifier; rerun with self-consistency |
| Long reasoning truncates final answer | max_tokens too small | Increase budget; or split into two calls |
| Reasoning leaks system prompt to user | No output separation | Use <thinking> + <answer> tags; strip thinking before display |
| Inconsistent format on reasoning | No demonstration | Add 1–2 few-shot CoT examples showing the desired format |
| Reasoning model latency too high | Default thinking budget too large | Lower budget_tokens for easier problems |
Real-world recipes
End-to-end CoT patterns for high-volume use cases.
Math word problem solver
def solve_word_problem(problem: str) -> dict:
resp = client.messages.create(
model="claude-opus-4-7",
max_tokens=2000,
messages=[{
"role": "user",
"content": (
f"Solve this word problem step by step.\n\n"
f"Problem: {problem}\n\n"
f"Format:\n"
f"Step 1: <identify variables and what's being asked>\n"
f"Step 2: <set up equations or relationships>\n"
f"Step 3: <solve algebraically>\n"
f"Step 4: <check the answer is reasonable>\n"
f"Final answer: <number with units>"
),
}],
)
text = resp.content[0].text
answer = text.split("Final answer:")[-1].strip()
return {"reasoning": text, "answer": answer}
Multi-step debugger
def debug_with_cot(code: str, error: str) -> str:
return call_claude(f"""You are an expert debugger. Diagnose the failure step by step.
<code>
{code}
</code>
<error>
{error}
</error>
Reason through the failure:
1. What does the error message tell us about the failing line and condition?
2. Trace execution up to that point — what variables hold what values?
3. Identify the root cause (one sentence).
4. Propose a minimal fix as a diff.
End with:
<fix>
<diff>...</diff>
</fix>""", max_tokens=2000)
Decision agent with explicit reasoning
def route_ticket(ticket: str) -> dict:
resp = client.messages.create(
model="claude-opus-4-7",
max_tokens=600,
messages=[{
"role": "user",
"content": f"""Route this ticket to a queue. Show your reasoning.
Available queues:
- urgent_billing (P1 billing issues, payment failures)
- general_billing (refunds, invoice questions)
- access (password, login, permissions)
- bug (broken behavior)
- product (feature requests, how-to)
Ticket: {ticket}
Reason in <thinking> tags. Then output:
<queue>queue_name</queue>
<priority>P1|P2|P3|P4</priority>
<reason>one sentence for the routing decision</reason>"""
}],
)
text = resp.content[0].text
return {
"queue": re.search(r"<queue>(.*?)</queue>", text).group(1),
"priority": re.search(r"<priority>(.*?)</priority>", text).group(1),
"reason": re.search(r"<reason>(.*?)</reason>", text).group(1),
"raw": text,
}
Verifier on top of generation
def generate_then_verify(problem: str) -> str:
candidates = [
call_claude(f"{problem}\n\nThink step by step.", temperature=0.7)
for _ in range(3)
]
verifier_prompt = (
f"Problem: {problem}\n\n"
+ "\n\n---\n\n".join(f"Candidate {i+1}:\n{c}" for i, c in enumerate(candidates))
+ "\n\nWhich candidate has the most correct reasoning? Reply with the number and a one-sentence justification, then restate the final answer."
)
return call_claude(verifier_prompt, temperature=0.0)
Quick reference
Pattern selection table, indexed by problem shape.
| Problem shape | First pattern to try |
|---|---|
| Single multi-step math problem | Zero-shot CoT |
| Many similar math problems, varied difficulty | Few-shot CoT with worked examples |
| Hard problem, can't afford a wrong answer | Self-consistency (N=5) |
| Combinatorial / planning / game | Tree of Thoughts |
| Applied physics / chemistry | Step-Back prompting |
| Multi-step procedural task | Plan-and-Solve |
| Already have a draft, want to catch errors | Self-critique loop |
| Have access to extended thinking | Use it instead of prompted CoT |
| Need to hide reasoning from end users | Tagged CoT + strip before display |
| Reasoning chain itself is the deliverable | Visible CoT; ask for numbered steps |