cheat sheet
DSPy
Build LLM programs in DSPy with declarative signatures, modules, and optimisers. Covers Predict, ChainOfThought, ReAct, BootstrapFewShot, COPRO, MIPRO, MIPROv2, and inference compilation.
DSPy — Programmatic Prompting and Optimisation
What it is
DSPy (Declarative Self-improving Python, from Stanford NLP) is a framework that replaces hand-tuned prompt strings with programs: typed Signatures describe the input → output contract; Modules compose them; and optimisers (also called teleprompters) compile the program by searching over few-shot examples, instructions, and demonstrations to maximise a developer-supplied metric. The slogan is "programming, not prompting": you write the logic and let DSPy figure out the prompt.
The novel contribution is inference compilation: given a labelled (or self-labelled) dataset and a metric function, DSPy uses a teacher model to bootstrap traces, then selects/refines few-shot examples and instructions that move the metric. The compiled program is portable across LLMs — swap GPT-4o-mini for Claude Sonnet at runtime without rewriting prompts.
Install
pip install dspy
pip install dspy chromadb sentence-transformers
Output:
Successfully installed dspy-2.x.x ...
The package was renamed from
dspy-aitodspyin 2024. Older tutorials useimport dspywithpip install dspy-ai; new installs usepip install dspyand the same import.
Quick example — a Predict module
A signature is "inputs -> outputs". dspy.Predict(signature) turns it into a callable. The LM is configured globally with dspy.configure(lm=...).
import dspy
import os
lm = dspy.LM("openai/gpt-4o-mini", api_key=os.environ["OPENAI_API_KEY"])
dspy.configure(lm=lm)
summarise = dspy.Predict("text -> summary")
out = summarise(text="DSPy compiles LLM programs by optimising prompts against a metric.")
print(out.summary)
Output:
DSPy is a framework that compiles language-model programs by tuning prompts to maximise a developer-defined metric.
When / why to use it
- You have a metric (accuracy, F1, BLEU, judge-LM score) and want the prompt tuned to it instead of guessing.
- Multi-stage LLM pipelines (decompose → retrieve → reason → answer) where hand-tuning every stage is painful.
- You expect to swap models (cheap dev model, premium prod model) without rewriting prompts.
- You want few-shot examples chosen automatically from a training set instead of cherry-picked by hand.
- Reasoning-heavy tasks (math, multi-hop QA, code) where
ChainOfThoughtandProgramOfThoughtreliably outperform plainPredict.
Common pitfalls
No metric, no optimisation — you must supply a
metric(example, pred, trace=None) -> float | boolto every optimiser. Without it there is nothing to optimise against, soBootstrapFewShotfalls back to using the trainset as raw demos.
Train/dev contamination — DSPy's optimisers select demos from the trainset and evaluate against a separate devset. Reusing the same examples in both inflates scores. Hold out at least 30% as devset.
Field name → prompt key — signature field names become prompt keys (
text,summary,reasoning). Renaming a field invalidates compiled prompts. Pick stable names up front.
Tracing leaks memory — DSPy stores every LM call when
dspy.settings.trace = []is set. Reset traces between batches in long-running services.
Inspect the compiled prompt with
dspy.inspect_history(n=1)after running. The exact text sent to the LM (system + few-shots + user) is printed, which is invaluable for debugging metric regressions.
Cache LM calls during optimisation with
dspy.configure(lm=lm, cache=True). Optimisers make hundreds of calls; caching cuts wall-clock and cost dramatically when iterating on metrics.
Signatures — the input/output contract
A Signature declares what goes in and what comes out. The simplest form is a string "inputs -> outputs"; the explicit form is a class subclassing dspy.Signature with InputField and OutputField, optionally annotated with descriptions.
import dspy
class GenerateAnswer(dspy.Signature):
"""Answer the question concisely using the context."""
context: str = dspy.InputField(desc="Relevant background facts.")
question: str = dspy.InputField()
answer: str = dspy.OutputField(desc="One or two short sentences.")
predictor = dspy.Predict(GenerateAnswer)
result = predictor(
context="Paris is the capital of France.",
question="What is the capital of France?",
)
print(result.answer)
Output:
The capital of France is Paris.
The docstring becomes the system instruction. Field descriptions become prompt hints. Output types (str, int, float, bool, list[str], Pydantic models) drive parsing — DSPy validates and retries on parse failure.
Modules
Modules are reusable LM programs. The built-in modules wrap a signature with a particular reasoning strategy.
| Module | Strategy |
|---|---|
dspy.Predict | Direct prediction (no intermediate reasoning). |
dspy.ChainOfThought | Adds a reasoning field before the final output. |
dspy.ChainOfThoughtWithHint | Same, with a hint field for evaluation-time guidance. |
dspy.ProgramOfThought | Generates and executes Python code to produce the answer. |
dspy.ReAct | Tool-using reasoning loop (Thought → Action → Observation). |
dspy.MultiChainComparison | Samples N reasoning chains and picks the best. |
dspy.Retrieve | Calls the configured retriever (RM). |
import dspy
cot = dspy.ChainOfThought("question -> answer")
result = cot(question="A train leaves at 3pm and travels 60 km/h. How far in 2.5 hours?")
print("Reasoning:", result.reasoning)
print("Answer: ", result.answer)
Output:
Reasoning: distance = speed * time = 60 * 2.5 = 150 km.
Answer: 150 km
Custom modules — composing signatures
Subclass dspy.Module and define forward(self, ...). Sub-modules become tunable as a whole.
import dspy
class RAG(dspy.Module):
def __init__(self, num_passages=3):
super().__init__()
self.retrieve = dspy.Retrieve(k=num_passages)
self.generate = dspy.ChainOfThought("context, question -> answer")
def forward(self, question: str):
passages = self.retrieve(question).passages
context = "\n\n".join(passages)
return self.generate(context=context, question=question)
self.retrieve and self.generate are both tunable; optimisers traverse the module tree and compile them jointly.
Configuring LMs and RMs
DSPy talks to LMs through a single dspy.LM(...) interface backed by LiteLLM, so any provider supported by LiteLLM works without extra adapters.
import dspy
import os
# OpenAI
lm = dspy.LM("openai/gpt-4o-mini", api_key=os.environ["OPENAI_API_KEY"])
# Anthropic
lm = dspy.LM("anthropic/claude-sonnet-4-6", api_key=os.environ["ANTHROPIC_API_KEY"])
# Local via Ollama
lm = dspy.LM("ollama_chat/llama3.1", api_base="http://localhost:11434")
dspy.configure(lm=lm)
Retrieval models (RMs) are configured similarly:
from dspy.retrieve.chromadb_rm import ChromadbRM
import chromadb
client = chromadb.PersistentClient(path="./chroma_db")
rm = ChromadbRM(collection_name="docs", persist_directory="./chroma_db", k=5)
dspy.configure(lm=lm, rm=rm)
Built-in RMs include ChromadbRM, QdrantRM, WeaviateRM, PineconeRM, ColBERTv2, and MarqoRM. Any callable returning passages can also be wrapped.
Datasets and dspy.Example
DSPy expects training data as dspy.Example objects. Mark which fields are inputs with .with_inputs(...); everything else is treated as a label.
import dspy
trainset = [
dspy.Example(question="2+2", answer="4").with_inputs("question"),
dspy.Example(question="Capital of Spain?", answer="Madrid").with_inputs("question"),
dspy.Example(question="Square root of 81?", answer="9").with_inputs("question"),
]
devset = [
dspy.Example(question="Capital of Italy?", answer="Rome").with_inputs("question"),
]
Metrics
A metric is metric(example, prediction, trace=None) -> float | bool. Optimisers maximise the metric.
def exact_match(example, pred, trace=None):
return example.answer.strip().lower() == pred.answer.strip().lower()
For semantic matching, use an LLM judge:
import dspy
class Judge(dspy.Signature):
"""Given a gold answer and a predicted answer, decide if they are equivalent."""
gold: str = dspy.InputField()
predicted: str = dspy.InputField()
correct: bool = dspy.OutputField()
judge = dspy.Predict(Judge)
def semantic_match(example, pred, trace=None) -> bool:
return judge(gold=example.answer, predicted=pred.answer).correct
Metrics can use
traceto score intermediate steps. For multi-hop QA, reward both final answer correctness and good intermediate retrieval.
Optimisers — compiling a program
Optimisers (formerly "teleprompters") take an unoptimised program, a trainset, and a metric, and return a compiled program with selected demos and (sometimes) refined instructions.
BootstrapFewShot — the default
Picks few-shot examples from the trainset by running the (uncompiled) program and keeping examples where the metric passes. Cheap, fast, and the default choice for getting a baseline.
import dspy
from dspy.teleprompt import BootstrapFewShot
rag = RAG()
optimiser = BootstrapFewShot(metric=exact_match, max_bootstrapped_demos=4, max_labeled_demos=4)
compiled_rag = optimiser.compile(rag, trainset=trainset)
max_bootstrapped_demos = demos generated by running the teacher; max_labeled_demos = demos used as-is from the trainset.
BootstrapFewShotWithRandomSearch — random search over demos
Runs BootstrapFewShot with multiple random seeds and picks the candidate with the best devset score. Better than BootstrapFewShot for non-trivial budgets.
from dspy.teleprompt import BootstrapFewShotWithRandomSearch
optimiser = BootstrapFewShotWithRandomSearch(
metric=exact_match,
max_bootstrapped_demos=4,
num_candidate_programs=10,
)
compiled = optimiser.compile(rag, trainset=trainset, valset=devset)
COPRO — instruction optimisation
COPRO (Coordinate Prompt Optimisation) refines the natural-language instructions in each signature. It does not change demos; it rewrites the docstring/prompts to be more effective for the task. Use when the system prompt matters more than the few-shots.
from dspy.teleprompt import COPRO
optimiser = COPRO(metric=exact_match, breadth=10, depth=3, init_temperature=1.4)
compiled = optimiser.compile(rag, trainset=trainset, eval_kwargs={"num_threads": 4})
breadth = candidate instructions per step; depth = optimisation steps.
MIPRO and MIPROv2 — joint instruction + demo optimisation
MIPRO (Multi-prompt Instruction Proposal Optimiser) jointly searches over instructions and few-shot demos using Bayesian optimisation. MIPROv2 is the current recommended large-budget optimiser — significantly better than COPRO for most tasks.
from dspy.teleprompt import MIPROv2
optimiser = MIPROv2(
metric=exact_match,
auto="medium",
)
compiled = optimiser.compile(
rag,
trainset=trainset,
valset=devset,
requires_permission_to_run=False,
)
auto="light" runs ~6 trials, "medium" ~12, "heavy" ~25. Each trial is ~100 LM calls — budget accordingly.
MIPROv2 cost —
auto="heavy"can spend hundreds of dollars on GPT-4-class teachers for a multi-stage RAG. Always cache, start withauto="light", and confirm the metric trends upward before scaling.
KNN-FewShot — demo retrieval by similarity
For each new query, retrieves the K most similar trainset examples and uses them as few-shot demos. Useful when the task distribution is wide and a fixed demo set generalises poorly.
from dspy.teleprompt import KNNFewShot
optimiser = KNNFewShot(k=4, trainset=trainset, vectorizer=dspy.Embedder("openai/text-embedding-3-small"))
compiled = optimiser.compile(rag, trainset=trainset)
Saving and loading compiled programs
compiled.save("./compiled_rag.json")
import dspy
fresh = RAG()
fresh.load("./compiled_rag.json")
The JSON contains the chosen demos and (for COPRO/MIPRO) the refined instructions. Commit it to git as a model artefact.
Evaluation harness
dspy.evaluate.Evaluate runs a program over a devset and reports the metric.
from dspy.evaluate import Evaluate
evaluator = Evaluate(devset=devset, metric=exact_match, num_threads=4, display_progress=True)
score = evaluator(compiled_rag)
print(f"Devset score: {score:.2%}")
Output:
Devset score: 87.50%
ReAct agents
dspy.ReAct implements the Thought → Action → Observation loop. Tools are Python callables with type-annotated signatures; DSPy generates the JSON-schema description automatically.
import dspy
def calculator(expression: str) -> float:
"""Evaluate a simple arithmetic expression."""
return float(eval(expression, {"__builtins__": {}}, {}))
def get_population(country: str) -> int:
"""Return the population of a country (rough)."""
return {"France": 68_000_000, "Spain": 47_000_000}.get(country, -1)
react = dspy.ReAct("question -> answer", tools=[calculator, get_population])
out = react(question="What is the combined population of France and Spain, plus 100?")
print(out.answer)
Output:
The combined population of France and Spain plus 100 is 115,000,100.
ReAct is itself a Module — pass it to an optimiser to tune the tool-selection prompt.
ProgramOfThought — code-generated answers
For numeric, list, or table tasks, generating Python and executing it beats free-text reasoning.
pot = dspy.ProgramOfThought("question -> answer")
out = pot(question="What is the standard deviation of [10, 12, 23, 23, 16, 23, 21, 16]?")
print(out.answer)
Output:
4.898979485566356
ProgramOfThought executes generated code — use a sandbox (
docker,nsjail, or a Restricted Python interpreter) for untrusted inputs.
Assertions — runtime guarantees
dspy.Assert enforces post-conditions. If the assertion fails, DSPy retries with the assertion's message included as feedback.
import dspy
class WriteSummary(dspy.Module):
def __init__(self):
super().__init__()
self.cot = dspy.ChainOfThought("text -> summary")
def forward(self, text):
out = self.cot(text=text)
dspy.Assert(
len(out.summary.split()) <= 30,
"Summary must be 30 words or fewer."
)
return out
summariser = dspy.assert_transform_module(WriteSummary())
print(summariser(text="DSPy programs are compiled by optimising prompts against a metric.").summary)
Use dspy.Suggest for soft constraints (logged but non-blocking) and dspy.Assert for hard ones.
Real-world recipes
Recipe — multi-hop QA over a knowledge base
Decompose the question, retrieve per sub-question, and synthesise.
import dspy
class MultiHopRAG(dspy.Module):
def __init__(self, num_hops=2, num_passages=3):
super().__init__()
self.num_hops = num_hops
self.gen_query = dspy.ChainOfThought("context, question -> next_search_query")
self.retrieve = dspy.Retrieve(k=num_passages)
self.gen_ans = dspy.ChainOfThought("context, question -> answer")
def forward(self, question):
context = []
for _ in range(self.num_hops):
q = self.gen_query(context="\n".join(context), question=question).next_search_query
context.extend(self.retrieve(q).passages)
return self.gen_ans(context="\n".join(context), question=question)
Compile with MIPROv2 and a hop-aware metric that rewards good intermediate queries.
Recipe — judge-driven metric
When the task is open-ended, use an LM-as-judge metric. Cache the judge to control cost.
class Faithfulness(dspy.Signature):
"""Is the answer faithful to the context? Output yes or no."""
context: str = dspy.InputField()
answer: str = dspy.InputField()
faithful: bool = dspy.OutputField()
judge = dspy.Predict(Faithfulness)
def faithful_metric(example, pred, trace=None):
return judge(context=example.context, answer=pred.answer).faithful
Recipe — A/B between two compiled programs
import dspy
from dspy.evaluate import Evaluate
eval_run = Evaluate(devset=devset, metric=exact_match, num_threads=4)
print("Bootstrap:", eval_run(compiled_bootstrap))
print("MIPROv2: ", eval_run(compiled_miprov2))
Promote whichever wins; keep both JSON artefacts in models/ for rollback.
Recipe — swap teacher / student LMs
Optimise once with an expensive teacher, deploy with a cheap student.
teacher = dspy.LM("openai/gpt-4o")
student = dspy.LM("openai/gpt-4o-mini")
with dspy.context(lm=teacher):
compiled = MIPROv2(metric=exact_match, auto="medium").compile(rag, trainset=trainset, valset=devset)
dspy.configure(lm=student)
print(compiled(question="...").answer)
dspy.context(...) temporarily overrides the LM for the optimisation block.
Recipe — diff two compiled prompts
import json
with open("./old.json") as f: old = json.load(f)
with open("./new.json") as f: new = json.load(f)
for mod in old:
if old[mod]["signature_instructions"] != new[mod]["signature_instructions"]:
print(f"{mod} instructions changed:")
print("- ", old[mod]["signature_instructions"])
print("+ ", new[mod]["signature_instructions"])
Commit the JSON to git and the diff makes prompt regressions visible in PRs.
Recipe — streaming responses
import dspy
lm = dspy.LM("openai/gpt-4o-mini")
dspy.configure(lm=lm)
stream_predict = dspy.streamify(dspy.Predict("topic -> explanation"))
import asyncio
async def main():
async for chunk in stream_predict(topic="Why is the sky blue?"):
if isinstance(chunk, dspy.streaming.StreamResponse):
print(chunk.chunk, end="", flush=True)
asyncio.run(main())
dspy.streamify(module) wraps any module for async streaming of its output fields.
Quick reference
| Task | Code |
|---|---|
| Install | pip install dspy |
| Configure LM | dspy.configure(lm=dspy.LM("openai/gpt-4o-mini")) |
| Configure RM | dspy.configure(rm=ChromadbRM(...)) |
| Inline signature | dspy.Predict("inputs -> outputs") |
| Class signature | class S(dspy.Signature): ... InputField/OutputField |
| Chain of thought | dspy.ChainOfThought("q -> a") |
| ReAct agent | dspy.ReAct(sig, tools=[f1, f2]) |
| Program of thought | dspy.ProgramOfThought("q -> a") |
| Retrieve | dspy.Retrieve(k=5) |
| Example | dspy.Example(...).with_inputs("q") |
| Bootstrap demos | BootstrapFewShot(metric=m, max_bootstrapped_demos=4) |
| Random search | BootstrapFewShotWithRandomSearch(num_candidate_programs=10) |
| Instruction optim | COPRO(metric=m, breadth=10, depth=3) |
| Joint optim | MIPROv2(metric=m, auto="medium") |
| KNN demos | KNNFewShot(k=4, trainset=ts) |
| Evaluate | Evaluate(devset=ds, metric=m)(program) |
| Inspect history | dspy.inspect_history(n=1) |
| Save compiled | program.save("file.json") |
| Load compiled | program.load("file.json") |
| Stream | dspy.streamify(module) |
| Hard assertion | dspy.Assert(cond, "message") |
| Soft suggestion | dspy.Suggest(cond, "message") |
| Temporary LM | with dspy.context(lm=teacher): ... |