cheat sheet

DSPy

Build LLM programs in DSPy with declarative signatures, modules, and optimisers. Covers Predict, ChainOfThought, ReAct, BootstrapFewShot, COPRO, MIPRO, MIPROv2, and inference compilation.

DSPy — Programmatic Prompting and Optimisation

What it is

DSPy (Declarative Self-improving Python, from Stanford NLP) is a framework that replaces hand-tuned prompt strings with programs: typed Signatures describe the input → output contract; Modules compose them; and optimisers (also called teleprompters) compile the program by searching over few-shot examples, instructions, and demonstrations to maximise a developer-supplied metric. The slogan is "programming, not prompting": you write the logic and let DSPy figure out the prompt.

The novel contribution is inference compilation: given a labelled (or self-labelled) dataset and a metric function, DSPy uses a teacher model to bootstrap traces, then selects/refines few-shot examples and instructions that move the metric. The compiled program is portable across LLMs — swap GPT-4o-mini for Claude Sonnet at runtime without rewriting prompts.

Install

bash
pip install dspy

pip install dspy chromadb sentence-transformers

Output:

text
Successfully installed dspy-2.x.x ...

The package was renamed from dspy-ai to dspy in 2024. Older tutorials use import dspy with pip install dspy-ai; new installs use pip install dspy and the same import.

Quick example — a Predict module

A signature is "inputs -> outputs". dspy.Predict(signature) turns it into a callable. The LM is configured globally with dspy.configure(lm=...).

python
import dspy
import os

lm = dspy.LM("openai/gpt-4o-mini", api_key=os.environ["OPENAI_API_KEY"])
dspy.configure(lm=lm)

summarise = dspy.Predict("text -> summary")
out = summarise(text="DSPy compiles LLM programs by optimising prompts against a metric.")
print(out.summary)

Output:

text
DSPy is a framework that compiles language-model programs by tuning prompts to maximise a developer-defined metric.

When / why to use it

  • You have a metric (accuracy, F1, BLEU, judge-LM score) and want the prompt tuned to it instead of guessing.
  • Multi-stage LLM pipelines (decompose → retrieve → reason → answer) where hand-tuning every stage is painful.
  • You expect to swap models (cheap dev model, premium prod model) without rewriting prompts.
  • You want few-shot examples chosen automatically from a training set instead of cherry-picked by hand.
  • Reasoning-heavy tasks (math, multi-hop QA, code) where ChainOfThought and ProgramOfThought reliably outperform plain Predict.

Common pitfalls

No metric, no optimisation — you must supply a metric(example, pred, trace=None) -> float | bool to every optimiser. Without it there is nothing to optimise against, so BootstrapFewShot falls back to using the trainset as raw demos.

Train/dev contamination — DSPy's optimisers select demos from the trainset and evaluate against a separate devset. Reusing the same examples in both inflates scores. Hold out at least 30% as devset.

Field name → prompt key — signature field names become prompt keys (text, summary, reasoning). Renaming a field invalidates compiled prompts. Pick stable names up front.

Tracing leaks memory — DSPy stores every LM call when dspy.settings.trace = [] is set. Reset traces between batches in long-running services.

Inspect the compiled prompt with dspy.inspect_history(n=1) after running. The exact text sent to the LM (system + few-shots + user) is printed, which is invaluable for debugging metric regressions.

Cache LM calls during optimisation with dspy.configure(lm=lm, cache=True). Optimisers make hundreds of calls; caching cuts wall-clock and cost dramatically when iterating on metrics.

Signatures — the input/output contract

A Signature declares what goes in and what comes out. The simplest form is a string "inputs -> outputs"; the explicit form is a class subclassing dspy.Signature with InputField and OutputField, optionally annotated with descriptions.

python
import dspy

class GenerateAnswer(dspy.Signature):
    """Answer the question concisely using the context."""

    context:  str = dspy.InputField(desc="Relevant background facts.")
    question: str = dspy.InputField()
    answer:   str = dspy.OutputField(desc="One or two short sentences.")

predictor = dspy.Predict(GenerateAnswer)
result = predictor(
    context="Paris is the capital of France.",
    question="What is the capital of France?",
)
print(result.answer)

Output:

text
The capital of France is Paris.

The docstring becomes the system instruction. Field descriptions become prompt hints. Output types (str, int, float, bool, list[str], Pydantic models) drive parsing — DSPy validates and retries on parse failure.

Modules

Modules are reusable LM programs. The built-in modules wrap a signature with a particular reasoning strategy.

ModuleStrategy
dspy.PredictDirect prediction (no intermediate reasoning).
dspy.ChainOfThoughtAdds a reasoning field before the final output.
dspy.ChainOfThoughtWithHintSame, with a hint field for evaluation-time guidance.
dspy.ProgramOfThoughtGenerates and executes Python code to produce the answer.
dspy.ReActTool-using reasoning loop (Thought → Action → Observation).
dspy.MultiChainComparisonSamples N reasoning chains and picks the best.
dspy.RetrieveCalls the configured retriever (RM).
python
import dspy

cot = dspy.ChainOfThought("question -> answer")
result = cot(question="A train leaves at 3pm and travels 60 km/h. How far in 2.5 hours?")
print("Reasoning:", result.reasoning)
print("Answer:   ", result.answer)

Output:

text
Reasoning: distance = speed * time = 60 * 2.5 = 150 km.
Answer:    150 km

Custom modules — composing signatures

Subclass dspy.Module and define forward(self, ...). Sub-modules become tunable as a whole.

python
import dspy

class RAG(dspy.Module):
    def __init__(self, num_passages=3):
        super().__init__()
        self.retrieve = dspy.Retrieve(k=num_passages)
        self.generate = dspy.ChainOfThought("context, question -> answer")

    def forward(self, question: str):
        passages = self.retrieve(question).passages
        context  = "\n\n".join(passages)
        return self.generate(context=context, question=question)

self.retrieve and self.generate are both tunable; optimisers traverse the module tree and compile them jointly.

Configuring LMs and RMs

DSPy talks to LMs through a single dspy.LM(...) interface backed by LiteLLM, so any provider supported by LiteLLM works without extra adapters.

python
import dspy
import os

# OpenAI
lm = dspy.LM("openai/gpt-4o-mini", api_key=os.environ["OPENAI_API_KEY"])

# Anthropic
lm = dspy.LM("anthropic/claude-sonnet-4-6", api_key=os.environ["ANTHROPIC_API_KEY"])

# Local via Ollama
lm = dspy.LM("ollama_chat/llama3.1", api_base="http://localhost:11434")

dspy.configure(lm=lm)

Retrieval models (RMs) are configured similarly:

python
from dspy.retrieve.chromadb_rm import ChromadbRM
import chromadb

client = chromadb.PersistentClient(path="./chroma_db")
rm = ChromadbRM(collection_name="docs", persist_directory="./chroma_db", k=5)
dspy.configure(lm=lm, rm=rm)

Built-in RMs include ChromadbRM, QdrantRM, WeaviateRM, PineconeRM, ColBERTv2, and MarqoRM. Any callable returning passages can also be wrapped.

Datasets and dspy.Example

DSPy expects training data as dspy.Example objects. Mark which fields are inputs with .with_inputs(...); everything else is treated as a label.

python
import dspy

trainset = [
    dspy.Example(question="2+2", answer="4").with_inputs("question"),
    dspy.Example(question="Capital of Spain?", answer="Madrid").with_inputs("question"),
    dspy.Example(question="Square root of 81?", answer="9").with_inputs("question"),
]

devset = [
    dspy.Example(question="Capital of Italy?", answer="Rome").with_inputs("question"),
]

Metrics

A metric is metric(example, prediction, trace=None) -> float | bool. Optimisers maximise the metric.

python
def exact_match(example, pred, trace=None):
    return example.answer.strip().lower() == pred.answer.strip().lower()

For semantic matching, use an LLM judge:

python
import dspy

class Judge(dspy.Signature):
    """Given a gold answer and a predicted answer, decide if they are equivalent."""

    gold:      str = dspy.InputField()
    predicted: str = dspy.InputField()
    correct:   bool = dspy.OutputField()

judge = dspy.Predict(Judge)

def semantic_match(example, pred, trace=None) -> bool:
    return judge(gold=example.answer, predicted=pred.answer).correct

Metrics can use trace to score intermediate steps. For multi-hop QA, reward both final answer correctness and good intermediate retrieval.

Optimisers — compiling a program

Optimisers (formerly "teleprompters") take an unoptimised program, a trainset, and a metric, and return a compiled program with selected demos and (sometimes) refined instructions.

BootstrapFewShot — the default

Picks few-shot examples from the trainset by running the (uncompiled) program and keeping examples where the metric passes. Cheap, fast, and the default choice for getting a baseline.

python
import dspy
from dspy.teleprompt import BootstrapFewShot

rag = RAG()
optimiser = BootstrapFewShot(metric=exact_match, max_bootstrapped_demos=4, max_labeled_demos=4)
compiled_rag = optimiser.compile(rag, trainset=trainset)

max_bootstrapped_demos = demos generated by running the teacher; max_labeled_demos = demos used as-is from the trainset.

BootstrapFewShotWithRandomSearch — random search over demos

Runs BootstrapFewShot with multiple random seeds and picks the candidate with the best devset score. Better than BootstrapFewShot for non-trivial budgets.

python
from dspy.teleprompt import BootstrapFewShotWithRandomSearch

optimiser = BootstrapFewShotWithRandomSearch(
    metric=exact_match,
    max_bootstrapped_demos=4,
    num_candidate_programs=10,
)
compiled = optimiser.compile(rag, trainset=trainset, valset=devset)

COPRO — instruction optimisation

COPRO (Coordinate Prompt Optimisation) refines the natural-language instructions in each signature. It does not change demos; it rewrites the docstring/prompts to be more effective for the task. Use when the system prompt matters more than the few-shots.

python
from dspy.teleprompt import COPRO

optimiser = COPRO(metric=exact_match, breadth=10, depth=3, init_temperature=1.4)
compiled = optimiser.compile(rag, trainset=trainset, eval_kwargs={"num_threads": 4})

breadth = candidate instructions per step; depth = optimisation steps.

MIPRO and MIPROv2 — joint instruction + demo optimisation

MIPRO (Multi-prompt Instruction Proposal Optimiser) jointly searches over instructions and few-shot demos using Bayesian optimisation. MIPROv2 is the current recommended large-budget optimiser — significantly better than COPRO for most tasks.

python
from dspy.teleprompt import MIPROv2

optimiser = MIPROv2(
    metric=exact_match,
    auto="medium",
)
compiled = optimiser.compile(
    rag,
    trainset=trainset,
    valset=devset,
    requires_permission_to_run=False,
)

auto="light" runs ~6 trials, "medium" ~12, "heavy" ~25. Each trial is ~100 LM calls — budget accordingly.

MIPROv2 costauto="heavy" can spend hundreds of dollars on GPT-4-class teachers for a multi-stage RAG. Always cache, start with auto="light", and confirm the metric trends upward before scaling.

KNN-FewShot — demo retrieval by similarity

For each new query, retrieves the K most similar trainset examples and uses them as few-shot demos. Useful when the task distribution is wide and a fixed demo set generalises poorly.

python
from dspy.teleprompt import KNNFewShot

optimiser = KNNFewShot(k=4, trainset=trainset, vectorizer=dspy.Embedder("openai/text-embedding-3-small"))
compiled = optimiser.compile(rag, trainset=trainset)

Saving and loading compiled programs

python
compiled.save("./compiled_rag.json")

import dspy
fresh = RAG()
fresh.load("./compiled_rag.json")

The JSON contains the chosen demos and (for COPRO/MIPRO) the refined instructions. Commit it to git as a model artefact.

Evaluation harness

dspy.evaluate.Evaluate runs a program over a devset and reports the metric.

python
from dspy.evaluate import Evaluate

evaluator = Evaluate(devset=devset, metric=exact_match, num_threads=4, display_progress=True)
score = evaluator(compiled_rag)
print(f"Devset score: {score:.2%}")

Output:

text
Devset score: 87.50%

ReAct agents

dspy.ReAct implements the Thought → Action → Observation loop. Tools are Python callables with type-annotated signatures; DSPy generates the JSON-schema description automatically.

python
import dspy

def calculator(expression: str) -> float:
    """Evaluate a simple arithmetic expression."""
    return float(eval(expression, {"__builtins__": {}}, {}))

def get_population(country: str) -> int:
    """Return the population of a country (rough)."""
    return {"France": 68_000_000, "Spain": 47_000_000}.get(country, -1)

react = dspy.ReAct("question -> answer", tools=[calculator, get_population])
out = react(question="What is the combined population of France and Spain, plus 100?")
print(out.answer)

Output:

text
The combined population of France and Spain plus 100 is 115,000,100.

ReAct is itself a Module — pass it to an optimiser to tune the tool-selection prompt.

ProgramOfThought — code-generated answers

For numeric, list, or table tasks, generating Python and executing it beats free-text reasoning.

python
pot = dspy.ProgramOfThought("question -> answer")
out = pot(question="What is the standard deviation of [10, 12, 23, 23, 16, 23, 21, 16]?")
print(out.answer)

Output:

text
4.898979485566356

ProgramOfThought executes generated code — use a sandbox (docker, nsjail, or a Restricted Python interpreter) for untrusted inputs.

Assertions — runtime guarantees

dspy.Assert enforces post-conditions. If the assertion fails, DSPy retries with the assertion's message included as feedback.

python
import dspy

class WriteSummary(dspy.Module):
    def __init__(self):
        super().__init__()
        self.cot = dspy.ChainOfThought("text -> summary")

    def forward(self, text):
        out = self.cot(text=text)
        dspy.Assert(
            len(out.summary.split()) <= 30,
            "Summary must be 30 words or fewer."
        )
        return out

summariser = dspy.assert_transform_module(WriteSummary())
print(summariser(text="DSPy programs are compiled by optimising prompts against a metric.").summary)

Use dspy.Suggest for soft constraints (logged but non-blocking) and dspy.Assert for hard ones.

Real-world recipes

Recipe — multi-hop QA over a knowledge base

Decompose the question, retrieve per sub-question, and synthesise.

python
import dspy

class MultiHopRAG(dspy.Module):
    def __init__(self, num_hops=2, num_passages=3):
        super().__init__()
        self.num_hops = num_hops
        self.gen_query = dspy.ChainOfThought("context, question -> next_search_query")
        self.retrieve  = dspy.Retrieve(k=num_passages)
        self.gen_ans   = dspy.ChainOfThought("context, question -> answer")

    def forward(self, question):
        context = []
        for _ in range(self.num_hops):
            q = self.gen_query(context="\n".join(context), question=question).next_search_query
            context.extend(self.retrieve(q).passages)
        return self.gen_ans(context="\n".join(context), question=question)

Compile with MIPROv2 and a hop-aware metric that rewards good intermediate queries.

Recipe — judge-driven metric

When the task is open-ended, use an LM-as-judge metric. Cache the judge to control cost.

python
class Faithfulness(dspy.Signature):
    """Is the answer faithful to the context? Output yes or no."""

    context:   str = dspy.InputField()
    answer:    str = dspy.InputField()
    faithful: bool = dspy.OutputField()

judge = dspy.Predict(Faithfulness)

def faithful_metric(example, pred, trace=None):
    return judge(context=example.context, answer=pred.answer).faithful

Recipe — A/B between two compiled programs

python
import dspy
from dspy.evaluate import Evaluate

eval_run = Evaluate(devset=devset, metric=exact_match, num_threads=4)
print("Bootstrap:", eval_run(compiled_bootstrap))
print("MIPROv2:  ", eval_run(compiled_miprov2))

Promote whichever wins; keep both JSON artefacts in models/ for rollback.

Recipe — swap teacher / student LMs

Optimise once with an expensive teacher, deploy with a cheap student.

python
teacher = dspy.LM("openai/gpt-4o")
student = dspy.LM("openai/gpt-4o-mini")

with dspy.context(lm=teacher):
    compiled = MIPROv2(metric=exact_match, auto="medium").compile(rag, trainset=trainset, valset=devset)

dspy.configure(lm=student)
print(compiled(question="...").answer)

dspy.context(...) temporarily overrides the LM for the optimisation block.

Recipe — diff two compiled prompts

python
import json

with open("./old.json") as f: old = json.load(f)
with open("./new.json") as f: new = json.load(f)

for mod in old:
    if old[mod]["signature_instructions"] != new[mod]["signature_instructions"]:
        print(f"{mod} instructions changed:")
        print("- ", old[mod]["signature_instructions"])
        print("+ ", new[mod]["signature_instructions"])

Commit the JSON to git and the diff makes prompt regressions visible in PRs.

Recipe — streaming responses

python
import dspy

lm = dspy.LM("openai/gpt-4o-mini")
dspy.configure(lm=lm)

stream_predict = dspy.streamify(dspy.Predict("topic -> explanation"))

import asyncio

async def main():
    async for chunk in stream_predict(topic="Why is the sky blue?"):
        if isinstance(chunk, dspy.streaming.StreamResponse):
            print(chunk.chunk, end="", flush=True)

asyncio.run(main())

dspy.streamify(module) wraps any module for async streaming of its output fields.

Quick reference

TaskCode
Installpip install dspy
Configure LMdspy.configure(lm=dspy.LM("openai/gpt-4o-mini"))
Configure RMdspy.configure(rm=ChromadbRM(...))
Inline signaturedspy.Predict("inputs -> outputs")
Class signatureclass S(dspy.Signature): ... InputField/OutputField
Chain of thoughtdspy.ChainOfThought("q -> a")
ReAct agentdspy.ReAct(sig, tools=[f1, f2])
Program of thoughtdspy.ProgramOfThought("q -> a")
Retrievedspy.Retrieve(k=5)
Exampledspy.Example(...).with_inputs("q")
Bootstrap demosBootstrapFewShot(metric=m, max_bootstrapped_demos=4)
Random searchBootstrapFewShotWithRandomSearch(num_candidate_programs=10)
Instruction optimCOPRO(metric=m, breadth=10, depth=3)
Joint optimMIPROv2(metric=m, auto="medium")
KNN demosKNNFewShot(k=4, trainset=ts)
EvaluateEvaluate(devset=ds, metric=m)(program)
Inspect historydspy.inspect_history(n=1)
Save compiledprogram.save("file.json")
Load compiledprogram.load("file.json")
Streamdspy.streamify(module)
Hard assertiondspy.Assert(cond, "message")
Soft suggestiondspy.Suggest(cond, "message")
Temporary LMwith dspy.context(lm=teacher): ...