cheat sheet
dspy
Package-level reference for DSPy on PyPI — the dspy / dspy-ai rename, install variants, version policy, and alternatives.
dspy
What it is
dspy is a framework from Stanford NLP for programming, rather than prompting, language models. You write declarative signatures (input/output schemas), compose them into modules, then run optimisers (formerly called teleprompters) that automatically tune the underlying prompts and few-shot examples for your task.
Reach for DSPy when you want LLM behaviour to be optimised programmatically (rather than hand-tuned via prompt engineering) and you have evaluation data to drive the optimiser. Reach for LangChain or LlamaIndex when you need a broader application framework with chat-history management and pre-built agents; DSPy is more of a compiler for prompts than an application framework.
Install
pip install dspy
Output: (none — exits 0 on success). This is the canonical install on modern releases.
pip install dspy-ai
Output: also valid — dspy-ai is the legacy name on PyPI. Recent releases publish both names to ease the rename, but new code should depend on dspy. Some older tutorials still show dspy-ai.
uv add dspy
Output: dependency resolved + added to pyproject.toml
poetry add dspy
Output: updated lockfile + virtualenv install
Versioning & Python support
- Current line is the
2.xseries. DSPy reached2.0in 2024 with a major API cleanup —Predict,ChainOfThought, andReActmodules stabilised, and the optimiser interface (BootstrapFewShot,MIPROv2,COPRO) was unified. - Supports Python 3.9+ on recent releases.
- Pin tightly (
dspy>=2.5,<3) — minor releases have repeatedly renamed optimisers (Teleprompter → Optimizer,MIPRO→MIPROv2) and changed module signatures. Notebooks written against2.0may not run on2.5+without edits. - The
dspy-ailegacy name is kept in sync withdspyfor now, but the Stanford team has signalled it may stop publishing under the old name in a future major. Treatdspyas the long-term path.
Package metadata
- Maintainer: Stanford NLP Group (
stanfordnlporg on GitHub); led by Omar Khattab - Project home: github.com/stanfordnlp/dspy
- Docs: dspy.ai
- PyPI (canonical): pypi.org/project/dspy
- PyPI (legacy): pypi.org/project/dspy-ai
- License: MIT (Apache-2.0 on some files; check
LICENSE) - Governance: academic project — Stanford NLP, with contributions from Databricks, Anthropic, and the broader community
- First released: 2023 (under
dspy-ai); renamed todspyin 2024 - Downloads: hundreds of thousands per month and growing — active research project with frequent releases.
Optional dependencies & extras
DSPy keeps its install lean. Core dependencies pulled in automatically include litellm (the LLM-provider abstraction layer), openai, pydantic, numpy, and tqdm. Published extras as of recent releases:
dspy[anthropic]— pulls inanthropicPython SDK for Claude.dspy[google]— pulls ingoogle-generativeaifor Gemini.dspy[aws]— Bedrock support viaboto3.dspy[chromadb]/dspy[qdrant]/dspy[weaviate]/dspy[pinecone]— retrieval-store integrations for RAG modules.dspy[milvus]/dspy[lancedb]/dspy[faiss]— additional retrieval backends.dspy[mongodb]— MongoDB Atlas Vector Search.dspy[snowflake]— Snowflake Cortex retrieval.dspy[mcp]— Model Context Protocol client integration.dspy[dev]— test, lint, type-check tooling; for contributing.
Many provider integrations actually come from litellm rather than DSPy itself — DSPy delegates to litellm for non-OpenAI models, so pip install litellm[anthropic] is sometimes the right install rather than a dspy[...] extra.
Alternatives
| Package | Trade-off |
|---|---|
langchain | Broad application framework with chains, agents, memory, and prompt templates. Use when you need an end-to-end app scaffold; DSPy is narrower and more optimisation-focused. |
llama-index | Document-indexing-first framework with strong RAG primitives. Use when ingest and retrieval are the centre of your app. |
haystack (farm-haystack) | Production-oriented NLP/RAG pipeline framework from deepset. More opinionated and infrastructure-aware. |
semantic-kernel | Microsoft's planner-style LLM framework. C#-influenced API; weaker Python community. |
outlines | Constrained generation library. Solves a different problem (forcing structured output), often paired with DSPy. |
litellm | Just the provider-abstraction layer DSPy uses internally. Use directly when you want the unified API without DSPy's signature/optimiser model. |
instructor | Pydantic-typed structured output over OpenAI-compatible APIs. Smaller surface area, no optimiser. |
Common gotchas
dspyvsdspy-aion PyPI. Historically published asdspy-ai; renamed todspyin 2024. Both currently install the same code, butdspyis canonical. Mixing both in onerequirements.txt(one from a transitive dep, one direct) can cause duplicate installs.- Signature-based programming model is unique. A
dspy.Signaturedeclaresinput: type -> output: typeshapes with docstrings; DSPy then generates the actual prompt template. New users frequently write the prompt themselves and wonder why optimisers do nothing — let the signature be the abstraction. - Teleprompters → Optimisers naming churn. Early DSPy called optimisers teleprompters (the class names were
BootstrapFewShot,BootstrapFewShotWithRandomSearch, etc.). Recent versions standardised on "optimiser" in docs but kept the legacy class names. The module pathdspy.telepromptstill exists for backward compatibility. - MIPRO → MIPROv2. The original
MIPROoptimiser was superseded byMIPROv2with a different API surface (different hyperparameters, different signature). Old notebook examples often referenceMIPRO; preferMIPROv2for new code. - Caching is on by default and persistent. DSPy caches LLM responses on disk under
~/.dspy_cache(or whereverDSPY_CACHEDIRpoints). This is great for reproducibility but causes confusing "why didn't my prompt change anything?" symptoms during iteration. Clear the cache (dspy.settings.configure(cache=False)or remove the directory) when debugging. - Optimisers require evaluation data.
BootstrapFewShot.compile(...)needs a metric function and a labelled trainset. Without these, the optimiser has nothing to optimise against, and DSPy degrades to a thin LLM wrapper. dspy.LMreplacesdspy.OpenAI/dspy.Anthropic. Recent releases unified the LM client around a singledspy.LM("openai/gpt-4o")constructor delegating tolitellm. Old tutorials showdspy.OpenAI(model="...")— that still works but is deprecated.- Module composition is implicit via
forward(). Adspy.Modulesubclass declares sub-modules as attributes and calls them insideforward(). The graph is then introspectable by optimisers. Composing modules without using thedspy.Modulebase class (e.g. plain functions) hides the graph and breaks optimisation.
Real-world recipes
DSPy recipes that come up across most non-trivial deployments. Each leans on the "signature → module → optimiser" pipeline rather than ad-hoc prompt templating.
Simple Q&A chain with ChainOfThought
The smallest non-trivial DSPy program: a typed signature, a ChainOfThought module that elicits reasoning before the answer, and a one-shot invocation. The reasoning field becomes a first-class output you can inspect, log, or evaluate.
import dspy
import os
dspy.configure(lm=dspy.LM("openai/gpt-4o-mini",
api_key=os.environ["OPENAI_API_KEY"]))
class FactualQA(dspy.Signature):
"""Answer a factual question with a short justification."""
question: str = dspy.InputField()
answer: str = dspy.OutputField()
cot = dspy.ChainOfThought(FactualQA)
out = cot(question="What year did the Apollo 11 mission land on the Moon?")
print("reasoning:", out.reasoning)
print("answer:", out.answer)
Output: reasoning is the model's chain of thought; answer is the extracted final answer — DSPy synthesises the prompt that elicits both.
RAG-style retriever module
A custom dspy.Module composes a retriever and a generator. The retriever is just another callable returning a list of strings; DSPy doesn't care where it comes from (Chroma, Qdrant, in-memory list — all the same to the module).
import dspy
class RAGSignature(dspy.Signature):
"""Answer the question using the supplied context."""
context: list[str] = dspy.InputField()
question: str = dspy.InputField()
answer: str = dspy.OutputField()
class SimpleRAG(dspy.Module):
def __init__(self, retriever, k: int = 3):
super().__init__()
self.retriever = retriever
self.k = k
self.generate = dspy.ChainOfThought(RAGSignature)
def forward(self, question: str):
passages = self.retriever(question, k=self.k)
return self.generate(context=passages, question=question)
# Toy retriever — replace with a Chroma / Qdrant client
def toy_retriever(q: str, k: int = 3) -> list[str]:
corpus = ["Alice works on the Notes team.",
"Bob ships the deploy tool.",
"Carol owns observability."]
return corpus[:k]
rag = SimpleRAG(toy_retriever)
print(rag(question="Who owns observability?").answer)
Output: Carol owns observability. The module is now optimisable — wrap it in a teleprompter to tune the few-shot demos.
Optimised few-shot pipeline with BootstrapFewShot
The whole point of DSPy: hand the optimiser a training set and a metric, let it select demonstrations. The compiled program is portable across LMs.
import dspy
from dspy.teleprompt import BootstrapFewShot
# Toy training data
trainset = [
dspy.Example(question="Capital of France?", answer="Paris").with_inputs("question"),
dspy.Example(question="Largest planet?", answer="Jupiter").with_inputs("question"),
dspy.Example(question="Author of Hamlet?", answer="Shakespeare").with_inputs("question"),
]
def exact_match(example, pred, trace=None) -> bool:
return example.answer.lower() == pred.answer.strip().lower()
cot = dspy.ChainOfThought(FactualQA)
optimizer = BootstrapFewShot(metric=exact_match, max_bootstrapped_demos=3)
compiled = optimizer.compile(cot, trainset=trainset)
# `compiled` is a fresh module with selected demos baked into the prompt
out = compiled(question="Capital of Italy?")
print(out.answer) # -> Rome
compiled.save("compiled_qa.json")
Output: the optimiser bootstraps traces, selects high-scoring demos, and produces a new module whose prompts include them. .save() persists the compiled state for later load().
Mixing retrieval + reasoning + tools (ReAct)
dspy.ReAct is DSPy's agent module — it loops between thought, tool call, and observation until reaching a final answer.
import dspy
def lookup(query: str) -> str:
"""Pretend wiki lookup."""
return f"[looked up] {query} -> 42"
def calculate(expression: str) -> str:
return str(eval(expression))
react = dspy.ReAct(
signature="question -> answer",
tools=[lookup, calculate],
max_iters=5,
)
out = react(question="What is 6 * 7?")
print(out.answer) # -> 42 (via calculate)
Output: ReAct decides which tool to call, invokes it, feeds the observation back into the loop, and emits a final answer. Optimisers can tune the tool-selection prompt against a metric.
Programmatic LM swap
A compiled DSPy program is portable across LMs because the prompt template is the program, not the model. Swap models at runtime to A/B test or to use a cheaper model in development.
import dspy
cheap = dspy.LM("openai/gpt-4o-mini", api_key=os.environ["OPENAI_API_KEY"])
prod = dspy.LM("anthropic/claude-sonnet-4-5",
api_key=os.environ["ANTHROPIC_API_KEY"])
dspy.configure(lm=cheap)
out_cheap = compiled(question="Capital of Spain?")
with dspy.context(lm=prod):
out_prod = compiled(question="Capital of Spain?")
Output: the program runs against two different providers; the prompt template is identical — only the LM client is swapped. dspy.context(lm=…) scopes the swap to a with block.
Production deployment
DSPy is library code; "deploying" means embedding it in your service and wiring caching, observability, and optimiser-output persistence.
Caching expensive LLM calls
DSPy caches every LM call to disk by default (~/.dspy_cache/ or $DSPY_CACHEDIR). For services with many duplicate prompts, this is free money — repeated requests hit the cache, not the model. The cache key includes prompt, temperature, and model name.
import dspy
# Disable caching during interactive development
dspy.settings.configure(cache=False)
# Custom cache directory (for shared cache across processes)
import os
os.environ["DSPY_CACHEDIR"] = "/var/cache/dspy"
# Cache is enabled by default for non-interactive use
dspy.settings.configure(cache=True)
Output: subsequent calls with identical inputs return instantly from disk; the LM provider is never contacted. For multi-replica services, point DSPY_CACHEDIR at a shared volume so all replicas hit the same cache.
Optimisation result persistence
A compiled program is a JSON blob of selected demos + tuned instructions. Save it after compilation; load on service start. Never recompile in the hot path.
# Build-time (CI step, notebook, or one-off script)
compiled = optimizer.compile(student, trainset=trainset)
compiled.save("models/qa-v1.json")
# Service startup
program = dspy.ChainOfThought(FactualQA)
program.load("models/qa-v1.json")
Output: the program loads its tuned state from disk at service start; the optimiser runs once at training time, not at request time.
Provider authentication
DSPy delegates to litellm for non-OpenAI models. Auth follows litellm's conventions:
| Provider | Env var |
|---|---|
| OpenAI | OPENAI_API_KEY |
| Anthropic | ANTHROPIC_API_KEY |
| Google AI Studio | GEMINI_API_KEY |
| Vertex AI | GOOGLE_APPLICATION_CREDENTIALS (path to service-account JSON) |
| AWS Bedrock | standard AWS env / IAM |
| Azure OpenAI | AZURE_OPENAI_API_KEY + AZURE_API_BASE + AZURE_API_VERSION |
| Local (Ollama, llama.cpp) | api_base="http://localhost:11434" + model id |
import dspy
# Local model via Ollama
lm = dspy.LM("ollama/llama3.1:8b", api_base="http://localhost:11434")
dspy.configure(lm=lm)
Output: the same DSPy program runs against a local model with no code changes.
Observability — Langfuse or Phoenix
DSPy's dspy.settings.trace exposes every LM call. Integrate with Langfuse or Arize Phoenix for production tracing:
import dspy
from langfuse.openai import openai # or use the litellm callback
# Enable trace collection
dspy.settings.configure(trace=[])
# Run program; traces accumulate
result = compiled(question="…")
# Inspect / forward traces
for entry in dspy.settings.trace:
print(entry)
Output: each LM call appears in the trace list with prompt, response, latency, and token counts; export to your observability backend.
Version migration guide
DSPy's 2.x line has been the most stable phase of the project, but minor releases have repeatedly renamed core classes. Pin tightly.
dspy-ai → dspy rename
Historically the package was dspy-ai on PyPI; the canonical name became dspy in 2024. Both currently install the same code but having both in one resolution graph causes duplicate installs.
# Old (legacy)
pip install dspy-ai
# New (canonical)
pip install dspy
# Audit: ensure only one is pinned in your project
pip list | grep -iE "^dspy"
Output: dspy 2.x.x is the only entry; if both appear, remove dspy-ai from requirements.txt and reinstall.
Teleprompters → Optimizers
Early DSPy named tuning components teleprompters; later renamed to optimizers in docs while keeping the legacy class names. Imports work from both paths:
# Both work
from dspy.teleprompt import BootstrapFewShot
from dspy.teleprompt import MIPROv2
# Class names use the legacy "teleprompt" suffix internally
Output: docs say "optimizer"; the import path is still dspy.teleprompt.
MIPRO → MIPROv2
The original MIPRO optimiser was superseded by MIPROv2 with a redesigned API. Old notebooks reference MIPRO; convert to MIPROv2:
# Old (deprecated)
from dspy.teleprompt import MIPRO
optimizer = MIPRO(metric=metric, num_candidates=10)
# New
from dspy.teleprompt import MIPROv2
optimizer = MIPROv2(metric=metric, auto="medium")
Output: the new API uses auto="light"/"medium"/"heavy" to set the budget; runs are typically 2–3× faster for equivalent quality.
dspy.OpenAI / dspy.Anthropic → dspy.LM
Recent releases unified the LM client around dspy.LM (delegating to litellm). Old code:
# Deprecated
lm = dspy.OpenAI(model="gpt-4", api_key=...)
lm = dspy.Anthropic(model="claude-sonnet", api_key=...)
# Canonical
lm = dspy.LM("openai/gpt-4")
lm = dspy.LM("anthropic/claude-sonnet-4-5")
Output: any litellm-supported provider works with the unified constructor; provider-specific classes are kept for backward compatibility but are no longer documented.
Testing strategies
DSPy programs are deterministic if you provide a deterministic LM. The standard testing pattern uses a mock LM that returns canned responses.
import dspy
from dspy.utils.dummies import DummyLM
def test_chain_of_thought_extracts_answer():
# DummyLM returns canned responses for each call in order
dspy.configure(lm=DummyLM([
{"reasoning": "It is 1969.", "answer": "1969"},
]))
program = dspy.ChainOfThought("question -> answer")
out = program(question="When was the Moon landing?")
assert out.answer == "1969"
def test_compiled_program_round_trip(tmp_path):
program = dspy.ChainOfThought("q -> a")
program.save(tmp_path / "p.json")
program2 = dspy.ChainOfThought("q -> a")
program2.load(tmp_path / "p.json")
assert program2.demos == program.demos
Output: pytest -q runs both in milliseconds; no LM provider contacted.
Key patterns:
DummyLM([dict, ...])— preset responses; one dict per call, in order.dspy.context(lm=DummyLM(...))— scope the mock to awithblock.dspy.settings.configure(cache=False)— disable disk cache during tests so canned responses don't get poisoned by a previous run.- Metric tests — write the metric function as ordinary Python and unit-test it independently of any LM:
def test_metric_exact_match(): ex = dspy.Example(answer="Paris") pred = dspy.Prediction(answer="paris") assert exact_match(ex, pred) is True
Troubleshooting common errors
| Symptom | Cause | Fix |
|---|---|---|
AuthenticationError from provider | Missing or mis-named env var | Set the right var (OPENAI_API_KEY, ANTHROPIC_API_KEY, etc.) before dspy.configure(lm=...). |
| "Cache hit but prompt seems wrong" during iteration | DSPy is returning a cached response from a previous prompt with the same key | dspy.settings.configure(cache=False) or delete ~/.dspy_cache/. |
BootstrapFewShot.compile returns identical program to input | No metric supplied, or metric returns 0 for every example | Supply metric=callable; verify the metric returns truthy for at least one example by hand. |
MIPRO import works but produces no improvement | Using deprecated MIPRO instead of MIPROv2; old MIPRO is a no-op in recent releases | Switch to from dspy.teleprompt import MIPROv2. |
| Optimised program performs worse than baseline | Train/dev overlap or trainset too small (< ~20 examples) | Hold out at least 30% as a separate devset; bump trainset to ~50+. |
RetryError: Max retries exceeded from litellm | Rate limits or transient provider error | Set dspy.configure(lm=dspy.LM(..., num_retries=5, timeout=60)). |
| Signature field renamed → compiled program now broken | Field names are part of the prompt key; renaming invalidates compiled state | Either keep field names stable or recompile after rename. |
dspy.OpenAI works but raises DeprecationWarning | Old API surface | Migrate to dspy.LM("openai/gpt-4o"). |
See also
- Frameworks: DSPy — signatures, modules, optimisers, ChainOfThought, ReAct
- Concept: RAG — retrieval-augmented generation patterns
- Concept: agents — agentic LLM control flow