cheat sheet

dspy

Package-level reference for DSPy on PyPI — the dspy / dspy-ai rename, install variants, version policy, and alternatives.

#pip#package#ai#llmupdated 05-31-2026

dspy

What it is

dspy is a framework from Stanford NLP for programming, rather than prompting, language models. You write declarative signatures (input/output schemas), compose them into modules, then run optimisers (formerly called teleprompters) that automatically tune the underlying prompts and few-shot examples for your task.

Reach for DSPy when you want LLM behaviour to be optimised programmatically (rather than hand-tuned via prompt engineering) and you have evaluation data to drive the optimiser. Reach for LangChain or LlamaIndex when you need a broader application framework with chat-history management and pre-built agents; DSPy is more of a compiler for prompts than an application framework.

Install

bash
pip install dspy

Output: (none — exits 0 on success). This is the canonical install on modern releases.

bash
pip install dspy-ai

Output: also valid — dspy-ai is the legacy name on PyPI. Recent releases publish both names to ease the rename, but new code should depend on dspy. Some older tutorials still show dspy-ai.

bash
uv add dspy

Output: dependency resolved + added to pyproject.toml

bash
poetry add dspy

Output: updated lockfile + virtualenv install

Versioning & Python support

  • Current line is the 2.x series. DSPy reached 2.0 in 2024 with a major API cleanup — Predict, ChainOfThought, and ReAct modules stabilised, and the optimiser interface (BootstrapFewShot, MIPROv2, COPRO) was unified.
  • Supports Python 3.9+ on recent releases.
  • Pin tightly (dspy>=2.5,<3) — minor releases have repeatedly renamed optimisers (Teleprompter → Optimizer, MIPROMIPROv2) and changed module signatures. Notebooks written against 2.0 may not run on 2.5+ without edits.
  • The dspy-ai legacy name is kept in sync with dspy for now, but the Stanford team has signalled it may stop publishing under the old name in a future major. Treat dspy as the long-term path.

Package metadata

  • Maintainer: Stanford NLP Group (stanfordnlp org on GitHub); led by Omar Khattab
  • Project home: github.com/stanfordnlp/dspy
  • Docs: dspy.ai
  • PyPI (canonical): pypi.org/project/dspy
  • PyPI (legacy): pypi.org/project/dspy-ai
  • License: MIT (Apache-2.0 on some files; check LICENSE)
  • Governance: academic project — Stanford NLP, with contributions from Databricks, Anthropic, and the broader community
  • First released: 2023 (under dspy-ai); renamed to dspy in 2024
  • Downloads: hundreds of thousands per month and growing — active research project with frequent releases.

Optional dependencies & extras

DSPy keeps its install lean. Core dependencies pulled in automatically include litellm (the LLM-provider abstraction layer), openai, pydantic, numpy, and tqdm. Published extras as of recent releases:

  • dspy[anthropic] — pulls in anthropic Python SDK for Claude.
  • dspy[google] — pulls in google-generativeai for Gemini.
  • dspy[aws] — Bedrock support via boto3.
  • dspy[chromadb] / dspy[qdrant] / dspy[weaviate] / dspy[pinecone] — retrieval-store integrations for RAG modules.
  • dspy[milvus] / dspy[lancedb] / dspy[faiss] — additional retrieval backends.
  • dspy[mongodb] — MongoDB Atlas Vector Search.
  • dspy[snowflake] — Snowflake Cortex retrieval.
  • dspy[mcp] — Model Context Protocol client integration.
  • dspy[dev] — test, lint, type-check tooling; for contributing.

Many provider integrations actually come from litellm rather than DSPy itself — DSPy delegates to litellm for non-OpenAI models, so pip install litellm[anthropic] is sometimes the right install rather than a dspy[...] extra.

Alternatives

PackageTrade-off
langchainBroad application framework with chains, agents, memory, and prompt templates. Use when you need an end-to-end app scaffold; DSPy is narrower and more optimisation-focused.
llama-indexDocument-indexing-first framework with strong RAG primitives. Use when ingest and retrieval are the centre of your app.
haystack (farm-haystack)Production-oriented NLP/RAG pipeline framework from deepset. More opinionated and infrastructure-aware.
semantic-kernelMicrosoft's planner-style LLM framework. C#-influenced API; weaker Python community.
outlinesConstrained generation library. Solves a different problem (forcing structured output), often paired with DSPy.
litellmJust the provider-abstraction layer DSPy uses internally. Use directly when you want the unified API without DSPy's signature/optimiser model.
instructorPydantic-typed structured output over OpenAI-compatible APIs. Smaller surface area, no optimiser.

Common gotchas

  1. dspy vs dspy-ai on PyPI. Historically published as dspy-ai; renamed to dspy in 2024. Both currently install the same code, but dspy is canonical. Mixing both in one requirements.txt (one from a transitive dep, one direct) can cause duplicate installs.
  2. Signature-based programming model is unique. A dspy.Signature declares input: type -> output: type shapes with docstrings; DSPy then generates the actual prompt template. New users frequently write the prompt themselves and wonder why optimisers do nothing — let the signature be the abstraction.
  3. Teleprompters → Optimisers naming churn. Early DSPy called optimisers teleprompters (the class names were BootstrapFewShot, BootstrapFewShotWithRandomSearch, etc.). Recent versions standardised on "optimiser" in docs but kept the legacy class names. The module path dspy.teleprompt still exists for backward compatibility.
  4. MIPRO → MIPROv2. The original MIPRO optimiser was superseded by MIPROv2 with a different API surface (different hyperparameters, different signature). Old notebook examples often reference MIPRO; prefer MIPROv2 for new code.
  5. Caching is on by default and persistent. DSPy caches LLM responses on disk under ~/.dspy_cache (or wherever DSPY_CACHEDIR points). This is great for reproducibility but causes confusing "why didn't my prompt change anything?" symptoms during iteration. Clear the cache (dspy.settings.configure(cache=False) or remove the directory) when debugging.
  6. Optimisers require evaluation data. BootstrapFewShot.compile(...) needs a metric function and a labelled trainset. Without these, the optimiser has nothing to optimise against, and DSPy degrades to a thin LLM wrapper.
  7. dspy.LM replaces dspy.OpenAI / dspy.Anthropic. Recent releases unified the LM client around a single dspy.LM("openai/gpt-4o") constructor delegating to litellm. Old tutorials show dspy.OpenAI(model="...") — that still works but is deprecated.
  8. Module composition is implicit via forward(). A dspy.Module subclass declares sub-modules as attributes and calls them inside forward(). The graph is then introspectable by optimisers. Composing modules without using the dspy.Module base class (e.g. plain functions) hides the graph and breaks optimisation.

Real-world recipes

DSPy recipes that come up across most non-trivial deployments. Each leans on the "signature → module → optimiser" pipeline rather than ad-hoc prompt templating.

Simple Q&A chain with ChainOfThought

The smallest non-trivial DSPy program: a typed signature, a ChainOfThought module that elicits reasoning before the answer, and a one-shot invocation. The reasoning field becomes a first-class output you can inspect, log, or evaluate.

python
import dspy
import os

dspy.configure(lm=dspy.LM("openai/gpt-4o-mini",
                          api_key=os.environ["OPENAI_API_KEY"]))

class FactualQA(dspy.Signature):
    """Answer a factual question with a short justification."""
    question: str = dspy.InputField()
    answer: str = dspy.OutputField()

cot = dspy.ChainOfThought(FactualQA)
out = cot(question="What year did the Apollo 11 mission land on the Moon?")
print("reasoning:", out.reasoning)
print("answer:", out.answer)

Output: reasoning is the model's chain of thought; answer is the extracted final answer — DSPy synthesises the prompt that elicits both.

RAG-style retriever module

A custom dspy.Module composes a retriever and a generator. The retriever is just another callable returning a list of strings; DSPy doesn't care where it comes from (Chroma, Qdrant, in-memory list — all the same to the module).

python
import dspy

class RAGSignature(dspy.Signature):
    """Answer the question using the supplied context."""
    context: list[str] = dspy.InputField()
    question: str = dspy.InputField()
    answer: str = dspy.OutputField()

class SimpleRAG(dspy.Module):
    def __init__(self, retriever, k: int = 3):
        super().__init__()
        self.retriever = retriever
        self.k = k
        self.generate = dspy.ChainOfThought(RAGSignature)

    def forward(self, question: str):
        passages = self.retriever(question, k=self.k)
        return self.generate(context=passages, question=question)

# Toy retriever — replace with a Chroma / Qdrant client
def toy_retriever(q: str, k: int = 3) -> list[str]:
    corpus = ["Alice works on the Notes team.",
              "Bob ships the deploy tool.",
              "Carol owns observability."]
    return corpus[:k]

rag = SimpleRAG(toy_retriever)
print(rag(question="Who owns observability?").answer)

Output: Carol owns observability. The module is now optimisable — wrap it in a teleprompter to tune the few-shot demos.

Optimised few-shot pipeline with BootstrapFewShot

The whole point of DSPy: hand the optimiser a training set and a metric, let it select demonstrations. The compiled program is portable across LMs.

python
import dspy
from dspy.teleprompt import BootstrapFewShot

# Toy training data
trainset = [
    dspy.Example(question="Capital of France?", answer="Paris").with_inputs("question"),
    dspy.Example(question="Largest planet?",    answer="Jupiter").with_inputs("question"),
    dspy.Example(question="Author of Hamlet?",  answer="Shakespeare").with_inputs("question"),
]

def exact_match(example, pred, trace=None) -> bool:
    return example.answer.lower() == pred.answer.strip().lower()

cot = dspy.ChainOfThought(FactualQA)
optimizer = BootstrapFewShot(metric=exact_match, max_bootstrapped_demos=3)
compiled = optimizer.compile(cot, trainset=trainset)

# `compiled` is a fresh module with selected demos baked into the prompt
out = compiled(question="Capital of Italy?")
print(out.answer)        # -> Rome
compiled.save("compiled_qa.json")

Output: the optimiser bootstraps traces, selects high-scoring demos, and produces a new module whose prompts include them. .save() persists the compiled state for later load().

Mixing retrieval + reasoning + tools (ReAct)

dspy.ReAct is DSPy's agent module — it loops between thought, tool call, and observation until reaching a final answer.

python
import dspy

def lookup(query: str) -> str:
    """Pretend wiki lookup."""
    return f"[looked up] {query} -> 42"

def calculate(expression: str) -> str:
    return str(eval(expression))

react = dspy.ReAct(
    signature="question -> answer",
    tools=[lookup, calculate],
    max_iters=5,
)
out = react(question="What is 6 * 7?")
print(out.answer)        # -> 42 (via calculate)

Output: ReAct decides which tool to call, invokes it, feeds the observation back into the loop, and emits a final answer. Optimisers can tune the tool-selection prompt against a metric.

Programmatic LM swap

A compiled DSPy program is portable across LMs because the prompt template is the program, not the model. Swap models at runtime to A/B test or to use a cheaper model in development.

python
import dspy

cheap = dspy.LM("openai/gpt-4o-mini", api_key=os.environ["OPENAI_API_KEY"])
prod  = dspy.LM("anthropic/claude-sonnet-4-5",
                 api_key=os.environ["ANTHROPIC_API_KEY"])

dspy.configure(lm=cheap)
out_cheap = compiled(question="Capital of Spain?")

with dspy.context(lm=prod):
    out_prod = compiled(question="Capital of Spain?")

Output: the program runs against two different providers; the prompt template is identical — only the LM client is swapped. dspy.context(lm=…) scopes the swap to a with block.

Production deployment

DSPy is library code; "deploying" means embedding it in your service and wiring caching, observability, and optimiser-output persistence.

Caching expensive LLM calls

DSPy caches every LM call to disk by default (~/.dspy_cache/ or $DSPY_CACHEDIR). For services with many duplicate prompts, this is free money — repeated requests hit the cache, not the model. The cache key includes prompt, temperature, and model name.

python
import dspy

# Disable caching during interactive development
dspy.settings.configure(cache=False)

# Custom cache directory (for shared cache across processes)
import os
os.environ["DSPY_CACHEDIR"] = "/var/cache/dspy"

# Cache is enabled by default for non-interactive use
dspy.settings.configure(cache=True)

Output: subsequent calls with identical inputs return instantly from disk; the LM provider is never contacted. For multi-replica services, point DSPY_CACHEDIR at a shared volume so all replicas hit the same cache.

Optimisation result persistence

A compiled program is a JSON blob of selected demos + tuned instructions. Save it after compilation; load on service start. Never recompile in the hot path.

python
# Build-time (CI step, notebook, or one-off script)
compiled = optimizer.compile(student, trainset=trainset)
compiled.save("models/qa-v1.json")

# Service startup
program = dspy.ChainOfThought(FactualQA)
program.load("models/qa-v1.json")

Output: the program loads its tuned state from disk at service start; the optimiser runs once at training time, not at request time.

Provider authentication

DSPy delegates to litellm for non-OpenAI models. Auth follows litellm's conventions:

ProviderEnv var
OpenAIOPENAI_API_KEY
AnthropicANTHROPIC_API_KEY
Google AI StudioGEMINI_API_KEY
Vertex AIGOOGLE_APPLICATION_CREDENTIALS (path to service-account JSON)
AWS Bedrockstandard AWS env / IAM
Azure OpenAIAZURE_OPENAI_API_KEY + AZURE_API_BASE + AZURE_API_VERSION
Local (Ollama, llama.cpp)api_base="http://localhost:11434" + model id
python
import dspy

# Local model via Ollama
lm = dspy.LM("ollama/llama3.1:8b", api_base="http://localhost:11434")
dspy.configure(lm=lm)

Output: the same DSPy program runs against a local model with no code changes.

Observability — Langfuse or Phoenix

DSPy's dspy.settings.trace exposes every LM call. Integrate with Langfuse or Arize Phoenix for production tracing:

python
import dspy
from langfuse.openai import openai  # or use the litellm callback

# Enable trace collection
dspy.settings.configure(trace=[])

# Run program; traces accumulate
result = compiled(question="…")

# Inspect / forward traces
for entry in dspy.settings.trace:
    print(entry)

Output: each LM call appears in the trace list with prompt, response, latency, and token counts; export to your observability backend.

Version migration guide

DSPy's 2.x line has been the most stable phase of the project, but minor releases have repeatedly renamed core classes. Pin tightly.

dspy-aidspy rename

Historically the package was dspy-ai on PyPI; the canonical name became dspy in 2024. Both currently install the same code but having both in one resolution graph causes duplicate installs.

bash
# Old (legacy)
pip install dspy-ai

# New (canonical)
pip install dspy

# Audit: ensure only one is pinned in your project
pip list | grep -iE "^dspy"

Output: dspy 2.x.x is the only entry; if both appear, remove dspy-ai from requirements.txt and reinstall.

Teleprompters → Optimizers

Early DSPy named tuning components teleprompters; later renamed to optimizers in docs while keeping the legacy class names. Imports work from both paths:

python
# Both work
from dspy.teleprompt import BootstrapFewShot
from dspy.teleprompt import MIPROv2

# Class names use the legacy "teleprompt" suffix internally

Output: docs say "optimizer"; the import path is still dspy.teleprompt.

MIPROMIPROv2

The original MIPRO optimiser was superseded by MIPROv2 with a redesigned API. Old notebooks reference MIPRO; convert to MIPROv2:

python
# Old (deprecated)
from dspy.teleprompt import MIPRO
optimizer = MIPRO(metric=metric, num_candidates=10)

# New
from dspy.teleprompt import MIPROv2
optimizer = MIPROv2(metric=metric, auto="medium")

Output: the new API uses auto="light"/"medium"/"heavy" to set the budget; runs are typically 2–3× faster for equivalent quality.

dspy.OpenAI / dspy.Anthropicdspy.LM

Recent releases unified the LM client around dspy.LM (delegating to litellm). Old code:

python
# Deprecated
lm = dspy.OpenAI(model="gpt-4", api_key=...)
lm = dspy.Anthropic(model="claude-sonnet", api_key=...)

# Canonical
lm = dspy.LM("openai/gpt-4")
lm = dspy.LM("anthropic/claude-sonnet-4-5")

Output: any litellm-supported provider works with the unified constructor; provider-specific classes are kept for backward compatibility but are no longer documented.

Testing strategies

DSPy programs are deterministic if you provide a deterministic LM. The standard testing pattern uses a mock LM that returns canned responses.

python
import dspy
from dspy.utils.dummies import DummyLM

def test_chain_of_thought_extracts_answer():
    # DummyLM returns canned responses for each call in order
    dspy.configure(lm=DummyLM([
        {"reasoning": "It is 1969.", "answer": "1969"},
    ]))
    program = dspy.ChainOfThought("question -> answer")
    out = program(question="When was the Moon landing?")
    assert out.answer == "1969"

def test_compiled_program_round_trip(tmp_path):
    program = dspy.ChainOfThought("q -> a")
    program.save(tmp_path / "p.json")
    program2 = dspy.ChainOfThought("q -> a")
    program2.load(tmp_path / "p.json")
    assert program2.demos == program.demos

Output: pytest -q runs both in milliseconds; no LM provider contacted.

Key patterns:

  • DummyLM([dict, ...]) — preset responses; one dict per call, in order.
  • dspy.context(lm=DummyLM(...)) — scope the mock to a with block.
  • dspy.settings.configure(cache=False) — disable disk cache during tests so canned responses don't get poisoned by a previous run.
  • Metric tests — write the metric function as ordinary Python and unit-test it independently of any LM:
    python
    def test_metric_exact_match():
        ex = dspy.Example(answer="Paris")
        pred = dspy.Prediction(answer="paris")
        assert exact_match(ex, pred) is True
    

Troubleshooting common errors

SymptomCauseFix
AuthenticationError from providerMissing or mis-named env varSet the right var (OPENAI_API_KEY, ANTHROPIC_API_KEY, etc.) before dspy.configure(lm=...).
"Cache hit but prompt seems wrong" during iterationDSPy is returning a cached response from a previous prompt with the same keydspy.settings.configure(cache=False) or delete ~/.dspy_cache/.
BootstrapFewShot.compile returns identical program to inputNo metric supplied, or metric returns 0 for every exampleSupply metric=callable; verify the metric returns truthy for at least one example by hand.
MIPRO import works but produces no improvementUsing deprecated MIPRO instead of MIPROv2; old MIPRO is a no-op in recent releasesSwitch to from dspy.teleprompt import MIPROv2.
Optimised program performs worse than baselineTrain/dev overlap or trainset too small (< ~20 examples)Hold out at least 30% as a separate devset; bump trainset to ~50+.
RetryError: Max retries exceeded from litellmRate limits or transient provider errorSet dspy.configure(lm=dspy.LM(..., num_retries=5, timeout=60)).
Signature field renamed → compiled program now brokenField names are part of the prompt key; renaming invalidates compiled stateEither keep field names stable or recompile after rename.
dspy.OpenAI works but raises DeprecationWarningOld API surfaceMigrate to dspy.LM("openai/gpt-4o").

See also