cheat sheet

guidance

Package-level reference for the guidance library on PyPI — install, LLM-provider extras, versioning, and alternatives like instructor and outlines.

guidance

What it is

guidance is a Python library for interleaving generation and constraints in LLM prompts — letting the model fill in slots, choose from enums, match regexes, and follow grammars while the host program retains control of the prompt structure. Originally a Microsoft Research project, it pioneered the "constrained decoding" approach that several modern structured-output libraries now share.

The library has gone through a significant API evolution. Early guidance (the 0.0.x series) was tightly coupled to constrained-token decoding with specific local models; the rewrite (0.1.x+) generalised to chat models and remote APIs and uses a more compositional g.gen() / g.select() / g.assistant() interface.

Install

bash
pip install guidance

Output: installs the core library — no LLM provider bundled

bash
uv add guidance

Output: dependency resolved + added to pyproject.toml

bash
poetry add guidance

Output: updated lockfile + virtualenv install

bash
pip install guidance openai      # for OpenAI/Azure
pip install guidance anthropic   # for Claude
pip install guidance transformers torch  # for local HF models

Output: add whichever provider SDK you intend to drive

Versioning & Python support

  • Current line is 0.1.x / 0.2.x (as of late 2025). Pre-1.0 — minor releases occasionally shuffle the public API. Pin tight in production.
  • Python 3.9+ on current releases.
  • The library was rewritten between the 0.0.x and 0.1.x series — old tutorials using {{gen 'name'}} Handlebars-style templates do not run on current versions. Modern API uses Python composition.
  • Local-model constrained decoding requires a compatible transformers install; remote APIs (OpenAI, Anthropic) use the provider SDKs directly.

Package metadata

  • Maintainer: Originally Microsoft Research; now community-maintained under the guidance-ai GitHub org
  • Project home: github.com/guidance-ai/guidance
  • Docs: guidance.readthedocs.io (when available; primary docs are in the GitHub README)
  • PyPI: pypi.org/project/guidance
  • License: MIT
  • Governance: community-led after Microsoft handed off; active contributor base
  • First released: 2023

Optional dependencies & extras

guidance is relatively self-contained. It does not bundle provider SDKs — install the ones you need:

ProviderPip target
OpenAI / Azure OpenAIopenai
Anthropicanthropic
Google Geminigoogle-generativeai
Local Hugging Facetransformers, torch, accelerate
llama.cppllama-cpp-python

Core deps the package pulls in itself:

  • numpy
  • pydantic
  • tiktoken (for token-level constraints)
  • requests

There are no major [extra] groups — the canonical install is pip install guidance <provider-sdk>.

Alternatives

PackageTrade-off
instructorPydantic-first structured output for OpenAI, Anthropic, etc. Simpler API; doesn't do token-level constrained decoding.
outlinesRegex / grammar / Pydantic-constrained generation against local HF or Mistral models. Strong on local-model constraint enforcement.
marvinHigher-level "AI functions" / prompt-as-function approach. Very Pythonic; lighter on grammar control.
lmqlDeclarative prompting language. Different paradigm; smaller community.
jsonformerToken-level JSON-schema enforcement for local models. Narrower scope than guidance.
Provider-native JSON mode / tool useBuilt into OpenAI, Anthropic, Gemini SDKs. Use when constraints are simple enough to expressed as JSON Schema.

Common gotchas

  1. 0.0.x Handlebars tutorials don't run. If a snippet uses {{#gen}} or {{select}} template syntax, it's pre-rewrite. Migrate to the Python composition API (with assistant(): g += gen(...)) on current releases.
  2. LLM-provider compatibility varies. Token-level constrained decoding only works with local models where guidance can intercept logits. Remote APIs (OpenAI, Anthropic) fall back to grammar-style retry-on-mismatch, which is slower and less guaranteed.
  3. tiktoken mismatch. OpenAI's tokenizer ships with guidance's deps, but model-specific BPE updates can lag — pin tiktoken alongside major OpenAI model upgrades.
  4. Async support is partial. Some operations are sync-only; mixing guidance with FastAPI/asyncio requires asyncio.to_thread() for the generation calls.
  5. Local-model GPU placement. Like transformers, you need device_map="auto" + accelerate to fit anything beyond a 7B model on a single GPU. guidance doesn't manage this for you.
  6. Maintenance cadence. After Microsoft handed off, release cadence is community-driven and slower than instructor or outlines. Check the issue tracker before betting a production stack on a niche feature.

Performance tuning

guidance performance breaks down into two phases: constraint evaluation per step and the underlying model forward pass.

  • Local model forward pass dominates. For a 7B model, each step is ~10-30ms; the constraint mask is ~0.1-1ms. Optimise the model first (quantisation, FlashAttention, torch.compile) before chasing constraint overhead.
  • select with N options costs O(N) to build the mask. For very large enums, prefer regex over a shared prefix.
  • regex is JIT-compiled per call. Cache compiled patterns by reusing the same gen(regex=PAT) literal — guidance memoises by pattern string.
  • KV cache reuse. Sequential gen calls on the same lm object share KV cache. Branching with lm + ... vs lm += ... matters for cache reuse — assignment preserves the cache, plain + creates a new branch.
  • Backend differences. models.Transformers benefits from torch.compile; models.LlamaCpp benefits from GGUF quantisation; models.OpenAI is dominated by network latency.
  • Batch is awkward. guidance's slot-fill is inherently sequential per request — for throughput, run multiple lm instances across multiple processes/GPUs.

Real-world recipes

guidance's value shows up when output shape matters more than free-form text. The recipes below string gen, select, and assistant/user/system into the patterns production teams actually use.

Recipe: JSON extraction with regex-enforced fields

Force the model to emit a strictly-shaped JSON object — no parsing, no retries:

python
import guidance
from guidance import models, gen, select

lm = models.Transformers("microsoft/Phi-3-mini-4k-instruct", device_map="auto")
lm += '''Extract the invoice fields as JSON.
{
  "vendor": "''' + gen("vendor", stop='"', max_tokens=40) + '''",
  "amount": ''' + gen("amount", regex=r"\d+\.\d{2}") + ''',
  "currency": "''' + select(["USD", "EUR", "GBP"], name="currency") + '''",
  "category": "''' + select(["billing", "shipping", "service", "tax"], name="category") + '''"
}'''

print(lm["vendor"], lm["amount"], lm["currency"], lm["category"])

Output: four parsed fields. The grammar bakes in shape; the model fills slots.

Recipe: multi-step planner with enums for branching

python
import guidance
from guidance import models, gen, select, system, user, assistant

lm = models.OpenAI("gpt-4o-mini")

with system():
    lm += "You decompose tasks into 1-3 step plans, then choose a primary action."

with user():
    lm += "Help me onboard a new dev to our codebase."

with assistant():
    lm += "Step 1: " + gen("step1", max_tokens=80, stop="\n") + "\n"
    lm += "Step 2: " + gen("step2", max_tokens=80, stop="\n") + "\n"
    lm += "Primary action: " + select(
        ["pair-program", "send-docs", "schedule-meeting"], name="action"
    )

Output: plan text in lm["step1"]/step2, enum-constrained action in lm["action"].

Recipe: regex-bounded numeric extraction

python
from guidance import models, gen

lm = models.OpenAI("gpt-4o-mini")
lm += f"Q: What year did the Apollo 11 moon landing happen?\nA: " + gen(
    "year", regex=r"\b(19|20)\d{2}\b", max_tokens=8,
)
print(lm["year"])

Output: 1969 — invalid token patterns are rejected at decoding time, no parsing required.

Recipe: chain-of-thought with shape constraints

Combine free-form reasoning with shape-constrained conclusion — a "think step by step, then answer in JSON" pattern that doesn't fall apart on edge cases.

python
from guidance import models, gen, select, system, user, assistant

lm = models.Transformers("microsoft/Phi-3-mini-4k-instruct",
                          device_map="auto", torch_dtype="bfloat16")

with system():
    lm += "You analyse customer reviews and produce structured ratings."

with user():
    lm += "Review: 'The food was delicious, but the service was painfully slow.'"

with assistant():
    lm += "Reasoning: " + gen("reasoning", max_tokens=200, stop="\n\n") + "\n\n"
    lm += "Score (1-5): " + gen("score", regex=r"[1-5]") + "\n"
    lm += "Category: " + select(
        ["food-positive", "food-negative", "service-positive", "service-negative", "mixed"],
        name="category",
    )

print(lm["reasoning"], lm["score"], lm["category"])

Output: free-form reasoning followed by enum-constrained score and category — final fields ready for direct downstream use.

Recipe: local model with token-level constraint

Token-level constrained decoding only works against local models (or providers exposing logprobs hooks). With a local Transformers model, guidance intercepts logits before sampling.

python
import guidance
from guidance import models, gen, select

lm = models.Transformers("microsoft/Phi-3-mini-4k-instruct",
                          device_map="auto", torch_dtype="bfloat16")
lm += "Classify sentiment of: 'I loved the new release.'\nLabel: " + select(
    ["positive", "negative", "neutral"], name="label"
)
print(lm["label"])

Output: positive. The model cannot emit any other token — no retries, no parsing.

Cost & rate-limit management

guidance saves money in two ways: it eliminates retries-for-shape and it lets a smaller model do work a larger model would otherwise have to do.

  • No retry loops for shape. With strict regex/grammar constraints, you don't need the "parse → fail → retry with stronger instructions" loop most JSON-extraction code has. That alone halves cost for shape-heavy workloads.
  • Smaller models suffice. A 3B local model with select/regex constraints often matches a 70B model's accuracy on extraction tasks — at a fraction of the cost.
  • Per-token cost on remote APIs. Remote constrained decoding (OpenAI / Anthropic via grammar-style retry) does NOT save tokens. Local models do, because rejected tokens never enter the generation.
  • Token-level caching. Local models with stable prefixes benefit from KV-cache reuse across calls — guidance keeps the cache warm between sequential gen() calls on the same lm object.
  • Bound max_tokens on every gen. Unbounded gen can run the model to its context limit when the stop pattern fails to match.
  • Profile slot fill time. A single complex select over 100 enum values can be slower than a regex-bounded gen — pick the cheaper expression for the same constraint.

Version migration guide

guidance has been through one major architectural rewrite. The lineage matters when reading old tutorials.

EraAPI style
0.0.xHandlebars-style templates: {{#gen 'name'}}...{{/gen}}. Tight coupling to specific local models.
0.1.x+ (current)Python composition: lm += gen("name", regex=...). Provider-agnostic; supports both local and remote.

Migration discipline:

  1. Any {{...}} template syntax is the old API. Don't try to port the templates — rewrite as Python composition.
  2. The rewrite generalised the LM abstraction; old guidance.llms.Transformers(...) calls became guidance.models.Transformers(...). Same idea, different namespace.
  3. Constraint primitives (gen, select, with_temperature) are stable since 0.1.x; build new code against them.
  4. Hedge: minor releases occasionally adjust the public API surface — pin tight and read the changelog before upgrading across minor versions.

Troubleshooting common errors

  • {{gen ...}} template doesn't run. That's the pre-rewrite API. Migrate to lm += gen(...) composition.
  • KeyError reading a slot. The gen/select call did not assign name=. The slot name is what you read back via lm["name"].
  • Remote-API constraints look slow. Remote providers (OpenAI, Anthropic) fall back to grammar-style retry — the model emits, guidance validates, retries on mismatch. Switch to a local model for true token-level constraints.
  • tiktoken mismatch. When OpenAI ships a new tokenizer, guidance's pin may lag. Upgrade tiktoken explicitly.
  • select(...) with 100+ options is slow. The decoding mask grows linearly. Either bucket the options or use a regex over their shared prefix.
  • gen(..., stop="\n") runs past the stop. A stop-string is matched at token boundaries — multi-byte stops can be missed mid-token. Prefer regex for hard limits.
  • Async use deadlocks. guidance is mostly synchronous; wrap calls in asyncio.to_thread() from FastAPI / aiohttp handlers.
  • Tokenizer differs between model and library. Loading a custom-tokenizer model can confuse guidance's offset calculations. Either use trust_remote_code=True (with the usual caveats) or pre-process inputs to align tokenization.

Security considerations

  • Constraint enforcement is not content safety. A grammar that forces "valid JSON" does not stop the model from emitting harmful content inside that JSON. Apply content filtering on outputs separately.
  • Local-model execution. models.Transformers(...) runs transformers under the hood — every trust_remote_code=True concern from there applies here.
  • Prompt injection through user data. Slot-filling against attacker-controlled inputs can still leak system prompts or change downstream behaviour. Treat inputs as untrusted; sanitise.
  • Regex denial-of-service. A pathological regex (catastrophic backtracking) inside gen(regex=...) can hang. Stick to linear-time patterns.
  • Output validation. Even with constraints, validate post-hoc with Pydantic. Constraints prevent shape errors; validation catches semantic errors.

When constrained decoding actually helps

The promise of guidance is "the model can't produce invalid output". That's strictly true for token-level constraints against local models — but only roughly true for grammar-style retry against hosted APIs. The distinction matters in practice.

PatternLocal model (token-level)Hosted API (grammar retry)
regex constraintEnforced at decode time; rejected tokens are maskedRetry on parse failure (up to N attempts)
select enumToken mask only allows enum tokensRetry; may produce out-of-enum text
Grammar (CFG)Strict; per-token rejectionRetry; complex grammars often fail repeatedly
Tool / function callingUse provider-native function calling insteadUse provider-native function calling
Latency floorAdds ~10-30% per step for mask buildingAdds retry round-trips on failure

When guidance shines:

  • Constrained extraction from messy text against a local model.
  • High-throughput pipelines where retries cost throughput.
  • Hard-shape outputs where downstream consumers can't tolerate format drift.
  • Research workflows where you want to express the prompt as Python code with interleaved generation.

When other tools fit better:

  • Simple JSON outputs from a hosted API → use the provider's native JSON mode.
  • Pydantic-validated extraction → instructor is simpler.
  • Grammar-defined structures locally → outlines has competitive grammar support.

Production deployment

guidance ships as a library; production deployment looks like any local-model serving stack with a constraint layer.

  • One lm object per worker. models.Transformers(...) holds GPU memory and an interpreter state. Build at startup, reuse per request.
  • Constraint cost is non-trivial. For very high QPS, the per-token mask computation adds up. Profile before committing.
  • Mixed local/remote routing. Use guidance.models.OpenAI("gpt-4o-mini") for hard cases that need flagship quality; reserve the local model for constrained extraction.
  • Backend choice matters. Local backends in guidance (Transformers, llama-cpp) have different constraint enforcement guarantees. Test the exact constraint you need against your deployment backend.
  • Container shape. Bake the local model into the image; constraints add no runtime files but consume VRAM.

Multi-provider patterns

guidance abstracts across providers via the models.* family — Transformers, OpenAI, Anthropic, LlamaCpp, Mock. The compositional API is identical:

python
from guidance import models, gen, select

# Swap one line to change backend
lm = models.OpenAI("gpt-4o-mini")
# or
lm = models.Transformers("microsoft/Phi-3-mini-4k-instruct", device_map="auto")

lm += "Pick a fruit: " + select(["apple", "banana", "cherry"], name="fruit")

Output: different cost / latency / constraint-enforcement properties — same Python.

Practical considerations:

  • Token-level constraints only work fully on local models. Remote APIs degrade to retry-on-mismatch.
  • Provider rate-limits surface as raw exceptionsguidance does not retry across providers for you. Wrap with tenacity for production.
  • Token / cost accounting differs by backend. For mixed deployments, layer a logging wrapper around all gen calls.

Evaluation & observability

guidance outputs are typed by construction (a select over ["A","B"] cannot emit "C") — eval focuses on whether the chosen value is correct, not whether the format is parseable.

  • Confusion matrices for select-bound outputs. Standard classification metrics.
  • Regex-bounded extractions evaluate via exact match against ground-truth fields.
  • Trace via langsmith or OTel. Wrap each call with a span; capture lm.user.input, lm["slot_name"], latency, and token counts.
  • A/B small vs large model. With constraints, a 3B-7B local model often ties or beats a flagship hosted model on extraction. Verify on your real data before assuming.

When NOT to use this

  • You only need JSON mode. instructor + Pydantic on top of OpenAI/Anthropic native function calling is simpler and well-supported.
  • You want self-improving prompts. dspy optimises prompts and examples programmatically — different paradigm.
  • All your models are hosted-API and you want speed. guidance on remote APIs is grammar-style retry; you may not see the speedup of true constrained decoding.
  • You want a templating language for prompts. Use Jinja, LangChain ChatPromptTemplate, or Mustache.
  • The output is free-form prose. guidance is for shaped outputs; if you don't have a shape, the constraints add complexity without benefit.

Ecosystem integrations

guidance is intentionally light on integrations — most users plug it into a larger framework rather than building one.

LayerWhat plugs in
Local modelstransformers (any causal LM), llama-cpp-python (GGUF), Triton-backed servers via raw HTTP.
Hosted APIsOpenAI, Anthropic, Azure OpenAI via their official SDKs. Gemini support varies by guidance version.
Tokenizationtiktoken (OpenAI), tokenizers (Hugging Face). Both ship transitively.
Output validationpydantic for post-extraction shape/semantic checks.
Observabilitylangsmith, OpenTelemetry, custom — wrap calls in @traceable to capture before/after state.
FrameworksLangChain doesn't wrap guidance natively; instead, use guidance for constrained extraction steps and pass results into LangChain runnables as plain Python values.

See also