cheat sheet

guidance

Package-level reference for the guidance library on PyPI — install, LLM-provider extras, versioning, and alternatives like instructor and outlines.

updated 05-31-2026

guidance

What it is

guidance is a Python library for interleaving generation and constraints in LLM prompts — letting the model fill in slots, choose from enums, match regexes, and follow grammars while the host program retains control of the prompt structure. Originally a Microsoft Research project, it pioneered the "constrained decoding" approach that several modern structured-output libraries now share.

The library has gone through a significant API evolution. Early guidance (the 0.0.x series) was tightly coupled to constrained-token decoding with specific local models; the rewrite (0.1.x+) generalised to chat models and remote APIs and uses a more compositional g.gen() / g.select() / g.assistant() interface.

Install

bash

pip install guidance

Output: installs the core library — no LLM provider bundled

bash

uv add guidance

Output: dependency resolved + added to pyproject.toml

bash

poetry add guidance

Output: updated lockfile + virtualenv install

bash

pip install guidance openai      # for OpenAI/Azure
pip install guidance anthropic   # for Claude
pip install guidance transformers torch  # for local HF models

Output: add whichever provider SDK you intend to drive

Versioning & Python support

Current line is 0.1.x / 0.2.x (as of late 2025). Pre-1.0 — minor releases occasionally shuffle the public API. Pin tight in production.
Python 3.9+ on current releases.
The library was rewritten between the 0.0.x and 0.1.x series — old tutorials using {{gen 'name'}} Handlebars-style templates do not run on current versions. Modern API uses Python composition.
Local-model constrained decoding requires a compatible transformers install; remote APIs (OpenAI, Anthropic) use the provider SDKs directly.

Package metadata

Maintainer: Originally Microsoft Research; now community-maintained under the guidance-ai GitHub org
Project home: github.com/guidance-ai/guidance
Docs: guidance.readthedocs.io (when available; primary docs are in the GitHub README)
PyPI: pypi.org/project/guidance
License: MIT
Governance: community-led after Microsoft handed off; active contributor base
First released: 2023

Optional dependencies & extras

guidance is relatively self-contained. It does not bundle provider SDKs — install the ones you need:

Provider	Pip target
OpenAI / Azure OpenAI	`openai`
Anthropic	`anthropic`
Google Gemini	`google-generativeai`
Local Hugging Face	`transformers`, `torch`, `accelerate`
llama.cpp	`llama-cpp-python`

Core deps the package pulls in itself:

numpy
pydantic
tiktoken (for token-level constraints)
requests

There are no major [extra] groups — the canonical install is pip install guidance <provider-sdk>.

Alternatives

Package	Trade-off
`instructor`	Pydantic-first structured output for OpenAI, Anthropic, etc. Simpler API; doesn't do token-level constrained decoding.
`outlines`	Regex / grammar / Pydantic-constrained generation against local HF or Mistral models. Strong on local-model constraint enforcement.
`marvin`	Higher-level "AI functions" / prompt-as-function approach. Very Pythonic; lighter on grammar control.
`lmql`	Declarative prompting language. Different paradigm; smaller community.
`jsonformer`	Token-level JSON-schema enforcement for local models. Narrower scope than `guidance`.
Provider-native JSON mode / tool use	Built into OpenAI, Anthropic, Gemini SDKs. Use when constraints are simple enough to expressed as JSON Schema.

Common gotchas

0.0.x Handlebars tutorials don't run. If a snippet uses {{#gen}} or {{select}} template syntax, it's pre-rewrite. Migrate to the Python composition API (with assistant(): g += gen(...)) on current releases.
LLM-provider compatibility varies. Token-level constrained decoding only works with local models where guidance can intercept logits. Remote APIs (OpenAI, Anthropic) fall back to grammar-style retry-on-mismatch, which is slower and less guaranteed.
tiktoken mismatch. OpenAI's tokenizer ships with guidance's deps, but model-specific BPE updates can lag — pin tiktoken alongside major OpenAI model upgrades.
Async support is partial. Some operations are sync-only; mixing guidance with FastAPI/asyncio requires asyncio.to_thread() for the generation calls.
Local-model GPU placement. Like transformers, you need device_map="auto" + accelerate to fit anything beyond a 7B model on a single GPU. guidance doesn't manage this for you.
Maintenance cadence. After Microsoft handed off, release cadence is community-driven and slower than instructor or outlines. Check the issue tracker before betting a production stack on a niche feature.

Performance tuning

guidance performance breaks down into two phases: constraint evaluation per step and the underlying model forward pass.

Local model forward pass dominates. For a 7B model, each step is ~10-30ms; the constraint mask is ~0.1-1ms. Optimise the model first (quantisation, FlashAttention, torch.compile) before chasing constraint overhead.
select with N options costs O(N) to build the mask. For very large enums, prefer regex over a shared prefix.
regex is JIT-compiled per call. Cache compiled patterns by reusing the same gen(regex=PAT) literal — guidance memoises by pattern string.
KV cache reuse. Sequential gen calls on the same lm object share KV cache. Branching with lm + ... vs lm += ... matters for cache reuse — assignment preserves the cache, plain + creates a new branch.
Backend differences. models.Transformers benefits from torch.compile; models.LlamaCpp benefits from GGUF quantisation; models.OpenAI is dominated by network latency.
Batch is awkward. guidance's slot-fill is inherently sequential per request — for throughput, run multiple lm instances across multiple processes/GPUs.

Real-world recipes

guidance's value shows up when output shape matters more than free-form text. The recipes below string gen, select, and assistant/user/system into the patterns production teams actually use.

Recipe: JSON extraction with regex-enforced fields

Force the model to emit a strictly-shaped JSON object — no parsing, no retries:

python

import guidance
from guidance import models, gen, select

lm = models.Transformers("microsoft/Phi-3-mini-4k-instruct", device_map="auto")
lm += '''Extract the invoice fields as JSON.
{
  "vendor": "''' + gen("vendor", stop='"', max_tokens=40) + '''",
  "amount": ''' + gen("amount", regex=r"\d+\.\d{2}") + ''',
  "currency": "''' + select(["USD", "EUR", "GBP"], name="currency") + '''",
  "category": "''' + select(["billing", "shipping", "service", "tax"], name="category") + '''"
}'''

print(lm["vendor"], lm["amount"], lm["currency"], lm["category"])

Output: four parsed fields. The grammar bakes in shape; the model fills slots.

Recipe: multi-step planner with enums for branching

python

import guidance
from guidance import models, gen, select, system, user, assistant

lm = models.OpenAI("gpt-4o-mini")

with system():
    lm += "You decompose tasks into 1-3 step plans, then choose a primary action."

with user():
    lm += "Help me onboard a new dev to our codebase."

with assistant():
    lm += "Step 1: " + gen("step1", max_tokens=80, stop="\n") + "\n"
    lm += "Step 2: " + gen("step2", max_tokens=80, stop="\n") + "\n"
    lm += "Primary action: " + select(
        ["pair-program", "send-docs", "schedule-meeting"], name="action"
    )

Output: plan text in lm["step1"]/step2, enum-constrained action in lm["action"].

Recipe: regex-bounded numeric extraction

python

from guidance import models, gen

lm = models.OpenAI("gpt-4o-mini")
lm += f"Q: What year did the Apollo 11 moon landing happen?\nA: " + gen(
    "year", regex=r"\b(19|20)\d{2}\b", max_tokens=8,
)
print(lm["year"])

Output: 1969 — invalid token patterns are rejected at decoding time, no parsing required.

Recipe: chain-of-thought with shape constraints

Combine free-form reasoning with shape-constrained conclusion — a "think step by step, then answer in JSON" pattern that doesn't fall apart on edge cases.

python

from guidance import models, gen, select, system, user, assistant

lm = models.Transformers("microsoft/Phi-3-mini-4k-instruct",
                          device_map="auto", torch_dtype="bfloat16")

with system():
    lm += "You analyse customer reviews and produce structured ratings."

with user():
    lm += "Review: 'The food was delicious, but the service was painfully slow.'"

with assistant():
    lm += "Reasoning: " + gen("reasoning", max_tokens=200, stop="\n\n") + "\n\n"
    lm += "Score (1-5): " + gen("score", regex=r"[1-5]") + "\n"
    lm += "Category: " + select(
        ["food-positive", "food-negative", "service-positive", "service-negative", "mixed"],
        name="category",
    )

print(lm["reasoning"], lm["score"], lm["category"])

Output: free-form reasoning followed by enum-constrained score and category — final fields ready for direct downstream use.

Recipe: local model with token-level constraint

Token-level constrained decoding only works against local models (or providers exposing logprobs hooks). With a local Transformers model, guidance intercepts logits before sampling.

python

import guidance
from guidance import models, gen, select

lm = models.Transformers("microsoft/Phi-3-mini-4k-instruct",
                          device_map="auto", torch_dtype="bfloat16")
lm += "Classify sentiment of: 'I loved the new release.'\nLabel: " + select(
    ["positive", "negative", "neutral"], name="label"
)
print(lm["label"])

Output: positive. The model cannot emit any other token — no retries, no parsing.

Cost & rate-limit management

guidance saves money in two ways: it eliminates retries-for-shape and it lets a smaller model do work a larger model would otherwise have to do.

No retry loops for shape. With strict regex/grammar constraints, you don't need the "parse → fail → retry with stronger instructions" loop most JSON-extraction code has. That alone halves cost for shape-heavy workloads.
Smaller models suffice. A 3B local model with select/regex constraints often matches a 70B model's accuracy on extraction tasks — at a fraction of the cost.
Per-token cost on remote APIs. Remote constrained decoding (OpenAI / Anthropic via grammar-style retry) does NOT save tokens. Local models do, because rejected tokens never enter the generation.
Token-level caching. Local models with stable prefixes benefit from KV-cache reuse across calls — guidance keeps the cache warm between sequential gen() calls on the same lm object.
Bound max_tokens on every gen. Unbounded gen can run the model to its context limit when the stop pattern fails to match.
Profile slot fill time. A single complex select over 100 enum values can be slower than a regex-bounded gen — pick the cheaper expression for the same constraint.

Version migration guide

guidance has been through one major architectural rewrite. The lineage matters when reading old tutorials.

Era	API style
`0.0.x`	Handlebars-style templates: `{{#gen 'name'}}...{{/gen}}`. Tight coupling to specific local models.
`0.1.x+` (current)	Python composition: `lm += gen("name", regex=...)`. Provider-agnostic; supports both local and remote.

Migration discipline:

Any {{...}} template syntax is the old API. Don't try to port the templates — rewrite as Python composition.
The rewrite generalised the LM abstraction; old guidance.llms.Transformers(...) calls became guidance.models.Transformers(...). Same idea, different namespace.
Constraint primitives (gen, select, with_temperature) are stable since 0.1.x; build new code against them.
Hedge: minor releases occasionally adjust the public API surface — pin tight and read the changelog before upgrading across minor versions.

Troubleshooting common errors

{{gen ...}} template doesn't run. That's the pre-rewrite API. Migrate to lm += gen(...) composition.
KeyError reading a slot. The gen/select call did not assign name=. The slot name is what you read back via lm["name"].
Remote-API constraints look slow. Remote providers (OpenAI, Anthropic) fall back to grammar-style retry — the model emits, guidance validates, retries on mismatch. Switch to a local model for true token-level constraints.
tiktoken mismatch. When OpenAI ships a new tokenizer, guidance's pin may lag. Upgrade tiktoken explicitly.
select(...) with 100+ options is slow. The decoding mask grows linearly. Either bucket the options or use a regex over their shared prefix.
gen(..., stop="\n") runs past the stop. A stop-string is matched at token boundaries — multi-byte stops can be missed mid-token. Prefer regex for hard limits.
Async use deadlocks. guidance is mostly synchronous; wrap calls in asyncio.to_thread() from FastAPI / aiohttp handlers.
Tokenizer differs between model and library. Loading a custom-tokenizer model can confuse guidance's offset calculations. Either use trust_remote_code=True (with the usual caveats) or pre-process inputs to align tokenization.

Security considerations

Constraint enforcement is not content safety. A grammar that forces "valid JSON" does not stop the model from emitting harmful content inside that JSON. Apply content filtering on outputs separately.
Local-model execution. models.Transformers(...) runs transformers under the hood — every trust_remote_code=True concern from there applies here.
Prompt injection through user data. Slot-filling against attacker-controlled inputs can still leak system prompts or change downstream behaviour. Treat inputs as untrusted; sanitise.
Regex denial-of-service. A pathological regex (catastrophic backtracking) inside gen(regex=...) can hang. Stick to linear-time patterns.
Output validation. Even with constraints, validate post-hoc with Pydantic. Constraints prevent shape errors; validation catches semantic errors.

When constrained decoding actually helps

The promise of guidance is "the model can't produce invalid output". That's strictly true for token-level constraints against local models — but only roughly true for grammar-style retry against hosted APIs. The distinction matters in practice.

Pattern	Local model (token-level)	Hosted API (grammar retry)
`regex` constraint	Enforced at decode time; rejected tokens are masked	Retry on parse failure (up to N attempts)
`select` enum	Token mask only allows enum tokens	Retry; may produce out-of-enum text
Grammar (CFG)	Strict; per-token rejection	Retry; complex grammars often fail repeatedly
Tool / function calling	Use provider-native function calling instead	Use provider-native function calling
Latency floor	Adds ~10-30% per step for mask building	Adds retry round-trips on failure

When guidance shines:

Constrained extraction from messy text against a local model.
High-throughput pipelines where retries cost throughput.
Hard-shape outputs where downstream consumers can't tolerate format drift.
Research workflows where you want to express the prompt as Python code with interleaved generation.

When other tools fit better:

Simple JSON outputs from a hosted API → use the provider's native JSON mode.
Pydantic-validated extraction → instructor is simpler.
Grammar-defined structures locally → outlines has competitive grammar support.

Production deployment

guidance ships as a library; production deployment looks like any local-model serving stack with a constraint layer.

One lm object per worker. models.Transformers(...) holds GPU memory and an interpreter state. Build at startup, reuse per request.
Constraint cost is non-trivial. For very high QPS, the per-token mask computation adds up. Profile before committing.
Mixed local/remote routing. Use guidance.models.OpenAI("gpt-4o-mini") for hard cases that need flagship quality; reserve the local model for constrained extraction.
Backend choice matters. Local backends in guidance (Transformers, llama-cpp) have different constraint enforcement guarantees. Test the exact constraint you need against your deployment backend.
Container shape. Bake the local model into the image; constraints add no runtime files but consume VRAM.

Multi-provider patterns

guidance abstracts across providers via the models.* family — Transformers, OpenAI, Anthropic, LlamaCpp, Mock. The compositional API is identical:

python

from guidance import models, gen, select

# Swap one line to change backend
lm = models.OpenAI("gpt-4o-mini")
# or
lm = models.Transformers("microsoft/Phi-3-mini-4k-instruct", device_map="auto")

lm += "Pick a fruit: " + select(["apple", "banana", "cherry"], name="fruit")

Output: different cost / latency / constraint-enforcement properties — same Python.

Practical considerations:

Token-level constraints only work fully on local models. Remote APIs degrade to retry-on-mismatch.
Provider rate-limits surface as raw exceptions — guidance does not retry across providers for you. Wrap with tenacity for production.
Token / cost accounting differs by backend. For mixed deployments, layer a logging wrapper around all gen calls.

Evaluation & observability

guidance outputs are typed by construction (a select over ["A","B"] cannot emit "C") — eval focuses on whether the chosen value is correct, not whether the format is parseable.

Confusion matrices for select-bound outputs. Standard classification metrics.
Regex-bounded extractions evaluate via exact match against ground-truth fields.
Trace via langsmith or OTel. Wrap each call with a span; capture lm.user.input, lm["slot_name"], latency, and token counts.
A/B small vs large model. With constraints, a 3B-7B local model often ties or beats a flagship hosted model on extraction. Verify on your real data before assuming.

When NOT to use this

You only need JSON mode. instructor + Pydantic on top of OpenAI/Anthropic native function calling is simpler and well-supported.
You want self-improving prompts. dspy optimises prompts and examples programmatically — different paradigm.
All your models are hosted-API and you want speed. guidance on remote APIs is grammar-style retry; you may not see the speedup of true constrained decoding.
You want a templating language for prompts. Use Jinja, LangChain ChatPromptTemplate, or Mustache.
The output is free-form prose. guidance is for shaped outputs; if you don't have a shape, the constraints add complexity without benefit.

Ecosystem integrations

guidance is intentionally light on integrations — most users plug it into a larger framework rather than building one.

Layer	What plugs in
Local models	`transformers` (any causal LM), `llama-cpp-python` (GGUF), Triton-backed servers via raw HTTP.
Hosted APIs	OpenAI, Anthropic, Azure OpenAI via their official SDKs. Gemini support varies by `guidance` version.
Tokenization	`tiktoken` (OpenAI), `tokenizers` (Hugging Face). Both ship transitively.
Output validation	`pydantic` for post-extraction shape/semantic checks.
Observability	`langsmith`, OpenTelemetry, custom — wrap calls in `@traceable` to capture before/after state.
Frameworks	LangChain doesn't wrap `guidance` natively; instead, use `guidance` for constrained extraction steps and pass results into LangChain runnables as plain Python values.

guidance

What it is

Install

Versioning & Python support

Package metadata

Optional dependencies & extras

Alternatives

Common gotchas

Performance tuning

Real-world recipes

Recipe: JSON extraction with regex-enforced fields

Recipe: multi-step planner with enums for branching

Recipe: regex-bounded numeric extraction

Recipe: chain-of-thought with shape constraints

Recipe: local model with token-level constraint

Cost & rate-limit management

Version migration guide

Troubleshooting common errors

Security considerations

When constrained decoding actually helps

Production deployment

Multi-provider patterns

Evaluation & observability

When NOT to use this

Ecosystem integrations

See also