cheat sheet
guidance
Package-level reference for the guidance library on PyPI — install, LLM-provider extras, versioning, and alternatives like instructor and outlines.
guidance
What it is
guidance is a Python library for interleaving generation and constraints in LLM prompts — letting the model fill in slots, choose from enums, match regexes, and follow grammars while the host program retains control of the prompt structure. Originally a Microsoft Research project, it pioneered the "constrained decoding" approach that several modern structured-output libraries now share.
The library has gone through a significant API evolution. Early guidance (the 0.0.x series) was tightly coupled to constrained-token decoding with specific local models; the rewrite (0.1.x+) generalised to chat models and remote APIs and uses a more compositional g.gen() / g.select() / g.assistant() interface.
Install
pip install guidance
Output: installs the core library — no LLM provider bundled
uv add guidance
Output: dependency resolved + added to pyproject.toml
poetry add guidance
Output: updated lockfile + virtualenv install
pip install guidance openai # for OpenAI/Azure
pip install guidance anthropic # for Claude
pip install guidance transformers torch # for local HF models
Output: add whichever provider SDK you intend to drive
Versioning & Python support
- Current line is
0.1.x/0.2.x(as of late 2025). Pre-1.0 — minor releases occasionally shuffle the public API. Pin tight in production. - Python
3.9+on current releases. - The library was rewritten between the
0.0.xand0.1.xseries — old tutorials using{{gen 'name'}}Handlebars-style templates do not run on current versions. Modern API uses Python composition. - Local-model constrained decoding requires a compatible
transformersinstall; remote APIs (OpenAI, Anthropic) use the provider SDKs directly.
Package metadata
- Maintainer: Originally Microsoft Research; now community-maintained under the
guidance-aiGitHub org - Project home: github.com/guidance-ai/guidance
- Docs: guidance.readthedocs.io (when available; primary docs are in the GitHub README)
- PyPI: pypi.org/project/guidance
- License: MIT
- Governance: community-led after Microsoft handed off; active contributor base
- First released: 2023
Optional dependencies & extras
guidance is relatively self-contained. It does not bundle provider SDKs — install the ones you need:
| Provider | Pip target |
|---|---|
| OpenAI / Azure OpenAI | openai |
| Anthropic | anthropic |
| Google Gemini | google-generativeai |
| Local Hugging Face | transformers, torch, accelerate |
| llama.cpp | llama-cpp-python |
Core deps the package pulls in itself:
numpypydantictiktoken(for token-level constraints)requests
There are no major [extra] groups — the canonical install is pip install guidance <provider-sdk>.
Alternatives
| Package | Trade-off |
|---|---|
instructor | Pydantic-first structured output for OpenAI, Anthropic, etc. Simpler API; doesn't do token-level constrained decoding. |
outlines | Regex / grammar / Pydantic-constrained generation against local HF or Mistral models. Strong on local-model constraint enforcement. |
marvin | Higher-level "AI functions" / prompt-as-function approach. Very Pythonic; lighter on grammar control. |
lmql | Declarative prompting language. Different paradigm; smaller community. |
jsonformer | Token-level JSON-schema enforcement for local models. Narrower scope than guidance. |
| Provider-native JSON mode / tool use | Built into OpenAI, Anthropic, Gemini SDKs. Use when constraints are simple enough to expressed as JSON Schema. |
Common gotchas
0.0.xHandlebars tutorials don't run. If a snippet uses{{#gen}}or{{select}}template syntax, it's pre-rewrite. Migrate to the Python composition API (with assistant(): g += gen(...)) on current releases.- LLM-provider compatibility varies. Token-level constrained decoding only works with local models where
guidancecan intercept logits. Remote APIs (OpenAI, Anthropic) fall back to grammar-style retry-on-mismatch, which is slower and less guaranteed. tiktokenmismatch. OpenAI's tokenizer ships withguidance's deps, but model-specific BPE updates can lag — pintiktokenalongside major OpenAI model upgrades.- Async support is partial. Some operations are sync-only; mixing
guidancewith FastAPI/asyncio requiresasyncio.to_thread()for the generation calls. - Local-model GPU placement. Like
transformers, you needdevice_map="auto"+accelerateto fit anything beyond a 7B model on a single GPU.guidancedoesn't manage this for you. - Maintenance cadence. After Microsoft handed off, release cadence is community-driven and slower than
instructororoutlines. Check the issue tracker before betting a production stack on a niche feature.
Performance tuning
guidance performance breaks down into two phases: constraint evaluation per step and the underlying model forward pass.
- Local model forward pass dominates. For a 7B model, each step is ~10-30ms; the constraint mask is ~0.1-1ms. Optimise the model first (quantisation, FlashAttention,
torch.compile) before chasing constraint overhead. selectwith N options costs O(N) to build the mask. For very large enums, prefer regex over a shared prefix.regexis JIT-compiled per call. Cache compiled patterns by reusing the samegen(regex=PAT)literal —guidancememoises by pattern string.- KV cache reuse. Sequential
gencalls on the samelmobject share KV cache. Branching withlm + ...vslm += ...matters for cache reuse — assignment preserves the cache, plain+creates a new branch. - Backend differences.
models.Transformersbenefits fromtorch.compile;models.LlamaCppbenefits from GGUF quantisation;models.OpenAIis dominated by network latency. - Batch is awkward.
guidance's slot-fill is inherently sequential per request — for throughput, run multiplelminstances across multiple processes/GPUs.
Real-world recipes
guidance's value shows up when output shape matters more than free-form text. The recipes below string gen, select, and assistant/user/system into the patterns production teams actually use.
Recipe: JSON extraction with regex-enforced fields
Force the model to emit a strictly-shaped JSON object — no parsing, no retries:
import guidance
from guidance import models, gen, select
lm = models.Transformers("microsoft/Phi-3-mini-4k-instruct", device_map="auto")
lm += '''Extract the invoice fields as JSON.
{
"vendor": "''' + gen("vendor", stop='"', max_tokens=40) + '''",
"amount": ''' + gen("amount", regex=r"\d+\.\d{2}") + ''',
"currency": "''' + select(["USD", "EUR", "GBP"], name="currency") + '''",
"category": "''' + select(["billing", "shipping", "service", "tax"], name="category") + '''"
}'''
print(lm["vendor"], lm["amount"], lm["currency"], lm["category"])
Output: four parsed fields. The grammar bakes in shape; the model fills slots.
Recipe: multi-step planner with enums for branching
import guidance
from guidance import models, gen, select, system, user, assistant
lm = models.OpenAI("gpt-4o-mini")
with system():
lm += "You decompose tasks into 1-3 step plans, then choose a primary action."
with user():
lm += "Help me onboard a new dev to our codebase."
with assistant():
lm += "Step 1: " + gen("step1", max_tokens=80, stop="\n") + "\n"
lm += "Step 2: " + gen("step2", max_tokens=80, stop="\n") + "\n"
lm += "Primary action: " + select(
["pair-program", "send-docs", "schedule-meeting"], name="action"
)
Output: plan text in lm["step1"]/step2, enum-constrained action in lm["action"].
Recipe: regex-bounded numeric extraction
from guidance import models, gen
lm = models.OpenAI("gpt-4o-mini")
lm += f"Q: What year did the Apollo 11 moon landing happen?\nA: " + gen(
"year", regex=r"\b(19|20)\d{2}\b", max_tokens=8,
)
print(lm["year"])
Output: 1969 — invalid token patterns are rejected at decoding time, no parsing required.
Recipe: chain-of-thought with shape constraints
Combine free-form reasoning with shape-constrained conclusion — a "think step by step, then answer in JSON" pattern that doesn't fall apart on edge cases.
from guidance import models, gen, select, system, user, assistant
lm = models.Transformers("microsoft/Phi-3-mini-4k-instruct",
device_map="auto", torch_dtype="bfloat16")
with system():
lm += "You analyse customer reviews and produce structured ratings."
with user():
lm += "Review: 'The food was delicious, but the service was painfully slow.'"
with assistant():
lm += "Reasoning: " + gen("reasoning", max_tokens=200, stop="\n\n") + "\n\n"
lm += "Score (1-5): " + gen("score", regex=r"[1-5]") + "\n"
lm += "Category: " + select(
["food-positive", "food-negative", "service-positive", "service-negative", "mixed"],
name="category",
)
print(lm["reasoning"], lm["score"], lm["category"])
Output: free-form reasoning followed by enum-constrained score and category — final fields ready for direct downstream use.
Recipe: local model with token-level constraint
Token-level constrained decoding only works against local models (or providers exposing logprobs hooks). With a local Transformers model, guidance intercepts logits before sampling.
import guidance
from guidance import models, gen, select
lm = models.Transformers("microsoft/Phi-3-mini-4k-instruct",
device_map="auto", torch_dtype="bfloat16")
lm += "Classify sentiment of: 'I loved the new release.'\nLabel: " + select(
["positive", "negative", "neutral"], name="label"
)
print(lm["label"])
Output: positive. The model cannot emit any other token — no retries, no parsing.
Cost & rate-limit management
guidance saves money in two ways: it eliminates retries-for-shape and it lets a smaller model do work a larger model would otherwise have to do.
- No retry loops for shape. With strict regex/grammar constraints, you don't need the "parse → fail → retry with stronger instructions" loop most JSON-extraction code has. That alone halves cost for shape-heavy workloads.
- Smaller models suffice. A 3B local model with
select/regexconstraints often matches a 70B model's accuracy on extraction tasks — at a fraction of the cost. - Per-token cost on remote APIs. Remote constrained decoding (OpenAI / Anthropic via grammar-style retry) does NOT save tokens. Local models do, because rejected tokens never enter the generation.
- Token-level caching. Local models with stable prefixes benefit from KV-cache reuse across calls —
guidancekeeps the cache warm between sequentialgen()calls on the samelmobject. - Bound
max_tokenson everygen. Unboundedgencan run the model to its context limit when the stop pattern fails to match. - Profile slot fill time. A single complex
selectover 100 enum values can be slower than aregex-boundedgen— pick the cheaper expression for the same constraint.
Version migration guide
guidance has been through one major architectural rewrite. The lineage matters when reading old tutorials.
| Era | API style |
|---|---|
0.0.x | Handlebars-style templates: {{#gen 'name'}}...{{/gen}}. Tight coupling to specific local models. |
0.1.x+ (current) | Python composition: lm += gen("name", regex=...). Provider-agnostic; supports both local and remote. |
Migration discipline:
- Any
{{...}}template syntax is the old API. Don't try to port the templates — rewrite as Python composition. - The rewrite generalised the LM abstraction; old
guidance.llms.Transformers(...)calls becameguidance.models.Transformers(...). Same idea, different namespace. - Constraint primitives (
gen,select,with_temperature) are stable since0.1.x; build new code against them. - Hedge: minor releases occasionally adjust the public API surface — pin tight and read the changelog before upgrading across minor versions.
Troubleshooting common errors
{{gen ...}}template doesn't run. That's the pre-rewrite API. Migrate tolm += gen(...)composition.KeyErrorreading a slot. Thegen/selectcall did not assignname=. The slot name is what you read back vialm["name"].- Remote-API constraints look slow. Remote providers (OpenAI, Anthropic) fall back to grammar-style retry — the model emits,
guidancevalidates, retries on mismatch. Switch to a local model for true token-level constraints. tiktokenmismatch. When OpenAI ships a new tokenizer,guidance's pin may lag. Upgradetiktokenexplicitly.select(...)with 100+ options is slow. The decoding mask grows linearly. Either bucket the options or use aregexover their shared prefix.gen(..., stop="\n")runs past the stop. A stop-string is matched at token boundaries — multi-byte stops can be missed mid-token. Preferregexfor hard limits.- Async use deadlocks.
guidanceis mostly synchronous; wrap calls inasyncio.to_thread()from FastAPI / aiohttp handlers. - Tokenizer differs between model and library. Loading a custom-tokenizer model can confuse
guidance's offset calculations. Either usetrust_remote_code=True(with the usual caveats) or pre-process inputs to align tokenization.
Security considerations
- Constraint enforcement is not content safety. A grammar that forces "valid JSON" does not stop the model from emitting harmful content inside that JSON. Apply content filtering on outputs separately.
- Local-model execution.
models.Transformers(...)runstransformersunder the hood — everytrust_remote_code=Trueconcern from there applies here. - Prompt injection through user data. Slot-filling against attacker-controlled inputs can still leak system prompts or change downstream behaviour. Treat inputs as untrusted; sanitise.
- Regex denial-of-service. A pathological regex (catastrophic backtracking) inside
gen(regex=...)can hang. Stick to linear-time patterns. - Output validation. Even with constraints, validate post-hoc with Pydantic. Constraints prevent shape errors; validation catches semantic errors.
When constrained decoding actually helps
The promise of guidance is "the model can't produce invalid output". That's strictly true for token-level constraints against local models — but only roughly true for grammar-style retry against hosted APIs. The distinction matters in practice.
| Pattern | Local model (token-level) | Hosted API (grammar retry) |
|---|---|---|
regex constraint | Enforced at decode time; rejected tokens are masked | Retry on parse failure (up to N attempts) |
select enum | Token mask only allows enum tokens | Retry; may produce out-of-enum text |
| Grammar (CFG) | Strict; per-token rejection | Retry; complex grammars often fail repeatedly |
| Tool / function calling | Use provider-native function calling instead | Use provider-native function calling |
| Latency floor | Adds ~10-30% per step for mask building | Adds retry round-trips on failure |
When guidance shines:
- Constrained extraction from messy text against a local model.
- High-throughput pipelines where retries cost throughput.
- Hard-shape outputs where downstream consumers can't tolerate format drift.
- Research workflows where you want to express the prompt as Python code with interleaved generation.
When other tools fit better:
- Simple JSON outputs from a hosted API → use the provider's native JSON mode.
- Pydantic-validated extraction →
instructoris simpler. - Grammar-defined structures locally →
outlineshas competitive grammar support.
Production deployment
guidance ships as a library; production deployment looks like any local-model serving stack with a constraint layer.
- One
lmobject per worker.models.Transformers(...)holds GPU memory and an interpreter state. Build at startup, reuse per request. - Constraint cost is non-trivial. For very high QPS, the per-token mask computation adds up. Profile before committing.
- Mixed local/remote routing. Use
guidance.models.OpenAI("gpt-4o-mini")for hard cases that need flagship quality; reserve the local model for constrained extraction. - Backend choice matters. Local backends in
guidance(Transformers, llama-cpp) have different constraint enforcement guarantees. Test the exact constraint you need against your deployment backend. - Container shape. Bake the local model into the image; constraints add no runtime files but consume VRAM.
Multi-provider patterns
guidance abstracts across providers via the models.* family — Transformers, OpenAI, Anthropic, LlamaCpp, Mock. The compositional API is identical:
from guidance import models, gen, select
# Swap one line to change backend
lm = models.OpenAI("gpt-4o-mini")
# or
lm = models.Transformers("microsoft/Phi-3-mini-4k-instruct", device_map="auto")
lm += "Pick a fruit: " + select(["apple", "banana", "cherry"], name="fruit")
Output: different cost / latency / constraint-enforcement properties — same Python.
Practical considerations:
- Token-level constraints only work fully on local models. Remote APIs degrade to retry-on-mismatch.
- Provider rate-limits surface as raw exceptions —
guidancedoes not retry across providers for you. Wrap withtenacityfor production. - Token / cost accounting differs by backend. For mixed deployments, layer a logging wrapper around all
gencalls.
Evaluation & observability
guidance outputs are typed by construction (a select over ["A","B"] cannot emit "C") — eval focuses on whether the chosen value is correct, not whether the format is parseable.
- Confusion matrices for
select-bound outputs. Standard classification metrics. - Regex-bounded extractions evaluate via exact match against ground-truth fields.
- Trace via
langsmithor OTel. Wrap each call with a span; capturelm.user.input,lm["slot_name"], latency, and token counts. - A/B small vs large model. With constraints, a 3B-7B local model often ties or beats a flagship hosted model on extraction. Verify on your real data before assuming.
When NOT to use this
- You only need JSON mode.
instructor+ Pydantic on top of OpenAI/Anthropic native function calling is simpler and well-supported. - You want self-improving prompts.
dspyoptimises prompts and examples programmatically — different paradigm. - All your models are hosted-API and you want speed.
guidanceon remote APIs is grammar-style retry; you may not see the speedup of true constrained decoding. - You want a templating language for prompts. Use Jinja, LangChain
ChatPromptTemplate, or Mustache. - The output is free-form prose.
guidanceis for shaped outputs; if you don't have a shape, the constraints add complexity without benefit.
Ecosystem integrations
guidance is intentionally light on integrations — most users plug it into a larger framework rather than building one.
| Layer | What plugs in |
|---|---|
| Local models | transformers (any causal LM), llama-cpp-python (GGUF), Triton-backed servers via raw HTTP. |
| Hosted APIs | OpenAI, Anthropic, Azure OpenAI via their official SDKs. Gemini support varies by guidance version. |
| Tokenization | tiktoken (OpenAI), tokenizers (Hugging Face). Both ship transitively. |
| Output validation | pydantic for post-extraction shape/semantic checks. |
| Observability | langsmith, OpenTelemetry, custom — wrap calls in @traceable to capture before/after state. |
| Frameworks | LangChain doesn't wrap guidance natively; instead, use guidance for constrained extraction steps and pass results into LangChain runnables as plain Python values. |
See also
- AI: guidance — constrained decoding, grammars, examples
- Concept: api — structured-output API design