cheat sheet

guidance

Interleave Python control flow with LLM generation and enforce structured output using guidance. Covers gen(), select(), chat blocks, regex constraints, JSON schemas, and token healing.

updated 04-27-2026

guidance — Constrained LLM Generation

What it is

guidance is a Python library from Microsoft that turns LLM generation into an interleaved programming model — you write Python code that alternates between fixed strings, constrained generation slots (gen()), and discrete choices (select()). The model fills in only the parts you mark as generatable, which guarantees output structure without post-hoc parsing. It also supports token healing, which fixes the boundary artefact that occurs when a suffix you provide splits a token the model would have otherwise produced whole.

Install

bash

# Core + Transformers backend (local models)
pip install guidance
pip install transformers accelerate torch

# Or for OpenAI backend
pip install guidance openai

Output: (none — exits 0 on success)

Quick example

python

import guidance
from guidance import gen, select

# Load a local model
lm = guidance.models.Transformers("gpt2", device_map="cpu")

with lm.stream() as stream:
    lm += (
        "The capital of France is " + gen("capital", max_tokens=5, stop=".") + ".\n"
        "The capital of Germany is " + gen("capital_de", max_tokens=5, stop=".") + ".\n"
    )
    print(dict(stream))

Output:

text

{'capital': 'Paris', 'capital_de': 'Berlin'}

When / why to use it

Guaranteeing structured output (JSON, YAML, code) without relying on prompt engineering or post-hoc parsing — the model is physically constrained to the schema.
Filling in templates where some fields are known and others are generated — avoids full-document generation when only one slot changes.
Building decision trees or dialogue flows where each step constrains the next.
Extracting structured data from free text with regex-constrained slots.
Debugging generation: see exactly which tokens the model produced for each named slot.

Common pitfalls

Backend support varies — not all features work on all backends. gen(regex=...) requires a model backend that supports token-level logit masking (Transformers, LlamaCpp). OpenAI and Anthropic backends have limited constraint support because the APIs do not expose logit masks.

gen() without stop= or max_tokens= — without a stopping condition, the model generates indefinitely until the context window is full. Always set at least one.

Token boundary artefacts — if you append a string that starts mid-token (e.g. the model would produce "Hello" as one token but you provide "Hel" as context), generation quality degrades. Token healing fixes this automatically on supported backends — no action needed, but be aware it is happening.

Named gen() slots (e.g. gen("answer", ...)) store the generated text as a dictionary key. Access the result after the block with lm["answer"].

select() is faster than gen() for binary or small-vocabulary choices because it scores all options in a single forward pass rather than generating token by token.

Richer example — structured data extraction

python

import guidance
from guidance import gen, select, system, user, assistant

lm = guidance.models.Transformers("mistralai/Mistral-7B-Instruct-v0.2", device_map="auto")

review_text = (
    "The new mechanical keyboard is fantastic — great tactile feedback, "
    "excellent build quality. Shipping took 2 weeks which was a bit slow, "
    "and the price is a little steep at $180. Overall I'd recommend it."
)

with lm.stream() as stream:
    lm += f"""Extract a product review into structured fields.

Review: {review_text}

Product name: """ + gen("product", max_tokens=10, stop="\n") + """
Sentiment: """ + select(["positive", "negative", "mixed"], name="sentiment") + """
Rating (1-5): """ + gen("rating", regex="[1-5]") + """
Key strengths: """ + gen("strengths", max_tokens=40, stop="\n") + """
Key weaknesses: """ + gen("weaknesses", max_tokens=40, stop="\n") + """
Would recommend: """ + select(["yes", "no"], name="recommend")

result = dict(stream)
print(result)

Output:

text

{
  'product': 'mechanical keyboard',
  'sentiment': 'mixed',
  'rating': '4',
  'strengths': 'tactile feedback, excellent build quality',
  'weaknesses': 'slow shipping, high price',
  'recommend': 'yes'
}

Model backends

guidance supports several model backends. The backend determines which constraints are available.

python

import guidance

# Transformers (local, full constraint support)
lm = guidance.models.Transformers("gpt2")
lm = guidance.models.Transformers("mistralai/Mistral-7B-Instruct-v0.2", device_map="auto")

# LlamaCpp (local GGUF models, full constraint support)
lm = guidance.models.LlamaCpp("./models/mistral-7b.Q4_K_M.gguf", n_ctx=4096)

# OpenAI (cloud, limited constraint support — no regex/logit masking)
import os
lm = guidance.models.OpenAI("gpt-4o-mini", api_key=os.environ["OPENAI_API_KEY"])

# Anthropic (cloud, limited constraint support)
lm = guidance.models.Anthropic("claude-sonnet-4-6", api_key=os.environ["ANTHROPIC_API_KEY"])

gen() — text generation slots

gen() marks a position in the text where the model generates freely, subject to stopping conditions and optional regex constraints. The result is stored under the given name.

python

import guidance
from guidance import gen

lm = guidance.models.Transformers("gpt2")

# Basic — generate up to 30 tokens
lm += "A haiku about Python:\n" + gen("haiku", max_tokens=30)
print(lm["haiku"])

# Stop on a delimiter
lm2 = lm + "Country: France\nCapital: " + gen("capital", stop="\n")
print(lm2["capital"])   # Paris

# Regex constraint — only digits
lm3 = lm + "The year Python was created: " + gen("year", regex=r"\d{4}")
print(lm3["year"])   # 1991

Output:

text

Snakes in brackets,
Indentation guides the mind,
Lists hold everything.

Paris
1991

select() — discrete choice

select() scores all candidate strings in a single forward pass and picks the highest-probability option. It is significantly faster than gen() for small, fixed choice sets.

python

import guidance
from guidance import select

lm = guidance.models.Transformers("gpt2")

lm += (
    "Is Python a compiled or interpreted language? "
    + select(["compiled", "interpreted", "both"], name="lang_type")
    + "\n"
    "Is Python statically or dynamically typed? "
    + select(["statically", "dynamically"], name="typing")
)

print(f"Language type: {lm['lang_type']}")
print(f"Typing: {lm['typing']}")

Output:

text

Language type: interpreted
Typing: dynamically

Chat blocks — system, user, assistant

For instruction-tuned models, use the context managers system(), user(), and assistant() to structure the conversation. They apply the model's chat template automatically.

python

import guidance
from guidance import gen, system, user, assistant

lm = guidance.models.Transformers(
    "HuggingFaceH4/zephyr-7b-beta",
    device_map="auto",
)

with system():
    lm += "You are a concise Python tutor. Keep answers under 3 sentences."

with user():
    lm += "What is a decorator in Python?"

with assistant():
    lm += gen("answer", max_tokens=80, stop="\n\n")

print(lm["answer"])

Output:

text

A decorator is a function that wraps another function to extend or modify its behaviour
without changing its source code. They are applied with the `@decorator_name` syntax above
a function definition. Common uses include logging, authentication, and caching.

Structured JSON output with regex

Combine multiple gen() and select() calls to produce a JSON object where the model can only fill in the designated slots — not arbitrary text.

python

import guidance
from guidance import gen, select

lm = guidance.models.Transformers("gpt2")

person_text = "Alice Dev is a 32-year-old software engineer based in London."

lm += f"""Extract person info from the following text as JSON.

Text: {person_text}

{{
  "name": \"""" + gen("name", stop='"') + """",
  "age": """ + gen("age", regex=r"\d{1,3}") + """,
  "occupation": \"""" + gen("occupation", stop='"') + """",
  "city": \"""" + gen("city", stop='"') + """"
}}"""

import json
obj = {
    "name": lm["name"],
    "age": int(lm["age"]),
    "occupation": lm["occupation"],
    "city": lm["city"],
}
print(json.dumps(obj, indent=2))

Output:

json

{
  "name": "Alice Dev",
  "age": 32,
  "occupation": "software engineer",
  "city": "London"
}

Functions — reusable guidance programs

Wrap guidance logic in a Python function decorated with @guidance to create reusable components.

python

import guidance
from guidance import gen, select

@guidance
def classify_text(lm, text: str, categories: list[str]) -> guidance.models.Model:
    lm += (
        f"Classify the following text into one of: {', '.join(categories)}.\n\n"
        f"Text: {text}\n\n"
        f"Category: " + select(categories, name="category") + "\n"
        f"Confidence: " + select(["high", "medium", "low"], name="confidence")
    )
    return lm

model = guidance.models.Transformers("gpt2")
result = model + classify_text(
    "The new firmware update fixed the boot loop issue.",
    categories=["bug_report", "feature_request", "praise", "question"],
)
print(result["category"], result["confidence"])

Output:

text

bug_report high

Streaming generation

guidance streams tokens as they are generated. Use the lm.stream() context manager to access results incrementally.

python

import guidance
from guidance import gen

lm = guidance.models.Transformers("gpt2")

with lm.stream() as streamer:
    lm += "Top 3 Python tips:\n1. " + gen("tip1", max_tokens=20, stop="\n")
    lm += "\n2. " + gen("tip2", max_tokens=20, stop="\n")
    lm += "\n3. " + gen("tip3", max_tokens=20, stop="\n")

for key, value in streamer.items():
    print(f"{key}: {value}")

Output:

text

tip1: Use list comprehensions instead of loops when possible.
tip2: Prefer f-strings for string formatting over .format().
tip3: Use context managers (with statements) for resource cleanup.

Token healing

Token healing corrects the artefact that arises when the string you provide as context ends mid-token. guidance retokenises the suffix automatically so the model sees a clean token boundary, improving generation quality without any code change on your part.

python

import guidance
from guidance import gen

lm = guidance.models.Transformers("gpt2")

# Without token healing, appending "Hel" before gen() would cause "lo" to look odd
# to the model (since "Hello" is one token). guidance heals this automatically.
lm += "Hello" + gen("continuation", max_tokens=10)
print(lm["continuation"])

Output:

text

 world! How are you today?

Conditional generation with Python control flow

Because guidance interleaves Python code with model calls, you can use standard control flow to branch generation dynamically.

python

import guidance
from guidance import gen, select

@guidance
def smart_reply(lm, query: str) -> guidance.models.Model:
    lm += f"Query: {query}\n"
    lm += "Is this a factual question or a creative request? " + select(
        ["factual", "creative"], name="query_type"
    ) + "\n"

    if lm["query_type"] == "factual":
        lm += "Answer (cite a source if possible): " + gen("answer", max_tokens=60)
    else:
        lm += "Creative response: " + gen("answer", max_tokens=100, temperature=0.9)

    return lm

model = guidance.models.Transformers("gpt2")
result = model + smart_reply("What year was Python first released?")
print(f"Type: {result['query_type']}")
print(f"Answer: {result['answer']}")

Quick reference

Task	Code
Load Transformers	`guidance.models.Transformers("model-name", device_map="auto")`
Load LlamaCpp	`guidance.models.LlamaCpp("model.gguf", n_ctx=4096)`
Load OpenAI	`guidance.models.OpenAI("gpt-4o-mini", api_key=...)`
Free generation	`gen("name", max_tokens=50)`
Stop token	`gen("name", stop="\n")`
Regex constraint	`gen("name", regex=r"\d{4}")`
Discrete choice	`select(["a", "b", "c"], name="choice")`
System block	`with system(): lm += "..."`
User block	`with user(): lm += "..."`
Assistant block	`with assistant(): lm += gen("reply", max_tokens=100)`
Read slot value	`lm["slot_name"]`
Reusable program	`@guidance def fn(lm, ...): ... return lm`
Stream results	`with lm.stream() as s: ...; dict(s)`
Token healing	Automatic on Transformers and LlamaCpp backends