cheat sheet
transformers
Package-level reference for the Hugging Face transformers library on PyPI — install extras, backend choice, versioning, and alternatives.
transformers
What it is
transformers is Hugging Face's flagship Python library for loading and running pre-trained neural networks — language models, vision models, speech models, and multimodal models. It provides a unified AutoModel / AutoTokenizer / pipeline API on top of PyTorch, TensorFlow, JAX/Flax, and (increasingly) ONNX Runtime backends.
The library is tightly coupled to the Hugging Face Hub — from_pretrained("model-id") downloads weights, tokenizer files, and config from a hub repo. The Hub now hosts well over a million model checkpoints.
Install
pip install transformers
Output: installs the library but no ML backend — you still need PyTorch, TF, or JAX separately
pip install "transformers[torch]"
Output: installs transformers plus a compatible PyTorch wheel
pip install "transformers[torch]" accelerate
Output: the standard "modern LLM" combo — adds device_map="auto" and multi-GPU support
uv add transformers torch accelerate
Output: dependencies resolved + added to pyproject.toml
poetry add transformers torch
Output: updated lockfile + virtualenv install
Versioning & Python support
- Current stable is the
4.xseries (and has been for the entire LLM era). Major bumps are rare; minor releases (4.45,4.46, …) ship roughly monthly and frequently add new model architectures. - Python
3.9+on current releases;3.10+recommended. - Loose semver — minor releases may add new APIs and deprecate old ones, but rarely break existing model loading. Patch releases are pure bug fixes.
- The library lags slightly behind the latest models on the Hub for architecture support (a brand-new model often needs
trust_remote_code=Trueuntil its class is upstreamed). - Pinning matters:
transformers==4.Xmatched withtorch>=Yis the typical compat matrix; the Hugging Face release notes call out the floors.
Package metadata
- Maintainer: Hugging Face (the
huggingfaceGitHub org) - Project home: github.com/huggingface/transformers
- Docs: huggingface.co/docs/transformers
- PyPI: pypi.org/project/transformers
- License: Apache-2.0
- Governance: commercial company + huge open-source contributor base
- First released: 2018 (originally
pytorch-pretrained-bert) - Downloads: tens of millions per month
Optional dependencies & extras
transformers defines many [extra] groups. The most relevant:
| Extra | Pulls in |
|---|---|
transformers[torch] | PyTorch wheel matched to the library's tested floor |
transformers[tf] | TensorFlow 2.x |
transformers[flax] | JAX + Flax |
transformers[sentencepiece] | The sentencepiece tokenizer (required for LLaMA, T5, Mistral, etc.) |
transformers[tokenizers] | The Rust-backed tokenizers library (pulled in by default in most paths) |
transformers[onnxruntime] | ONNX export + inference with onnxruntime |
transformers[serving] | Adds FastAPI for transformers serve |
transformers[vision] | Pillow + image-processing deps for vision models |
transformers[audio] | librosa, soundfile for speech models |
transformers[all] | Everything above (very large install) |
Companion packages from the same org:
accelerate— multi-GPU / mixed-precision /device_map="auto"peft— parameter-efficient fine-tuning (LoRA, QLoRA, adapters)bitsandbytes— 8-bit and 4-bit quantisationdatasets— Hugging Face dataset loading and streamingsafetensors— fast, memory-safe checkpoint format (now default)huggingface-hub— Hub client, used byfrom_pretrained
Alternatives
| Package | Trade-off |
|---|---|
vllm | Production-grade LLM inference server — way faster throughput than raw transformers.generate(). Use when serving at scale. |
text-generation-inference (TGI) | Hugging Face's own production serving stack. |
onnxruntime | Run exported ONNX models with no Python ML framework. Smaller deploy footprint. |
tensorflow-hub | TF-native pre-trained model hub. Mostly superseded by Hugging Face Hub today. |
mlx (Apple Silicon) | Native Apple Silicon inference; some HF models mirrored. Use for Mac-local LLMs. |
llama-cpp-python | GGUF-quantised CPU/GPU inference. Use when you need llama.cpp's quant formats. |
Common gotchas
- Model card vs Hub repo vs Inference API are different things. The model "card" is the README; the Hub repo holds weights; the Inference API is a separate hosted service.
from_pretrainedonly touches the Hub repo. trust_remote_code=Trueis code execution. Many cutting-edge models ship custom modeling code that loads via this flag — it runs arbitrary Python from the Hub. Only enable for repos you trust, ideally pinned to a revision SHA.device_map="auto"needsaccelerate. Without it, the model loads to CPU and you wonder why inference is glacial.- FlashAttention is opt-in. Pass
attn_implementation="flash_attention_2"tofrom_pretrainedand installflash-attnseparately — it's not in any extra. - Tokenizer mismatch with sentencepiece-based models. Loading LLaMA / Mistral / T5 without
transformers[sentencepiece]raises a crypticImportError. Install the extra. - Big models OOM silently on Windows.
device_map="auto"will happily spill to disk viaaccelerateon Linux, but pagefile semantics on Windows hit hard. Use Linux / WSL for >7B models. pipeline()is slow per-call. It re-runs framework overhead each invocation. Build the model + tokenizer manually and batch inputs for throughput.- Hub auth required for gated models. LLaMA, Gemma, and others need
huggingface-cli loginand an approved license click on the website beforefrom_pretrainedworks.
Ecosystem integrations
transformers is the trunk of a wider Hugging Face ecosystem. The companion packages overlap less than their names suggest — each owns a slice of the lifecycle.
| Package | What it owns |
|---|---|
accelerate | Multi-GPU placement, mixed-precision, device_map="auto", distributed launch (accelerate launch …). |
peft | Parameter-efficient fine-tuning: LoRA, QLoRA, prefix tuning, IA3, adapters. |
bitsandbytes | 8-bit and 4-bit quantisation kernels for CUDA. Powers BitsAndBytesConfig. |
optimum | Hardware-specific backends: ONNX Runtime, OpenVINO, TensorRT, Intel Habana, AWS Neuron. |
safetensors | Memory-mapped, pickle-free checkpoint format. Default for new uploads. |
datasets | Streaming / mapped tabular dataset library. The standard pre-Trainer data layer. |
tokenizers | Rust-backed fast tokenizers — AutoTokenizer instantiates these by default. |
huggingface-hub | Hub client. from_pretrained calls it transitively; CLI tools (huggingface-cli) come from here. |
evaluate | Standard metrics (accuracy, f1, rouge, bleu, bertscore). Compatible with Trainer.compute_metrics. |
trl | Reinforcement learning from human feedback — SFTTrainer, DPOTrainer, PPOTrainer. |
transformers.js | Browser-side inference (different package; same model files). |
Inference-server siblings (not strict integrations but the same model files):
vllm— high-throughput LLM serving with continuous batching and paged attention.text-generation-inference(TGI) — Hugging Face's own production server.llama-cpp-python— GGUF-quantised inference on CPU + small GPUs.
Most production setups mix these — fine-tune with transformers/peft, serve with vLLM, observe with langsmith or OpenTelemetry.
Real-world recipes
Recipe: chat-template inference server
A FastAPI service that wraps a chat model with proper chat templating, streaming, and prompt-cache-friendly batching.
from fastapi import FastAPI
from fastapi.responses import StreamingResponse
from transformers import AutoTokenizer, AutoModelForCausalLM, TextIteratorStreamer
from threading import Thread
import torch
MODEL = "meta-llama/Llama-3.1-8B-Instruct"
tok = AutoTokenizer.from_pretrained(MODEL)
model = AutoModelForCausalLM.from_pretrained(
MODEL,
torch_dtype=torch.bfloat16,
device_map="auto",
attn_implementation="sdpa", # safe default; flash_attention_2 if installed
)
tok.pad_token = tok.eos_token
app = FastAPI()
@app.post("/chat")
def chat(payload: dict):
prompt = tok.apply_chat_template(payload["messages"], tokenize=False, add_generation_prompt=True)
inputs = tok(prompt, return_tensors="pt").to(model.device)
streamer = TextIteratorStreamer(tok, skip_prompt=True, skip_special_tokens=True)
Thread(target=model.generate, kwargs=dict(
**inputs, streamer=streamer, max_new_tokens=512, do_sample=True, temperature=0.7,
)).start()
return StreamingResponse((t for t in streamer), media_type="text/plain")
Output: clients receive tokens incrementally; one worker per GPU is the right shape.
Recipe: LoRA fine-tune + merge for serving
Train a small adapter with PEFT, merge it back into the base, then export to a serving format:
from peft import PeftModel
from transformers import AutoModelForCausalLM, AutoTokenizer
import torch
base = AutoModelForCausalLM.from_pretrained("meta-llama/Llama-3.1-8B-Instruct",
torch_dtype=torch.bfloat16, device_map="auto")
adapter = PeftModel.from_pretrained(base, "./lora-out/final")
merged = adapter.merge_and_unload()
merged.save_pretrained("./llama3-merged", safe_serialization=True)
AutoTokenizer.from_pretrained("meta-llama/Llama-3.1-8B-Instruct").save_pretrained("./llama3-merged")
Output: ./llama3-merged is a complete fine-tuned checkpoint ready for vLLM / TGI / any HF-compatible loader.
Recipe: streaming embedding pipeline over a parquet corpus
from datasets import load_dataset
from transformers import AutoTokenizer, AutoModel
import torch, torch.nn.functional as F
tok = AutoTokenizer.from_pretrained("BAAI/bge-small-en-v1.5")
model = AutoModel.from_pretrained("BAAI/bge-small-en-v1.5").to("cuda").eval()
@torch.no_grad()
def embed(batch):
enc = tok(batch["text"], padding=True, truncation=True, return_tensors="pt", max_length=512).to("cuda")
h = model(**enc).last_hidden_state[:, 0] # CLS pooling for BGE
h = F.normalize(h, p=2, dim=1)
return {"embedding": h.cpu().tolist()}
ds = load_dataset("parquet", data_files="docs-*.parquet", streaming=True)["train"]
out = ds.map(embed, batched=True, batch_size=64)
out.to_parquet("docs-embedded.parquet")
Output: memory-bounded embedding pipeline that scales to corpora larger than RAM.
Recipe: zero-shot classification ladder
Triage incoming support tickets cheaply with a zero-shot model, escalate to an LLM only if confidence is low.
from transformers import pipeline
zsc = pipeline("zero-shot-classification", model="MoritzLaurer/deberta-v3-base-zeroshot-v2.0", device=0)
CATEGORIES = ["billing", "bug report", "feature request", "account access", "other"]
def route(ticket: str):
res = zsc(ticket, candidate_labels=CATEGORIES, multi_label=False)
label, score = res["labels"][0], res["scores"][0]
return label if score >= 0.75 else "escalate-to-llm"
Output: ~10× cheaper than running every ticket through an LLM, with calibrated confidence to fall back.
Recipe: offline-first deployment
Pre-bake the model into the container; lock down Hub access at runtime:
FROM python:3.12-slim
RUN pip install --no-cache-dir transformers torch accelerate
RUN python -c "from transformers import AutoModel, AutoTokenizer; \
AutoModel.from_pretrained('intfloat/e5-small-v2'); \
AutoTokenizer.from_pretrained('intfloat/e5-small-v2')"
ENV HF_HUB_OFFLINE=1 TRANSFORMERS_OFFLINE=1
COPY . /app
CMD ["python", "/app/serve.py"]
Output: model weights baked into the image; runtime has no Hub dependency.
Performance tuning
The default model.generate(...) call leaves a lot of throughput on the table. The biggest wins, in rough order of impact:
- Batch on the server, not the client. For inference servers, batch concurrent requests inside one
generatecall when possible. Even mismatched-length prompts benefit from padded batching. - Attention implementation.
attn_implementation="sdpa"(PyTorch's scaled-dot-product attention) is a safe default on modern PyTorch.flash_attention_2is faster and lower-memory but needspip install flash-attnand supported GPUs (Ampere+). - KV cache is on by default for
generate. Don't disable it; checkmodel.config.use_cache=True. torch.compilefor stable shapes.model = torch.compile(model, mode="reduce-overhead")improves throughput materially on Ampere+ GPUs once the graph stabilises. Compile time is heavy — only worth it for long-running servers.- Quantisation. 4-bit NF4 (
BitsAndBytesConfig(load_in_4bit=True, bnb_4bit_quant_type="nf4")) cuts memory ~4× with usually <1% quality loss. Pre-quantised GPTQ / AWQ checkpoints load faster at inference time. - Mixed precision.
torch_dtype=torch.bfloat16on Ampere+;float16on older cards. Save 2× memory vsfloat32. - Gradient checkpointing for training. Trade recompute for memory:
training_args.gradient_checkpointing = Truelets larger batches/sequences fit. - Use
optim="adamw_bnb_8bit"for training — 8-bit optimiser states roughly halve VRAM usage. - DataLoader workers. For training,
num_workers=4(or higher) +pin_memory=Trueremoves CPU bottlenecks. TORCH_LOGS=recompileswhile developing — catches silent recompilation pitfalls undertorch.compile.- Multi-GPU.
device_map="auto"splits across GPUs naively (layer-pipelined). Useaccelerate launch --multi_gpu+ Trainer for proper data-parallel training; FSDP or DeepSpeed for tensor-parallel training. - CPU inference.
optimum-intel+ OpenVINO accelerates BERT-class encoders by 3-5× on Xeon CPUs. ONNX Runtime is a portable alternative.
For production inference at scale, hand off to vLLM or TGI. transformers.generate() was never optimised for throughput — those servers add continuous batching, paged attention, and tensor parallelism that you'd otherwise reimplement.
Version migration guide
The 4.x line has lasted the entire modern LLM era. Within 4.x, churn comes from monthly minor releases that frequently add models and occasionally tighten APIs.
| Roughly | What tends to break |
|---|---|
<4.30 | Pre-safetensors default. Many tutorials still call .bin loading explicitly. |
~4.35 | attn_implementation arg standardised. Older code passes attention via separate flags. |
~4.40 | device_map="auto" paths assume accelerate is installed (it isn't always). |
~4.45 | Chat template defaults tightened — some older instruct models require apply_chat_template explicitly. |
| Latest | Steady drumbeat of new model architectures; trust_remote_code=True is the bridge until the class is upstreamed. |
Migration discipline:
- Pin both library and torch.
transformers==4.X+torch==Y.Zare the supported pair listed in release notes. - Re-test fine-tunes after upgrades. New defaults (loss-fns, gradient clipping, LR scheduler) occasionally affect downstream metrics.
from_pretrained(...)arg deprecations are noisy but generally non-breaking for a release or two. Address before they go silent.- Tokenizer special-token handling has tightened over time. Pin the tokenizer JSON when reproducibility matters.
Hedge: when in doubt, the Hugging Face Transformers Release Notes on GitHub list every behavioral change per release — search there rather than guessing.
Security considerations
transformers runs arbitrary model weights, occasionally arbitrary model code, and pulls files from the public Hub. The threat surface deserves attention.
trust_remote_code=Trueis code execution. Many cutting-edge models ship custom modeling code that loads via this flag. Audit the repo and pin a revision SHA (revision="abc123") before enabling. Treat unknown repos like unknown PyPI packages.- Pickle in
.bincheckpoints. Pre-safetensors.binfiles are Python pickles — loading one runs arbitrary code. Prefersafetensors; the loader rejects mixed loads in strict mode. - Hub supply chain. Anyone can publish to the Hub. Pin
revision=<commit-sha>for production use, not branch names. - Tokenizer files.
tokenizer.jsonis data, not code, but malicious vocab can produce token IDs that confuse downstream code. Defence-in-depth: validate vocab size againstmodel.config.vocab_size. - Gated model licenses. LLaMA, Gemma, Mistral, others have license obligations beyond Apache-2.0. Track which models touch your build pipeline.
- PII in prompts and outputs. Local inference avoids sending data to a third party — but logs, traces, and dataset caches will still capture it. Apply the same redaction policy you'd use for hosted APIs.
- Adversarial prompts. Local models don't have the abuse mitigations a hosted service does. If users supply prompts directly, add a content classifier upstream.
- Network egress at load time.
from_pretrainedhits the Hub. UseTRANSFORMERS_OFFLINE=1+ a pre-baked cache in air-gapped environments.
Troubleshooting common errors
OSError: Can't load tokenizer for ...— usually a missing extra.pip install sentencepiece(LLaMA/T5/Mistral) ortiktoken(some OpenAI-derived tokenizers).ImportError: This requires you to install the latest version of bitsandbytes—pip install -U bitsandbytesand ensure CUDA matches your PyTorch build.device_map="auto"errors on CPU-only machines. Installaccelerateeven on CPU; without it the device-mapping logic short-circuits to errors.CUDA out of memory. Trylow_cpu_mem_usage=True, quantisation (4-bit), gradient checkpointing, smaller batch, shorter sequences, ordevice_map="auto"with offloading.- Generated text is gibberish for instruct models. You forgot
apply_chat_template. Decoder-only chat models trained with template tokens produce nonsense if you skip them. pad_tokenerror during batched generation. Settokenizer.pad_token = tokenizer.eos_tokenandmodel.config.pad_token_id = tokenizer.eos_token_idbefore the call.- Whisper transcriptions repeat indefinitely. Pass
condition_on_previous_text=Falseand chunk inputs withchunk_length_s=30. trust_remote_codewarning. Either accept and pin a revision SHA, or wait for the architecture to be upstreamed.- Tokenizer mismatch between training and inference. Always save the tokenizer alongside the model (
tokenizer.save_pretrained(dir)); loading from the base checkpoint after fine-tuning corrupts special tokens.
When NOT to use this
- Production LLM inference at scale. Use vLLM or TGI. They add continuous batching, paged attention, and proper request scheduling.
transformers.generate()runs one request at a time and idles GPU. - Edge / mobile deployment. ONNX Runtime, MLC LLM, llama.cpp, and Core ML are the right tools.
transformersisn't built for that runtime envelope. - Pure embedding inference at high QPS.
sentence-transformersis purpose-built;fastembed(ONNX) is leaner. - Hosted-API parity. If you only need a remote model, the provider SDK (
openai,anthropic) is one HTTP call away — no PyTorch needed. - Tiny ML. For sub-100 MB classifiers,
scikit-learn+transformers.AutoTokenizerfor features is often cleaner than the full pipeline.
Production deployment
transformers is a library; production deployment means picking the right server pattern around it. For most teams the right answer is "use vLLM" — but plenty of workloads run transformers directly in production.
Single-worker GPU server. One AutoModelForCausalLM instance held by one FastAPI process per GPU. Simple, debuggable, and adequate for low-QPS internal services. Use a process supervisor (gunicorn / supervisord) — multiple workers per GPU just thrash VRAM.
CPU encoder server. BERT-class encoders served via optimum-onnxruntime or optimum-openvino on Xeon CPUs are the canonical "embedding micro-service" shape. ~10-100× cheaper than a GPU box for the same QPS.
Container shape. Build the image with the model pre-downloaded (saves cold-start time and reduces Hub dependency at runtime). Use multi-stage builds — the base CUDA image is large; copy only the runtime layer.
Model versioning. from_pretrained("name", revision="abc123") pins the exact Hub commit. Track the SHA in requirements.txt (as a comment) so rollback is auditable.
Inference batching. Either implement client-side micro-batching (collect ~10ms of requests, run one generate) or hand the workload to vLLM, which does this automatically and far better.
GPU utilisation telemetry. Export nvidia-smi metrics through dcgm-exporter to Prometheus. If GPU utilisation sits below 30%, you're paying for idle silicon — batch harder, or stop using a GPU.
Health checks. /health should run a trivial 1-token generate — this catches NaN-loaded models that a process-level check wouldn't.
Cost & rate-limit management
Self-hosted transformers swaps API-side rate limits for GPU-side rate limits. The economics change, but the bookkeeping doesn't disappear.
- GPU-hour budgets. Track
(QPS × p99_latency × dollars_per_GPU_hour)per model. For an embedding model on an A10, that's pennies per million calls; for an 8B chat model on an H100, it's measurable. - Choose the smallest model that meets quality. A 1.5B model that hits your accuracy bar beats a 70B model on TCO by 20-50×.
- Quantise. 4-bit NF4 with
BitsAndBytesConfigcuts VRAM ~4× and often serves 2× more concurrent requests on the same GPU. - Batch. Server-side batching is where most production savings come from. vLLM ships with continuous batching; if you're sticking with
transformers, group requests within 20-50ms windows. - Shut down cold GPUs. Spot/preemptible instances are 60-90% cheaper. For workloads tolerant of restart latency, that's the easy win.
- Embedding caching. For embedding workloads, cache by content hash — many corpora have 10-30% duplicate documents.
- Local quotas. A single misbehaving caller can saturate your GPU. Rate-limit per tenant at the application layer.
Multi-provider patterns
transformers itself isn't multi-provider — but it sits next to provider SDKs in many production architectures. The common patterns:
- Routing cheap workloads to local, expensive workloads to hosted. A
route(query) -> providerfunction based on query complexity, latency budget, or model strengths. Local embedding + hosted chat is the classic split. - Fine-tune local, serve from
transformers, fall back to hosted on rate-limit. For high-availability services, treat the local model as primary and a hosted API as failover. - vLLM as the OpenAI-compatible front door. vLLM serves
transformers-format models over an OpenAI-compatible HTTP API. Any client that speaks OpenAI's wire format (LangChain, LiteLLM, openai-python) talks to it unchanged. - LiteLLM proxy in front of mixed local + hosted. Same routing logic, centralised — one base URL for your application, multiple providers behind it.
- Tokenizer parity. Cross-provider routing needs tokenizer compatibility for token-budget calculations.
tiktokenhandles OpenAI;transformers.AutoTokenizerhandles everything else.
See also
- AI: transformers — pipelines, generation, fine-tuning
- Packages: pip-sentence-transformers — embedding-focused sibling
- Concept: api — client-library design