cheat sheet

transformers

Package-level reference for the Hugging Face transformers library on PyPI — install extras, backend choice, versioning, and alternatives.

transformers

What it is

transformers is Hugging Face's flagship Python library for loading and running pre-trained neural networks — language models, vision models, speech models, and multimodal models. It provides a unified AutoModel / AutoTokenizer / pipeline API on top of PyTorch, TensorFlow, JAX/Flax, and (increasingly) ONNX Runtime backends.

The library is tightly coupled to the Hugging Face Hubfrom_pretrained("model-id") downloads weights, tokenizer files, and config from a hub repo. The Hub now hosts well over a million model checkpoints.

Install

bash
pip install transformers

Output: installs the library but no ML backend — you still need PyTorch, TF, or JAX separately

bash
pip install "transformers[torch]"

Output: installs transformers plus a compatible PyTorch wheel

bash
pip install "transformers[torch]" accelerate

Output: the standard "modern LLM" combo — adds device_map="auto" and multi-GPU support

bash
uv add transformers torch accelerate

Output: dependencies resolved + added to pyproject.toml

bash
poetry add transformers torch

Output: updated lockfile + virtualenv install

Versioning & Python support

  • Current stable is the 4.x series (and has been for the entire LLM era). Major bumps are rare; minor releases (4.45, 4.46, …) ship roughly monthly and frequently add new model architectures.
  • Python 3.9+ on current releases; 3.10+ recommended.
  • Loose semver — minor releases may add new APIs and deprecate old ones, but rarely break existing model loading. Patch releases are pure bug fixes.
  • The library lags slightly behind the latest models on the Hub for architecture support (a brand-new model often needs trust_remote_code=True until its class is upstreamed).
  • Pinning matters: transformers==4.X matched with torch>=Y is the typical compat matrix; the Hugging Face release notes call out the floors.

Package metadata

  • Maintainer: Hugging Face (the huggingface GitHub org)
  • Project home: github.com/huggingface/transformers
  • Docs: huggingface.co/docs/transformers
  • PyPI: pypi.org/project/transformers
  • License: Apache-2.0
  • Governance: commercial company + huge open-source contributor base
  • First released: 2018 (originally pytorch-pretrained-bert)
  • Downloads: tens of millions per month

Optional dependencies & extras

transformers defines many [extra] groups. The most relevant:

ExtraPulls in
transformers[torch]PyTorch wheel matched to the library's tested floor
transformers[tf]TensorFlow 2.x
transformers[flax]JAX + Flax
transformers[sentencepiece]The sentencepiece tokenizer (required for LLaMA, T5, Mistral, etc.)
transformers[tokenizers]The Rust-backed tokenizers library (pulled in by default in most paths)
transformers[onnxruntime]ONNX export + inference with onnxruntime
transformers[serving]Adds FastAPI for transformers serve
transformers[vision]Pillow + image-processing deps for vision models
transformers[audio]librosa, soundfile for speech models
transformers[all]Everything above (very large install)

Companion packages from the same org:

  • accelerate — multi-GPU / mixed-precision / device_map="auto"
  • peft — parameter-efficient fine-tuning (LoRA, QLoRA, adapters)
  • bitsandbytes — 8-bit and 4-bit quantisation
  • datasets — Hugging Face dataset loading and streaming
  • safetensors — fast, memory-safe checkpoint format (now default)
  • huggingface-hub — Hub client, used by from_pretrained

Alternatives

PackageTrade-off
vllmProduction-grade LLM inference server — way faster throughput than raw transformers.generate(). Use when serving at scale.
text-generation-inference (TGI)Hugging Face's own production serving stack.
onnxruntimeRun exported ONNX models with no Python ML framework. Smaller deploy footprint.
tensorflow-hubTF-native pre-trained model hub. Mostly superseded by Hugging Face Hub today.
mlx (Apple Silicon)Native Apple Silicon inference; some HF models mirrored. Use for Mac-local LLMs.
llama-cpp-pythonGGUF-quantised CPU/GPU inference. Use when you need llama.cpp's quant formats.

Common gotchas

  1. Model card vs Hub repo vs Inference API are different things. The model "card" is the README; the Hub repo holds weights; the Inference API is a separate hosted service. from_pretrained only touches the Hub repo.
  2. trust_remote_code=True is code execution. Many cutting-edge models ship custom modeling code that loads via this flag — it runs arbitrary Python from the Hub. Only enable for repos you trust, ideally pinned to a revision SHA.
  3. device_map="auto" needs accelerate. Without it, the model loads to CPU and you wonder why inference is glacial.
  4. FlashAttention is opt-in. Pass attn_implementation="flash_attention_2" to from_pretrained and install flash-attn separately — it's not in any extra.
  5. Tokenizer mismatch with sentencepiece-based models. Loading LLaMA / Mistral / T5 without transformers[sentencepiece] raises a cryptic ImportError. Install the extra.
  6. Big models OOM silently on Windows. device_map="auto" will happily spill to disk via accelerate on Linux, but pagefile semantics on Windows hit hard. Use Linux / WSL for >7B models.
  7. pipeline() is slow per-call. It re-runs framework overhead each invocation. Build the model + tokenizer manually and batch inputs for throughput.
  8. Hub auth required for gated models. LLaMA, Gemma, and others need huggingface-cli login and an approved license click on the website before from_pretrained works.

Ecosystem integrations

transformers is the trunk of a wider Hugging Face ecosystem. The companion packages overlap less than their names suggest — each owns a slice of the lifecycle.

PackageWhat it owns
accelerateMulti-GPU placement, mixed-precision, device_map="auto", distributed launch (accelerate launch …).
peftParameter-efficient fine-tuning: LoRA, QLoRA, prefix tuning, IA3, adapters.
bitsandbytes8-bit and 4-bit quantisation kernels for CUDA. Powers BitsAndBytesConfig.
optimumHardware-specific backends: ONNX Runtime, OpenVINO, TensorRT, Intel Habana, AWS Neuron.
safetensorsMemory-mapped, pickle-free checkpoint format. Default for new uploads.
datasetsStreaming / mapped tabular dataset library. The standard pre-Trainer data layer.
tokenizersRust-backed fast tokenizers — AutoTokenizer instantiates these by default.
huggingface-hubHub client. from_pretrained calls it transitively; CLI tools (huggingface-cli) come from here.
evaluateStandard metrics (accuracy, f1, rouge, bleu, bertscore). Compatible with Trainer.compute_metrics.
trlReinforcement learning from human feedback — SFTTrainer, DPOTrainer, PPOTrainer.
transformers.jsBrowser-side inference (different package; same model files).

Inference-server siblings (not strict integrations but the same model files):

  • vllm — high-throughput LLM serving with continuous batching and paged attention.
  • text-generation-inference (TGI) — Hugging Face's own production server.
  • llama-cpp-python — GGUF-quantised inference on CPU + small GPUs.

Most production setups mix these — fine-tune with transformers/peft, serve with vLLM, observe with langsmith or OpenTelemetry.

Real-world recipes

Recipe: chat-template inference server

A FastAPI service that wraps a chat model with proper chat templating, streaming, and prompt-cache-friendly batching.

python
from fastapi import FastAPI
from fastapi.responses import StreamingResponse
from transformers import AutoTokenizer, AutoModelForCausalLM, TextIteratorStreamer
from threading import Thread
import torch

MODEL = "meta-llama/Llama-3.1-8B-Instruct"

tok   = AutoTokenizer.from_pretrained(MODEL)
model = AutoModelForCausalLM.from_pretrained(
    MODEL,
    torch_dtype=torch.bfloat16,
    device_map="auto",
    attn_implementation="sdpa",   # safe default; flash_attention_2 if installed
)
tok.pad_token = tok.eos_token

app = FastAPI()

@app.post("/chat")
def chat(payload: dict):
    prompt = tok.apply_chat_template(payload["messages"], tokenize=False, add_generation_prompt=True)
    inputs = tok(prompt, return_tensors="pt").to(model.device)
    streamer = TextIteratorStreamer(tok, skip_prompt=True, skip_special_tokens=True)
    Thread(target=model.generate, kwargs=dict(
        **inputs, streamer=streamer, max_new_tokens=512, do_sample=True, temperature=0.7,
    )).start()
    return StreamingResponse((t for t in streamer), media_type="text/plain")

Output: clients receive tokens incrementally; one worker per GPU is the right shape.

Recipe: LoRA fine-tune + merge for serving

Train a small adapter with PEFT, merge it back into the base, then export to a serving format:

python
from peft import PeftModel
from transformers import AutoModelForCausalLM, AutoTokenizer
import torch

base = AutoModelForCausalLM.from_pretrained("meta-llama/Llama-3.1-8B-Instruct",
                                            torch_dtype=torch.bfloat16, device_map="auto")
adapter = PeftModel.from_pretrained(base, "./lora-out/final")
merged  = adapter.merge_and_unload()
merged.save_pretrained("./llama3-merged", safe_serialization=True)
AutoTokenizer.from_pretrained("meta-llama/Llama-3.1-8B-Instruct").save_pretrained("./llama3-merged")

Output: ./llama3-merged is a complete fine-tuned checkpoint ready for vLLM / TGI / any HF-compatible loader.

Recipe: streaming embedding pipeline over a parquet corpus

python
from datasets import load_dataset
from transformers import AutoTokenizer, AutoModel
import torch, torch.nn.functional as F

tok = AutoTokenizer.from_pretrained("BAAI/bge-small-en-v1.5")
model = AutoModel.from_pretrained("BAAI/bge-small-en-v1.5").to("cuda").eval()

@torch.no_grad()
def embed(batch):
    enc = tok(batch["text"], padding=True, truncation=True, return_tensors="pt", max_length=512).to("cuda")
    h = model(**enc).last_hidden_state[:, 0]  # CLS pooling for BGE
    h = F.normalize(h, p=2, dim=1)
    return {"embedding": h.cpu().tolist()}

ds = load_dataset("parquet", data_files="docs-*.parquet", streaming=True)["train"]
out = ds.map(embed, batched=True, batch_size=64)
out.to_parquet("docs-embedded.parquet")

Output: memory-bounded embedding pipeline that scales to corpora larger than RAM.

Recipe: zero-shot classification ladder

Triage incoming support tickets cheaply with a zero-shot model, escalate to an LLM only if confidence is low.

python
from transformers import pipeline
zsc = pipeline("zero-shot-classification", model="MoritzLaurer/deberta-v3-base-zeroshot-v2.0", device=0)

CATEGORIES = ["billing", "bug report", "feature request", "account access", "other"]

def route(ticket: str):
    res = zsc(ticket, candidate_labels=CATEGORIES, multi_label=False)
    label, score = res["labels"][0], res["scores"][0]
    return label if score >= 0.75 else "escalate-to-llm"

Output: ~10× cheaper than running every ticket through an LLM, with calibrated confidence to fall back.

Recipe: offline-first deployment

Pre-bake the model into the container; lock down Hub access at runtime:

dockerfile
FROM python:3.12-slim
RUN pip install --no-cache-dir transformers torch accelerate
RUN python -c "from transformers import AutoModel, AutoTokenizer; \
    AutoModel.from_pretrained('intfloat/e5-small-v2'); \
    AutoTokenizer.from_pretrained('intfloat/e5-small-v2')"
ENV HF_HUB_OFFLINE=1 TRANSFORMERS_OFFLINE=1
COPY . /app
CMD ["python", "/app/serve.py"]

Output: model weights baked into the image; runtime has no Hub dependency.

Performance tuning

The default model.generate(...) call leaves a lot of throughput on the table. The biggest wins, in rough order of impact:

  • Batch on the server, not the client. For inference servers, batch concurrent requests inside one generate call when possible. Even mismatched-length prompts benefit from padded batching.
  • Attention implementation. attn_implementation="sdpa" (PyTorch's scaled-dot-product attention) is a safe default on modern PyTorch. flash_attention_2 is faster and lower-memory but needs pip install flash-attn and supported GPUs (Ampere+).
  • KV cache is on by default for generate. Don't disable it; check model.config.use_cache=True.
  • torch.compile for stable shapes. model = torch.compile(model, mode="reduce-overhead") improves throughput materially on Ampere+ GPUs once the graph stabilises. Compile time is heavy — only worth it for long-running servers.
  • Quantisation. 4-bit NF4 (BitsAndBytesConfig(load_in_4bit=True, bnb_4bit_quant_type="nf4")) cuts memory ~4× with usually <1% quality loss. Pre-quantised GPTQ / AWQ checkpoints load faster at inference time.
  • Mixed precision. torch_dtype=torch.bfloat16 on Ampere+; float16 on older cards. Save 2× memory vs float32.
  • Gradient checkpointing for training. Trade recompute for memory: training_args.gradient_checkpointing = True lets larger batches/sequences fit.
  • Use optim="adamw_bnb_8bit" for training — 8-bit optimiser states roughly halve VRAM usage.
  • DataLoader workers. For training, num_workers=4 (or higher) + pin_memory=True removes CPU bottlenecks.
  • TORCH_LOGS=recompiles while developing — catches silent recompilation pitfalls under torch.compile.
  • Multi-GPU. device_map="auto" splits across GPUs naively (layer-pipelined). Use accelerate launch --multi_gpu + Trainer for proper data-parallel training; FSDP or DeepSpeed for tensor-parallel training.
  • CPU inference. optimum-intel + OpenVINO accelerates BERT-class encoders by 3-5× on Xeon CPUs. ONNX Runtime is a portable alternative.

For production inference at scale, hand off to vLLM or TGI. transformers.generate() was never optimised for throughput — those servers add continuous batching, paged attention, and tensor parallelism that you'd otherwise reimplement.

Version migration guide

The 4.x line has lasted the entire modern LLM era. Within 4.x, churn comes from monthly minor releases that frequently add models and occasionally tighten APIs.

RoughlyWhat tends to break
<4.30Pre-safetensors default. Many tutorials still call .bin loading explicitly.
~4.35attn_implementation arg standardised. Older code passes attention via separate flags.
~4.40device_map="auto" paths assume accelerate is installed (it isn't always).
~4.45Chat template defaults tightened — some older instruct models require apply_chat_template explicitly.
LatestSteady drumbeat of new model architectures; trust_remote_code=True is the bridge until the class is upstreamed.

Migration discipline:

  1. Pin both library and torch. transformers==4.X + torch==Y.Z are the supported pair listed in release notes.
  2. Re-test fine-tunes after upgrades. New defaults (loss-fns, gradient clipping, LR scheduler) occasionally affect downstream metrics.
  3. from_pretrained(...) arg deprecations are noisy but generally non-breaking for a release or two. Address before they go silent.
  4. Tokenizer special-token handling has tightened over time. Pin the tokenizer JSON when reproducibility matters.

Hedge: when in doubt, the Hugging Face Transformers Release Notes on GitHub list every behavioral change per release — search there rather than guessing.

Security considerations

transformers runs arbitrary model weights, occasionally arbitrary model code, and pulls files from the public Hub. The threat surface deserves attention.

  • trust_remote_code=True is code execution. Many cutting-edge models ship custom modeling code that loads via this flag. Audit the repo and pin a revision SHA (revision="abc123") before enabling. Treat unknown repos like unknown PyPI packages.
  • Pickle in .bin checkpoints. Pre-safetensors .bin files are Python pickles — loading one runs arbitrary code. Prefer safetensors; the loader rejects mixed loads in strict mode.
  • Hub supply chain. Anyone can publish to the Hub. Pin revision=<commit-sha> for production use, not branch names.
  • Tokenizer files. tokenizer.json is data, not code, but malicious vocab can produce token IDs that confuse downstream code. Defence-in-depth: validate vocab size against model.config.vocab_size.
  • Gated model licenses. LLaMA, Gemma, Mistral, others have license obligations beyond Apache-2.0. Track which models touch your build pipeline.
  • PII in prompts and outputs. Local inference avoids sending data to a third party — but logs, traces, and dataset caches will still capture it. Apply the same redaction policy you'd use for hosted APIs.
  • Adversarial prompts. Local models don't have the abuse mitigations a hosted service does. If users supply prompts directly, add a content classifier upstream.
  • Network egress at load time. from_pretrained hits the Hub. Use TRANSFORMERS_OFFLINE=1 + a pre-baked cache in air-gapped environments.

Troubleshooting common errors

  • OSError: Can't load tokenizer for ... — usually a missing extra. pip install sentencepiece (LLaMA/T5/Mistral) or tiktoken (some OpenAI-derived tokenizers).
  • ImportError: This requires you to install the latest version of bitsandbytespip install -U bitsandbytes and ensure CUDA matches your PyTorch build.
  • device_map="auto" errors on CPU-only machines. Install accelerate even on CPU; without it the device-mapping logic short-circuits to errors.
  • CUDA out of memory. Try low_cpu_mem_usage=True, quantisation (4-bit), gradient checkpointing, smaller batch, shorter sequences, or device_map="auto" with offloading.
  • Generated text is gibberish for instruct models. You forgot apply_chat_template. Decoder-only chat models trained with template tokens produce nonsense if you skip them.
  • pad_token error during batched generation. Set tokenizer.pad_token = tokenizer.eos_token and model.config.pad_token_id = tokenizer.eos_token_id before the call.
  • Whisper transcriptions repeat indefinitely. Pass condition_on_previous_text=False and chunk inputs with chunk_length_s=30.
  • trust_remote_code warning. Either accept and pin a revision SHA, or wait for the architecture to be upstreamed.
  • Tokenizer mismatch between training and inference. Always save the tokenizer alongside the model (tokenizer.save_pretrained(dir)); loading from the base checkpoint after fine-tuning corrupts special tokens.

When NOT to use this

  • Production LLM inference at scale. Use vLLM or TGI. They add continuous batching, paged attention, and proper request scheduling. transformers.generate() runs one request at a time and idles GPU.
  • Edge / mobile deployment. ONNX Runtime, MLC LLM, llama.cpp, and Core ML are the right tools. transformers isn't built for that runtime envelope.
  • Pure embedding inference at high QPS. sentence-transformers is purpose-built; fastembed (ONNX) is leaner.
  • Hosted-API parity. If you only need a remote model, the provider SDK (openai, anthropic) is one HTTP call away — no PyTorch needed.
  • Tiny ML. For sub-100 MB classifiers, scikit-learn + transformers.AutoTokenizer for features is often cleaner than the full pipeline.

Production deployment

transformers is a library; production deployment means picking the right server pattern around it. For most teams the right answer is "use vLLM" — but plenty of workloads run transformers directly in production.

Single-worker GPU server. One AutoModelForCausalLM instance held by one FastAPI process per GPU. Simple, debuggable, and adequate for low-QPS internal services. Use a process supervisor (gunicorn / supervisord) — multiple workers per GPU just thrash VRAM.

CPU encoder server. BERT-class encoders served via optimum-onnxruntime or optimum-openvino on Xeon CPUs are the canonical "embedding micro-service" shape. ~10-100× cheaper than a GPU box for the same QPS.

Container shape. Build the image with the model pre-downloaded (saves cold-start time and reduces Hub dependency at runtime). Use multi-stage builds — the base CUDA image is large; copy only the runtime layer.

Model versioning. from_pretrained("name", revision="abc123") pins the exact Hub commit. Track the SHA in requirements.txt (as a comment) so rollback is auditable.

Inference batching. Either implement client-side micro-batching (collect ~10ms of requests, run one generate) or hand the workload to vLLM, which does this automatically and far better.

GPU utilisation telemetry. Export nvidia-smi metrics through dcgm-exporter to Prometheus. If GPU utilisation sits below 30%, you're paying for idle silicon — batch harder, or stop using a GPU.

Health checks. /health should run a trivial 1-token generate — this catches NaN-loaded models that a process-level check wouldn't.

Cost & rate-limit management

Self-hosted transformers swaps API-side rate limits for GPU-side rate limits. The economics change, but the bookkeeping doesn't disappear.

  • GPU-hour budgets. Track (QPS × p99_latency × dollars_per_GPU_hour) per model. For an embedding model on an A10, that's pennies per million calls; for an 8B chat model on an H100, it's measurable.
  • Choose the smallest model that meets quality. A 1.5B model that hits your accuracy bar beats a 70B model on TCO by 20-50×.
  • Quantise. 4-bit NF4 with BitsAndBytesConfig cuts VRAM ~4× and often serves 2× more concurrent requests on the same GPU.
  • Batch. Server-side batching is where most production savings come from. vLLM ships with continuous batching; if you're sticking with transformers, group requests within 20-50ms windows.
  • Shut down cold GPUs. Spot/preemptible instances are 60-90% cheaper. For workloads tolerant of restart latency, that's the easy win.
  • Embedding caching. For embedding workloads, cache by content hash — many corpora have 10-30% duplicate documents.
  • Local quotas. A single misbehaving caller can saturate your GPU. Rate-limit per tenant at the application layer.

Multi-provider patterns

transformers itself isn't multi-provider — but it sits next to provider SDKs in many production architectures. The common patterns:

  • Routing cheap workloads to local, expensive workloads to hosted. A route(query) -> provider function based on query complexity, latency budget, or model strengths. Local embedding + hosted chat is the classic split.
  • Fine-tune local, serve from transformers, fall back to hosted on rate-limit. For high-availability services, treat the local model as primary and a hosted API as failover.
  • vLLM as the OpenAI-compatible front door. vLLM serves transformers-format models over an OpenAI-compatible HTTP API. Any client that speaks OpenAI's wire format (LangChain, LiteLLM, openai-python) talks to it unchanged.
  • LiteLLM proxy in front of mixed local + hosted. Same routing logic, centralised — one base URL for your application, multiple providers behind it.
  • Tokenizer parity. Cross-provider routing needs tokenizer compatibility for token-budget calculations. tiktoken handles OpenAI; transformers.AutoTokenizer handles everything else.

See also