cheat sheet

transformers

Load and run pre-trained models for NLP, vision, and audio with the Hugging Face Transformers library. Covers pipelines, AutoModel, tokenisation, generation, fine-tuning, and device placement.

updated 04-27-2026

transformers — Hugging Face

What it is

Hugging Face Transformers is the standard Python library for loading, running, and fine-tuning pre-trained neural networks — language models, image classifiers, speech recognisers, and more. It provides a unified API (pipeline, AutoModel, AutoTokenizer) that works across PyTorch, TensorFlow, and JAX backends. The Hugging Face Hub hosts over 900 000 public model checkpoints that can be downloaded with a single function call.

Install

bash

pip install transformers
pip install transformers[torch]   # with PyTorch (most common)
pip install accelerate            # needed for device_map="auto" and multi-GPU

Output: (none — exits 0 on success)

Quick example

python

from transformers import pipeline

classifier = pipeline("sentiment-analysis")
result = classifier("I absolutely loved this product — exceeded every expectation.")
print(result)

Output:

text

[{'label': 'POSITIVE', 'score': 0.9998}]

When / why to use it

Running any of the 900k+ models on Hugging Face Hub without writing model code.
Tasks: text classification, summarisation, translation, question answering, image classification, ASR, text-to-image, and more.
Fine-tuning a pre-trained checkpoint on your own dataset with Trainer.
Loading a model locally for offline or privacy-constrained inference.
Integrating models into LangChain, LlamaIndex, or custom pipelines.

Common pitfalls

pad_token not set for decoder-only models — GPT-style models have no pad token by default. When batching inputs, set tokenizer.pad_token = tokenizer.eos_token and model.config.pad_token_id = tokenizer.eos_token_id before encoding.

return_tensors="pt" must match your backend — if you're using PyTorch, pass return_tensors="pt" to the tokenizer. Omitting it returns Python lists, which most model forward() calls reject.

Model card licences — many models (LLaMA, Gemma, Mistral) require accepting a licence on Hugging Face before downloading. Authenticate with huggingface-cli login and accept the licence on the model's Hub page first.

Large model downloads — a 7B-parameter model in float16 is ~14 GB. Use low_cpu_mem_usage=True and device_map="auto" (requires accelerate) to load directly to GPU without duplicating the model in CPU RAM.

Prefer safetensors format checkpoints when available — they load faster and are safer than .bin (pickle) files. Most recent Hub checkpoints include model.safetensors.

Richer example — text summarisation pipeline

python

from transformers import pipeline

summariser = pipeline(
    "summarization",
    model="facebook/bart-large-cnn",
    device=0,       # GPU 0; use -1 or omit for CPU
)

article = """
Scientists at the Global Climate Institute have published findings showing
that ocean temperatures in the northern Atlantic rose by an average of
1.4 degrees Celsius over the past decade, the largest recorded increase
in over a century of measurements. Researchers attribute the change to
a combination of greenhouse gas accumulation and shifting ocean currents.
The study calls for immediate international policy responses.
"""

result = summariser(article, max_length=60, min_length=20, do_sample=False)
print(result[0]["summary_text"])

Output:

text

Ocean temperatures in the northern Atlantic rose by 1.4°C over the past decade.
Scientists call for immediate international policy responses to the findings.

pipeline() — zero-code inference

pipeline is the fastest path from model name to prediction. It handles tokenisation, model loading, device placement, and post-processing automatically. Specify device=0 for CUDA GPU, device="mps" for Apple Silicon, or device=-1 / omit for CPU.

python

from transformers import pipeline

# Named-entity recognition
ner = pipeline("ner", grouped_entities=True)
print(ner("Alice Dev works at Acme Corp in London."))

Output:

text

[{'entity_group': 'PER', 'score': 0.998, 'word': 'Alice Dev', 'start': 0, 'end': 9},
 {'entity_group': 'ORG', 'score': 0.994, 'word': 'Acme Corp', 'start': 19, 'end': 28},
 {'entity_group': 'LOC', 'score': 0.999, 'word': 'London', 'start': 32, 'end': 38}]

python

# Zero-shot classification — no task-specific training needed
zsc = pipeline("zero-shot-classification")
result = zsc(
    "This quarter we shipped 14 new features and fixed 32 bugs.",
    candidate_labels=["product update", "financial report", "sports"],
)
print(result["labels"][0], f"{result['scores'][0]:.2%}")

Output:

text

product update 96.74%

python

# Automatic speech recognition
asr = pipeline("automatic-speech-recognition", model="openai/whisper-base")
result = asr("audio.wav")
print(result["text"])

Output:

text

 The quick brown fox jumps over the lazy dog.

Common task strings: "text-classification", "token-classification", "question-answering", "summarization", "translation_en_to_fr", "text-generation", "fill-mask", "image-classification", "object-detection", "automatic-speech-recognition".

AutoTokenizer and AutoModel

AutoTokenizer and AutoModel* load the correct class for any checkpoint automatically, making code checkpoint-agnostic. The * in AutoModel* corresponds to task-specific heads: AutoModelForCausalLM, AutoModelForSequenceClassification, AutoModelForSeq2SeqLM, etc.

python

from transformers import AutoTokenizer, AutoModelForSequenceClassification
import torch

model_name = "distilbert-base-uncased-finetuned-sst-2-english"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForSequenceClassification.from_pretrained(model_name)

texts = ["I loved every moment.", "Worst experience of my life."]
inputs = tokenizer(texts, return_tensors="pt", padding=True, truncation=True, max_length=512)

with torch.no_grad():
    logits = model(**inputs).logits

probs = torch.softmax(logits, dim=-1)
labels = model.config.id2label
for text, prob in zip(texts, probs):
    pred = labels[prob.argmax().item()]
    print(f"{pred}: {prob.max():.2%}  — {text}")

Output:

text

POSITIVE: 99.98%  — I loved every moment.
NEGATIVE: 99.94%  — Worst experience of my life.

Device placement — CPU, CUDA, MPS

python

from transformers import AutoModelForCausalLM, AutoTokenizer
import torch

model_name = "mistralai/Mistral-7B-Instruct-v0.2"

# Single GPU — manual
model = AutoModelForCausalLM.from_pretrained(
    model_name,
    torch_dtype=torch.float16,
    device_map="cuda:0",
)

# Multi-GPU / CPU offload — requires accelerate
model = AutoModelForCausalLM.from_pretrained(
    model_name,
    torch_dtype=torch.float16,
    device_map="auto",          # distributes layers across available devices
    low_cpu_mem_usage=True,     # never loads full model into CPU RAM
)

# Apple Silicon MPS
model = AutoModelForCausalLM.from_pretrained(
    model_name,
    torch_dtype=torch.float16,
    device_map="mps",
)

# 4-bit quantisation — requires bitsandbytes
from transformers import BitsAndBytesConfig
bnb = BitsAndBytesConfig(load_in_4bit=True, bnb_4bit_compute_dtype=torch.float16)
model = AutoModelForCausalLM.from_pretrained(model_name, quantization_config=bnb)

generate() — text generation parameters

generate() is the method behind all decoder-based text generation. Key parameters control length, randomness, and search strategy.

python

from transformers import AutoTokenizer, AutoModelForCausalLM
import torch

tokenizer = AutoTokenizer.from_pretrained("gpt2")
model = AutoModelForCausalLM.from_pretrained("gpt2")

prompt = "The future of renewable energy is"
inputs = tokenizer(prompt, return_tensors="pt")

# Greedy (deterministic)
out = model.generate(**inputs, max_new_tokens=30)
print(tokenizer.decode(out[0], skip_special_tokens=True))

Output:

text

The future of renewable energy is bright. The world is moving toward a clean energy economy...

python

# Sampling — creative, varied
out = model.generate(
    **inputs,
    max_new_tokens=50,
    do_sample=True,
    temperature=0.8,
    top_p=0.92,
)

# Beam search — balances quality and diversity
out = model.generate(
    **inputs,
    max_new_tokens=50,
    num_beams=5,
    early_stopping=True,
    no_repeat_ngram_size=3,
)

Parameter	Effect
`max_new_tokens`	Hard cap on tokens generated
`do_sample=True`	Enable sampling (off = greedy)
`temperature`	Randomness: 0 = deterministic, >1 = chaotic
`top_p`	Nucleus sampling: keep top p% probability mass
`top_k`	Keep only top k tokens per step
`num_beams`	Beam search width (1 = greedy)
`repetition_penalty`	Penalise repeated tokens (1.0 = no penalty)
`no_repeat_ngram_size`	Block repeating n-grams of this length

Chat templates — apply_chat_template

Instruction-tuned models expect input in a specific format (system/user/assistant turns). apply_chat_template encodes a list of message dicts into the model's correct prompt format.

python

from transformers import AutoTokenizer, AutoModelForCausalLM
import torch

model_name = "HuggingFaceH4/zephyr-7b-beta"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForCausalLM.from_pretrained(model_name, torch_dtype=torch.float16, device_map="auto")

messages = [
    {"role": "system", "content": "You are a concise Python tutor."},
    {"role": "user",   "content": "What is a list comprehension?"},
]

prompt = tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
inputs = tokenizer(prompt, return_tensors="pt").to(model.device)
out = model.generate(**inputs, max_new_tokens=120, do_sample=False)

response = tokenizer.decode(out[0][inputs["input_ids"].shape[1]:], skip_special_tokens=True)
print(response)

Output:

text

A list comprehension is a concise way to create a list:
  squares = [x**2 for x in range(10)]
It can include a filter condition:
  evens = [x for x in range(20) if x % 2 == 0]

Trainer — fine-tuning

Trainer handles the training loop, evaluation, checkpointing, and logging. Provide a TrainingArguments config, a Dataset object, and the model.

python

from transformers import (
    AutoTokenizer, AutoModelForSequenceClassification,
    Trainer, TrainingArguments
)
from datasets import load_dataset
import numpy as np

model_name = "distilbert-base-uncased"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForSequenceClassification.from_pretrained(model_name, num_labels=2)

dataset = load_dataset("imdb")

def tokenize(batch):
    return tokenizer(batch["text"], truncation=True, padding="max_length", max_length=256)

tokenized = dataset.map(tokenize, batched=True)
tokenized = tokenized.rename_column("label", "labels")
tokenized.set_format("torch", columns=["input_ids", "attention_mask", "labels"])

args = TrainingArguments(
    output_dir="./checkpoints",
    num_train_epochs=1,
    per_device_train_batch_size=16,
    per_device_eval_batch_size=32,
    evaluation_strategy="epoch",
    save_strategy="epoch",
    load_best_model_at_end=True,
    logging_steps=50,
)

def compute_metrics(eval_pred):
    logits, labels = eval_pred
    preds = np.argmax(logits, axis=-1)
    return {"accuracy": (preds == labels).mean()}

trainer = Trainer(
    model=model,
    args=args,
    train_dataset=tokenized["train"].select(range(2000)),
    eval_dataset=tokenized["test"].select(range(500)),
    compute_metrics=compute_metrics,
)
trainer.train()

Saving and loading checkpoints

python

# Save model + tokenizer
model.save_pretrained("./my-model")
tokenizer.save_pretrained("./my-model")

# Load later
from transformers import AutoModelForSequenceClassification, AutoTokenizer
model = AutoModelForSequenceClassification.from_pretrained("./my-model")
tokenizer = AutoTokenizer.from_pretrained("./my-model")

# Push to Hub
model.push_to_hub("alicedev/my-sentiment-classifier")
tokenizer.push_to_hub("alicedev/my-sentiment-classifier")

AutoModel head families

The AutoModelFor* class controls which task head is attached on top of the base transformer. Pick the head that matches the supervised task — using the wrong class either fails to load or gives the wrong output shape.

Class	Task	Output
`AutoModel`	Raw encoder	hidden states
`AutoModelForCausalLM`	Left-to-right text generation	next-token logits
`AutoModelForSeq2SeqLM`	Encoder-decoder (T5, BART)	decoder logits
`AutoModelForMaskedLM`	BERT-style masked-token prediction	masked-position logits
`AutoModelForSequenceClassification`	Single-label / multi-label classification	logits per class
`AutoModelForTokenClassification`	NER, POS tagging	logits per token per class
`AutoModelForQuestionAnswering`	Extractive QA	start/end span logits
`AutoModelForImageClassification`	Vision classification	logits per class
`AutoModelForObjectDetection`	DETR-style detection	boxes + class logits
`AutoModelForSpeechSeq2Seq`	Whisper-style ASR	transcript token logits
`AutoModelForVision2Seq`	Image → text (BLIP, LLaVA)	caption / answer logits

python

from transformers import AutoModelForQuestionAnswering, AutoTokenizer
import torch

model_name = "deepset/roberta-base-squad2"
tokenizer  = AutoTokenizer.from_pretrained(model_name)
model      = AutoModelForQuestionAnswering.from_pretrained(model_name)

context  = "The transformer architecture was introduced in 2017 by Vaswani et al."
question = "When was the transformer introduced?"

inputs = tokenizer(question, context, return_tensors="pt")
with torch.no_grad():
    out = model(**inputs)
start = out.start_logits.argmax()
end   = out.end_logits.argmax() + 1
answer = tokenizer.decode(inputs["input_ids"][0][start:end])
print(answer)

Output:

text

Tokenisers — fast vs slow, special tokens, offsets

Hugging Face tokenisers come in two flavours: "slow" (pure Python) and "fast" (Rust-backed via the tokenizers crate). Always prefer the fast version — it supports offset mapping for token-to-character alignment, batched encoding, and significantly higher throughput.

python

from transformers import AutoTokenizer

tok = AutoTokenizer.from_pretrained("bert-base-uncased")
print(type(tok).__name__, "  is_fast:", tok.is_fast)

# Special tokens
print("CLS:", tok.cls_token, tok.cls_token_id)
print("SEP:", tok.sep_token, tok.sep_token_id)
print("PAD:", tok.pad_token, tok.pad_token_id)

# Encoding with offset mapping
out = tok(
    "Alice Dev visited London.",
    return_tensors="pt",
    return_offsets_mapping=True,
    return_attention_mask=True,
)
for tid, (start, end) in zip(out["input_ids"][0], out["offset_mapping"][0]):
    print(f"  {tok.decode([tid]):>12s}  chars[{start.item()}:{end.item()}]")

Output:

text

BertTokenizerFast   is_fast: True
CLS: [CLS] 101
SEP: [SEP] 102
PAD: [PAD] 0
       [CLS]  chars[0:0]
       alice  chars[0:5]
         dev  chars[6:9]
     visited  chars[10:17]
      london  chars[18:24]
           .  chars[24:25]
       [SEP]  chars[0:0]

Padding, truncation, and attention masks

When batching, all sequences in a batch must be the same length. padding=True pads to the longest sequence in the batch; padding="max_length" pads to a fixed max_length. Always pair padding with the resulting attention mask so the model ignores pad tokens.

python

from transformers import AutoTokenizer
import torch

tok = AutoTokenizer.from_pretrained("distilbert-base-uncased")

texts = ["Short.", "A medium-length sentence about transformers."]

# Pad to longest in batch (most efficient)
out = tok(texts, return_tensors="pt", padding=True, truncation=True, max_length=64)
print(out["input_ids"].shape, out["attention_mask"].shape)
print(out["attention_mask"][0])

# Truncate long inputs from the left or right
long_text = "word " * 1000
out = tok(long_text, return_tensors="pt", truncation=True, max_length=32, truncation_side="left")
print(out["input_ids"].shape)

Output:

text

torch.Size([2, 11]) torch.Size([2, 11])
tensor([1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0])
torch.Size([1, 32])

datasets — loading & mapping at scale

The datasets library (separate package) is the canonical way to feed data to Transformers. It supports streaming from the Hub, memory-mapped arrays for huge corpora, and .map() for parallel preprocessing.

bash

pip install datasets

Output: (none — exits 0 on success)

python

from datasets import load_dataset, Dataset

# Load from the Hub
ds = load_dataset("imdb", split="train")
print(ds)

# Streaming for huge datasets that don't fit in memory
big = load_dataset("c4", "en", split="train", streaming=True)
for example in big.take(2):
    print(example["text"][:50])

# Parallel preprocessing
tok = AutoTokenizer.from_pretrained("distilbert-base-uncased")
def tokenize(batch):
    return tok(batch["text"], truncation=True, padding=False, max_length=256)

ds_tok = ds.map(tokenize, batched=True, num_proc=4)
ds_tok.set_format("torch", columns=["input_ids", "attention_mask", "label"])
print(ds_tok[0]["input_ids"][:10])

Output:

text

Dataset({features: ['text', 'label'], num_rows: 25000})
Beginners tutorial: read this if you want to underst
A new feature for clouds: real-time processing of la
tensor([ 101, 1996, 2143, 2059, 1011, 2918, 1996, 1009, 2003, 2025])

Pipeline batching

Pipelines accept batch_size and an iterable of inputs for efficient inference. Avoid Python loops over the pipeline — they re-pad and re-call per element.

python

from transformers import pipeline

classifier = pipeline(
    "sentiment-analysis",
    model="distilbert-base-uncased-finetuned-sst-2-english",
    device=0,
    batch_size=32,
)

texts = ["Great product."] * 1000 + ["Terrible."] * 1000
# Iterating yields one prediction at a time but batches internally
for i, result in enumerate(classifier(texts)):
    if i < 3:
        print(result)

Output:

text

{'label': 'POSITIVE', 'score': 0.9999}
{'label': 'POSITIVE', 'score': 0.9999}
{'label': 'POSITIVE', 'score': 0.9999}

torch.compile and Flash Attention

For inference on Ampere+ GPUs, torch.compile plus Flash Attention 2 cuts latency by 30–60%.

python

from transformers import AutoModelForCausalLM, AutoTokenizer
import torch

tok = AutoTokenizer.from_pretrained("meta-llama/Llama-3.1-8B-Instruct")
model = AutoModelForCausalLM.from_pretrained(
    "meta-llama/Llama-3.1-8B-Instruct",
    torch_dtype=torch.bfloat16,
    device_map="auto",
    attn_implementation="flash_attention_2",   # requires `pip install flash-attn`
)
model = torch.compile(model, mode="reduce-overhead")

inputs = tok("The transformer architecture", return_tensors="pt").to(model.device)
out = model.generate(**inputs, max_new_tokens=64)
print(tok.decode(out[0], skip_special_tokens=True))

Output:

text

The transformer architecture replaces recurrence with self-attention, allowing
parallel processing of sequences and longer-range dependencies than RNNs...

Quantisation — bitsandbytes, GPTQ, AWQ

Quantisation reduces memory footprint at small (or sometimes negligible) accuracy cost. The cheapest path is BitsAndBytesConfig; for inference-only deployments use a pre-quantised GPTQ or AWQ checkpoint.

python

from transformers import AutoModelForCausalLM, BitsAndBytesConfig, AutoTokenizer
import torch

# 4-bit NF4 with double-quantisation (best memory/accuracy trade-off)
bnb = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_quant_type="nf4",
    bnb_4bit_use_double_quant=True,
    bnb_4bit_compute_dtype=torch.bfloat16,
)
model = AutoModelForCausalLM.from_pretrained(
    "meta-llama/Llama-3.1-8B-Instruct",
    quantization_config=bnb,
    device_map="auto",
)
tok = AutoTokenizer.from_pretrained("meta-llama/Llama-3.1-8B-Instruct")
print(f"Footprint: {model.get_memory_footprint() / 1e9:.2f} GB")

Output:

text

Footprint: 5.83 GB

Pre-quantised checkpoints exist for many models (look for -AWQ / -GPTQ / -bnb-4bit suffixes on the Hub). They load with from_pretrained without extra config.

PEFT and LoRA fine-tuning

For most practical fine-tuning, full-parameter training is overkill. PEFT (Parameter-Efficient Fine-Tuning) — particularly LoRA — trains a small set of adapter weights while freezing the base model.

bash

pip install peft accelerate bitsandbytes

Output: (none — exits 0 on success)

python

from transformers import AutoModelForCausalLM, AutoTokenizer, BitsAndBytesConfig, TrainingArguments
from peft import LoraConfig, get_peft_model, prepare_model_for_kbit_training
from trl import SFTTrainer
from datasets import load_dataset
import torch

model_name = "meta-llama/Llama-3.1-8B-Instruct"
bnb = BitsAndBytesConfig(load_in_4bit=True, bnb_4bit_quant_type="nf4",
                         bnb_4bit_compute_dtype=torch.bfloat16)
model = AutoModelForCausalLM.from_pretrained(model_name, quantization_config=bnb, device_map="auto")
model = prepare_model_for_kbit_training(model)
tok   = AutoTokenizer.from_pretrained(model_name)
tok.pad_token = tok.eos_token

lora = LoraConfig(
    r=16, lora_alpha=32, lora_dropout=0.05, bias="none",
    target_modules=["q_proj","k_proj","v_proj","o_proj"],
    task_type="CAUSAL_LM",
)
model = get_peft_model(model, lora)
model.print_trainable_parameters()

ds = load_dataset("databricks/databricks-dolly-15k", split="train[:1000]")
args = TrainingArguments(output_dir="./lora-out", num_train_epochs=1,
                         per_device_train_batch_size=4, gradient_accumulation_steps=4,
                         learning_rate=2e-4, fp16=False, bf16=True, logging_steps=20,
                         save_strategy="epoch")
trainer = SFTTrainer(model=model, args=args, train_dataset=ds, tokenizer=tok,
                     dataset_text_field="instruction", max_seq_length=512)
trainer.train()
trainer.model.save_pretrained("./lora-out/final")

Output:

text

trainable params: 6,815,744 || all params: 8,037,408,768 || trainable%: 0.0848

Loading and merging LoRA adapters

python

from peft import PeftModel
from transformers import AutoModelForCausalLM, AutoTokenizer

base = AutoModelForCausalLM.from_pretrained("meta-llama/Llama-3.1-8B-Instruct", device_map="auto")
model = PeftModel.from_pretrained(base, "./lora-out/final")

# Optionally merge LoRA into base weights for export
merged = model.merge_and_unload()
merged.save_pretrained("./llama3-merged")

Datasets, collators, dynamic padding

A data collator forms batches and pads on the fly. DataCollatorWithPadding is the right default for classification; DataCollatorForLanguageModeling handles MLM with masking; DataCollatorForSeq2Seq shifts labels for decoder training.

python

from transformers import AutoTokenizer, DataCollatorWithPadding
from torch.utils.data import DataLoader

tok = AutoTokenizer.from_pretrained("distilbert-base-uncased")
collator = DataCollatorWithPadding(tokenizer=tok, padding="longest", return_tensors="pt")

loader = DataLoader(ds_tok, batch_size=16, collate_fn=collator, shuffle=True)
batch = next(iter(loader))
print({k: v.shape for k, v in batch.items()})

Output:

text

{'input_ids': torch.Size([16, 142]), 'attention_mask': torch.Size([16, 142]), 'labels': torch.Size([16])}

Evaluation with `evaluate`

The evaluate library ships standard metrics that align with Trainer's compute_metrics signature.

bash

pip install evaluate

Output: (none — exits 0 on success)

python

import evaluate, numpy as np

accuracy = evaluate.load("accuracy")
f1       = evaluate.load("f1")

def compute_metrics(eval_pred):
    logits, labels = eval_pred
    preds = np.argmax(logits, axis=-1)
    return {
        **accuracy.compute(predictions=preds, references=labels),
        **f1.compute(predictions=preds, references=labels, average="macro"),
    }

Inference optimisations — KV cache, batching, beam vs sample

For generation, the most impactful knobs are: KV cache (on by default), batch size, beam width vs sampling, and quantisation. Streaming with TextIteratorStreamer makes long generations interactive.

python

from transformers import AutoTokenizer, AutoModelForCausalLM, TextIteratorStreamer
from threading import Thread
import torch

tok = AutoTokenizer.from_pretrained("gpt2")
model = AutoModelForCausalLM.from_pretrained("gpt2").to("cuda")
streamer = TextIteratorStreamer(tok, skip_prompt=True, skip_special_tokens=True)

inputs = tok("The future of AI is", return_tensors="pt").to(model.device)
kwargs = dict(**inputs, streamer=streamer, max_new_tokens=80, do_sample=True, temperature=0.8)
Thread(target=model.generate, kwargs=kwargs).start()
for text in streamer:
    print(text, end="", flush=True)

Output:

text

 cooperative — a future in which intelligent systems work alongside humans rather than displacing them...

Vision — image classification and object detection

python

from transformers import pipeline
from PIL import Image

# Classification
clf = pipeline("image-classification", model="google/vit-base-patch16-224")
img = Image.open("cat.jpg")
print(clf(img)[:3])

# Object detection
det = pipeline("object-detection", model="facebook/detr-resnet-50")
results = det(img)
for r in results[:3]:
    print(f"  {r['label']:15s}  score={r['score']:.2f}  box={r['box']}")

Output:

text

[{'label': 'Egyptian cat', 'score': 0.78}, {'label': 'tabby, tabby cat', 'score': 0.12}, {'label': 'tiger cat', 'score': 0.06}]
  cat            score=0.99  box={'xmin': 30, 'ymin': 18, 'xmax': 470, 'ymax': 380}
  remote         score=0.97  box={'xmin': 510, 'ymin': 70, 'xmax': 620, 'ymax': 175}
  couch          score=0.94  box={'xmin': 0,  'ymin': 200,'xmax': 640, 'ymax': 480}

Audio — ASR with Whisper

python

from transformers import pipeline

asr = pipeline(
    "automatic-speech-recognition",
    model="openai/whisper-large-v3",
    chunk_length_s=30,
    return_timestamps=True,
    device=0,
)
out = asr("interview.wav")
print(out["text"][:120])
for chunk in out["chunks"][:3]:
    start, end = chunk["timestamp"]
    print(f"  [{start:6.2f}-{end:6.2f}]  {chunk['text']}")

Hugging Face Hub — auth, downloads, uploads

bash

huggingface-cli login                  # paste token from huggingface.co/settings/tokens

Output:

text

    _|    _|  _|    _|    _|_|_|    _|_|_|_|_|    _|_|_|
    ...
Token has been saved to /home/alice/.cache/huggingface/token.
Login successful

bash

huggingface-cli download meta-llama/Llama-3.1-8B-Instruct --local-dir ./llama-3.1-8b

Output:

text

Downloading config.json: 100% 614/614
Downloading model-00001-of-00004.safetensors: 100% 4.98G/4.98G
Downloading tokenizer.json: 100% 9.09M/9.09M
./llama-3.1-8b

python

from huggingface_hub import upload_folder, snapshot_download

# Pull a snapshot (resumable, cached)
local = snapshot_download(repo_id="BAAI/bge-small-en-v1.5", local_dir="./bge-small")
print(local)

# Push a folder
upload_folder(folder_path="./my-model", repo_id="alicedev/my-classifier", commit_message="initial")

Real-world recipes

Recipe: text-classification micro-service

python

from transformers import pipeline
from fastapi import FastAPI

clf = pipeline(
    "text-classification",
    model="cardiffnlp/twitter-roberta-base-sentiment-latest",
    device=0,
    batch_size=16,
)

app = FastAPI()

@app.post("/classify")
def classify(payload: dict):
    return clf(payload["texts"], top_k=None)

bash

uvicorn app:app --host 0.0.0.0 --port 8000

Output:

text

INFO:     Uvicorn running on http://0.0.0.0:8000 (Press CTRL+C to quit)
INFO:     Application startup complete.

Recipe: batch summarisation over a folder of PDFs

python

from transformers import pipeline
from pathlib import Path
import fitz       # PyMuPDF

summariser = pipeline("summarization", model="facebook/bart-large-cnn", device=0, batch_size=4)

def pdf_text(path: Path) -> str:
    return "\n".join(page.get_text() for page in fitz.open(path))

pdfs = list(Path("./docs").glob("*.pdf"))
texts = [pdf_text(p)[:4000] for p in pdfs]
summaries = summariser(texts, max_length=120, min_length=40, do_sample=False)

for p, s in zip(pdfs, summaries):
    Path(f"./summaries/{p.stem}.txt").write_text(s["summary_text"])
print(f"Wrote {len(pdfs)} summaries")

Output:

text

Wrote 42 summaries

Recipe: offline inference behind a corporate firewall

Download model weights ahead of time, then point Transformers at the local snapshot — no Hub calls at runtime.

bash

HF_HUB_DOWNLOAD_TIMEOUT=120 huggingface-cli download intfloat/e5-small-v2 --local-dir ./e5-small-v2
export HF_HUB_OFFLINE=1
export TRANSFORMERS_OFFLINE=1

Output:

text

Fetching 12 files: 100%|████████████████████████████████████████| 12/12
./e5-small-v2

python

from transformers import AutoTokenizer, AutoModel
tok   = AutoTokenizer.from_pretrained("./e5-small-v2")     # local path
model = AutoModel.from_pretrained("./e5-small-v2")

Recipe: extracting embeddings from any encoder

Pool hidden states from a base encoder for a custom semantic search index.

python

import torch, torch.nn.functional as F
from transformers import AutoTokenizer, AutoModel

tok   = AutoTokenizer.from_pretrained("intfloat/e5-small-v2")
model = AutoModel.from_pretrained("intfloat/e5-small-v2").to("cuda").eval()

@torch.no_grad()
def embed(texts, prefix="passage: "):
    batch = tok([prefix + t for t in texts], padding=True, truncation=True,
                return_tensors="pt", max_length=512).to(model.device)
    out = model(**batch).last_hidden_state
    mask = batch["attention_mask"].unsqueeze(-1).float()
    pooled = (out * mask).sum(1) / mask.sum(1)        # mean pooling with attention mask
    return F.normalize(pooled, p=2, dim=1).cpu()

emb = embed(["Linux process control with systemd", "PyTorch optimisers"])
print(emb.shape, emb[0][:5])

Output:

text

torch.Size([2, 384]) tensor([ 0.0192, -0.0411,  0.0623, -0.0084,  0.0719])

Recipe: gradient checkpointing for long-sequence fine-tuning

Trade compute for memory to fit larger batches or longer sequences on a single GPU.

python

from transformers import TrainingArguments

args = TrainingArguments(
    output_dir="./out",
    per_device_train_batch_size=2,
    gradient_accumulation_steps=8,
    gradient_checkpointing=True,           # recompute activations on backward pass
    optim="adamw_bnb_8bit",                # 8-bit optimiser (memory saver)
    bf16=True,
    num_train_epochs=1,
    logging_steps=20,
)

Performance and reliability tips

Set tokenizer.pad_token = tokenizer.eos_token and model.config.pad_token_id = tokenizer.eos_token_id for decoder-only LMs before batched generation — missing this corrupts outputs silently.
Always pass truncation=True with a max_length matched to the model's model_max_length. Otherwise long inputs raise errors mid-batch.
Use from_pretrained(..., low_cpu_mem_usage=True, device_map="auto") — without it you load the full FP32 model into CPU before moving to GPU, which OOMs on large models.
Pin model revisions in production: from_pretrained("name", revision="abc123"). Default main can change under you.
Prefer safetensors checkpoints (faster, sandboxed) over .bin pickles.
Set TRANSFORMERS_NO_ADVISORY_WARNINGS=1 and TOKENIZERS_PARALLELISM=false in production logs to cut noise.
Cache the tokenizer once, not per call: tokenizer instantiation is slow due to vocab loading.
For inference servers, use a single pipeline per worker process — pipelines hold model weights and aren't safely shareable across processes without spawn start method.

Quick reference

Task	Code
Quick inference	`pipeline("task", model="checkpoint")`
Batched pipeline	`pipeline(..., batch_size=32)`
Load tokenizer	`AutoTokenizer.from_pretrained("name")`
Load causal LM	`AutoModelForCausalLM.from_pretrained("name")`
Load seq2seq	`AutoModelForSeq2SeqLM.from_pretrained("name")`
Load classifier	`AutoModelForSequenceClassification.from_pretrained("name", num_labels=N)`
Tokenize	`tokenizer(text, return_tensors="pt", padding=True, truncation=True)`
Offset mapping	`tokenizer(text, return_offsets_mapping=True)`
GPU placement	`from_pretrained(..., device_map="auto")`
Flash attention	`attn_implementation="flash_attention_2"`
4-bit load	`from_pretrained(..., quantization_config=BitsAndBytesConfig(load_in_4bit=True))`
Compile	`model = torch.compile(model, mode="reduce-overhead")`
Generate text	`model.generate(**inputs, max_new_tokens=100, do_sample=True)`
Stream tokens	`TextIteratorStreamer(tokenizer)` + thread
Chat template	`tokenizer.apply_chat_template(messages, tokenize=False)`
Decode output	`tokenizer.decode(out[0], skip_special_tokens=True)`
Fine-tune	`Trainer(model, args, train_dataset, eval_dataset).train()`
LoRA fine-tune	`get_peft_model(model, LoraConfig(...))`
Merge LoRA	`merged = peft_model.merge_and_unload()`
Data collator	`DataCollatorWithPadding(tokenizer)`
Datasets `.map`	`ds.map(tokenize, batched=True, num_proc=4)`
Streaming dataset	`load_dataset(..., streaming=True).take(N)`
Save	`model.save_pretrained("./dir")`
Push to Hub	`model.push_to_hub("owner/name")`
Snapshot download	`snapshot_download(repo_id="name", local_dir="./dir")`
HF login	`huggingface-cli login`
Offline mode	`export TRANSFORMERS_OFFLINE=1 HF_HUB_OFFLINE=1`

transformers — Hugging Face

What it is

Install

Quick example

When / why to use it

Common pitfalls

Richer example — text summarisation pipeline

pipeline() — zero-code inference

AutoTokenizer and AutoModel

Device placement — CPU, CUDA, MPS

generate() — text generation parameters

Chat templates — apply_chat_template

Trainer — fine-tuning

Saving and loading checkpoints

AutoModel head families

Tokenisers — fast vs slow, special tokens, offsets

Padding, truncation, and attention masks

datasets — loading & mapping at scale

Pipeline batching

torch.compile and Flash Attention

Quantisation — bitsandbytes, GPTQ, AWQ

PEFT and LoRA fine-tuning

Loading and merging LoRA adapters

Datasets, collators, dynamic padding

Evaluation with evaluate

Inference optimisations — KV cache, batching, beam vs sample

Vision — image classification and object detection

Audio — ASR with Whisper

Hugging Face Hub — auth, downloads, uploads

Real-world recipes

Recipe: text-classification micro-service

Recipe: batch summarisation over a folder of PDFs

Recipe: offline inference behind a corporate firewall

Recipe: extracting embeddings from any encoder

Recipe: gradient checkpointing for long-sequence fine-tuning

Performance and reliability tips

Quick reference

Evaluation with `evaluate`