cheat sheet
transformers
Load and run pre-trained models for NLP, vision, and audio with the Hugging Face Transformers library. Covers pipelines, AutoModel, tokenisation, generation, fine-tuning, and device placement.
transformers — Hugging Face
What it is
Hugging Face Transformers is the standard Python library for loading, running, and fine-tuning pre-trained neural networks — language models, image classifiers, speech recognisers, and more. It provides a unified API (pipeline, AutoModel, AutoTokenizer) that works across PyTorch, TensorFlow, and JAX backends. The Hugging Face Hub hosts over 900 000 public model checkpoints that can be downloaded with a single function call.
Install
pip install transformers
pip install transformers[torch] # with PyTorch (most common)
pip install accelerate # needed for device_map="auto" and multi-GPU
Output: (none — exits 0 on success)
Quick example
from transformers import pipeline
classifier = pipeline("sentiment-analysis")
result = classifier("I absolutely loved this product — exceeded every expectation.")
print(result)
Output:
[{'label': 'POSITIVE', 'score': 0.9998}]
When / why to use it
- Running any of the 900k+ models on Hugging Face Hub without writing model code.
- Tasks: text classification, summarisation, translation, question answering, image classification, ASR, text-to-image, and more.
- Fine-tuning a pre-trained checkpoint on your own dataset with
Trainer. - Loading a model locally for offline or privacy-constrained inference.
- Integrating models into LangChain, LlamaIndex, or custom pipelines.
Common pitfalls
pad_tokennot set for decoder-only models — GPT-style models have no pad token by default. When batching inputs, settokenizer.pad_token = tokenizer.eos_tokenandmodel.config.pad_token_id = tokenizer.eos_token_idbefore encoding.
return_tensors="pt"must match your backend — if you're using PyTorch, passreturn_tensors="pt"to the tokenizer. Omitting it returns Python lists, which most modelforward()calls reject.
Model card licences — many models (LLaMA, Gemma, Mistral) require accepting a licence on Hugging Face before downloading. Authenticate with
huggingface-cli loginand accept the licence on the model's Hub page first.
Large model downloads — a 7B-parameter model in float16 is ~14 GB. Use
low_cpu_mem_usage=Trueanddevice_map="auto"(requiresaccelerate) to load directly to GPU without duplicating the model in CPU RAM.
Prefer
safetensorsformat checkpoints when available — they load faster and are safer than.bin(pickle) files. Most recent Hub checkpoints includemodel.safetensors.
Richer example — text summarisation pipeline
from transformers import pipeline
summariser = pipeline(
"summarization",
model="facebook/bart-large-cnn",
device=0, # GPU 0; use -1 or omit for CPU
)
article = """
Scientists at the Global Climate Institute have published findings showing
that ocean temperatures in the northern Atlantic rose by an average of
1.4 degrees Celsius over the past decade, the largest recorded increase
in over a century of measurements. Researchers attribute the change to
a combination of greenhouse gas accumulation and shifting ocean currents.
The study calls for immediate international policy responses.
"""
result = summariser(article, max_length=60, min_length=20, do_sample=False)
print(result[0]["summary_text"])
Output:
Ocean temperatures in the northern Atlantic rose by 1.4°C over the past decade.
Scientists call for immediate international policy responses to the findings.
pipeline() — zero-code inference
pipeline is the fastest path from model name to prediction. It handles tokenisation, model loading, device placement, and post-processing automatically. Specify device=0 for CUDA GPU, device="mps" for Apple Silicon, or device=-1 / omit for CPU.
from transformers import pipeline
# Named-entity recognition
ner = pipeline("ner", grouped_entities=True)
print(ner("Alice Dev works at Acme Corp in London."))
Output:
[{'entity_group': 'PER', 'score': 0.998, 'word': 'Alice Dev', 'start': 0, 'end': 9},
{'entity_group': 'ORG', 'score': 0.994, 'word': 'Acme Corp', 'start': 19, 'end': 28},
{'entity_group': 'LOC', 'score': 0.999, 'word': 'London', 'start': 32, 'end': 38}]
# Zero-shot classification — no task-specific training needed
zsc = pipeline("zero-shot-classification")
result = zsc(
"This quarter we shipped 14 new features and fixed 32 bugs.",
candidate_labels=["product update", "financial report", "sports"],
)
print(result["labels"][0], f"{result['scores'][0]:.2%}")
Output:
product update 96.74%
# Automatic speech recognition
asr = pipeline("automatic-speech-recognition", model="openai/whisper-base")
result = asr("audio.wav")
print(result["text"])
Output:
The quick brown fox jumps over the lazy dog.
Common task strings: "text-classification", "token-classification", "question-answering", "summarization", "translation_en_to_fr", "text-generation", "fill-mask", "image-classification", "object-detection", "automatic-speech-recognition".
AutoTokenizer and AutoModel
AutoTokenizer and AutoModel* load the correct class for any checkpoint automatically, making code checkpoint-agnostic. The * in AutoModel* corresponds to task-specific heads: AutoModelForCausalLM, AutoModelForSequenceClassification, AutoModelForSeq2SeqLM, etc.
from transformers import AutoTokenizer, AutoModelForSequenceClassification
import torch
model_name = "distilbert-base-uncased-finetuned-sst-2-english"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForSequenceClassification.from_pretrained(model_name)
texts = ["I loved every moment.", "Worst experience of my life."]
inputs = tokenizer(texts, return_tensors="pt", padding=True, truncation=True, max_length=512)
with torch.no_grad():
logits = model(**inputs).logits
probs = torch.softmax(logits, dim=-1)
labels = model.config.id2label
for text, prob in zip(texts, probs):
pred = labels[prob.argmax().item()]
print(f"{pred}: {prob.max():.2%} — {text}")
Output:
POSITIVE: 99.98% — I loved every moment.
NEGATIVE: 99.94% — Worst experience of my life.
Device placement — CPU, CUDA, MPS
from transformers import AutoModelForCausalLM, AutoTokenizer
import torch
model_name = "mistralai/Mistral-7B-Instruct-v0.2"
# Single GPU — manual
model = AutoModelForCausalLM.from_pretrained(
model_name,
torch_dtype=torch.float16,
device_map="cuda:0",
)
# Multi-GPU / CPU offload — requires accelerate
model = AutoModelForCausalLM.from_pretrained(
model_name,
torch_dtype=torch.float16,
device_map="auto", # distributes layers across available devices
low_cpu_mem_usage=True, # never loads full model into CPU RAM
)
# Apple Silicon MPS
model = AutoModelForCausalLM.from_pretrained(
model_name,
torch_dtype=torch.float16,
device_map="mps",
)
# 4-bit quantisation — requires bitsandbytes
from transformers import BitsAndBytesConfig
bnb = BitsAndBytesConfig(load_in_4bit=True, bnb_4bit_compute_dtype=torch.float16)
model = AutoModelForCausalLM.from_pretrained(model_name, quantization_config=bnb)
generate() — text generation parameters
generate() is the method behind all decoder-based text generation. Key parameters control length, randomness, and search strategy.
from transformers import AutoTokenizer, AutoModelForCausalLM
import torch
tokenizer = AutoTokenizer.from_pretrained("gpt2")
model = AutoModelForCausalLM.from_pretrained("gpt2")
prompt = "The future of renewable energy is"
inputs = tokenizer(prompt, return_tensors="pt")
# Greedy (deterministic)
out = model.generate(**inputs, max_new_tokens=30)
print(tokenizer.decode(out[0], skip_special_tokens=True))
Output:
The future of renewable energy is bright. The world is moving toward a clean energy economy...
# Sampling — creative, varied
out = model.generate(
**inputs,
max_new_tokens=50,
do_sample=True,
temperature=0.8,
top_p=0.92,
)
# Beam search — balances quality and diversity
out = model.generate(
**inputs,
max_new_tokens=50,
num_beams=5,
early_stopping=True,
no_repeat_ngram_size=3,
)
| Parameter | Effect |
|---|---|
max_new_tokens | Hard cap on tokens generated |
do_sample=True | Enable sampling (off = greedy) |
temperature | Randomness: 0 = deterministic, >1 = chaotic |
top_p | Nucleus sampling: keep top p% probability mass |
top_k | Keep only top k tokens per step |
num_beams | Beam search width (1 = greedy) |
repetition_penalty | Penalise repeated tokens (1.0 = no penalty) |
no_repeat_ngram_size | Block repeating n-grams of this length |
Chat templates — apply_chat_template
Instruction-tuned models expect input in a specific format (system/user/assistant turns). apply_chat_template encodes a list of message dicts into the model's correct prompt format.
from transformers import AutoTokenizer, AutoModelForCausalLM
import torch
model_name = "HuggingFaceH4/zephyr-7b-beta"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForCausalLM.from_pretrained(model_name, torch_dtype=torch.float16, device_map="auto")
messages = [
{"role": "system", "content": "You are a concise Python tutor."},
{"role": "user", "content": "What is a list comprehension?"},
]
prompt = tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
inputs = tokenizer(prompt, return_tensors="pt").to(model.device)
out = model.generate(**inputs, max_new_tokens=120, do_sample=False)
response = tokenizer.decode(out[0][inputs["input_ids"].shape[1]:], skip_special_tokens=True)
print(response)
Output:
A list comprehension is a concise way to create a list:
squares = [x**2 for x in range(10)]
It can include a filter condition:
evens = [x for x in range(20) if x % 2 == 0]
Trainer — fine-tuning
Trainer handles the training loop, evaluation, checkpointing, and logging. Provide a TrainingArguments config, a Dataset object, and the model.
from transformers import (
AutoTokenizer, AutoModelForSequenceClassification,
Trainer, TrainingArguments
)
from datasets import load_dataset
import numpy as np
model_name = "distilbert-base-uncased"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForSequenceClassification.from_pretrained(model_name, num_labels=2)
dataset = load_dataset("imdb")
def tokenize(batch):
return tokenizer(batch["text"], truncation=True, padding="max_length", max_length=256)
tokenized = dataset.map(tokenize, batched=True)
tokenized = tokenized.rename_column("label", "labels")
tokenized.set_format("torch", columns=["input_ids", "attention_mask", "labels"])
args = TrainingArguments(
output_dir="./checkpoints",
num_train_epochs=1,
per_device_train_batch_size=16,
per_device_eval_batch_size=32,
evaluation_strategy="epoch",
save_strategy="epoch",
load_best_model_at_end=True,
logging_steps=50,
)
def compute_metrics(eval_pred):
logits, labels = eval_pred
preds = np.argmax(logits, axis=-1)
return {"accuracy": (preds == labels).mean()}
trainer = Trainer(
model=model,
args=args,
train_dataset=tokenized["train"].select(range(2000)),
eval_dataset=tokenized["test"].select(range(500)),
compute_metrics=compute_metrics,
)
trainer.train()
Saving and loading checkpoints
# Save model + tokenizer
model.save_pretrained("./my-model")
tokenizer.save_pretrained("./my-model")
# Load later
from transformers import AutoModelForSequenceClassification, AutoTokenizer
model = AutoModelForSequenceClassification.from_pretrained("./my-model")
tokenizer = AutoTokenizer.from_pretrained("./my-model")
# Push to Hub
model.push_to_hub("alicedev/my-sentiment-classifier")
tokenizer.push_to_hub("alicedev/my-sentiment-classifier")
AutoModel head families
The AutoModelFor* class controls which task head is attached on top of the base transformer. Pick the head that matches the supervised task — using the wrong class either fails to load or gives the wrong output shape.
| Class | Task | Output |
|---|---|---|
AutoModel | Raw encoder | hidden states |
AutoModelForCausalLM | Left-to-right text generation | next-token logits |
AutoModelForSeq2SeqLM | Encoder-decoder (T5, BART) | decoder logits |
AutoModelForMaskedLM | BERT-style masked-token prediction | masked-position logits |
AutoModelForSequenceClassification | Single-label / multi-label classification | logits per class |
AutoModelForTokenClassification | NER, POS tagging | logits per token per class |
AutoModelForQuestionAnswering | Extractive QA | start/end span logits |
AutoModelForImageClassification | Vision classification | logits per class |
AutoModelForObjectDetection | DETR-style detection | boxes + class logits |
AutoModelForSpeechSeq2Seq | Whisper-style ASR | transcript token logits |
AutoModelForVision2Seq | Image → text (BLIP, LLaVA) | caption / answer logits |
from transformers import AutoModelForQuestionAnswering, AutoTokenizer
import torch
model_name = "deepset/roberta-base-squad2"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForQuestionAnswering.from_pretrained(model_name)
context = "The transformer architecture was introduced in 2017 by Vaswani et al."
question = "When was the transformer introduced?"
inputs = tokenizer(question, context, return_tensors="pt")
with torch.no_grad():
out = model(**inputs)
start = out.start_logits.argmax()
end = out.end_logits.argmax() + 1
answer = tokenizer.decode(inputs["input_ids"][0][start:end])
print(answer)
Output:
2017
Tokenisers — fast vs slow, special tokens, offsets
Hugging Face tokenisers come in two flavours: "slow" (pure Python) and "fast" (Rust-backed via the tokenizers crate). Always prefer the fast version — it supports offset mapping for token-to-character alignment, batched encoding, and significantly higher throughput.
from transformers import AutoTokenizer
tok = AutoTokenizer.from_pretrained("bert-base-uncased")
print(type(tok).__name__, " is_fast:", tok.is_fast)
# Special tokens
print("CLS:", tok.cls_token, tok.cls_token_id)
print("SEP:", tok.sep_token, tok.sep_token_id)
print("PAD:", tok.pad_token, tok.pad_token_id)
# Encoding with offset mapping
out = tok(
"Alice Dev visited London.",
return_tensors="pt",
return_offsets_mapping=True,
return_attention_mask=True,
)
for tid, (start, end) in zip(out["input_ids"][0], out["offset_mapping"][0]):
print(f" {tok.decode([tid]):>12s} chars[{start.item()}:{end.item()}]")
Output:
BertTokenizerFast is_fast: True
CLS: [CLS] 101
SEP: [SEP] 102
PAD: [PAD] 0
[CLS] chars[0:0]
alice chars[0:5]
dev chars[6:9]
visited chars[10:17]
london chars[18:24]
. chars[24:25]
[SEP] chars[0:0]
Padding, truncation, and attention masks
When batching, all sequences in a batch must be the same length. padding=True pads to the longest sequence in the batch; padding="max_length" pads to a fixed max_length. Always pair padding with the resulting attention mask so the model ignores pad tokens.
from transformers import AutoTokenizer
import torch
tok = AutoTokenizer.from_pretrained("distilbert-base-uncased")
texts = ["Short.", "A medium-length sentence about transformers."]
# Pad to longest in batch (most efficient)
out = tok(texts, return_tensors="pt", padding=True, truncation=True, max_length=64)
print(out["input_ids"].shape, out["attention_mask"].shape)
print(out["attention_mask"][0])
# Truncate long inputs from the left or right
long_text = "word " * 1000
out = tok(long_text, return_tensors="pt", truncation=True, max_length=32, truncation_side="left")
print(out["input_ids"].shape)
Output:
torch.Size([2, 11]) torch.Size([2, 11])
tensor([1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0])
torch.Size([1, 32])
datasets — loading & mapping at scale
The datasets library (separate package) is the canonical way to feed data to Transformers. It supports streaming from the Hub, memory-mapped arrays for huge corpora, and .map() for parallel preprocessing.
pip install datasets
Output: (none — exits 0 on success)
from datasets import load_dataset, Dataset
# Load from the Hub
ds = load_dataset("imdb", split="train")
print(ds)
# Streaming for huge datasets that don't fit in memory
big = load_dataset("c4", "en", split="train", streaming=True)
for example in big.take(2):
print(example["text"][:50])
# Parallel preprocessing
tok = AutoTokenizer.from_pretrained("distilbert-base-uncased")
def tokenize(batch):
return tok(batch["text"], truncation=True, padding=False, max_length=256)
ds_tok = ds.map(tokenize, batched=True, num_proc=4)
ds_tok.set_format("torch", columns=["input_ids", "attention_mask", "label"])
print(ds_tok[0]["input_ids"][:10])
Output:
Dataset({features: ['text', 'label'], num_rows: 25000})
Beginners tutorial: read this if you want to underst
A new feature for clouds: real-time processing of la
tensor([ 101, 1996, 2143, 2059, 1011, 2918, 1996, 1009, 2003, 2025])
Pipeline batching
Pipelines accept batch_size and an iterable of inputs for efficient inference. Avoid Python loops over the pipeline — they re-pad and re-call per element.
from transformers import pipeline
classifier = pipeline(
"sentiment-analysis",
model="distilbert-base-uncased-finetuned-sst-2-english",
device=0,
batch_size=32,
)
texts = ["Great product."] * 1000 + ["Terrible."] * 1000
# Iterating yields one prediction at a time but batches internally
for i, result in enumerate(classifier(texts)):
if i < 3:
print(result)
Output:
{'label': 'POSITIVE', 'score': 0.9999}
{'label': 'POSITIVE', 'score': 0.9999}
{'label': 'POSITIVE', 'score': 0.9999}
torch.compile and Flash Attention
For inference on Ampere+ GPUs, torch.compile plus Flash Attention 2 cuts latency by 30–60%.
from transformers import AutoModelForCausalLM, AutoTokenizer
import torch
tok = AutoTokenizer.from_pretrained("meta-llama/Llama-3.1-8B-Instruct")
model = AutoModelForCausalLM.from_pretrained(
"meta-llama/Llama-3.1-8B-Instruct",
torch_dtype=torch.bfloat16,
device_map="auto",
attn_implementation="flash_attention_2", # requires `pip install flash-attn`
)
model = torch.compile(model, mode="reduce-overhead")
inputs = tok("The transformer architecture", return_tensors="pt").to(model.device)
out = model.generate(**inputs, max_new_tokens=64)
print(tok.decode(out[0], skip_special_tokens=True))
Output:
The transformer architecture replaces recurrence with self-attention, allowing
parallel processing of sequences and longer-range dependencies than RNNs...
Quantisation — bitsandbytes, GPTQ, AWQ
Quantisation reduces memory footprint at small (or sometimes negligible) accuracy cost. The cheapest path is BitsAndBytesConfig; for inference-only deployments use a pre-quantised GPTQ or AWQ checkpoint.
from transformers import AutoModelForCausalLM, BitsAndBytesConfig, AutoTokenizer
import torch
# 4-bit NF4 with double-quantisation (best memory/accuracy trade-off)
bnb = BitsAndBytesConfig(
load_in_4bit=True,
bnb_4bit_quant_type="nf4",
bnb_4bit_use_double_quant=True,
bnb_4bit_compute_dtype=torch.bfloat16,
)
model = AutoModelForCausalLM.from_pretrained(
"meta-llama/Llama-3.1-8B-Instruct",
quantization_config=bnb,
device_map="auto",
)
tok = AutoTokenizer.from_pretrained("meta-llama/Llama-3.1-8B-Instruct")
print(f"Footprint: {model.get_memory_footprint() / 1e9:.2f} GB")
Output:
Footprint: 5.83 GB
Pre-quantised checkpoints exist for many models (look for -AWQ / -GPTQ / -bnb-4bit suffixes on the Hub). They load with from_pretrained without extra config.
PEFT and LoRA fine-tuning
For most practical fine-tuning, full-parameter training is overkill. PEFT (Parameter-Efficient Fine-Tuning) — particularly LoRA — trains a small set of adapter weights while freezing the base model.
pip install peft accelerate bitsandbytes
Output: (none — exits 0 on success)
from transformers import AutoModelForCausalLM, AutoTokenizer, BitsAndBytesConfig, TrainingArguments
from peft import LoraConfig, get_peft_model, prepare_model_for_kbit_training
from trl import SFTTrainer
from datasets import load_dataset
import torch
model_name = "meta-llama/Llama-3.1-8B-Instruct"
bnb = BitsAndBytesConfig(load_in_4bit=True, bnb_4bit_quant_type="nf4",
bnb_4bit_compute_dtype=torch.bfloat16)
model = AutoModelForCausalLM.from_pretrained(model_name, quantization_config=bnb, device_map="auto")
model = prepare_model_for_kbit_training(model)
tok = AutoTokenizer.from_pretrained(model_name)
tok.pad_token = tok.eos_token
lora = LoraConfig(
r=16, lora_alpha=32, lora_dropout=0.05, bias="none",
target_modules=["q_proj","k_proj","v_proj","o_proj"],
task_type="CAUSAL_LM",
)
model = get_peft_model(model, lora)
model.print_trainable_parameters()
ds = load_dataset("databricks/databricks-dolly-15k", split="train[:1000]")
args = TrainingArguments(output_dir="./lora-out", num_train_epochs=1,
per_device_train_batch_size=4, gradient_accumulation_steps=4,
learning_rate=2e-4, fp16=False, bf16=True, logging_steps=20,
save_strategy="epoch")
trainer = SFTTrainer(model=model, args=args, train_dataset=ds, tokenizer=tok,
dataset_text_field="instruction", max_seq_length=512)
trainer.train()
trainer.model.save_pretrained("./lora-out/final")
Output:
trainable params: 6,815,744 || all params: 8,037,408,768 || trainable%: 0.0848
Loading and merging LoRA adapters
from peft import PeftModel
from transformers import AutoModelForCausalLM, AutoTokenizer
base = AutoModelForCausalLM.from_pretrained("meta-llama/Llama-3.1-8B-Instruct", device_map="auto")
model = PeftModel.from_pretrained(base, "./lora-out/final")
# Optionally merge LoRA into base weights for export
merged = model.merge_and_unload()
merged.save_pretrained("./llama3-merged")
Datasets, collators, dynamic padding
A data collator forms batches and pads on the fly. DataCollatorWithPadding is the right default for classification; DataCollatorForLanguageModeling handles MLM with masking; DataCollatorForSeq2Seq shifts labels for decoder training.
from transformers import AutoTokenizer, DataCollatorWithPadding
from torch.utils.data import DataLoader
tok = AutoTokenizer.from_pretrained("distilbert-base-uncased")
collator = DataCollatorWithPadding(tokenizer=tok, padding="longest", return_tensors="pt")
loader = DataLoader(ds_tok, batch_size=16, collate_fn=collator, shuffle=True)
batch = next(iter(loader))
print({k: v.shape for k, v in batch.items()})
Output:
{'input_ids': torch.Size([16, 142]), 'attention_mask': torch.Size([16, 142]), 'labels': torch.Size([16])}
Evaluation with evaluate
The evaluate library ships standard metrics that align with Trainer's compute_metrics signature.
pip install evaluate
Output: (none — exits 0 on success)
import evaluate, numpy as np
accuracy = evaluate.load("accuracy")
f1 = evaluate.load("f1")
def compute_metrics(eval_pred):
logits, labels = eval_pred
preds = np.argmax(logits, axis=-1)
return {
**accuracy.compute(predictions=preds, references=labels),
**f1.compute(predictions=preds, references=labels, average="macro"),
}
Inference optimisations — KV cache, batching, beam vs sample
For generation, the most impactful knobs are: KV cache (on by default), batch size, beam width vs sampling, and quantisation. Streaming with TextIteratorStreamer makes long generations interactive.
from transformers import AutoTokenizer, AutoModelForCausalLM, TextIteratorStreamer
from threading import Thread
import torch
tok = AutoTokenizer.from_pretrained("gpt2")
model = AutoModelForCausalLM.from_pretrained("gpt2").to("cuda")
streamer = TextIteratorStreamer(tok, skip_prompt=True, skip_special_tokens=True)
inputs = tok("The future of AI is", return_tensors="pt").to(model.device)
kwargs = dict(**inputs, streamer=streamer, max_new_tokens=80, do_sample=True, temperature=0.8)
Thread(target=model.generate, kwargs=kwargs).start()
for text in streamer:
print(text, end="", flush=True)
Output:
cooperative — a future in which intelligent systems work alongside humans rather than displacing them...
Vision — image classification and object detection
from transformers import pipeline
from PIL import Image
# Classification
clf = pipeline("image-classification", model="google/vit-base-patch16-224")
img = Image.open("cat.jpg")
print(clf(img)[:3])
# Object detection
det = pipeline("object-detection", model="facebook/detr-resnet-50")
results = det(img)
for r in results[:3]:
print(f" {r['label']:15s} score={r['score']:.2f} box={r['box']}")
Output:
[{'label': 'Egyptian cat', 'score': 0.78}, {'label': 'tabby, tabby cat', 'score': 0.12}, {'label': 'tiger cat', 'score': 0.06}]
cat score=0.99 box={'xmin': 30, 'ymin': 18, 'xmax': 470, 'ymax': 380}
remote score=0.97 box={'xmin': 510, 'ymin': 70, 'xmax': 620, 'ymax': 175}
couch score=0.94 box={'xmin': 0, 'ymin': 200,'xmax': 640, 'ymax': 480}
Audio — ASR with Whisper
from transformers import pipeline
asr = pipeline(
"automatic-speech-recognition",
model="openai/whisper-large-v3",
chunk_length_s=30,
return_timestamps=True,
device=0,
)
out = asr("interview.wav")
print(out["text"][:120])
for chunk in out["chunks"][:3]:
start, end = chunk["timestamp"]
print(f" [{start:6.2f}-{end:6.2f}] {chunk['text']}")
Hugging Face Hub — auth, downloads, uploads
huggingface-cli login # paste token from huggingface.co/settings/tokens
Output:
_| _| _| _| _|_|_| _|_|_|_|_| _|_|_|
...
Token has been saved to /home/alice/.cache/huggingface/token.
Login successful
huggingface-cli download meta-llama/Llama-3.1-8B-Instruct --local-dir ./llama-3.1-8b
Output:
Downloading config.json: 100% 614/614
Downloading model-00001-of-00004.safetensors: 100% 4.98G/4.98G
Downloading tokenizer.json: 100% 9.09M/9.09M
./llama-3.1-8b
from huggingface_hub import upload_folder, snapshot_download
# Pull a snapshot (resumable, cached)
local = snapshot_download(repo_id="BAAI/bge-small-en-v1.5", local_dir="./bge-small")
print(local)
# Push a folder
upload_folder(folder_path="./my-model", repo_id="alicedev/my-classifier", commit_message="initial")
Real-world recipes
Recipe: text-classification micro-service
from transformers import pipeline
from fastapi import FastAPI
clf = pipeline(
"text-classification",
model="cardiffnlp/twitter-roberta-base-sentiment-latest",
device=0,
batch_size=16,
)
app = FastAPI()
@app.post("/classify")
def classify(payload: dict):
return clf(payload["texts"], top_k=None)
uvicorn app:app --host 0.0.0.0 --port 8000
Output:
INFO: Uvicorn running on http://0.0.0.0:8000 (Press CTRL+C to quit)
INFO: Application startup complete.
Recipe: batch summarisation over a folder of PDFs
from transformers import pipeline
from pathlib import Path
import fitz # PyMuPDF
summariser = pipeline("summarization", model="facebook/bart-large-cnn", device=0, batch_size=4)
def pdf_text(path: Path) -> str:
return "\n".join(page.get_text() for page in fitz.open(path))
pdfs = list(Path("./docs").glob("*.pdf"))
texts = [pdf_text(p)[:4000] for p in pdfs]
summaries = summariser(texts, max_length=120, min_length=40, do_sample=False)
for p, s in zip(pdfs, summaries):
Path(f"./summaries/{p.stem}.txt").write_text(s["summary_text"])
print(f"Wrote {len(pdfs)} summaries")
Output:
Wrote 42 summaries
Recipe: offline inference behind a corporate firewall
Download model weights ahead of time, then point Transformers at the local snapshot — no Hub calls at runtime.
HF_HUB_DOWNLOAD_TIMEOUT=120 huggingface-cli download intfloat/e5-small-v2 --local-dir ./e5-small-v2
export HF_HUB_OFFLINE=1
export TRANSFORMERS_OFFLINE=1
Output:
Fetching 12 files: 100%|████████████████████████████████████████| 12/12
./e5-small-v2
from transformers import AutoTokenizer, AutoModel
tok = AutoTokenizer.from_pretrained("./e5-small-v2") # local path
model = AutoModel.from_pretrained("./e5-small-v2")
Recipe: extracting embeddings from any encoder
Pool hidden states from a base encoder for a custom semantic search index.
import torch, torch.nn.functional as F
from transformers import AutoTokenizer, AutoModel
tok = AutoTokenizer.from_pretrained("intfloat/e5-small-v2")
model = AutoModel.from_pretrained("intfloat/e5-small-v2").to("cuda").eval()
@torch.no_grad()
def embed(texts, prefix="passage: "):
batch = tok([prefix + t for t in texts], padding=True, truncation=True,
return_tensors="pt", max_length=512).to(model.device)
out = model(**batch).last_hidden_state
mask = batch["attention_mask"].unsqueeze(-1).float()
pooled = (out * mask).sum(1) / mask.sum(1) # mean pooling with attention mask
return F.normalize(pooled, p=2, dim=1).cpu()
emb = embed(["Linux process control with systemd", "PyTorch optimisers"])
print(emb.shape, emb[0][:5])
Output:
torch.Size([2, 384]) tensor([ 0.0192, -0.0411, 0.0623, -0.0084, 0.0719])
Recipe: gradient checkpointing for long-sequence fine-tuning
Trade compute for memory to fit larger batches or longer sequences on a single GPU.
from transformers import TrainingArguments
args = TrainingArguments(
output_dir="./out",
per_device_train_batch_size=2,
gradient_accumulation_steps=8,
gradient_checkpointing=True, # recompute activations on backward pass
optim="adamw_bnb_8bit", # 8-bit optimiser (memory saver)
bf16=True,
num_train_epochs=1,
logging_steps=20,
)
Performance and reliability tips
- Set
tokenizer.pad_token = tokenizer.eos_tokenandmodel.config.pad_token_id = tokenizer.eos_token_idfor decoder-only LMs before batched generation — missing this corrupts outputs silently. - Always pass
truncation=Truewith amax_lengthmatched to the model'smodel_max_length. Otherwise long inputs raise errors mid-batch. - Use
from_pretrained(..., low_cpu_mem_usage=True, device_map="auto")— without it you load the full FP32 model into CPU before moving to GPU, which OOMs on large models. - Pin model revisions in production:
from_pretrained("name", revision="abc123"). Defaultmaincan change under you. - Prefer
safetensorscheckpoints (faster, sandboxed) over.binpickles. - Set
TRANSFORMERS_NO_ADVISORY_WARNINGS=1andTOKENIZERS_PARALLELISM=falsein production logs to cut noise. - Cache the tokenizer once, not per call: tokenizer instantiation is slow due to vocab loading.
- For inference servers, use a single
pipelineper worker process — pipelines hold model weights and aren't safely shareable across processes withoutspawnstart method.
Quick reference
| Task | Code |
|---|---|
| Quick inference | pipeline("task", model="checkpoint") |
| Batched pipeline | pipeline(..., batch_size=32) |
| Load tokenizer | AutoTokenizer.from_pretrained("name") |
| Load causal LM | AutoModelForCausalLM.from_pretrained("name") |
| Load seq2seq | AutoModelForSeq2SeqLM.from_pretrained("name") |
| Load classifier | AutoModelForSequenceClassification.from_pretrained("name", num_labels=N) |
| Tokenize | tokenizer(text, return_tensors="pt", padding=True, truncation=True) |
| Offset mapping | tokenizer(text, return_offsets_mapping=True) |
| GPU placement | from_pretrained(..., device_map="auto") |
| Flash attention | attn_implementation="flash_attention_2" |
| 4-bit load | from_pretrained(..., quantization_config=BitsAndBytesConfig(load_in_4bit=True)) |
| Compile | model = torch.compile(model, mode="reduce-overhead") |
| Generate text | model.generate(**inputs, max_new_tokens=100, do_sample=True) |
| Stream tokens | TextIteratorStreamer(tokenizer) + thread |
| Chat template | tokenizer.apply_chat_template(messages, tokenize=False) |
| Decode output | tokenizer.decode(out[0], skip_special_tokens=True) |
| Fine-tune | Trainer(model, args, train_dataset, eval_dataset).train() |
| LoRA fine-tune | get_peft_model(model, LoraConfig(...)) |
| Merge LoRA | merged = peft_model.merge_and_unload() |
| Data collator | DataCollatorWithPadding(tokenizer) |
Datasets .map | ds.map(tokenize, batched=True, num_proc=4) |
| Streaming dataset | load_dataset(..., streaming=True).take(N) |
| Save | model.save_pretrained("./dir") |
| Push to Hub | model.push_to_hub("owner/name") |
| Snapshot download | snapshot_download(repo_id="name", local_dir="./dir") |
| HF login | huggingface-cli login |
| Offline mode | export TRANSFORMERS_OFFLINE=1 HF_HUB_OFFLINE=1 |