cheat sheet

transformers

Load and run pre-trained models for NLP, vision, and audio with the Hugging Face Transformers library. Covers pipelines, AutoModel, tokenisation, generation, fine-tuning, and device placement.

transformers — Hugging Face

What it is

Hugging Face Transformers is the standard Python library for loading, running, and fine-tuning pre-trained neural networks — language models, image classifiers, speech recognisers, and more. It provides a unified API (pipeline, AutoModel, AutoTokenizer) that works across PyTorch, TensorFlow, and JAX backends. The Hugging Face Hub hosts over 900 000 public model checkpoints that can be downloaded with a single function call.

Install

bash
pip install transformers
pip install transformers[torch]   # with PyTorch (most common)
pip install accelerate            # needed for device_map="auto" and multi-GPU

Output: (none — exits 0 on success)

Quick example

python
from transformers import pipeline

classifier = pipeline("sentiment-analysis")
result = classifier("I absolutely loved this product — exceeded every expectation.")
print(result)

Output:

text
[{'label': 'POSITIVE', 'score': 0.9998}]

When / why to use it

  • Running any of the 900k+ models on Hugging Face Hub without writing model code.
  • Tasks: text classification, summarisation, translation, question answering, image classification, ASR, text-to-image, and more.
  • Fine-tuning a pre-trained checkpoint on your own dataset with Trainer.
  • Loading a model locally for offline or privacy-constrained inference.
  • Integrating models into LangChain, LlamaIndex, or custom pipelines.

Common pitfalls

pad_token not set for decoder-only models — GPT-style models have no pad token by default. When batching inputs, set tokenizer.pad_token = tokenizer.eos_token and model.config.pad_token_id = tokenizer.eos_token_id before encoding.

return_tensors="pt" must match your backend — if you're using PyTorch, pass return_tensors="pt" to the tokenizer. Omitting it returns Python lists, which most model forward() calls reject.

Model card licences — many models (LLaMA, Gemma, Mistral) require accepting a licence on Hugging Face before downloading. Authenticate with huggingface-cli login and accept the licence on the model's Hub page first.

Large model downloads — a 7B-parameter model in float16 is ~14 GB. Use low_cpu_mem_usage=True and device_map="auto" (requires accelerate) to load directly to GPU without duplicating the model in CPU RAM.

Prefer safetensors format checkpoints when available — they load faster and are safer than .bin (pickle) files. Most recent Hub checkpoints include model.safetensors.

Richer example — text summarisation pipeline

python
from transformers import pipeline

summariser = pipeline(
    "summarization",
    model="facebook/bart-large-cnn",
    device=0,       # GPU 0; use -1 or omit for CPU
)

article = """
Scientists at the Global Climate Institute have published findings showing
that ocean temperatures in the northern Atlantic rose by an average of
1.4 degrees Celsius over the past decade, the largest recorded increase
in over a century of measurements. Researchers attribute the change to
a combination of greenhouse gas accumulation and shifting ocean currents.
The study calls for immediate international policy responses.
"""

result = summariser(article, max_length=60, min_length=20, do_sample=False)
print(result[0]["summary_text"])

Output:

text
Ocean temperatures in the northern Atlantic rose by 1.4°C over the past decade.
Scientists call for immediate international policy responses to the findings.

pipeline() — zero-code inference

pipeline is the fastest path from model name to prediction. It handles tokenisation, model loading, device placement, and post-processing automatically. Specify device=0 for CUDA GPU, device="mps" for Apple Silicon, or device=-1 / omit for CPU.

python
from transformers import pipeline

# Named-entity recognition
ner = pipeline("ner", grouped_entities=True)
print(ner("Alice Dev works at Acme Corp in London."))

Output:

text
[{'entity_group': 'PER', 'score': 0.998, 'word': 'Alice Dev', 'start': 0, 'end': 9},
 {'entity_group': 'ORG', 'score': 0.994, 'word': 'Acme Corp', 'start': 19, 'end': 28},
 {'entity_group': 'LOC', 'score': 0.999, 'word': 'London', 'start': 32, 'end': 38}]
python
# Zero-shot classification — no task-specific training needed
zsc = pipeline("zero-shot-classification")
result = zsc(
    "This quarter we shipped 14 new features and fixed 32 bugs.",
    candidate_labels=["product update", "financial report", "sports"],
)
print(result["labels"][0], f"{result['scores'][0]:.2%}")

Output:

text
product update 96.74%
python
# Automatic speech recognition
asr = pipeline("automatic-speech-recognition", model="openai/whisper-base")
result = asr("audio.wav")
print(result["text"])

Output:

text
 The quick brown fox jumps over the lazy dog.

Common task strings: "text-classification", "token-classification", "question-answering", "summarization", "translation_en_to_fr", "text-generation", "fill-mask", "image-classification", "object-detection", "automatic-speech-recognition".

AutoTokenizer and AutoModel

AutoTokenizer and AutoModel* load the correct class for any checkpoint automatically, making code checkpoint-agnostic. The * in AutoModel* corresponds to task-specific heads: AutoModelForCausalLM, AutoModelForSequenceClassification, AutoModelForSeq2SeqLM, etc.

python
from transformers import AutoTokenizer, AutoModelForSequenceClassification
import torch

model_name = "distilbert-base-uncased-finetuned-sst-2-english"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForSequenceClassification.from_pretrained(model_name)

texts = ["I loved every moment.", "Worst experience of my life."]
inputs = tokenizer(texts, return_tensors="pt", padding=True, truncation=True, max_length=512)

with torch.no_grad():
    logits = model(**inputs).logits

probs = torch.softmax(logits, dim=-1)
labels = model.config.id2label
for text, prob in zip(texts, probs):
    pred = labels[prob.argmax().item()]
    print(f"{pred}: {prob.max():.2%}{text}")

Output:

text
POSITIVE: 99.98%  — I loved every moment.
NEGATIVE: 99.94%  — Worst experience of my life.

Device placement — CPU, CUDA, MPS

python
from transformers import AutoModelForCausalLM, AutoTokenizer
import torch

model_name = "mistralai/Mistral-7B-Instruct-v0.2"

# Single GPU — manual
model = AutoModelForCausalLM.from_pretrained(
    model_name,
    torch_dtype=torch.float16,
    device_map="cuda:0",
)

# Multi-GPU / CPU offload — requires accelerate
model = AutoModelForCausalLM.from_pretrained(
    model_name,
    torch_dtype=torch.float16,
    device_map="auto",          # distributes layers across available devices
    low_cpu_mem_usage=True,     # never loads full model into CPU RAM
)

# Apple Silicon MPS
model = AutoModelForCausalLM.from_pretrained(
    model_name,
    torch_dtype=torch.float16,
    device_map="mps",
)

# 4-bit quantisation — requires bitsandbytes
from transformers import BitsAndBytesConfig
bnb = BitsAndBytesConfig(load_in_4bit=True, bnb_4bit_compute_dtype=torch.float16)
model = AutoModelForCausalLM.from_pretrained(model_name, quantization_config=bnb)

generate() — text generation parameters

generate() is the method behind all decoder-based text generation. Key parameters control length, randomness, and search strategy.

python
from transformers import AutoTokenizer, AutoModelForCausalLM
import torch

tokenizer = AutoTokenizer.from_pretrained("gpt2")
model = AutoModelForCausalLM.from_pretrained("gpt2")

prompt = "The future of renewable energy is"
inputs = tokenizer(prompt, return_tensors="pt")

# Greedy (deterministic)
out = model.generate(**inputs, max_new_tokens=30)
print(tokenizer.decode(out[0], skip_special_tokens=True))

Output:

text
The future of renewable energy is bright. The world is moving toward a clean energy economy...
python
# Sampling — creative, varied
out = model.generate(
    **inputs,
    max_new_tokens=50,
    do_sample=True,
    temperature=0.8,
    top_p=0.92,
)

# Beam search — balances quality and diversity
out = model.generate(
    **inputs,
    max_new_tokens=50,
    num_beams=5,
    early_stopping=True,
    no_repeat_ngram_size=3,
)
ParameterEffect
max_new_tokensHard cap on tokens generated
do_sample=TrueEnable sampling (off = greedy)
temperatureRandomness: 0 = deterministic, >1 = chaotic
top_pNucleus sampling: keep top p% probability mass
top_kKeep only top k tokens per step
num_beamsBeam search width (1 = greedy)
repetition_penaltyPenalise repeated tokens (1.0 = no penalty)
no_repeat_ngram_sizeBlock repeating n-grams of this length

Chat templates — apply_chat_template

Instruction-tuned models expect input in a specific format (system/user/assistant turns). apply_chat_template encodes a list of message dicts into the model's correct prompt format.

python
from transformers import AutoTokenizer, AutoModelForCausalLM
import torch

model_name = "HuggingFaceH4/zephyr-7b-beta"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForCausalLM.from_pretrained(model_name, torch_dtype=torch.float16, device_map="auto")

messages = [
    {"role": "system", "content": "You are a concise Python tutor."},
    {"role": "user",   "content": "What is a list comprehension?"},
]

prompt = tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
inputs = tokenizer(prompt, return_tensors="pt").to(model.device)
out = model.generate(**inputs, max_new_tokens=120, do_sample=False)

response = tokenizer.decode(out[0][inputs["input_ids"].shape[1]:], skip_special_tokens=True)
print(response)

Output:

text
A list comprehension is a concise way to create a list:
  squares = [x**2 for x in range(10)]
It can include a filter condition:
  evens = [x for x in range(20) if x % 2 == 0]

Trainer — fine-tuning

Trainer handles the training loop, evaluation, checkpointing, and logging. Provide a TrainingArguments config, a Dataset object, and the model.

python
from transformers import (
    AutoTokenizer, AutoModelForSequenceClassification,
    Trainer, TrainingArguments
)
from datasets import load_dataset
import numpy as np

model_name = "distilbert-base-uncased"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForSequenceClassification.from_pretrained(model_name, num_labels=2)

dataset = load_dataset("imdb")

def tokenize(batch):
    return tokenizer(batch["text"], truncation=True, padding="max_length", max_length=256)

tokenized = dataset.map(tokenize, batched=True)
tokenized = tokenized.rename_column("label", "labels")
tokenized.set_format("torch", columns=["input_ids", "attention_mask", "labels"])

args = TrainingArguments(
    output_dir="./checkpoints",
    num_train_epochs=1,
    per_device_train_batch_size=16,
    per_device_eval_batch_size=32,
    evaluation_strategy="epoch",
    save_strategy="epoch",
    load_best_model_at_end=True,
    logging_steps=50,
)

def compute_metrics(eval_pred):
    logits, labels = eval_pred
    preds = np.argmax(logits, axis=-1)
    return {"accuracy": (preds == labels).mean()}

trainer = Trainer(
    model=model,
    args=args,
    train_dataset=tokenized["train"].select(range(2000)),
    eval_dataset=tokenized["test"].select(range(500)),
    compute_metrics=compute_metrics,
)
trainer.train()

Saving and loading checkpoints

python
# Save model + tokenizer
model.save_pretrained("./my-model")
tokenizer.save_pretrained("./my-model")

# Load later
from transformers import AutoModelForSequenceClassification, AutoTokenizer
model = AutoModelForSequenceClassification.from_pretrained("./my-model")
tokenizer = AutoTokenizer.from_pretrained("./my-model")

# Push to Hub
model.push_to_hub("alicedev/my-sentiment-classifier")
tokenizer.push_to_hub("alicedev/my-sentiment-classifier")

AutoModel head families

The AutoModelFor* class controls which task head is attached on top of the base transformer. Pick the head that matches the supervised task — using the wrong class either fails to load or gives the wrong output shape.

ClassTaskOutput
AutoModelRaw encoderhidden states
AutoModelForCausalLMLeft-to-right text generationnext-token logits
AutoModelForSeq2SeqLMEncoder-decoder (T5, BART)decoder logits
AutoModelForMaskedLMBERT-style masked-token predictionmasked-position logits
AutoModelForSequenceClassificationSingle-label / multi-label classificationlogits per class
AutoModelForTokenClassificationNER, POS tagginglogits per token per class
AutoModelForQuestionAnsweringExtractive QAstart/end span logits
AutoModelForImageClassificationVision classificationlogits per class
AutoModelForObjectDetectionDETR-style detectionboxes + class logits
AutoModelForSpeechSeq2SeqWhisper-style ASRtranscript token logits
AutoModelForVision2SeqImage → text (BLIP, LLaVA)caption / answer logits
python
from transformers import AutoModelForQuestionAnswering, AutoTokenizer
import torch

model_name = "deepset/roberta-base-squad2"
tokenizer  = AutoTokenizer.from_pretrained(model_name)
model      = AutoModelForQuestionAnswering.from_pretrained(model_name)

context  = "The transformer architecture was introduced in 2017 by Vaswani et al."
question = "When was the transformer introduced?"

inputs = tokenizer(question, context, return_tensors="pt")
with torch.no_grad():
    out = model(**inputs)
start = out.start_logits.argmax()
end   = out.end_logits.argmax() + 1
answer = tokenizer.decode(inputs["input_ids"][0][start:end])
print(answer)

Output:

text
2017

Tokenisers — fast vs slow, special tokens, offsets

Hugging Face tokenisers come in two flavours: "slow" (pure Python) and "fast" (Rust-backed via the tokenizers crate). Always prefer the fast version — it supports offset mapping for token-to-character alignment, batched encoding, and significantly higher throughput.

python
from transformers import AutoTokenizer

tok = AutoTokenizer.from_pretrained("bert-base-uncased")
print(type(tok).__name__, "  is_fast:", tok.is_fast)

# Special tokens
print("CLS:", tok.cls_token, tok.cls_token_id)
print("SEP:", tok.sep_token, tok.sep_token_id)
print("PAD:", tok.pad_token, tok.pad_token_id)

# Encoding with offset mapping
out = tok(
    "Alice Dev visited London.",
    return_tensors="pt",
    return_offsets_mapping=True,
    return_attention_mask=True,
)
for tid, (start, end) in zip(out["input_ids"][0], out["offset_mapping"][0]):
    print(f"  {tok.decode([tid]):>12s}  chars[{start.item()}:{end.item()}]")

Output:

text
BertTokenizerFast   is_fast: True
CLS: [CLS] 101
SEP: [SEP] 102
PAD: [PAD] 0
       [CLS]  chars[0:0]
       alice  chars[0:5]
         dev  chars[6:9]
     visited  chars[10:17]
      london  chars[18:24]
           .  chars[24:25]
       [SEP]  chars[0:0]

Padding, truncation, and attention masks

When batching, all sequences in a batch must be the same length. padding=True pads to the longest sequence in the batch; padding="max_length" pads to a fixed max_length. Always pair padding with the resulting attention mask so the model ignores pad tokens.

python
from transformers import AutoTokenizer
import torch

tok = AutoTokenizer.from_pretrained("distilbert-base-uncased")

texts = ["Short.", "A medium-length sentence about transformers."]

# Pad to longest in batch (most efficient)
out = tok(texts, return_tensors="pt", padding=True, truncation=True, max_length=64)
print(out["input_ids"].shape, out["attention_mask"].shape)
print(out["attention_mask"][0])

# Truncate long inputs from the left or right
long_text = "word " * 1000
out = tok(long_text, return_tensors="pt", truncation=True, max_length=32, truncation_side="left")
print(out["input_ids"].shape)

Output:

text
torch.Size([2, 11]) torch.Size([2, 11])
tensor([1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0])
torch.Size([1, 32])

datasets — loading & mapping at scale

The datasets library (separate package) is the canonical way to feed data to Transformers. It supports streaming from the Hub, memory-mapped arrays for huge corpora, and .map() for parallel preprocessing.

bash
pip install datasets

Output: (none — exits 0 on success)

python
from datasets import load_dataset, Dataset

# Load from the Hub
ds = load_dataset("imdb", split="train")
print(ds)

# Streaming for huge datasets that don't fit in memory
big = load_dataset("c4", "en", split="train", streaming=True)
for example in big.take(2):
    print(example["text"][:50])

# Parallel preprocessing
tok = AutoTokenizer.from_pretrained("distilbert-base-uncased")
def tokenize(batch):
    return tok(batch["text"], truncation=True, padding=False, max_length=256)

ds_tok = ds.map(tokenize, batched=True, num_proc=4)
ds_tok.set_format("torch", columns=["input_ids", "attention_mask", "label"])
print(ds_tok[0]["input_ids"][:10])

Output:

text
Dataset({features: ['text', 'label'], num_rows: 25000})
Beginners tutorial: read this if you want to underst
A new feature for clouds: real-time processing of la
tensor([ 101, 1996, 2143, 2059, 1011, 2918, 1996, 1009, 2003, 2025])

Pipeline batching

Pipelines accept batch_size and an iterable of inputs for efficient inference. Avoid Python loops over the pipeline — they re-pad and re-call per element.

python
from transformers import pipeline

classifier = pipeline(
    "sentiment-analysis",
    model="distilbert-base-uncased-finetuned-sst-2-english",
    device=0,
    batch_size=32,
)

texts = ["Great product."] * 1000 + ["Terrible."] * 1000
# Iterating yields one prediction at a time but batches internally
for i, result in enumerate(classifier(texts)):
    if i < 3:
        print(result)

Output:

text
{'label': 'POSITIVE', 'score': 0.9999}
{'label': 'POSITIVE', 'score': 0.9999}
{'label': 'POSITIVE', 'score': 0.9999}

torch.compile and Flash Attention

For inference on Ampere+ GPUs, torch.compile plus Flash Attention 2 cuts latency by 30–60%.

python
from transformers import AutoModelForCausalLM, AutoTokenizer
import torch

tok = AutoTokenizer.from_pretrained("meta-llama/Llama-3.1-8B-Instruct")
model = AutoModelForCausalLM.from_pretrained(
    "meta-llama/Llama-3.1-8B-Instruct",
    torch_dtype=torch.bfloat16,
    device_map="auto",
    attn_implementation="flash_attention_2",   # requires `pip install flash-attn`
)
model = torch.compile(model, mode="reduce-overhead")

inputs = tok("The transformer architecture", return_tensors="pt").to(model.device)
out = model.generate(**inputs, max_new_tokens=64)
print(tok.decode(out[0], skip_special_tokens=True))

Output:

text
The transformer architecture replaces recurrence with self-attention, allowing
parallel processing of sequences and longer-range dependencies than RNNs...

Quantisation — bitsandbytes, GPTQ, AWQ

Quantisation reduces memory footprint at small (or sometimes negligible) accuracy cost. The cheapest path is BitsAndBytesConfig; for inference-only deployments use a pre-quantised GPTQ or AWQ checkpoint.

python
from transformers import AutoModelForCausalLM, BitsAndBytesConfig, AutoTokenizer
import torch

# 4-bit NF4 with double-quantisation (best memory/accuracy trade-off)
bnb = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_quant_type="nf4",
    bnb_4bit_use_double_quant=True,
    bnb_4bit_compute_dtype=torch.bfloat16,
)
model = AutoModelForCausalLM.from_pretrained(
    "meta-llama/Llama-3.1-8B-Instruct",
    quantization_config=bnb,
    device_map="auto",
)
tok = AutoTokenizer.from_pretrained("meta-llama/Llama-3.1-8B-Instruct")
print(f"Footprint: {model.get_memory_footprint() / 1e9:.2f} GB")

Output:

text
Footprint: 5.83 GB

Pre-quantised checkpoints exist for many models (look for -AWQ / -GPTQ / -bnb-4bit suffixes on the Hub). They load with from_pretrained without extra config.

PEFT and LoRA fine-tuning

For most practical fine-tuning, full-parameter training is overkill. PEFT (Parameter-Efficient Fine-Tuning) — particularly LoRA — trains a small set of adapter weights while freezing the base model.

bash
pip install peft accelerate bitsandbytes

Output: (none — exits 0 on success)

python
from transformers import AutoModelForCausalLM, AutoTokenizer, BitsAndBytesConfig, TrainingArguments
from peft import LoraConfig, get_peft_model, prepare_model_for_kbit_training
from trl import SFTTrainer
from datasets import load_dataset
import torch

model_name = "meta-llama/Llama-3.1-8B-Instruct"
bnb = BitsAndBytesConfig(load_in_4bit=True, bnb_4bit_quant_type="nf4",
                         bnb_4bit_compute_dtype=torch.bfloat16)
model = AutoModelForCausalLM.from_pretrained(model_name, quantization_config=bnb, device_map="auto")
model = prepare_model_for_kbit_training(model)
tok   = AutoTokenizer.from_pretrained(model_name)
tok.pad_token = tok.eos_token

lora = LoraConfig(
    r=16, lora_alpha=32, lora_dropout=0.05, bias="none",
    target_modules=["q_proj","k_proj","v_proj","o_proj"],
    task_type="CAUSAL_LM",
)
model = get_peft_model(model, lora)
model.print_trainable_parameters()

ds = load_dataset("databricks/databricks-dolly-15k", split="train[:1000]")
args = TrainingArguments(output_dir="./lora-out", num_train_epochs=1,
                         per_device_train_batch_size=4, gradient_accumulation_steps=4,
                         learning_rate=2e-4, fp16=False, bf16=True, logging_steps=20,
                         save_strategy="epoch")
trainer = SFTTrainer(model=model, args=args, train_dataset=ds, tokenizer=tok,
                     dataset_text_field="instruction", max_seq_length=512)
trainer.train()
trainer.model.save_pretrained("./lora-out/final")

Output:

text
trainable params: 6,815,744 || all params: 8,037,408,768 || trainable%: 0.0848

Loading and merging LoRA adapters

python
from peft import PeftModel
from transformers import AutoModelForCausalLM, AutoTokenizer

base = AutoModelForCausalLM.from_pretrained("meta-llama/Llama-3.1-8B-Instruct", device_map="auto")
model = PeftModel.from_pretrained(base, "./lora-out/final")

# Optionally merge LoRA into base weights for export
merged = model.merge_and_unload()
merged.save_pretrained("./llama3-merged")

Datasets, collators, dynamic padding

A data collator forms batches and pads on the fly. DataCollatorWithPadding is the right default for classification; DataCollatorForLanguageModeling handles MLM with masking; DataCollatorForSeq2Seq shifts labels for decoder training.

python
from transformers import AutoTokenizer, DataCollatorWithPadding
from torch.utils.data import DataLoader

tok = AutoTokenizer.from_pretrained("distilbert-base-uncased")
collator = DataCollatorWithPadding(tokenizer=tok, padding="longest", return_tensors="pt")

loader = DataLoader(ds_tok, batch_size=16, collate_fn=collator, shuffle=True)
batch = next(iter(loader))
print({k: v.shape for k, v in batch.items()})

Output:

text
{'input_ids': torch.Size([16, 142]), 'attention_mask': torch.Size([16, 142]), 'labels': torch.Size([16])}

Evaluation with evaluate

The evaluate library ships standard metrics that align with Trainer's compute_metrics signature.

bash
pip install evaluate

Output: (none — exits 0 on success)

python
import evaluate, numpy as np

accuracy = evaluate.load("accuracy")
f1       = evaluate.load("f1")

def compute_metrics(eval_pred):
    logits, labels = eval_pred
    preds = np.argmax(logits, axis=-1)
    return {
        **accuracy.compute(predictions=preds, references=labels),
        **f1.compute(predictions=preds, references=labels, average="macro"),
    }

Inference optimisations — KV cache, batching, beam vs sample

For generation, the most impactful knobs are: KV cache (on by default), batch size, beam width vs sampling, and quantisation. Streaming with TextIteratorStreamer makes long generations interactive.

python
from transformers import AutoTokenizer, AutoModelForCausalLM, TextIteratorStreamer
from threading import Thread
import torch

tok = AutoTokenizer.from_pretrained("gpt2")
model = AutoModelForCausalLM.from_pretrained("gpt2").to("cuda")
streamer = TextIteratorStreamer(tok, skip_prompt=True, skip_special_tokens=True)

inputs = tok("The future of AI is", return_tensors="pt").to(model.device)
kwargs = dict(**inputs, streamer=streamer, max_new_tokens=80, do_sample=True, temperature=0.8)
Thread(target=model.generate, kwargs=kwargs).start()
for text in streamer:
    print(text, end="", flush=True)

Output:

text
 cooperative — a future in which intelligent systems work alongside humans rather than displacing them...

Vision — image classification and object detection

python
from transformers import pipeline
from PIL import Image

# Classification
clf = pipeline("image-classification", model="google/vit-base-patch16-224")
img = Image.open("cat.jpg")
print(clf(img)[:3])

# Object detection
det = pipeline("object-detection", model="facebook/detr-resnet-50")
results = det(img)
for r in results[:3]:
    print(f"  {r['label']:15s}  score={r['score']:.2f}  box={r['box']}")

Output:

text
[{'label': 'Egyptian cat', 'score': 0.78}, {'label': 'tabby, tabby cat', 'score': 0.12}, {'label': 'tiger cat', 'score': 0.06}]
  cat            score=0.99  box={'xmin': 30, 'ymin': 18, 'xmax': 470, 'ymax': 380}
  remote         score=0.97  box={'xmin': 510, 'ymin': 70, 'xmax': 620, 'ymax': 175}
  couch          score=0.94  box={'xmin': 0,  'ymin': 200,'xmax': 640, 'ymax': 480}

Audio — ASR with Whisper

python
from transformers import pipeline

asr = pipeline(
    "automatic-speech-recognition",
    model="openai/whisper-large-v3",
    chunk_length_s=30,
    return_timestamps=True,
    device=0,
)
out = asr("interview.wav")
print(out["text"][:120])
for chunk in out["chunks"][:3]:
    start, end = chunk["timestamp"]
    print(f"  [{start:6.2f}-{end:6.2f}]  {chunk['text']}")

Hugging Face Hub — auth, downloads, uploads

bash
huggingface-cli login                  # paste token from huggingface.co/settings/tokens

Output:

text
    _|    _|  _|    _|    _|_|_|    _|_|_|_|_|    _|_|_|
    ...
Token has been saved to /home/alice/.cache/huggingface/token.
Login successful
bash
huggingface-cli download meta-llama/Llama-3.1-8B-Instruct --local-dir ./llama-3.1-8b

Output:

text
Downloading config.json: 100% 614/614
Downloading model-00001-of-00004.safetensors: 100% 4.98G/4.98G
Downloading tokenizer.json: 100% 9.09M/9.09M
./llama-3.1-8b
python
from huggingface_hub import upload_folder, snapshot_download

# Pull a snapshot (resumable, cached)
local = snapshot_download(repo_id="BAAI/bge-small-en-v1.5", local_dir="./bge-small")
print(local)

# Push a folder
upload_folder(folder_path="./my-model", repo_id="alicedev/my-classifier", commit_message="initial")

Real-world recipes

Recipe: text-classification micro-service

python
from transformers import pipeline
from fastapi import FastAPI

clf = pipeline(
    "text-classification",
    model="cardiffnlp/twitter-roberta-base-sentiment-latest",
    device=0,
    batch_size=16,
)

app = FastAPI()

@app.post("/classify")
def classify(payload: dict):
    return clf(payload["texts"], top_k=None)
bash
uvicorn app:app --host 0.0.0.0 --port 8000

Output:

text
INFO:     Uvicorn running on http://0.0.0.0:8000 (Press CTRL+C to quit)
INFO:     Application startup complete.

Recipe: batch summarisation over a folder of PDFs

python
from transformers import pipeline
from pathlib import Path
import fitz       # PyMuPDF

summariser = pipeline("summarization", model="facebook/bart-large-cnn", device=0, batch_size=4)

def pdf_text(path: Path) -> str:
    return "\n".join(page.get_text() for page in fitz.open(path))

pdfs = list(Path("./docs").glob("*.pdf"))
texts = [pdf_text(p)[:4000] for p in pdfs]
summaries = summariser(texts, max_length=120, min_length=40, do_sample=False)

for p, s in zip(pdfs, summaries):
    Path(f"./summaries/{p.stem}.txt").write_text(s["summary_text"])
print(f"Wrote {len(pdfs)} summaries")

Output:

text
Wrote 42 summaries

Recipe: offline inference behind a corporate firewall

Download model weights ahead of time, then point Transformers at the local snapshot — no Hub calls at runtime.

bash
HF_HUB_DOWNLOAD_TIMEOUT=120 huggingface-cli download intfloat/e5-small-v2 --local-dir ./e5-small-v2
export HF_HUB_OFFLINE=1
export TRANSFORMERS_OFFLINE=1

Output:

text
Fetching 12 files: 100%|████████████████████████████████████████| 12/12
./e5-small-v2
python
from transformers import AutoTokenizer, AutoModel
tok   = AutoTokenizer.from_pretrained("./e5-small-v2")     # local path
model = AutoModel.from_pretrained("./e5-small-v2")

Recipe: extracting embeddings from any encoder

Pool hidden states from a base encoder for a custom semantic search index.

python
import torch, torch.nn.functional as F
from transformers import AutoTokenizer, AutoModel

tok   = AutoTokenizer.from_pretrained("intfloat/e5-small-v2")
model = AutoModel.from_pretrained("intfloat/e5-small-v2").to("cuda").eval()

@torch.no_grad()
def embed(texts, prefix="passage: "):
    batch = tok([prefix + t for t in texts], padding=True, truncation=True,
                return_tensors="pt", max_length=512).to(model.device)
    out = model(**batch).last_hidden_state
    mask = batch["attention_mask"].unsqueeze(-1).float()
    pooled = (out * mask).sum(1) / mask.sum(1)        # mean pooling with attention mask
    return F.normalize(pooled, p=2, dim=1).cpu()

emb = embed(["Linux process control with systemd", "PyTorch optimisers"])
print(emb.shape, emb[0][:5])

Output:

text
torch.Size([2, 384]) tensor([ 0.0192, -0.0411,  0.0623, -0.0084,  0.0719])

Recipe: gradient checkpointing for long-sequence fine-tuning

Trade compute for memory to fit larger batches or longer sequences on a single GPU.

python
from transformers import TrainingArguments

args = TrainingArguments(
    output_dir="./out",
    per_device_train_batch_size=2,
    gradient_accumulation_steps=8,
    gradient_checkpointing=True,           # recompute activations on backward pass
    optim="adamw_bnb_8bit",                # 8-bit optimiser (memory saver)
    bf16=True,
    num_train_epochs=1,
    logging_steps=20,
)

Performance and reliability tips

  • Set tokenizer.pad_token = tokenizer.eos_token and model.config.pad_token_id = tokenizer.eos_token_id for decoder-only LMs before batched generation — missing this corrupts outputs silently.
  • Always pass truncation=True with a max_length matched to the model's model_max_length. Otherwise long inputs raise errors mid-batch.
  • Use from_pretrained(..., low_cpu_mem_usage=True, device_map="auto") — without it you load the full FP32 model into CPU before moving to GPU, which OOMs on large models.
  • Pin model revisions in production: from_pretrained("name", revision="abc123"). Default main can change under you.
  • Prefer safetensors checkpoints (faster, sandboxed) over .bin pickles.
  • Set TRANSFORMERS_NO_ADVISORY_WARNINGS=1 and TOKENIZERS_PARALLELISM=false in production logs to cut noise.
  • Cache the tokenizer once, not per call: tokenizer instantiation is slow due to vocab loading.
  • For inference servers, use a single pipeline per worker process — pipelines hold model weights and aren't safely shareable across processes without spawn start method.

Quick reference

TaskCode
Quick inferencepipeline("task", model="checkpoint")
Batched pipelinepipeline(..., batch_size=32)
Load tokenizerAutoTokenizer.from_pretrained("name")
Load causal LMAutoModelForCausalLM.from_pretrained("name")
Load seq2seqAutoModelForSeq2SeqLM.from_pretrained("name")
Load classifierAutoModelForSequenceClassification.from_pretrained("name", num_labels=N)
Tokenizetokenizer(text, return_tensors="pt", padding=True, truncation=True)
Offset mappingtokenizer(text, return_offsets_mapping=True)
GPU placementfrom_pretrained(..., device_map="auto")
Flash attentionattn_implementation="flash_attention_2"
4-bit loadfrom_pretrained(..., quantization_config=BitsAndBytesConfig(load_in_4bit=True))
Compilemodel = torch.compile(model, mode="reduce-overhead")
Generate textmodel.generate(**inputs, max_new_tokens=100, do_sample=True)
Stream tokensTextIteratorStreamer(tokenizer) + thread
Chat templatetokenizer.apply_chat_template(messages, tokenize=False)
Decode outputtokenizer.decode(out[0], skip_special_tokens=True)
Fine-tuneTrainer(model, args, train_dataset, eval_dataset).train()
LoRA fine-tuneget_peft_model(model, LoraConfig(...))
Merge LoRAmerged = peft_model.merge_and_unload()
Data collatorDataCollatorWithPadding(tokenizer)
Datasets .mapds.map(tokenize, batched=True, num_proc=4)
Streaming datasetload_dataset(..., streaming=True).take(N)
Savemodel.save_pretrained("./dir")
Push to Hubmodel.push_to_hub("owner/name")
Snapshot downloadsnapshot_download(repo_id="name", local_dir="./dir")
HF loginhuggingface-cli login
Offline modeexport TRANSFORMERS_OFFLINE=1 HF_HUB_OFFLINE=1