cheat sheet
Sentence Transformers
Comprehensive reference for the sentence-transformers Python library — embeddings, similarity, clustering, retrieval, fine-tuning, and popular models (BGE, E5, GTE, Nomic, Jina).
Sentence Transformers — Embeddings, Search & Fine-Tuning
sentence-transformers is the de facto Python library for computing dense vector embeddings from text (and images). It wraps Hugging Face Transformers and exposes a clean API for semantic similarity, search, clustering, reranking, and fine-tuning.
- GitHub: UKPLab/sentence-transformers
- Docs: sbert.net
- PyPI:
pip install sentence-transformers
What it is
Sentence Transformers (SBERT) is a Python library built on top of Hugging Face Transformers that turns sentences, paragraphs, or images into fixed-length dense vectors suitable for similarity, search, clustering, and reranking. Reach for it whenever you need semantic comparisons of text — RAG retrieval, deduplication, recommendation, classification with frozen embeddings, or fine-tuned domain-specific encoders — without writing custom training loops.
Why sentence-transformers?
| Feature | What it means in practice |
|---|---|
| Bi-encoder architecture | Single forward pass per sentence → fast, scalable to millions of docs |
| Pre-trained SBERT models | High-quality out-of-the-box without any fine-tuning |
| Cross-encoder support | Accurate reranking on top of bi-encoder retrieval |
| Hugging Face Hub integration | One-line model loading from 7 000+ compatible checkpoints |
| Multi-modal support | Some models encode both text and images in the same space |
| Built-in fine-tuning API | Contrastive, cosine, triplet, and MSE loss trainers included |
| Batch & device awareness | Automatic GPU/MPS/CPU dispatch, batch_encode_plus speed |
| Dimensionality flexibility | Matryoshka models let you truncate embeddings without retraining |
Installation
# Minimal
pip install sentence-transformers
# With GPU (CUDA 12)
pip install sentence-transformers torch --index-url https://download.pytorch.org/whl/cu121
# With FAISS for large-scale ANN search
pip install sentence-transformers faiss-gpu # GPU
pip install sentence-transformers faiss-cpu # CPU-only
Output:
Successfully installed sentence-transformers-3.0.1 torch-2.3.0 transformers-4.41.2 huggingface-hub-0.23.2 tokenizers-0.19.1
Model selection guide
Pick by use-case, not just benchmark score.
| Use case | Recommended model | Dim | Notes |
|---|---|---|---|
| General English semantic similarity | all-MiniLM-L6-v2 | 384 | Fast, very popular baseline |
| Best English quality | all-mpnet-base-v2 | 768 | ~2× slower than MiniLM, better accuracy |
| Best retrieval accuracy (English) | BAAI/bge-large-en-v1.5 | 1024 | Top MTEB English; requires query prefix |
| Instruction-tuned retrieval | intfloat/e5-large-v2 | 1024 | Requires query: / passage: prefixes |
| Multilingual + hybrid retrieval | BAAI/bge-m3 | 1024 | 100+ languages; dense + sparse + ColBERT |
| Multilingual (50+ languages) | paraphrase-multilingual-mpnet-base-v2 | 768 | Good cross-lingual retrieval |
| Multilingual instruction-tuned | intfloat/multilingual-e5-large-instruct | 1024 | 100+ languages, task-prefix format |
| Long documents (8 k tokens) | nomic-ai/nomic-embed-text-v1.5 | 768 | Rotary position embeddings; Apache-2 |
| Long context + task prefixes | jinaai/jina-embeddings-v3 | 1024 | 8 k tokens, 89 languages, Matryoshka |
| Matryoshka / truncatable | mixedbread-ai/mxbai-embed-large-v1 | 1024→64 | Truncate to any dim at query time |
| Highest accuracy (LLM-backed) | Alibaba-NLP/gte-Qwen2-7B-instruct | 3584 | ~14 GB VRAM; near-SOTA on MTEB |
| Code + text | flax-sentence-embeddings/st-codesearch-distilroberta-base | 768 | Semantic code search |
| Text + image | clip-ViT-B-32 | 512 | CLIP multi-modal space |
| Reranking (cross-encoder, English) | cross-encoder/ms-marco-MiniLM-L-6-v2 | scalar | Use on top of bi-encoder results |
| Reranking (multilingual) | jinaai/jina-reranker-v2-base-multilingual | scalar | 100+ languages cross-encoder reranker |
Model prefix requirements — quick reference
Several high-quality models require task-specific text prefixes on queries or passages. Using them incorrectly silently degrades retrieval quality.
| Model | Query prefix | Passage prefix |
|---|---|---|
all-MiniLM-L6-v2 | none | none |
all-mpnet-base-v2 | none | none |
BAAI/bge-*-en-v1.5 | "Represent this sentence for searching relevant passages: " | none |
BAAI/bge-m3 | none | none |
intfloat/e5-*-v2 | "query: " | "passage: " |
intfloat/multilingual-e5-large-instruct | "Instruct: {task}\nQuery: " | none |
Alibaba-NLP/gte-large-en-v1.5 | none | none |
Alibaba-NLP/gte-Qwen2-7B-instruct | "Instruct: {task}\nQuery: " | none |
nomic-ai/nomic-embed-text-v1.5 | "search_query: " | "search_document: " |
jinaai/jina-embeddings-v3 | task="retrieval.query" kwarg | task="retrieval.passage" kwarg |
mixedbread-ai/mxbai-embed-large-v1 | "Represent this sentence for searching relevant passages: " | none |
BGE models (BAAI)
BGE (BERT-based Generalized Embeddings) from the Beijing Academy of AI are among the highest-ranked models on the MTEB leaderboard and load directly into sentence-transformers.
Standard BGE retrieval models
BGE retrieval models expect a short instruction prepended to queries only — passages are encoded as-is. Omitting the prefix on queries causes a noticeable drop in retrieval quality.
from sentence_transformers import SentenceTransformer, util
model = SentenceTransformer("BAAI/bge-large-en-v1.5")
passages = [
"Linux process management with systemd",
"SSH key-based authentication on Ubuntu",
"Python virtual environments with venv",
]
passage_emb = model.encode(passages, normalize_embeddings=True)
# Queries require the instruction prefix
instruction = "Represent this sentence for searching relevant passages: "
query_emb = model.encode(
instruction + "how do I manage linux services",
normalize_embeddings=True,
)
hits = util.semantic_search(query_emb, passage_emb, top_k=3)[0]
for hit in hits:
print(f" {hit['score']:.3f} {passages[hit['corpus_id']]}")
BGE model sizes:
| Model | Dim | Speed | Best for |
|---|---|---|---|
BAAI/bge-small-en-v1.5 | 384 | Fast | Low-latency, CPU inference |
BAAI/bge-base-en-v1.5 | 768 | Medium | Balanced accuracy / speed |
BAAI/bge-large-en-v1.5 | 1024 | Slow | Highest single-model accuracy |
BAAI/bge-m3 | 1024 | Slow | Multilingual + hybrid retrieval |
BGE-M3 — dense retrieval via sentence-transformers
BGE-M3 supports 100+ languages and 8 192 token context. No prefix is required.
from sentence_transformers import SentenceTransformer, util
model = SentenceTransformer("BAAI/bge-m3")
passages = [
"Linux process management with systemd",
"Gestion des processus Linux avec systemd", # French
"Linux-Prozessverwaltung mit systemd", # German
]
passage_emb = model.encode(passages, normalize_embeddings=True)
# Cross-lingual query — no prefix needed for BGE-M3
query_emb = model.encode(
"how do I manage linux services",
normalize_embeddings=True,
)
hits = util.semantic_search(query_emb, passage_emb, top_k=3)[0]
for hit in hits:
print(f" {hit['score']:.3f} {passages[hit['corpus_id']]}")
BGE-M3 — dense + sparse + ColBERT via FlagEmbedding
For full hybrid capability (dense vector + lexical sparse + late-interaction ColBERT), use the FlagEmbedding library instead.
pip install FlagEmbedding
Output:
Successfully installed FlagEmbedding-1.2.10
from FlagEmbedding import BGEM3FlagModel
model = BGEM3FlagModel("BAAI/bge-m3", use_fp16=True)
passages = ["Linux systemd service management", "SSH tunneling guide"]
queries = ["how to restart a linux service"]
p_out = model.encode(passages, return_dense=True, return_sparse=True, return_colbert_vecs=True)
q_out = model.encode(queries, return_dense=True, return_sparse=True, return_colbert_vecs=True)
# Individual retrieval modes
dense_score = model.compute_dpr_score(q_out["dense_vecs"], p_out["dense_vecs"])
sparse_score = model.compute_lexical_matching_score(
q_out["lexical_weights"][0], p_out["lexical_weights"]
)
colbert_score = model.compute_colbert_score(q_out["colbert_vecs"], p_out["colbert_vecs"])
# Hybrid: weighted combination (tune weights for your domain)
hybrid = 0.4 * dense_score + 0.2 * sparse_score + 0.4 * colbert_score
print(hybrid)
BGE reranking
from sentence_transformers import CrossEncoder
# English reranker
reranker = CrossEncoder("BAAI/bge-reranker-large")
query = "how to restart a linux service"
candidates = [
"Use systemctl restart <service> to restart a service.",
"The ls command lists files in a directory.",
"Rebooting the machine will restart all services.",
]
scores = reranker.predict([(query, c) for c in candidates])
for score, text in sorted(zip(scores, candidates), reverse=True):
print(f" {score:.3f} {text}")
E5 models (Microsoft / intfloat)
E5 (Embeddings from bidirectional Encoder representations) models consistently rank near the top of MTEB. They require "query: " and "passage: " prefixes — omitting them causes a meaningful accuracy drop.
E5 English retrieval
from sentence_transformers import SentenceTransformer, util
model = SentenceTransformer("intfloat/e5-large-v2")
# Passages: always prefix with "passage: "
passages = [
"passage: Linux process management with systemd",
"passage: SSH key-based authentication on Ubuntu",
"passage: Python virtual environments with venv",
]
# Queries: always prefix with "query: "
query = "query: how do I manage linux services"
passage_emb = model.encode(passages, normalize_embeddings=True)
query_emb = model.encode(query, normalize_embeddings=True)
hits = util.semantic_search(query_emb, passage_emb, top_k=3)[0]
for hit in hits:
print(f" {hit['score']:.3f} {passages[hit['corpus_id']]}")
Multilingual E5 with task instruction
from sentence_transformers import SentenceTransformer, util
model = SentenceTransformer("intfloat/multilingual-e5-large-instruct")
def e5_query(task: str, query: str) -> str:
return f"Instruct: {task}\nQuery: {query}"
task = "Given a question, retrieve passages that answer the question"
# Cross-lingual: German query against English passages
passages = [
"passage: Linux process management with systemd",
"passage: SSH key-based authentication guide",
]
query = e5_query(task, "wie verwalte ich Linux-Dienste") # German
passage_emb = model.encode(passages, normalize_embeddings=True)
query_emb = model.encode(query, normalize_embeddings=True)
hits = util.semantic_search(query_emb, passage_emb, top_k=2)[0]
for hit in hits:
print(f" {hit['score']:.3f} {passages[hit['corpus_id']]}")
E5 model variants:
| Model | Dim | Notes |
|---|---|---|
intfloat/e5-small-v2 | 384 | Fast, English only |
intfloat/e5-base-v2 | 768 | Balanced, English only |
intfloat/e5-large-v2 | 1024 | Best English quality |
intfloat/multilingual-e5-large-instruct | 1024 | 100+ languages, instruction-tuned |
GTE models (Alibaba / Tongyi)
GTE (General Text Embeddings) are strong all-round performers. Standard GTE models need no prefix. The instruction-tuned gte-Qwen2 variant uses the same "Instruct: ... \nQuery: " format as multilingual E5.
GTE standard
from sentence_transformers import SentenceTransformer, util
# trust_remote_code required — GTE ships a custom pooling module
model = SentenceTransformer("Alibaba-NLP/gte-large-en-v1.5", trust_remote_code=True)
passages = [
"Linux process management with systemd",
"SSH key-based authentication on Ubuntu",
]
passage_emb = model.encode(passages, normalize_embeddings=True)
query_emb = model.encode("how to restart a linux service", normalize_embeddings=True)
hits = util.semantic_search(query_emb, passage_emb, top_k=2)[0]
for hit in hits:
print(f" {hit['score']:.3f} {passages[hit['corpus_id']]}")
GTE-Qwen2 (LLM-backed, near-SOTA)
GTE-Qwen2-7B-instruct replaces the BERT backbone with a 7 B-parameter LLM, achieving near-SOTA MTEB scores. It requires trust_remote_code=True and roughly 14 GB of VRAM.
from sentence_transformers import SentenceTransformer, util
model = SentenceTransformer(
"Alibaba-NLP/gte-Qwen2-7B-instruct",
trust_remote_code=True,
model_kwargs={"torch_dtype": "auto"},
)
task = "Given a question, retrieve passages that answer the question"
query = f"Instruct: {task}\nQuery: how do I manage linux services"
passage_emb = model.encode(passages, normalize_embeddings=True)
query_emb = model.encode(query, normalize_embeddings=True)
hits = util.semantic_search(query_emb, passage_emb, top_k=2)[0]
for hit in hits:
print(f" {hit['score']:.3f} {passages[hit['corpus_id']]}")
GTE model sizes:
| Model | Dim | VRAM | Notes |
|---|---|---|---|
Alibaba-NLP/gte-small | 384 | ~1 GB | Fast baseline |
Alibaba-NLP/gte-base-en-v1.5 | 768 | ~1.5 GB | Balanced |
Alibaba-NLP/gte-large-en-v1.5 | 1024 | ~3 GB | Best mid-size English |
Alibaba-NLP/gte-Qwen2-7B-instruct | 3584 | ~14 GB | Near-SOTA, LLM backbone |
Nomic Embed (long context, open license)
nomic-embed-text-v1.5 supports 8 192 tokens per document and is fully open (Apache-2). It uses task-name prefixes rather than free-text instructions.
from sentence_transformers import SentenceTransformer, util
# trust_remote_code required for the custom RoPE implementation
model = SentenceTransformer("nomic-ai/nomic-embed-text-v1.5", trust_remote_code=True)
# Task prefixes: search_query / search_document / clustering / classification
passages = [
"search_document: Linux process management with systemd",
"search_document: SSH key-based authentication guide",
"search_document: Python virtual environments and pip",
]
query = "search_query: how to manage linux services"
passage_emb = model.encode(passages, normalize_embeddings=True)
query_emb = model.encode(query, normalize_embeddings=True)
hits = util.semantic_search(query_emb, passage_emb, top_k=3)[0]
for hit in hits:
print(f" {hit['score']:.3f} {passages[hit['corpus_id']]}")
Available task prefixes:
| Prefix | Use for |
|---|---|
search_query: | User queries at retrieval time |
search_document: | Passages / documents in the index |
clustering: | Grouping documents by topic |
classification: | Text classification tasks |
Jina Embeddings v3
Jina v3 exposes task type as a keyword argument rather than a text prefix, and supports Matryoshka dimension truncation down to 32 dims.
from sentence_transformers import SentenceTransformer, util
model = SentenceTransformer("jinaai/jina-embeddings-v3", trust_remote_code=True)
passages = [
"Linux process management with systemd",
"SSH key-based authentication guide",
"Python virtual environments and pip",
]
query = "how to manage linux services"
# task kwarg controls the pooling strategy
passage_emb = model.encode(passages, task="retrieval.passage", normalize_embeddings=True)
query_emb = model.encode(query, task="retrieval.query", normalize_embeddings=True)
hits = util.semantic_search(query_emb, passage_emb, top_k=3)[0]
for hit in hits:
print(f" {hit['score']:.3f} {passages[hit['corpus_id']]}")
Available task values:
| Task value | Use for |
|---|---|
retrieval.query | Query-side encoding |
retrieval.passage | Passage/document encoding |
separation | Distinguishing semantically different texts |
classification | Text classification |
text-matching | Symmetric similarity (STS, paraphrase) |
Jina Matryoshka truncation
Jina v3's Matryoshka support lets you truncate the full 1024-dim output to any smaller size at query time, reducing index storage with minimal accuracy loss.
import numpy as np
full_emb = model.encode("Hello world", task="retrieval.query", normalize_embeddings=True)
# Truncate to 256 dims (4× smaller index, minimal quality loss)
small_emb = full_emb[:256]
small_emb = small_emb / np.linalg.norm(small_emb)
print(small_emb.shape) # (256,)
Core API
Encode sentences
from sentence_transformers import SentenceTransformer
model = SentenceTransformer("all-MiniLM-L6-v2")
sentences = [
"The quick brown fox jumps over the lazy dog.",
"A fast auburn fox leaps across a sleeping canine.",
"Python is a high-level programming language.",
]
# Returns numpy array: (3, 384)
embeddings = model.encode(sentences)
print(embeddings.shape) # (3, 384)
print(embeddings[0][:5]) # first 5 dims of sentence 0
Normalize for cosine similarity
embeddings = model.encode(sentences, normalize_embeddings=True)
# After normalization, dot product == cosine similarity
scores = embeddings @ embeddings.T
print(scores)
Batch encode with progress bar
embeddings = model.encode(
large_list,
batch_size=64,
show_progress_bar=True,
normalize_embeddings=True,
)
Semantic similarity
Semantic similarity measures how closely related two pieces of text are in meaning, independent of exact wording. Use it to find paraphrases, score answer relevance, or detect near-duplicate content.
Sentence pair similarity
from sentence_transformers import SentenceTransformer, util
model = SentenceTransformer("all-MiniLM-L6-v2")
s1 = "How do I reset my password?"
s2 = "What is the process for changing login credentials?"
s3 = "Show me pictures of cats."
emb = model.encode([s1, s2, s3], normalize_embeddings=True)
print(util.cos_sim(emb[0], emb[1]).item()) # ~0.82 — high similarity
print(util.cos_sim(emb[0], emb[2]).item()) # ~0.11 — low similarity
All-pairs similarity matrix
from sentence_transformers import util
# paraphrase_mining is faster than brute-force for large lists
pairs = util.paraphrase_mining(model, sentences, top_k=5)
for score, i, j in pairs:
print(f"{score:.3f} [{i}] {sentences[i]}")
print(f" [{j}] {sentences[j]}")
Semantic search
Semantic search retrieves documents by meaning rather than keyword overlap. A query is encoded into the same embedding space as the corpus, then nearest-neighbour lookup finds the most relevant passages.
Brute-force cosine search (small corpora)
from sentence_transformers import SentenceTransformer, util
import torch
model = SentenceTransformer("all-MiniLM-L6-v2")
corpus = [
"Linux process management with systemd",
"How to configure SSH key-based authentication",
"Python virtual environments and pip",
"Kubernetes pod scheduling and affinity rules",
"Z/OS JCL syntax reference",
]
corpus_emb = model.encode(corpus, convert_to_tensor=True, normalize_embeddings=True)
def search(query: str, top_k: int = 3):
q_emb = model.encode(query, convert_to_tensor=True, normalize_embeddings=True)
hits = util.semantic_search(q_emb, corpus_emb, top_k=top_k)[0]
for hit in hits:
print(f" {hit['score']:.3f} {corpus[hit['corpus_id']]}")
search("SSH tunneling on Linux")
# 0.741 How to configure SSH key-based authentication
# 0.612 Linux process management with systemd
# 0.389 Python virtual environments and pip
FAISS index (millions of docs)
FAISS (Facebook AI Similarity Search) is a C++ library for efficient approximate nearest-neighbour search over dense vectors. Use it when your corpus is too large for brute-force cosine search (typically >100 k documents).
import faiss
import numpy as np
from sentence_transformers import SentenceTransformer
model = SentenceTransformer("all-MiniLM-L6-v2")
dim = 384
# --- Build index ---
corpus_emb = model.encode(corpus, normalize_embeddings=True).astype("float32")
index = faiss.IndexFlatIP(dim) # inner product == cosine on normalised vecs
index.add(corpus_emb)
# Optionally persist
faiss.write_index(index, "corpus.index")
# --- Query ---
query_emb = model.encode(["SSH on Linux"], normalize_embeddings=True).astype("float32")
distances, indices = index.search(query_emb, k=5)
for dist, idx in zip(distances[0], indices[0]):
print(f"{dist:.3f} {corpus[idx]}")
FAISS with IVF (faster at scale)
Inverted File (IVF) indexing partitions the vector space into Voronoi cells. At query time FAISS only searches the nprobe nearest cells, giving a controllable speed-vs-recall trade-off that makes it practical at tens of millions of vectors.
# Approximate nearest neighbour — train on a sample, then add all vecs
nlist = 100 # number of Voronoi cells; rule of thumb: sqrt(N)
quantizer = faiss.IndexFlatIP(dim)
index = faiss.IndexIVFFlat(quantizer, dim, nlist, faiss.METRIC_INNER_PRODUCT)
index.train(corpus_emb)
index.add(corpus_emb)
index.nprobe = 10 # cells to visit at query time (speed vs recall tradeoff)
Reranking with cross-encoders
A cross-encoder takes (query, passage) together and outputs a relevance score. It is slower than a bi-encoder but significantly more accurate — use it to rerank a small candidate set returned by a bi-encoder first pass.
from sentence_transformers import CrossEncoder
reranker = CrossEncoder("cross-encoder/ms-marco-MiniLM-L-6-v2")
query = "best way to monitor Linux memory usage"
candidates = [
"vmstat reports virtual memory statistics.",
"Use free -h to view RAM and swap usage.",
"The ls command lists directory contents.",
"htop is an interactive process viewer with memory columns.",
"Python garbage collection controls object lifetimes.",
]
# Returns array of floats (relevance logits)
scores = reranker.predict([(query, c) for c in candidates])
ranked = sorted(zip(scores, candidates), reverse=True)
for score, text in ranked:
print(f"{score:7.2f} {text}")
Two-stage retrieval pipeline
def two_stage_search(query: str, corpus: list[str], bi_k: int = 50, final_k: int = 5):
# Stage 1: fast bi-encoder retrieval
q_emb = bi_model.encode(query, convert_to_tensor=True, normalize_embeddings=True)
hits = util.semantic_search(q_emb, corpus_emb, top_k=bi_k)[0]
candidates = [corpus[h["corpus_id"]] for h in hits]
# Stage 2: accurate cross-encoder reranking
pairs = [(query, c) for c in candidates]
scores = reranker.predict(pairs)
ranked = sorted(zip(scores, candidates), reverse=True)[:final_k]
return ranked
Clustering
Clustering groups documents by topic without labelled data. Embeddings act as features; standard algorithms (K-Means, agglomerative, community detection) then identify structure in the vector space.
K-Means clustering
from sklearn.cluster import KMeans
from sentence_transformers import SentenceTransformer
model = SentenceTransformer("all-MiniLM-L6-v2")
docs = [
"Configure nginx reverse proxy",
"Set up Apache virtual hosts",
"nginx SSL certificate setup",
"Train a neural network with PyTorch",
"Fine-tune BERT on custom dataset",
"Gradient descent explained",
"Kubernetes deployment manifests",
"Docker multi-stage builds",
"Container orchestration with Helm",
]
emb = model.encode(docs, normalize_embeddings=True)
km = KMeans(n_clusters=3, random_state=42, n_init="auto")
labels = km.fit_predict(emb)
from collections import defaultdict
clusters = defaultdict(list)
for label, doc in zip(labels, docs):
clusters[label].append(doc)
for cluster_id, items in clusters.items():
print(f"\n--- Cluster {cluster_id} ---")
for item in items:
print(f" {item}")
Agglomerative clustering (no fixed K)
Agglomerative clustering merges the closest pair of clusters bottom-up. Unlike K-Means you do not specify K in advance — instead you cut the resulting dendrogram at a cosine-distance threshold.
from sentence_transformers import util
import numpy as np
# Compute all-pairs cosine similarity
cos_scores = util.cos_sim(emb, emb).numpy()
# scipy linkage
from scipy.cluster.hierarchy import linkage, fcluster
from scipy.spatial.distance import squareform
distance_matrix = 1 - cos_scores
np.fill_diagonal(distance_matrix, 0)
condensed = squareform(distance_matrix, checks=False)
Z = linkage(condensed, method="average")
labels = fcluster(Z, t=0.35, criterion="distance") # 0.35 = cosine dist threshold
Community detection (built-in fast clustering)
util.community_detection is a greedy algorithm built into sentence-transformers that finds dense regions in the similarity graph. It requires only a similarity threshold and cluster size minimum — no target cluster count.
from sentence_transformers import util
clusters = util.community_detection(
emb,
min_community_size=2,
threshold=0.75, # cosine similarity threshold
)
for i, cluster in enumerate(clusters):
print(f"\nCluster {i+1}:")
for idx in cluster:
print(f" [{idx}] {docs[idx]}")
Semantic textual similarity (STS) benchmark evaluation
The STS benchmark is a standard dataset of sentence pairs rated 0–5 for similarity. Evaluating against it gives a Spearman correlation score you can use to compare models or track fine-tuning progress on a held-out set.
from sentence_transformers.evaluation import EmbeddingSimilarityEvaluator
from datasets import load_dataset
stsb = load_dataset("mteb/stsbenchmark-sts", split="test")
evaluator = EmbeddingSimilarityEvaluator(
sentences1=stsb["sentence1"],
sentences2=stsb["sentence2"],
scores=[s / 5.0 for s in stsb["score"]], # normalize 0–5 → 0–1
name="stsb-test",
)
result = evaluator(model)
print(result) # Spearman correlation
Matryoshka embeddings (truncatable dimensions)
Matryoshka Representation Learning (MRL) trains a model so the first N dimensions are already meaningful. You can trade accuracy for speed by reducing dimensionality without retraining the model.
from sentence_transformers import SentenceTransformer
import numpy as np
model = SentenceTransformer("mixedbread-ai/mxbai-embed-large-v1")
# Full 1024-dim (highest accuracy)
full_emb = model.encode("Hello world", normalize_embeddings=True)
# Truncated 256-dim (4× smaller index, ~1–2% accuracy drop)
small_emb = full_emb[:256]
small_emb = small_emb / np.linalg.norm(small_emb) # re-normalize after truncation
print(full_emb.shape) # (1024,)
print(small_emb.shape) # (256,)
Multi-modal embeddings (CLIP)
CLIP (Contrastive Language–Image Pre-Training) projects both text and images into a shared embedding space. You can use it to score image-text relevance or search an image collection with a natural-language query.
from sentence_transformers import SentenceTransformer, util
from PIL import Image
model = SentenceTransformer("clip-ViT-B-32")
# Encode text
text_emb = model.encode(["a photo of a cat", "a photo of a dog"])
# Encode image
img = Image.open("cat.jpg")
img_emb = model.encode(img)
# Cross-modal similarity
sim_cat = util.cos_sim(img_emb, text_emb[0]) # text: "cat"
sim_dog = util.cos_sim(img_emb, text_emb[1]) # text: "dog"
print(f"cat: {sim_cat.item():.3f}, dog: {sim_dog.item():.3f}")
Fine-tuning
Fine-tuning adapts a pre-trained embedding model to your domain using labelled sentence pairs. Even a few hundred high-quality examples can meaningfully improve retrieval quality on domain-specific text.
Prepare training data
Fine-tuning requires sentence pairs with labels. Three common formats:
from sentence_transformers import InputExample
# Format 1: (sentence_a, sentence_b, similarity_score 0–1)
examples_sts = [
InputExample(texts=["My new laptop arrived.", "I got a new computer today."], label=0.9),
InputExample(texts=["The server crashed.", "Pizza is delicious."], label=0.05),
]
# Format 2: positive pairs only (NLI / natural paraphrases)
examples_pos = [
InputExample(texts=["Cancel my subscription", "I want to stop my plan"]),
InputExample(texts=["Reset password", "Forgot login credentials"]),
]
# Format 3: triplets (anchor, positive, negative)
examples_triplet = [
InputExample(texts=["Linux firewall rules", "iptables cheat sheet", "Windows registry keys"]),
]
CosineSimilarity loss (STS-style)
CosineSimilarityLoss trains the model to match predicted cosine similarity against a float label in [0, 1]. Use it when you have explicit numeric similarity ratings such as human-rated STS pairs.
from torch.utils.data import DataLoader
from sentence_transformers import SentenceTransformer, losses
model = SentenceTransformer("all-MiniLM-L6-v2")
loader = DataLoader(examples_sts, shuffle=True, batch_size=16)
loss = losses.CosineSimilarityLoss(model)
model.fit(
train_objectives=[(loader, loss)],
epochs=3,
warmup_steps=100,
output_path="./my-finetuned-model",
show_progress_bar=True,
)
MultipleNegativesRankingLoss (contrastive, pairs only)
This is the most data-efficient loss for retrieval fine-tuning. Every other example in the batch acts as a hard negative.
from sentence_transformers import losses
loader = DataLoader(examples_pos, shuffle=True, batch_size=32)
loss = losses.MultipleNegativesRankingLoss(model)
model.fit(
train_objectives=[(loader, loss)],
epochs=5,
warmup_steps=200,
output_path="./my-retrieval-model",
)
TripletLoss
TripletLoss pulls an anchor closer to a positive example and pushes it away from a negative in the embedding space. Use it when data comes as (anchor, positive, negative) triplets rather than scored pairs.
loader = DataLoader(examples_triplet, shuffle=True, batch_size=16)
loss = losses.TripletLoss(model=model, distance_metric=losses.TripletDistanceMetric.COSINE)
model.fit(train_objectives=[(loader, loss)], epochs=3)
Save and reload
# Save
model.save("./my-finetuned-model")
# Load locally
model = SentenceTransformer("./my-finetuned-model")
# Push to Hugging Face Hub
model.push_to_hub("alicedev/my-finetuned-model")
Real-world use cases
1. FAQ / support ticket routing
faqs = {
"How do I reset my password?": "visit /account/reset",
"Where is my order?": "check /orders with your tracking number",
"How do I cancel my subscription?": "go to /billing and click Cancel",
"What payment methods do you accept?": "Visa, MasterCard, PayPal",
}
faq_texts = list(faqs.keys())
faq_answers = list(faqs.values())
faq_emb = model.encode(faq_texts, normalize_embeddings=True, convert_to_tensor=True)
def route_ticket(user_query: str, threshold: float = 0.65):
q_emb = model.encode(user_query, normalize_embeddings=True, convert_to_tensor=True)
hits = util.semantic_search(q_emb, faq_emb, top_k=1)[0]
hit = hits[0]
if hit["score"] >= threshold:
return faq_answers[hit["corpus_id"]]
return "Escalate to human agent — no FAQ match above threshold."
print(route_ticket("I forgot my login")) # → password reset answer
print(route_ticket("Do you take credit cards?")) # → payment methods answer
print(route_ticket("Tell me a joke")) # → escalate
2. Duplicate / near-duplicate detection
from sentence_transformers import util
def find_duplicates(texts: list[str], threshold: float = 0.92):
emb = model.encode(texts, normalize_embeddings=True, convert_to_tensor=True)
pairs = util.paraphrase_mining(model, texts, top_k=10)
duplicates = [(i, j, s) for s, i, j in pairs if s >= threshold]
return duplicates
tickets = [
"Server is down",
"The server is not responding",
"Production outage — site unreachable",
"Cannot connect to production server",
"Scheduled maintenance complete",
]
for i, j, score in find_duplicates(tickets):
print(f"{score:.3f} '{tickets[i]}' ↔ '{tickets[j]}'")
3. Code search
from sentence_transformers import SentenceTransformer, util
model = SentenceTransformer("flax-sentence-embeddings/st-codesearch-distilroberta-base")
code_snippets = [
"def connect_db(host, port, db): return psycopg2.connect(host=host, port=port, dbname=db)",
"async def fetch_user(user_id: int) -> dict: ...",
"for item in items: cache[item.id] = item",
"class RetryPolicy: def __init__(self, max_retries=3, backoff=2.0): ...",
"df.groupby('department')['salary'].mean()",
]
code_emb = model.encode(code_snippets, normalize_embeddings=True, convert_to_tensor=True)
def search_code(nl_query: str):
q_emb = model.encode(nl_query, normalize_embeddings=True, convert_to_tensor=True)
hits = util.semantic_search(q_emb, code_emb, top_k=3)[0]
for hit in hits:
print(f" {hit['score']:.3f} {code_snippets[hit['corpus_id']]}")
search_code("connect to postgres database")
search_code("retry with exponential backoff")
search_code("average salary by department")
4. RAG embedding pipeline
import json
from pathlib import Path
from sentence_transformers import SentenceTransformer
import numpy as np
model = SentenceTransformer("all-mpnet-base-v2")
def build_index(chunks: list[dict], output_dir: str = "./index"):
Path(output_dir).mkdir(exist_ok=True)
texts = [c["text"] for c in chunks]
emb = model.encode(texts, batch_size=64, normalize_embeddings=True, show_progress_bar=True)
np.save(f"{output_dir}/embeddings.npy", emb)
with open(f"{output_dir}/metadata.json", "w") as f:
json.dump(chunks, f)
print(f"Indexed {len(chunks)} chunks → {output_dir}")
def query_index(query: str, index_dir: str = "./index", top_k: int = 5):
emb = np.load(f"{index_dir}/embeddings.npy")
with open(f"{index_dir}/metadata.json") as f:
metadata = json.load(f)
q_emb = model.encode(query, normalize_embeddings=True)
scores = emb @ q_emb
top_idx = np.argsort(scores)[::-1][:top_k]
return [(scores[i], metadata[i]) for i in top_idx]
5. Zero-shot classification with NLI
Use a cross-encoder trained on NLI to classify text into custom labels without any labelled examples.
from sentence_transformers import CrossEncoder
nli = CrossEncoder("cross-encoder/nli-deberta-v3-small")
text = "The quarterly earnings exceeded analyst expectations by 12%."
candidate_labels = ["finance", "sports", "technology", "politics", "healthcare"]
pairs = [(text, f"This text is about {label}.") for label in candidate_labels]
scores = nli.predict(pairs, apply_softmax=True)
# scores shape: (n_labels, 3) — contradiction, neutral, entailment
entailment_scores = scores[:, 2] # index 2 = entailment
ranked = sorted(zip(entailment_scores, candidate_labels), reverse=True)
for score, label in ranked:
print(f" {score:.3f} {label}")
# 0.921 finance
# 0.041 technology
6. Semantic de-duplication of a dataset
from sentence_transformers import SentenceTransformer, util
import torch
def deduplicate(texts: list[str], threshold: float = 0.90) -> list[str]:
model = SentenceTransformer("all-MiniLM-L6-v2")
emb = model.encode(texts, normalize_embeddings=True, convert_to_tensor=True)
kept = []
kept_emb = []
for text, vec in zip(texts, emb):
if not kept_emb:
kept.append(text)
kept_emb.append(vec)
continue
stack = torch.stack(kept_emb)
sims = util.cos_sim(vec.unsqueeze(0), stack)[0]
if sims.max().item() < threshold:
kept.append(text)
kept_emb.append(vec)
print(f"Kept {len(kept)}/{len(texts)} after deduplication at threshold={threshold}")
return kept
Performance tips
| Tip | Impact |
|---|---|
normalize_embeddings=True at encode time | Avoids re-normalizing later; enables dot product as cosine similarity |
convert_to_tensor=True | Returns PyTorch tensor on GPU if available; faster for downstream ops |
batch_size=64–256 on GPU | Saturates GPU throughput; tune based on VRAM |
Use all-MiniLM-L6-v2 for prototyping | 5× faster than mpnet, within ~5% on most benchmarks |
| FAISS IVF or HNSW for >100 k docs | Brute-force cosine doesn't scale; IVF gives 10–100× speedup |
| Cache embeddings to disk | Re-encode only new documents, not the full corpus |
Pin torch version | Transitive upgrades can silently change GPU behaviour |
model.half() (fp16) | Halves VRAM; negligible accuracy loss on most models |
# fp16 inference example
model = SentenceTransformer("all-mpnet-base-v2")
model.half() # convert weights to float16
emb = model.encode(texts, convert_to_tensor=True)
Common errors
| Error | Cause | Fix |
|---|---|---|
CUDA out of memory | Batch too large | Reduce batch_size; use model.half() |
RuntimeError: stack expects each tensor to be equal size | Variable-length inputs in manual batching | Use model.encode() — it handles padding automatically |
| Similarity scores all near 1.0 | Embeddings not normalized | Pass normalize_embeddings=True |
| Similarity scores all near 0.0 | Mixed models between index build and query | Always use the exact same model checkpoint for both |
OSError: Can't load tokenizer for 'model-name' | Model not on Hugging Face Hub or local path typo | Check spelling; try SentenceTransformer("sentence-transformers/model-name") |
| Cross-encoder scores are logits, not probabilities | Default predict() output | Pass apply_softmax=True when you need probabilities |