cheat sheet

Sentence Transformers

Comprehensive reference for the sentence-transformers Python library — embeddings, similarity, clustering, retrieval, fine-tuning, and popular models (BGE, E5, GTE, Nomic, Jina).

Sentence Transformers — Embeddings, Search & Fine-Tuning

sentence-transformers is the de facto Python library for computing dense vector embeddings from text (and images). It wraps Hugging Face Transformers and exposes a clean API for semantic similarity, search, clustering, reranking, and fine-tuning.


What it is

Sentence Transformers (SBERT) is a Python library built on top of Hugging Face Transformers that turns sentences, paragraphs, or images into fixed-length dense vectors suitable for similarity, search, clustering, and reranking. Reach for it whenever you need semantic comparisons of text — RAG retrieval, deduplication, recommendation, classification with frozen embeddings, or fine-tuned domain-specific encoders — without writing custom training loops.


Why sentence-transformers?

FeatureWhat it means in practice
Bi-encoder architectureSingle forward pass per sentence → fast, scalable to millions of docs
Pre-trained SBERT modelsHigh-quality out-of-the-box without any fine-tuning
Cross-encoder supportAccurate reranking on top of bi-encoder retrieval
Hugging Face Hub integrationOne-line model loading from 7 000+ compatible checkpoints
Multi-modal supportSome models encode both text and images in the same space
Built-in fine-tuning APIContrastive, cosine, triplet, and MSE loss trainers included
Batch & device awarenessAutomatic GPU/MPS/CPU dispatch, batch_encode_plus speed
Dimensionality flexibilityMatryoshka models let you truncate embeddings without retraining

Installation

bash
# Minimal
pip install sentence-transformers

# With GPU (CUDA 12)
pip install sentence-transformers torch --index-url https://download.pytorch.org/whl/cu121

# With FAISS for large-scale ANN search
pip install sentence-transformers faiss-gpu   # GPU
pip install sentence-transformers faiss-cpu   # CPU-only

Output:

text
Successfully installed sentence-transformers-3.0.1 torch-2.3.0 transformers-4.41.2 huggingface-hub-0.23.2 tokenizers-0.19.1

Model selection guide

Pick by use-case, not just benchmark score.

Use caseRecommended modelDimNotes
General English semantic similarityall-MiniLM-L6-v2384Fast, very popular baseline
Best English qualityall-mpnet-base-v2768~2× slower than MiniLM, better accuracy
Best retrieval accuracy (English)BAAI/bge-large-en-v1.51024Top MTEB English; requires query prefix
Instruction-tuned retrievalintfloat/e5-large-v21024Requires query: / passage: prefixes
Multilingual + hybrid retrievalBAAI/bge-m31024100+ languages; dense + sparse + ColBERT
Multilingual (50+ languages)paraphrase-multilingual-mpnet-base-v2768Good cross-lingual retrieval
Multilingual instruction-tunedintfloat/multilingual-e5-large-instruct1024100+ languages, task-prefix format
Long documents (8 k tokens)nomic-ai/nomic-embed-text-v1.5768Rotary position embeddings; Apache-2
Long context + task prefixesjinaai/jina-embeddings-v310248 k tokens, 89 languages, Matryoshka
Matryoshka / truncatablemixedbread-ai/mxbai-embed-large-v11024→64Truncate to any dim at query time
Highest accuracy (LLM-backed)Alibaba-NLP/gte-Qwen2-7B-instruct3584~14 GB VRAM; near-SOTA on MTEB
Code + textflax-sentence-embeddings/st-codesearch-distilroberta-base768Semantic code search
Text + imageclip-ViT-B-32512CLIP multi-modal space
Reranking (cross-encoder, English)cross-encoder/ms-marco-MiniLM-L-6-v2scalarUse on top of bi-encoder results
Reranking (multilingual)jinaai/jina-reranker-v2-base-multilingualscalar100+ languages cross-encoder reranker

Model prefix requirements — quick reference

Several high-quality models require task-specific text prefixes on queries or passages. Using them incorrectly silently degrades retrieval quality.

ModelQuery prefixPassage prefix
all-MiniLM-L6-v2nonenone
all-mpnet-base-v2nonenone
BAAI/bge-*-en-v1.5"Represent this sentence for searching relevant passages: "none
BAAI/bge-m3nonenone
intfloat/e5-*-v2"query: ""passage: "
intfloat/multilingual-e5-large-instruct"Instruct: {task}\nQuery: "none
Alibaba-NLP/gte-large-en-v1.5nonenone
Alibaba-NLP/gte-Qwen2-7B-instruct"Instruct: {task}\nQuery: "none
nomic-ai/nomic-embed-text-v1.5"search_query: ""search_document: "
jinaai/jina-embeddings-v3task="retrieval.query" kwargtask="retrieval.passage" kwarg
mixedbread-ai/mxbai-embed-large-v1"Represent this sentence for searching relevant passages: "none

BGE models (BAAI)

BGE (BERT-based Generalized Embeddings) from the Beijing Academy of AI are among the highest-ranked models on the MTEB leaderboard and load directly into sentence-transformers.

Standard BGE retrieval models

BGE retrieval models expect a short instruction prepended to queries only — passages are encoded as-is. Omitting the prefix on queries causes a noticeable drop in retrieval quality.

python
from sentence_transformers import SentenceTransformer, util

model = SentenceTransformer("BAAI/bge-large-en-v1.5")

passages = [
    "Linux process management with systemd",
    "SSH key-based authentication on Ubuntu",
    "Python virtual environments with venv",
]
passage_emb = model.encode(passages, normalize_embeddings=True)

# Queries require the instruction prefix
instruction = "Represent this sentence for searching relevant passages: "
query_emb = model.encode(
    instruction + "how do I manage linux services",
    normalize_embeddings=True,
)

hits = util.semantic_search(query_emb, passage_emb, top_k=3)[0]
for hit in hits:
    print(f"  {hit['score']:.3f}  {passages[hit['corpus_id']]}")

BGE model sizes:

ModelDimSpeedBest for
BAAI/bge-small-en-v1.5384FastLow-latency, CPU inference
BAAI/bge-base-en-v1.5768MediumBalanced accuracy / speed
BAAI/bge-large-en-v1.51024SlowHighest single-model accuracy
BAAI/bge-m31024SlowMultilingual + hybrid retrieval

BGE-M3 — dense retrieval via sentence-transformers

BGE-M3 supports 100+ languages and 8 192 token context. No prefix is required.

python
from sentence_transformers import SentenceTransformer, util

model = SentenceTransformer("BAAI/bge-m3")

passages = [
    "Linux process management with systemd",
    "Gestion des processus Linux avec systemd",   # French
    "Linux-Prozessverwaltung mit systemd",          # German
]
passage_emb = model.encode(passages, normalize_embeddings=True)

# Cross-lingual query — no prefix needed for BGE-M3
query_emb = model.encode(
    "how do I manage linux services",
    normalize_embeddings=True,
)

hits = util.semantic_search(query_emb, passage_emb, top_k=3)[0]
for hit in hits:
    print(f"  {hit['score']:.3f}  {passages[hit['corpus_id']]}")

BGE-M3 — dense + sparse + ColBERT via FlagEmbedding

For full hybrid capability (dense vector + lexical sparse + late-interaction ColBERT), use the FlagEmbedding library instead.

bash
pip install FlagEmbedding

Output:

text
Successfully installed FlagEmbedding-1.2.10
python
from FlagEmbedding import BGEM3FlagModel

model = BGEM3FlagModel("BAAI/bge-m3", use_fp16=True)

passages = ["Linux systemd service management", "SSH tunneling guide"]
queries  = ["how to restart a linux service"]

p_out = model.encode(passages, return_dense=True, return_sparse=True, return_colbert_vecs=True)
q_out = model.encode(queries,  return_dense=True, return_sparse=True, return_colbert_vecs=True)

# Individual retrieval modes
dense_score   = model.compute_dpr_score(q_out["dense_vecs"], p_out["dense_vecs"])
sparse_score  = model.compute_lexical_matching_score(
    q_out["lexical_weights"][0], p_out["lexical_weights"]
)
colbert_score = model.compute_colbert_score(q_out["colbert_vecs"], p_out["colbert_vecs"])

# Hybrid: weighted combination (tune weights for your domain)
hybrid = 0.4 * dense_score + 0.2 * sparse_score + 0.4 * colbert_score
print(hybrid)

BGE reranking

python
from sentence_transformers import CrossEncoder

# English reranker
reranker = CrossEncoder("BAAI/bge-reranker-large")

query = "how to restart a linux service"
candidates = [
    "Use systemctl restart <service> to restart a service.",
    "The ls command lists files in a directory.",
    "Rebooting the machine will restart all services.",
]

scores = reranker.predict([(query, c) for c in candidates])
for score, text in sorted(zip(scores, candidates), reverse=True):
    print(f"  {score:.3f}  {text}")

E5 models (Microsoft / intfloat)

E5 (Embeddings from bidirectional Encoder representations) models consistently rank near the top of MTEB. They require "query: " and "passage: " prefixes — omitting them causes a meaningful accuracy drop.

E5 English retrieval

python
from sentence_transformers import SentenceTransformer, util

model = SentenceTransformer("intfloat/e5-large-v2")

# Passages: always prefix with "passage: "
passages = [
    "passage: Linux process management with systemd",
    "passage: SSH key-based authentication on Ubuntu",
    "passage: Python virtual environments with venv",
]

# Queries: always prefix with "query: "
query = "query: how do I manage linux services"

passage_emb = model.encode(passages, normalize_embeddings=True)
query_emb   = model.encode(query,    normalize_embeddings=True)

hits = util.semantic_search(query_emb, passage_emb, top_k=3)[0]
for hit in hits:
    print(f"  {hit['score']:.3f}  {passages[hit['corpus_id']]}")

Multilingual E5 with task instruction

python
from sentence_transformers import SentenceTransformer, util

model = SentenceTransformer("intfloat/multilingual-e5-large-instruct")

def e5_query(task: str, query: str) -> str:
    return f"Instruct: {task}\nQuery: {query}"

task = "Given a question, retrieve passages that answer the question"

# Cross-lingual: German query against English passages
passages = [
    "passage: Linux process management with systemd",
    "passage: SSH key-based authentication guide",
]
query = e5_query(task, "wie verwalte ich Linux-Dienste")   # German

passage_emb = model.encode(passages, normalize_embeddings=True)
query_emb   = model.encode(query,    normalize_embeddings=True)

hits = util.semantic_search(query_emb, passage_emb, top_k=2)[0]
for hit in hits:
    print(f"  {hit['score']:.3f}  {passages[hit['corpus_id']]}")

E5 model variants:

ModelDimNotes
intfloat/e5-small-v2384Fast, English only
intfloat/e5-base-v2768Balanced, English only
intfloat/e5-large-v21024Best English quality
intfloat/multilingual-e5-large-instruct1024100+ languages, instruction-tuned

GTE models (Alibaba / Tongyi)

GTE (General Text Embeddings) are strong all-round performers. Standard GTE models need no prefix. The instruction-tuned gte-Qwen2 variant uses the same "Instruct: ... \nQuery: " format as multilingual E5.

GTE standard

python
from sentence_transformers import SentenceTransformer, util

# trust_remote_code required — GTE ships a custom pooling module
model = SentenceTransformer("Alibaba-NLP/gte-large-en-v1.5", trust_remote_code=True)

passages = [
    "Linux process management with systemd",
    "SSH key-based authentication on Ubuntu",
]
passage_emb = model.encode(passages, normalize_embeddings=True)
query_emb   = model.encode("how to restart a linux service", normalize_embeddings=True)

hits = util.semantic_search(query_emb, passage_emb, top_k=2)[0]
for hit in hits:
    print(f"  {hit['score']:.3f}  {passages[hit['corpus_id']]}")

GTE-Qwen2 (LLM-backed, near-SOTA)

GTE-Qwen2-7B-instruct replaces the BERT backbone with a 7 B-parameter LLM, achieving near-SOTA MTEB scores. It requires trust_remote_code=True and roughly 14 GB of VRAM.

python
from sentence_transformers import SentenceTransformer, util

model = SentenceTransformer(
    "Alibaba-NLP/gte-Qwen2-7B-instruct",
    trust_remote_code=True,
    model_kwargs={"torch_dtype": "auto"},
)

task = "Given a question, retrieve passages that answer the question"
query = f"Instruct: {task}\nQuery: how do I manage linux services"

passage_emb = model.encode(passages, normalize_embeddings=True)
query_emb   = model.encode(query,    normalize_embeddings=True)

hits = util.semantic_search(query_emb, passage_emb, top_k=2)[0]
for hit in hits:
    print(f"  {hit['score']:.3f}  {passages[hit['corpus_id']]}")

GTE model sizes:

ModelDimVRAMNotes
Alibaba-NLP/gte-small384~1 GBFast baseline
Alibaba-NLP/gte-base-en-v1.5768~1.5 GBBalanced
Alibaba-NLP/gte-large-en-v1.51024~3 GBBest mid-size English
Alibaba-NLP/gte-Qwen2-7B-instruct3584~14 GBNear-SOTA, LLM backbone

Nomic Embed (long context, open license)

nomic-embed-text-v1.5 supports 8 192 tokens per document and is fully open (Apache-2). It uses task-name prefixes rather than free-text instructions.

python
from sentence_transformers import SentenceTransformer, util

# trust_remote_code required for the custom RoPE implementation
model = SentenceTransformer("nomic-ai/nomic-embed-text-v1.5", trust_remote_code=True)

# Task prefixes: search_query / search_document / clustering / classification
passages = [
    "search_document: Linux process management with systemd",
    "search_document: SSH key-based authentication guide",
    "search_document: Python virtual environments and pip",
]
query = "search_query: how to manage linux services"

passage_emb = model.encode(passages, normalize_embeddings=True)
query_emb   = model.encode(query,    normalize_embeddings=True)

hits = util.semantic_search(query_emb, passage_emb, top_k=3)[0]
for hit in hits:
    print(f"  {hit['score']:.3f}  {passages[hit['corpus_id']]}")

Available task prefixes:

PrefixUse for
search_query:User queries at retrieval time
search_document:Passages / documents in the index
clustering:Grouping documents by topic
classification:Text classification tasks

Jina Embeddings v3

Jina v3 exposes task type as a keyword argument rather than a text prefix, and supports Matryoshka dimension truncation down to 32 dims.

python
from sentence_transformers import SentenceTransformer, util

model = SentenceTransformer("jinaai/jina-embeddings-v3", trust_remote_code=True)

passages = [
    "Linux process management with systemd",
    "SSH key-based authentication guide",
    "Python virtual environments and pip",
]
query = "how to manage linux services"

# task kwarg controls the pooling strategy
passage_emb = model.encode(passages, task="retrieval.passage", normalize_embeddings=True)
query_emb   = model.encode(query,    task="retrieval.query",   normalize_embeddings=True)

hits = util.semantic_search(query_emb, passage_emb, top_k=3)[0]
for hit in hits:
    print(f"  {hit['score']:.3f}  {passages[hit['corpus_id']]}")

Available task values:

Task valueUse for
retrieval.queryQuery-side encoding
retrieval.passagePassage/document encoding
separationDistinguishing semantically different texts
classificationText classification
text-matchingSymmetric similarity (STS, paraphrase)

Jina Matryoshka truncation

Jina v3's Matryoshka support lets you truncate the full 1024-dim output to any smaller size at query time, reducing index storage with minimal accuracy loss.

python
import numpy as np

full_emb = model.encode("Hello world", task="retrieval.query", normalize_embeddings=True)

# Truncate to 256 dims (4× smaller index, minimal quality loss)
small_emb = full_emb[:256]
small_emb = small_emb / np.linalg.norm(small_emb)
print(small_emb.shape)  # (256,)

Core API

Encode sentences

python
from sentence_transformers import SentenceTransformer

model = SentenceTransformer("all-MiniLM-L6-v2")

sentences = [
    "The quick brown fox jumps over the lazy dog.",
    "A fast auburn fox leaps across a sleeping canine.",
    "Python is a high-level programming language.",
]

# Returns numpy array: (3, 384)
embeddings = model.encode(sentences)

print(embeddings.shape)        # (3, 384)
print(embeddings[0][:5])       # first 5 dims of sentence 0

Normalize for cosine similarity

python
embeddings = model.encode(sentences, normalize_embeddings=True)
# After normalization, dot product == cosine similarity
scores = embeddings @ embeddings.T
print(scores)

Batch encode with progress bar

python
embeddings = model.encode(
    large_list,
    batch_size=64,
    show_progress_bar=True,
    normalize_embeddings=True,
)

Semantic similarity

Semantic similarity measures how closely related two pieces of text are in meaning, independent of exact wording. Use it to find paraphrases, score answer relevance, or detect near-duplicate content.

Sentence pair similarity

python
from sentence_transformers import SentenceTransformer, util

model = SentenceTransformer("all-MiniLM-L6-v2")

s1 = "How do I reset my password?"
s2 = "What is the process for changing login credentials?"
s3 = "Show me pictures of cats."

emb = model.encode([s1, s2, s3], normalize_embeddings=True)

print(util.cos_sim(emb[0], emb[1]).item())  # ~0.82 — high similarity
print(util.cos_sim(emb[0], emb[2]).item())  # ~0.11 — low similarity

All-pairs similarity matrix

python
from sentence_transformers import util

# paraphrase_mining is faster than brute-force for large lists
pairs = util.paraphrase_mining(model, sentences, top_k=5)

for score, i, j in pairs:
    print(f"{score:.3f}  [{i}] {sentences[i]}")
    print(f"       [{j}] {sentences[j]}")

Semantic search retrieves documents by meaning rather than keyword overlap. A query is encoded into the same embedding space as the corpus, then nearest-neighbour lookup finds the most relevant passages.

Brute-force cosine search (small corpora)

python
from sentence_transformers import SentenceTransformer, util
import torch

model = SentenceTransformer("all-MiniLM-L6-v2")

corpus = [
    "Linux process management with systemd",
    "How to configure SSH key-based authentication",
    "Python virtual environments and pip",
    "Kubernetes pod scheduling and affinity rules",
    "Z/OS JCL syntax reference",
]

corpus_emb = model.encode(corpus, convert_to_tensor=True, normalize_embeddings=True)

def search(query: str, top_k: int = 3):
    q_emb = model.encode(query, convert_to_tensor=True, normalize_embeddings=True)
    hits = util.semantic_search(q_emb, corpus_emb, top_k=top_k)[0]
    for hit in hits:
        print(f"  {hit['score']:.3f}  {corpus[hit['corpus_id']]}")

search("SSH tunneling on Linux")
# 0.741  How to configure SSH key-based authentication
# 0.612  Linux process management with systemd
# 0.389  Python virtual environments and pip

FAISS index (millions of docs)

FAISS (Facebook AI Similarity Search) is a C++ library for efficient approximate nearest-neighbour search over dense vectors. Use it when your corpus is too large for brute-force cosine search (typically >100 k documents).

python
import faiss
import numpy as np
from sentence_transformers import SentenceTransformer

model = SentenceTransformer("all-MiniLM-L6-v2")
dim = 384

# --- Build index ---
corpus_emb = model.encode(corpus, normalize_embeddings=True).astype("float32")

index = faiss.IndexFlatIP(dim)          # inner product == cosine on normalised vecs
index.add(corpus_emb)

# Optionally persist
faiss.write_index(index, "corpus.index")

# --- Query ---
query_emb = model.encode(["SSH on Linux"], normalize_embeddings=True).astype("float32")
distances, indices = index.search(query_emb, k=5)

for dist, idx in zip(distances[0], indices[0]):
    print(f"{dist:.3f}  {corpus[idx]}")

FAISS with IVF (faster at scale)

Inverted File (IVF) indexing partitions the vector space into Voronoi cells. At query time FAISS only searches the nprobe nearest cells, giving a controllable speed-vs-recall trade-off that makes it practical at tens of millions of vectors.

python
# Approximate nearest neighbour — train on a sample, then add all vecs
nlist = 100   # number of Voronoi cells; rule of thumb: sqrt(N)
quantizer = faiss.IndexFlatIP(dim)
index = faiss.IndexIVFFlat(quantizer, dim, nlist, faiss.METRIC_INNER_PRODUCT)

index.train(corpus_emb)
index.add(corpus_emb)
index.nprobe = 10   # cells to visit at query time (speed vs recall tradeoff)

Reranking with cross-encoders

A cross-encoder takes (query, passage) together and outputs a relevance score. It is slower than a bi-encoder but significantly more accurate — use it to rerank a small candidate set returned by a bi-encoder first pass.

python
from sentence_transformers import CrossEncoder

reranker = CrossEncoder("cross-encoder/ms-marco-MiniLM-L-6-v2")

query = "best way to monitor Linux memory usage"
candidates = [
    "vmstat reports virtual memory statistics.",
    "Use free -h to view RAM and swap usage.",
    "The ls command lists directory contents.",
    "htop is an interactive process viewer with memory columns.",
    "Python garbage collection controls object lifetimes.",
]

# Returns array of floats (relevance logits)
scores = reranker.predict([(query, c) for c in candidates])

ranked = sorted(zip(scores, candidates), reverse=True)
for score, text in ranked:
    print(f"{score:7.2f}  {text}")

Two-stage retrieval pipeline

python
def two_stage_search(query: str, corpus: list[str], bi_k: int = 50, final_k: int = 5):
    # Stage 1: fast bi-encoder retrieval
    q_emb = bi_model.encode(query, convert_to_tensor=True, normalize_embeddings=True)
    hits = util.semantic_search(q_emb, corpus_emb, top_k=bi_k)[0]
    candidates = [corpus[h["corpus_id"]] for h in hits]

    # Stage 2: accurate cross-encoder reranking
    pairs = [(query, c) for c in candidates]
    scores = reranker.predict(pairs)

    ranked = sorted(zip(scores, candidates), reverse=True)[:final_k]
    return ranked

Clustering

Clustering groups documents by topic without labelled data. Embeddings act as features; standard algorithms (K-Means, agglomerative, community detection) then identify structure in the vector space.

K-Means clustering

python
from sklearn.cluster import KMeans
from sentence_transformers import SentenceTransformer

model = SentenceTransformer("all-MiniLM-L6-v2")

docs = [
    "Configure nginx reverse proxy",
    "Set up Apache virtual hosts",
    "nginx SSL certificate setup",
    "Train a neural network with PyTorch",
    "Fine-tune BERT on custom dataset",
    "Gradient descent explained",
    "Kubernetes deployment manifests",
    "Docker multi-stage builds",
    "Container orchestration with Helm",
]

emb = model.encode(docs, normalize_embeddings=True)

km = KMeans(n_clusters=3, random_state=42, n_init="auto")
labels = km.fit_predict(emb)

from collections import defaultdict
clusters = defaultdict(list)
for label, doc in zip(labels, docs):
    clusters[label].append(doc)

for cluster_id, items in clusters.items():
    print(f"\n--- Cluster {cluster_id} ---")
    for item in items:
        print(f"  {item}")

Agglomerative clustering (no fixed K)

Agglomerative clustering merges the closest pair of clusters bottom-up. Unlike K-Means you do not specify K in advance — instead you cut the resulting dendrogram at a cosine-distance threshold.

python
from sentence_transformers import util
import numpy as np

# Compute all-pairs cosine similarity
cos_scores = util.cos_sim(emb, emb).numpy()

# scipy linkage
from scipy.cluster.hierarchy import linkage, fcluster
from scipy.spatial.distance import squareform

distance_matrix = 1 - cos_scores
np.fill_diagonal(distance_matrix, 0)
condensed = squareform(distance_matrix, checks=False)

Z = linkage(condensed, method="average")
labels = fcluster(Z, t=0.35, criterion="distance")  # 0.35 = cosine dist threshold

Community detection (built-in fast clustering)

util.community_detection is a greedy algorithm built into sentence-transformers that finds dense regions in the similarity graph. It requires only a similarity threshold and cluster size minimum — no target cluster count.

python
from sentence_transformers import util

clusters = util.community_detection(
    emb,
    min_community_size=2,
    threshold=0.75,        # cosine similarity threshold
)

for i, cluster in enumerate(clusters):
    print(f"\nCluster {i+1}:")
    for idx in cluster:
        print(f"  [{idx}] {docs[idx]}")

Semantic textual similarity (STS) benchmark evaluation

The STS benchmark is a standard dataset of sentence pairs rated 0–5 for similarity. Evaluating against it gives a Spearman correlation score you can use to compare models or track fine-tuning progress on a held-out set.

python
from sentence_transformers.evaluation import EmbeddingSimilarityEvaluator
from datasets import load_dataset

stsb = load_dataset("mteb/stsbenchmark-sts", split="test")

evaluator = EmbeddingSimilarityEvaluator(
    sentences1=stsb["sentence1"],
    sentences2=stsb["sentence2"],
    scores=[s / 5.0 for s in stsb["score"]],   # normalize 0–5 → 0–1
    name="stsb-test",
)

result = evaluator(model)
print(result)  # Spearman correlation

Matryoshka embeddings (truncatable dimensions)

Matryoshka Representation Learning (MRL) trains a model so the first N dimensions are already meaningful. You can trade accuracy for speed by reducing dimensionality without retraining the model.

python
from sentence_transformers import SentenceTransformer
import numpy as np

model = SentenceTransformer("mixedbread-ai/mxbai-embed-large-v1")

# Full 1024-dim (highest accuracy)
full_emb = model.encode("Hello world", normalize_embeddings=True)

# Truncated 256-dim (4× smaller index, ~1–2% accuracy drop)
small_emb = full_emb[:256]
small_emb = small_emb / np.linalg.norm(small_emb)   # re-normalize after truncation

print(full_emb.shape)   # (1024,)
print(small_emb.shape)  # (256,)

Multi-modal embeddings (CLIP)

CLIP (Contrastive Language–Image Pre-Training) projects both text and images into a shared embedding space. You can use it to score image-text relevance or search an image collection with a natural-language query.

python
from sentence_transformers import SentenceTransformer, util
from PIL import Image

model = SentenceTransformer("clip-ViT-B-32")

# Encode text
text_emb = model.encode(["a photo of a cat", "a photo of a dog"])

# Encode image
img = Image.open("cat.jpg")
img_emb = model.encode(img)

# Cross-modal similarity
sim_cat = util.cos_sim(img_emb, text_emb[0])   # text: "cat"
sim_dog = util.cos_sim(img_emb, text_emb[1])   # text: "dog"
print(f"cat: {sim_cat.item():.3f}, dog: {sim_dog.item():.3f}")

Fine-tuning

Fine-tuning adapts a pre-trained embedding model to your domain using labelled sentence pairs. Even a few hundred high-quality examples can meaningfully improve retrieval quality on domain-specific text.

Prepare training data

Fine-tuning requires sentence pairs with labels. Three common formats:

python
from sentence_transformers import InputExample

# Format 1: (sentence_a, sentence_b, similarity_score 0–1)
examples_sts = [
    InputExample(texts=["My new laptop arrived.", "I got a new computer today."], label=0.9),
    InputExample(texts=["The server crashed.", "Pizza is delicious."], label=0.05),
]

# Format 2: positive pairs only (NLI / natural paraphrases)
examples_pos = [
    InputExample(texts=["Cancel my subscription", "I want to stop my plan"]),
    InputExample(texts=["Reset password", "Forgot login credentials"]),
]

# Format 3: triplets (anchor, positive, negative)
examples_triplet = [
    InputExample(texts=["Linux firewall rules", "iptables cheat sheet", "Windows registry keys"]),
]

CosineSimilarity loss (STS-style)

CosineSimilarityLoss trains the model to match predicted cosine similarity against a float label in [0, 1]. Use it when you have explicit numeric similarity ratings such as human-rated STS pairs.

python
from torch.utils.data import DataLoader
from sentence_transformers import SentenceTransformer, losses

model = SentenceTransformer("all-MiniLM-L6-v2")

loader = DataLoader(examples_sts, shuffle=True, batch_size=16)
loss = losses.CosineSimilarityLoss(model)

model.fit(
    train_objectives=[(loader, loss)],
    epochs=3,
    warmup_steps=100,
    output_path="./my-finetuned-model",
    show_progress_bar=True,
)

MultipleNegativesRankingLoss (contrastive, pairs only)

This is the most data-efficient loss for retrieval fine-tuning. Every other example in the batch acts as a hard negative.

python
from sentence_transformers import losses

loader = DataLoader(examples_pos, shuffle=True, batch_size=32)
loss = losses.MultipleNegativesRankingLoss(model)

model.fit(
    train_objectives=[(loader, loss)],
    epochs=5,
    warmup_steps=200,
    output_path="./my-retrieval-model",
)

TripletLoss

TripletLoss pulls an anchor closer to a positive example and pushes it away from a negative in the embedding space. Use it when data comes as (anchor, positive, negative) triplets rather than scored pairs.

python
loader = DataLoader(examples_triplet, shuffle=True, batch_size=16)
loss = losses.TripletLoss(model=model, distance_metric=losses.TripletDistanceMetric.COSINE)

model.fit(train_objectives=[(loader, loss)], epochs=3)

Save and reload

python
# Save
model.save("./my-finetuned-model")

# Load locally
model = SentenceTransformer("./my-finetuned-model")

# Push to Hugging Face Hub
model.push_to_hub("alicedev/my-finetuned-model")

Real-world use cases

1. FAQ / support ticket routing

python
faqs = {
    "How do I reset my password?": "visit /account/reset",
    "Where is my order?": "check /orders with your tracking number",
    "How do I cancel my subscription?": "go to /billing and click Cancel",
    "What payment methods do you accept?": "Visa, MasterCard, PayPal",
}

faq_texts = list(faqs.keys())
faq_answers = list(faqs.values())
faq_emb = model.encode(faq_texts, normalize_embeddings=True, convert_to_tensor=True)

def route_ticket(user_query: str, threshold: float = 0.65):
    q_emb = model.encode(user_query, normalize_embeddings=True, convert_to_tensor=True)
    hits = util.semantic_search(q_emb, faq_emb, top_k=1)[0]
    hit = hits[0]
    if hit["score"] >= threshold:
        return faq_answers[hit["corpus_id"]]
    return "Escalate to human agent — no FAQ match above threshold."

print(route_ticket("I forgot my login"))          # → password reset answer
print(route_ticket("Do you take credit cards?"))  # → payment methods answer
print(route_ticket("Tell me a joke"))             # → escalate

2. Duplicate / near-duplicate detection

python
from sentence_transformers import util

def find_duplicates(texts: list[str], threshold: float = 0.92):
    emb = model.encode(texts, normalize_embeddings=True, convert_to_tensor=True)
    pairs = util.paraphrase_mining(model, texts, top_k=10)
    duplicates = [(i, j, s) for s, i, j in pairs if s >= threshold]
    return duplicates

tickets = [
    "Server is down",
    "The server is not responding",
    "Production outage — site unreachable",
    "Cannot connect to production server",
    "Scheduled maintenance complete",
]

for i, j, score in find_duplicates(tickets):
    print(f"{score:.3f}  '{tickets[i]}' ↔ '{tickets[j]}'")
python
from sentence_transformers import SentenceTransformer, util

model = SentenceTransformer("flax-sentence-embeddings/st-codesearch-distilroberta-base")

code_snippets = [
    "def connect_db(host, port, db): return psycopg2.connect(host=host, port=port, dbname=db)",
    "async def fetch_user(user_id: int) -> dict: ...",
    "for item in items: cache[item.id] = item",
    "class RetryPolicy: def __init__(self, max_retries=3, backoff=2.0): ...",
    "df.groupby('department')['salary'].mean()",
]

code_emb = model.encode(code_snippets, normalize_embeddings=True, convert_to_tensor=True)

def search_code(nl_query: str):
    q_emb = model.encode(nl_query, normalize_embeddings=True, convert_to_tensor=True)
    hits = util.semantic_search(q_emb, code_emb, top_k=3)[0]
    for hit in hits:
        print(f"  {hit['score']:.3f}  {code_snippets[hit['corpus_id']]}")

search_code("connect to postgres database")
search_code("retry with exponential backoff")
search_code("average salary by department")

4. RAG embedding pipeline

python
import json
from pathlib import Path
from sentence_transformers import SentenceTransformer
import numpy as np

model = SentenceTransformer("all-mpnet-base-v2")

def build_index(chunks: list[dict], output_dir: str = "./index"):
    Path(output_dir).mkdir(exist_ok=True)

    texts = [c["text"] for c in chunks]
    emb = model.encode(texts, batch_size=64, normalize_embeddings=True, show_progress_bar=True)

    np.save(f"{output_dir}/embeddings.npy", emb)
    with open(f"{output_dir}/metadata.json", "w") as f:
        json.dump(chunks, f)

    print(f"Indexed {len(chunks)} chunks → {output_dir}")

def query_index(query: str, index_dir: str = "./index", top_k: int = 5):
    emb = np.load(f"{index_dir}/embeddings.npy")
    with open(f"{index_dir}/metadata.json") as f:
        metadata = json.load(f)

    q_emb = model.encode(query, normalize_embeddings=True)
    scores = emb @ q_emb
    top_idx = np.argsort(scores)[::-1][:top_k]

    return [(scores[i], metadata[i]) for i in top_idx]

5. Zero-shot classification with NLI

Use a cross-encoder trained on NLI to classify text into custom labels without any labelled examples.

python
from sentence_transformers import CrossEncoder

nli = CrossEncoder("cross-encoder/nli-deberta-v3-small")

text = "The quarterly earnings exceeded analyst expectations by 12%."
candidate_labels = ["finance", "sports", "technology", "politics", "healthcare"]

pairs = [(text, f"This text is about {label}.") for label in candidate_labels]
scores = nli.predict(pairs, apply_softmax=True)

# scores shape: (n_labels, 3) — contradiction, neutral, entailment
entailment_scores = scores[:, 2]   # index 2 = entailment
ranked = sorted(zip(entailment_scores, candidate_labels), reverse=True)

for score, label in ranked:
    print(f"  {score:.3f}  {label}")
# 0.921  finance
# 0.041  technology

6. Semantic de-duplication of a dataset

python
from sentence_transformers import SentenceTransformer, util
import torch

def deduplicate(texts: list[str], threshold: float = 0.90) -> list[str]:
    model = SentenceTransformer("all-MiniLM-L6-v2")
    emb = model.encode(texts, normalize_embeddings=True, convert_to_tensor=True)

    kept = []
    kept_emb = []

    for text, vec in zip(texts, emb):
        if not kept_emb:
            kept.append(text)
            kept_emb.append(vec)
            continue

        stack = torch.stack(kept_emb)
        sims = util.cos_sim(vec.unsqueeze(0), stack)[0]
        if sims.max().item() < threshold:
            kept.append(text)
            kept_emb.append(vec)

    print(f"Kept {len(kept)}/{len(texts)} after deduplication at threshold={threshold}")
    return kept

Performance tips

TipImpact
normalize_embeddings=True at encode timeAvoids re-normalizing later; enables dot product as cosine similarity
convert_to_tensor=TrueReturns PyTorch tensor on GPU if available; faster for downstream ops
batch_size=64–256 on GPUSaturates GPU throughput; tune based on VRAM
Use all-MiniLM-L6-v2 for prototyping5× faster than mpnet, within ~5% on most benchmarks
FAISS IVF or HNSW for >100 k docsBrute-force cosine doesn't scale; IVF gives 10–100× speedup
Cache embeddings to diskRe-encode only new documents, not the full corpus
Pin torch versionTransitive upgrades can silently change GPU behaviour
model.half() (fp16)Halves VRAM; negligible accuracy loss on most models
python
# fp16 inference example
model = SentenceTransformer("all-mpnet-base-v2")
model.half()   # convert weights to float16

emb = model.encode(texts, convert_to_tensor=True)

Common errors

ErrorCauseFix
CUDA out of memoryBatch too largeReduce batch_size; use model.half()
RuntimeError: stack expects each tensor to be equal sizeVariable-length inputs in manual batchingUse model.encode() — it handles padding automatically
Similarity scores all near 1.0Embeddings not normalizedPass normalize_embeddings=True
Similarity scores all near 0.0Mixed models between index build and queryAlways use the exact same model checkpoint for both
OSError: Can't load tokenizer for 'model-name'Model not on Hugging Face Hub or local path typoCheck spelling; try SentenceTransformer("sentence-transformers/model-name")
Cross-encoder scores are logits, not probabilitiesDefault predict() outputPass apply_softmax=True when you need probabilities