cheat sheet

Sentence Transformers

Comprehensive reference for the sentence-transformers Python library — embeddings, similarity, clustering, retrieval, fine-tuning, and popular models (BGE, E5, GTE, Nomic, Jina).

updated 05-02-2026

Sentence Transformers — Embeddings, Search & Fine-Tuning

sentence-transformers is the de facto Python library for computing dense vector embeddings from text (and images). It wraps Hugging Face Transformers and exposes a clean API for semantic similarity, search, clustering, reranking, and fine-tuning.

GitHub: UKPLab/sentence-transformers
Docs: sbert.net
PyPI: pip install sentence-transformers

What it is

Sentence Transformers (SBERT) is a Python library built on top of Hugging Face Transformers that turns sentences, paragraphs, or images into fixed-length dense vectors suitable for similarity, search, clustering, and reranking. Reach for it whenever you need semantic comparisons of text — RAG retrieval, deduplication, recommendation, classification with frozen embeddings, or fine-tuned domain-specific encoders — without writing custom training loops.

Why sentence-transformers?

Feature	What it means in practice
Bi-encoder architecture	Single forward pass per sentence → fast, scalable to millions of docs
Pre-trained SBERT models	High-quality out-of-the-box without any fine-tuning
Cross-encoder support	Accurate reranking on top of bi-encoder retrieval
Hugging Face Hub integration	One-line model loading from 7 000+ compatible checkpoints
Multi-modal support	Some models encode both text and images in the same space
Built-in fine-tuning API	Contrastive, cosine, triplet, and MSE loss trainers included
Batch & device awareness	Automatic GPU/MPS/CPU dispatch, `batch_encode_plus` speed
Dimensionality flexibility	Matryoshka models let you truncate embeddings without retraining

Installation

bash

# Minimal
pip install sentence-transformers

# With GPU (CUDA 12)
pip install sentence-transformers torch --index-url https://download.pytorch.org/whl/cu121

# With FAISS for large-scale ANN search
pip install sentence-transformers faiss-gpu   # GPU
pip install sentence-transformers faiss-cpu   # CPU-only

Output:

text

Successfully installed sentence-transformers-3.0.1 torch-2.3.0 transformers-4.41.2 huggingface-hub-0.23.2 tokenizers-0.19.1

Model selection guide

Pick by use-case, not just benchmark score.

Use case	Recommended model	Dim	Notes
General English semantic similarity	`all-MiniLM-L6-v2`	384	Fast, very popular baseline
Best English quality	`all-mpnet-base-v2`	768	~2× slower than MiniLM, better accuracy
Best retrieval accuracy (English)	`BAAI/bge-large-en-v1.5`	1024	Top MTEB English; requires query prefix
Instruction-tuned retrieval	`intfloat/e5-large-v2`	1024	Requires `query:` / `passage:` prefixes
Multilingual + hybrid retrieval	`BAAI/bge-m3`	1024	100+ languages; dense + sparse + ColBERT
Multilingual (50+ languages)	`paraphrase-multilingual-mpnet-base-v2`	768	Good cross-lingual retrieval
Multilingual instruction-tuned	`intfloat/multilingual-e5-large-instruct`	1024	100+ languages, task-prefix format
Long documents (8 k tokens)	`nomic-ai/nomic-embed-text-v1.5`	768	Rotary position embeddings; Apache-2
Long context + task prefixes	`jinaai/jina-embeddings-v3`	1024	8 k tokens, 89 languages, Matryoshka
Matryoshka / truncatable	`mixedbread-ai/mxbai-embed-large-v1`	1024→64	Truncate to any dim at query time
Highest accuracy (LLM-backed)	`Alibaba-NLP/gte-Qwen2-7B-instruct`	3584	~14 GB VRAM; near-SOTA on MTEB
Code + text	`flax-sentence-embeddings/st-codesearch-distilroberta-base`	768	Semantic code search
Text + image	`clip-ViT-B-32`	512	CLIP multi-modal space
Reranking (cross-encoder, English)	`cross-encoder/ms-marco-MiniLM-L-6-v2`	scalar	Use on top of bi-encoder results
Reranking (multilingual)	`jinaai/jina-reranker-v2-base-multilingual`	scalar	100+ languages cross-encoder reranker

Model prefix requirements — quick reference

Several high-quality models require task-specific text prefixes on queries or passages. Using them incorrectly silently degrades retrieval quality.

Model	Query prefix	Passage prefix
`all-MiniLM-L6-v2`	none	none
`all-mpnet-base-v2`	none	none
`BAAI/bge-*-en-v1.5`	`"Represent this sentence for searching relevant passages: "`	none
`BAAI/bge-m3`	none	none
`intfloat/e5-*-v2`	`"query: "`	`"passage: "`
`intfloat/multilingual-e5-large-instruct`	`"Instruct: {task}\nQuery: "`	none
`Alibaba-NLP/gte-large-en-v1.5`	none	none
`Alibaba-NLP/gte-Qwen2-7B-instruct`	`"Instruct: {task}\nQuery: "`	none
`nomic-ai/nomic-embed-text-v1.5`	`"search_query: "`	`"search_document: "`
`jinaai/jina-embeddings-v3`	`task="retrieval.query"` kwarg	`task="retrieval.passage"` kwarg
`mixedbread-ai/mxbai-embed-large-v1`	`"Represent this sentence for searching relevant passages: "`	none

BGE models (BAAI)

BGE (BERT-based Generalized Embeddings) from the Beijing Academy of AI are among the highest-ranked models on the MTEB leaderboard and load directly into sentence-transformers.

Standard BGE retrieval models

BGE retrieval models expect a short instruction prepended to queries only — passages are encoded as-is. Omitting the prefix on queries causes a noticeable drop in retrieval quality.

python

from sentence_transformers import SentenceTransformer, util

model = SentenceTransformer("BAAI/bge-large-en-v1.5")

passages = [
    "Linux process management with systemd",
    "SSH key-based authentication on Ubuntu",
    "Python virtual environments with venv",
]
passage_emb = model.encode(passages, normalize_embeddings=True)

# Queries require the instruction prefix
instruction = "Represent this sentence for searching relevant passages: "
query_emb = model.encode(
    instruction + "how do I manage linux services",
    normalize_embeddings=True,
)

hits = util.semantic_search(query_emb, passage_emb, top_k=3)[0]
for hit in hits:
    print(f"  {hit['score']:.3f}  {passages[hit['corpus_id']]}")

BGE model sizes:

Model	Dim	Speed	Best for
`BAAI/bge-small-en-v1.5`	384	Fast	Low-latency, CPU inference
`BAAI/bge-base-en-v1.5`	768	Medium	Balanced accuracy / speed
`BAAI/bge-large-en-v1.5`	1024	Slow	Highest single-model accuracy
`BAAI/bge-m3`	1024	Slow	Multilingual + hybrid retrieval

BGE-M3 — dense retrieval via sentence-transformers

BGE-M3 supports 100+ languages and 8 192 token context. No prefix is required.

python

from sentence_transformers import SentenceTransformer, util

model = SentenceTransformer("BAAI/bge-m3")

passages = [
    "Linux process management with systemd",
    "Gestion des processus Linux avec systemd",   # French
    "Linux-Prozessverwaltung mit systemd",          # German
]
passage_emb = model.encode(passages, normalize_embeddings=True)

# Cross-lingual query — no prefix needed for BGE-M3
query_emb = model.encode(
    "how do I manage linux services",
    normalize_embeddings=True,
)

hits = util.semantic_search(query_emb, passage_emb, top_k=3)[0]
for hit in hits:
    print(f"  {hit['score']:.3f}  {passages[hit['corpus_id']]}")

BGE-M3 — dense + sparse + ColBERT via FlagEmbedding

For full hybrid capability (dense vector + lexical sparse + late-interaction ColBERT), use the FlagEmbedding library instead.

bash

pip install FlagEmbedding

Output:

text

Successfully installed FlagEmbedding-1.2.10

python

from FlagEmbedding import BGEM3FlagModel

model = BGEM3FlagModel("BAAI/bge-m3", use_fp16=True)

passages = ["Linux systemd service management", "SSH tunneling guide"]
queries  = ["how to restart a linux service"]

p_out = model.encode(passages, return_dense=True, return_sparse=True, return_colbert_vecs=True)
q_out = model.encode(queries,  return_dense=True, return_sparse=True, return_colbert_vecs=True)

# Individual retrieval modes
dense_score   = model.compute_dpr_score(q_out["dense_vecs"], p_out["dense_vecs"])
sparse_score  = model.compute_lexical_matching_score(
    q_out["lexical_weights"][0], p_out["lexical_weights"]
)
colbert_score = model.compute_colbert_score(q_out["colbert_vecs"], p_out["colbert_vecs"])

# Hybrid: weighted combination (tune weights for your domain)
hybrid = 0.4 * dense_score + 0.2 * sparse_score + 0.4 * colbert_score
print(hybrid)

BGE reranking

python

from sentence_transformers import CrossEncoder

# English reranker
reranker = CrossEncoder("BAAI/bge-reranker-large")

query = "how to restart a linux service"
candidates = [
    "Use systemctl restart <service> to restart a service.",
    "The ls command lists files in a directory.",
    "Rebooting the machine will restart all services.",
]

scores = reranker.predict([(query, c) for c in candidates])
for score, text in sorted(zip(scores, candidates), reverse=True):
    print(f"  {score:.3f}  {text}")

E5 models (Microsoft / intfloat)

E5 (Embeddings from bidirectional Encoder representations) models consistently rank near the top of MTEB. They require "query: " and "passage: " prefixes — omitting them causes a meaningful accuracy drop.

E5 English retrieval

python

from sentence_transformers import SentenceTransformer, util

model = SentenceTransformer("intfloat/e5-large-v2")

# Passages: always prefix with "passage: "
passages = [
    "passage: Linux process management with systemd",
    "passage: SSH key-based authentication on Ubuntu",
    "passage: Python virtual environments with venv",
]

# Queries: always prefix with "query: "
query = "query: how do I manage linux services"

passage_emb = model.encode(passages, normalize_embeddings=True)
query_emb   = model.encode(query,    normalize_embeddings=True)

hits = util.semantic_search(query_emb, passage_emb, top_k=3)[0]
for hit in hits:
    print(f"  {hit['score']:.3f}  {passages[hit['corpus_id']]}")

Multilingual E5 with task instruction

python

from sentence_transformers import SentenceTransformer, util

model = SentenceTransformer("intfloat/multilingual-e5-large-instruct")

def e5_query(task: str, query: str) -> str:
    return f"Instruct: {task}\nQuery: {query}"

task = "Given a question, retrieve passages that answer the question"

# Cross-lingual: German query against English passages
passages = [
    "passage: Linux process management with systemd",
    "passage: SSH key-based authentication guide",
]
query = e5_query(task, "wie verwalte ich Linux-Dienste")   # German

passage_emb = model.encode(passages, normalize_embeddings=True)
query_emb   = model.encode(query,    normalize_embeddings=True)

hits = util.semantic_search(query_emb, passage_emb, top_k=2)[0]
for hit in hits:
    print(f"  {hit['score']:.3f}  {passages[hit['corpus_id']]}")

E5 model variants:

Model	Dim	Notes
`intfloat/e5-small-v2`	384	Fast, English only
`intfloat/e5-base-v2`	768	Balanced, English only
`intfloat/e5-large-v2`	1024	Best English quality
`intfloat/multilingual-e5-large-instruct`	1024	100+ languages, instruction-tuned

GTE models (Alibaba / Tongyi)

GTE (General Text Embeddings) are strong all-round performers. Standard GTE models need no prefix. The instruction-tuned gte-Qwen2 variant uses the same "Instruct: ... \nQuery: " format as multilingual E5.

GTE standard

python

from sentence_transformers import SentenceTransformer, util

# trust_remote_code required — GTE ships a custom pooling module
model = SentenceTransformer("Alibaba-NLP/gte-large-en-v1.5", trust_remote_code=True)

passages = [
    "Linux process management with systemd",
    "SSH key-based authentication on Ubuntu",
]
passage_emb = model.encode(passages, normalize_embeddings=True)
query_emb   = model.encode("how to restart a linux service", normalize_embeddings=True)

hits = util.semantic_search(query_emb, passage_emb, top_k=2)[0]
for hit in hits:
    print(f"  {hit['score']:.3f}  {passages[hit['corpus_id']]}")

GTE-Qwen2 (LLM-backed, near-SOTA)

GTE-Qwen2-7B-instruct replaces the BERT backbone with a 7 B-parameter LLM, achieving near-SOTA MTEB scores. It requires trust_remote_code=True and roughly 14 GB of VRAM.

python

from sentence_transformers import SentenceTransformer, util

model = SentenceTransformer(
    "Alibaba-NLP/gte-Qwen2-7B-instruct",
    trust_remote_code=True,
    model_kwargs={"torch_dtype": "auto"},
)

task = "Given a question, retrieve passages that answer the question"
query = f"Instruct: {task}\nQuery: how do I manage linux services"

passage_emb = model.encode(passages, normalize_embeddings=True)
query_emb   = model.encode(query,    normalize_embeddings=True)

hits = util.semantic_search(query_emb, passage_emb, top_k=2)[0]
for hit in hits:
    print(f"  {hit['score']:.3f}  {passages[hit['corpus_id']]}")

GTE model sizes:

Model	Dim	VRAM	Notes
`Alibaba-NLP/gte-small`	384	~1 GB	Fast baseline
`Alibaba-NLP/gte-base-en-v1.5`	768	~1.5 GB	Balanced
`Alibaba-NLP/gte-large-en-v1.5`	1024	~3 GB	Best mid-size English
`Alibaba-NLP/gte-Qwen2-7B-instruct`	3584	~14 GB	Near-SOTA, LLM backbone

Nomic Embed (long context, open license)

nomic-embed-text-v1.5 supports 8 192 tokens per document and is fully open (Apache-2). It uses task-name prefixes rather than free-text instructions.

python

from sentence_transformers import SentenceTransformer, util

# trust_remote_code required for the custom RoPE implementation
model = SentenceTransformer("nomic-ai/nomic-embed-text-v1.5", trust_remote_code=True)

# Task prefixes: search_query / search_document / clustering / classification
passages = [
    "search_document: Linux process management with systemd",
    "search_document: SSH key-based authentication guide",
    "search_document: Python virtual environments and pip",
]
query = "search_query: how to manage linux services"

passage_emb = model.encode(passages, normalize_embeddings=True)
query_emb   = model.encode(query,    normalize_embeddings=True)

hits = util.semantic_search(query_emb, passage_emb, top_k=3)[0]
for hit in hits:
    print(f"  {hit['score']:.3f}  {passages[hit['corpus_id']]}")

Available task prefixes:

Prefix	Use for
`search_query:`	User queries at retrieval time
`search_document:`	Passages / documents in the index
`clustering:`	Grouping documents by topic
`classification:`	Text classification tasks

Jina Embeddings v3

Jina v3 exposes task type as a keyword argument rather than a text prefix, and supports Matryoshka dimension truncation down to 32 dims.

python

from sentence_transformers import SentenceTransformer, util

model = SentenceTransformer("jinaai/jina-embeddings-v3", trust_remote_code=True)

passages = [
    "Linux process management with systemd",
    "SSH key-based authentication guide",
    "Python virtual environments and pip",
]
query = "how to manage linux services"

# task kwarg controls the pooling strategy
passage_emb = model.encode(passages, task="retrieval.passage", normalize_embeddings=True)
query_emb   = model.encode(query,    task="retrieval.query",   normalize_embeddings=True)

hits = util.semantic_search(query_emb, passage_emb, top_k=3)[0]
for hit in hits:
    print(f"  {hit['score']:.3f}  {passages[hit['corpus_id']]}")

Available task values:

Task value	Use for
`retrieval.query`	Query-side encoding
`retrieval.passage`	Passage/document encoding
`separation`	Distinguishing semantically different texts
`classification`	Text classification
`text-matching`	Symmetric similarity (STS, paraphrase)

Jina Matryoshka truncation

Jina v3's Matryoshka support lets you truncate the full 1024-dim output to any smaller size at query time, reducing index storage with minimal accuracy loss.

python

import numpy as np

full_emb = model.encode("Hello world", task="retrieval.query", normalize_embeddings=True)

# Truncate to 256 dims (4× smaller index, minimal quality loss)
small_emb = full_emb[:256]
small_emb = small_emb / np.linalg.norm(small_emb)
print(small_emb.shape)  # (256,)

Core API

Encode sentences

python

from sentence_transformers import SentenceTransformer

model = SentenceTransformer("all-MiniLM-L6-v2")

sentences = [
    "The quick brown fox jumps over the lazy dog.",
    "A fast auburn fox leaps across a sleeping canine.",
    "Python is a high-level programming language.",
]

# Returns numpy array: (3, 384)
embeddings = model.encode(sentences)

print(embeddings.shape)        # (3, 384)
print(embeddings[0][:5])       # first 5 dims of sentence 0

Normalize for cosine similarity

python

embeddings = model.encode(sentences, normalize_embeddings=True)
# After normalization, dot product == cosine similarity
scores = embeddings @ embeddings.T
print(scores)

Batch encode with progress bar

python

embeddings = model.encode(
    large_list,
    batch_size=64,
    show_progress_bar=True,
    normalize_embeddings=True,
)

Semantic similarity

Semantic similarity measures how closely related two pieces of text are in meaning, independent of exact wording. Use it to find paraphrases, score answer relevance, or detect near-duplicate content.

Sentence pair similarity

python

from sentence_transformers import SentenceTransformer, util

model = SentenceTransformer("all-MiniLM-L6-v2")

s1 = "How do I reset my password?"
s2 = "What is the process for changing login credentials?"
s3 = "Show me pictures of cats."

emb = model.encode([s1, s2, s3], normalize_embeddings=True)

print(util.cos_sim(emb[0], emb[1]).item())  # ~0.82 — high similarity
print(util.cos_sim(emb[0], emb[2]).item())  # ~0.11 — low similarity

All-pairs similarity matrix

python

from sentence_transformers import util

# paraphrase_mining is faster than brute-force for large lists
pairs = util.paraphrase_mining(model, sentences, top_k=5)

for score, i, j in pairs:
    print(f"{score:.3f}  [{i}] {sentences[i]}")
    print(f"       [{j}] {sentences[j]}")

Semantic search

Semantic search retrieves documents by meaning rather than keyword overlap. A query is encoded into the same embedding space as the corpus, then nearest-neighbour lookup finds the most relevant passages.

Brute-force cosine search (small corpora)

python

from sentence_transformers import SentenceTransformer, util
import torch

model = SentenceTransformer("all-MiniLM-L6-v2")

corpus = [
    "Linux process management with systemd",
    "How to configure SSH key-based authentication",
    "Python virtual environments and pip",
    "Kubernetes pod scheduling and affinity rules",
    "Z/OS JCL syntax reference",
]

corpus_emb = model.encode(corpus, convert_to_tensor=True, normalize_embeddings=True)

def search(query: str, top_k: int = 3):
    q_emb = model.encode(query, convert_to_tensor=True, normalize_embeddings=True)
    hits = util.semantic_search(q_emb, corpus_emb, top_k=top_k)[0]
    for hit in hits:
        print(f"  {hit['score']:.3f}  {corpus[hit['corpus_id']]}")

search("SSH tunneling on Linux")
# 0.741  How to configure SSH key-based authentication
# 0.612  Linux process management with systemd
# 0.389  Python virtual environments and pip

FAISS index (millions of docs)

FAISS (Facebook AI Similarity Search) is a C++ library for efficient approximate nearest-neighbour search over dense vectors. Use it when your corpus is too large for brute-force cosine search (typically >100 k documents).

python

import faiss
import numpy as np
from sentence_transformers import SentenceTransformer

model = SentenceTransformer("all-MiniLM-L6-v2")
dim = 384

# --- Build index ---
corpus_emb = model.encode(corpus, normalize_embeddings=True).astype("float32")

index = faiss.IndexFlatIP(dim)          # inner product == cosine on normalised vecs
index.add(corpus_emb)

# Optionally persist
faiss.write_index(index, "corpus.index")

# --- Query ---
query_emb = model.encode(["SSH on Linux"], normalize_embeddings=True).astype("float32")
distances, indices = index.search(query_emb, k=5)

for dist, idx in zip(distances[0], indices[0]):
    print(f"{dist:.3f}  {corpus[idx]}")

FAISS with IVF (faster at scale)

Inverted File (IVF) indexing partitions the vector space into Voronoi cells. At query time FAISS only searches the nprobe nearest cells, giving a controllable speed-vs-recall trade-off that makes it practical at tens of millions of vectors.

python

# Approximate nearest neighbour — train on a sample, then add all vecs
nlist = 100   # number of Voronoi cells; rule of thumb: sqrt(N)
quantizer = faiss.IndexFlatIP(dim)
index = faiss.IndexIVFFlat(quantizer, dim, nlist, faiss.METRIC_INNER_PRODUCT)

index.train(corpus_emb)
index.add(corpus_emb)
index.nprobe = 10   # cells to visit at query time (speed vs recall tradeoff)

Reranking with cross-encoders

A cross-encoder takes (query, passage) together and outputs a relevance score. It is slower than a bi-encoder but significantly more accurate — use it to rerank a small candidate set returned by a bi-encoder first pass.

python

from sentence_transformers import CrossEncoder

reranker = CrossEncoder("cross-encoder/ms-marco-MiniLM-L-6-v2")

query = "best way to monitor Linux memory usage"
candidates = [
    "vmstat reports virtual memory statistics.",
    "Use free -h to view RAM and swap usage.",
    "The ls command lists directory contents.",
    "htop is an interactive process viewer with memory columns.",
    "Python garbage collection controls object lifetimes.",
]

# Returns array of floats (relevance logits)
scores = reranker.predict([(query, c) for c in candidates])

ranked = sorted(zip(scores, candidates), reverse=True)
for score, text in ranked:
    print(f"{score:7.2f}  {text}")

Two-stage retrieval pipeline

python

def two_stage_search(query: str, corpus: list[str], bi_k: int = 50, final_k: int = 5):
    # Stage 1: fast bi-encoder retrieval
    q_emb = bi_model.encode(query, convert_to_tensor=True, normalize_embeddings=True)
    hits = util.semantic_search(q_emb, corpus_emb, top_k=bi_k)[0]
    candidates = [corpus[h["corpus_id"]] for h in hits]

    # Stage 2: accurate cross-encoder reranking
    pairs = [(query, c) for c in candidates]
    scores = reranker.predict(pairs)

    ranked = sorted(zip(scores, candidates), reverse=True)[:final_k]
    return ranked

Clustering

Clustering groups documents by topic without labelled data. Embeddings act as features; standard algorithms (K-Means, agglomerative, community detection) then identify structure in the vector space.

K-Means clustering

python

from sklearn.cluster import KMeans
from sentence_transformers import SentenceTransformer

model = SentenceTransformer("all-MiniLM-L6-v2")

docs = [
    "Configure nginx reverse proxy",
    "Set up Apache virtual hosts",
    "nginx SSL certificate setup",
    "Train a neural network with PyTorch",
    "Fine-tune BERT on custom dataset",
    "Gradient descent explained",
    "Kubernetes deployment manifests",
    "Docker multi-stage builds",
    "Container orchestration with Helm",
]

emb = model.encode(docs, normalize_embeddings=True)

km = KMeans(n_clusters=3, random_state=42, n_init="auto")
labels = km.fit_predict(emb)

from collections import defaultdict
clusters = defaultdict(list)
for label, doc in zip(labels, docs):
    clusters[label].append(doc)

for cluster_id, items in clusters.items():
    print(f"\n--- Cluster {cluster_id} ---")
    for item in items:
        print(f"  {item}")

Agglomerative clustering (no fixed K)

Agglomerative clustering merges the closest pair of clusters bottom-up. Unlike K-Means you do not specify K in advance — instead you cut the resulting dendrogram at a cosine-distance threshold.

python

from sentence_transformers import util
import numpy as np

# Compute all-pairs cosine similarity
cos_scores = util.cos_sim(emb, emb).numpy()

# scipy linkage
from scipy.cluster.hierarchy import linkage, fcluster
from scipy.spatial.distance import squareform

distance_matrix = 1 - cos_scores
np.fill_diagonal(distance_matrix, 0)
condensed = squareform(distance_matrix, checks=False)

Z = linkage(condensed, method="average")
labels = fcluster(Z, t=0.35, criterion="distance")  # 0.35 = cosine dist threshold

Community detection (built-in fast clustering)

util.community_detection is a greedy algorithm built into sentence-transformers that finds dense regions in the similarity graph. It requires only a similarity threshold and cluster size minimum — no target cluster count.

python

from sentence_transformers import util

clusters = util.community_detection(
    emb,
    min_community_size=2,
    threshold=0.75,        # cosine similarity threshold
)

for i, cluster in enumerate(clusters):
    print(f"\nCluster {i+1}:")
    for idx in cluster:
        print(f"  [{idx}] {docs[idx]}")

Semantic textual similarity (STS) benchmark evaluation

The STS benchmark is a standard dataset of sentence pairs rated 0–5 for similarity. Evaluating against it gives a Spearman correlation score you can use to compare models or track fine-tuning progress on a held-out set.

python

from sentence_transformers.evaluation import EmbeddingSimilarityEvaluator
from datasets import load_dataset

stsb = load_dataset("mteb/stsbenchmark-sts", split="test")

evaluator = EmbeddingSimilarityEvaluator(
    sentences1=stsb["sentence1"],
    sentences2=stsb["sentence2"],
    scores=[s / 5.0 for s in stsb["score"]],   # normalize 0–5 → 0–1
    name="stsb-test",
)

result = evaluator(model)
print(result)  # Spearman correlation

Matryoshka embeddings (truncatable dimensions)

Matryoshka Representation Learning (MRL) trains a model so the first N dimensions are already meaningful. You can trade accuracy for speed by reducing dimensionality without retraining the model.

python

from sentence_transformers import SentenceTransformer
import numpy as np

model = SentenceTransformer("mixedbread-ai/mxbai-embed-large-v1")

# Full 1024-dim (highest accuracy)
full_emb = model.encode("Hello world", normalize_embeddings=True)

# Truncated 256-dim (4× smaller index, ~1–2% accuracy drop)
small_emb = full_emb[:256]
small_emb = small_emb / np.linalg.norm(small_emb)   # re-normalize after truncation

print(full_emb.shape)   # (1024,)
print(small_emb.shape)  # (256,)

CLIP (Contrastive Language–Image Pre-Training) projects both text and images into a shared embedding space. You can use it to score image-text relevance or search an image collection with a natural-language query.

python

from sentence_transformers import SentenceTransformer, util
from PIL import Image

model = SentenceTransformer("clip-ViT-B-32")

# Encode text
text_emb = model.encode(["a photo of a cat", "a photo of a dog"])

# Encode image
img = Image.open("cat.jpg")
img_emb = model.encode(img)

# Cross-modal similarity
sim_cat = util.cos_sim(img_emb, text_emb[0])   # text: "cat"
sim_dog = util.cos_sim(img_emb, text_emb[1])   # text: "dog"
print(f"cat: {sim_cat.item():.3f}, dog: {sim_dog.item():.3f}")

Fine-tuning

Fine-tuning adapts a pre-trained embedding model to your domain using labelled sentence pairs. Even a few hundred high-quality examples can meaningfully improve retrieval quality on domain-specific text.

Prepare training data

Fine-tuning requires sentence pairs with labels. Three common formats:

python

from sentence_transformers import InputExample

# Format 1: (sentence_a, sentence_b, similarity_score 0–1)
examples_sts = [
    InputExample(texts=["My new laptop arrived.", "I got a new computer today."], label=0.9),
    InputExample(texts=["The server crashed.", "Pizza is delicious."], label=0.05),
]

# Format 2: positive pairs only (NLI / natural paraphrases)
examples_pos = [
    InputExample(texts=["Cancel my subscription", "I want to stop my plan"]),
    InputExample(texts=["Reset password", "Forgot login credentials"]),
]

# Format 3: triplets (anchor, positive, negative)
examples_triplet = [
    InputExample(texts=["Linux firewall rules", "iptables cheat sheet", "Windows registry keys"]),
]

CosineSimilarity loss (STS-style)

CosineSimilarityLoss trains the model to match predicted cosine similarity against a float label in [0, 1]. Use it when you have explicit numeric similarity ratings such as human-rated STS pairs.

python

from torch.utils.data import DataLoader
from sentence_transformers import SentenceTransformer, losses

model = SentenceTransformer("all-MiniLM-L6-v2")

loader = DataLoader(examples_sts, shuffle=True, batch_size=16)
loss = losses.CosineSimilarityLoss(model)

model.fit(
    train_objectives=[(loader, loss)],
    epochs=3,
    warmup_steps=100,
    output_path="./my-finetuned-model",
    show_progress_bar=True,
)

MultipleNegativesRankingLoss (contrastive, pairs only)

This is the most data-efficient loss for retrieval fine-tuning. Every other example in the batch acts as a hard negative.

python

from sentence_transformers import losses

loader = DataLoader(examples_pos, shuffle=True, batch_size=32)
loss = losses.MultipleNegativesRankingLoss(model)

model.fit(
    train_objectives=[(loader, loss)],
    epochs=5,
    warmup_steps=200,
    output_path="./my-retrieval-model",
)

TripletLoss

TripletLoss pulls an anchor closer to a positive example and pushes it away from a negative in the embedding space. Use it when data comes as (anchor, positive, negative) triplets rather than scored pairs.

python

loader = DataLoader(examples_triplet, shuffle=True, batch_size=16)
loss = losses.TripletLoss(model=model, distance_metric=losses.TripletDistanceMetric.COSINE)

model.fit(train_objectives=[(loader, loss)], epochs=3)

Save and reload

python

# Save
model.save("./my-finetuned-model")

# Load locally
model = SentenceTransformer("./my-finetuned-model")

# Push to Hugging Face Hub
model.push_to_hub("alicedev/my-finetuned-model")

Real-world use cases

1. FAQ / support ticket routing

python

faqs = {
    "How do I reset my password?": "visit /account/reset",
    "Where is my order?": "check /orders with your tracking number",
    "How do I cancel my subscription?": "go to /billing and click Cancel",
    "What payment methods do you accept?": "Visa, MasterCard, PayPal",
}

faq_texts = list(faqs.keys())
faq_answers = list(faqs.values())
faq_emb = model.encode(faq_texts, normalize_embeddings=True, convert_to_tensor=True)

def route_ticket(user_query: str, threshold: float = 0.65):
    q_emb = model.encode(user_query, normalize_embeddings=True, convert_to_tensor=True)
    hits = util.semantic_search(q_emb, faq_emb, top_k=1)[0]
    hit = hits[0]
    if hit["score"] >= threshold:
        return faq_answers[hit["corpus_id"]]
    return "Escalate to human agent — no FAQ match above threshold."

print(route_ticket("I forgot my login"))          # → password reset answer
print(route_ticket("Do you take credit cards?"))  # → payment methods answer
print(route_ticket("Tell me a joke"))             # → escalate

2. Duplicate / near-duplicate detection

python

from sentence_transformers import util

def find_duplicates(texts: list[str], threshold: float = 0.92):
    emb = model.encode(texts, normalize_embeddings=True, convert_to_tensor=True)
    pairs = util.paraphrase_mining(model, texts, top_k=10)
    duplicates = [(i, j, s) for s, i, j in pairs if s >= threshold]
    return duplicates

tickets = [
    "Server is down",
    "The server is not responding",
    "Production outage — site unreachable",
    "Cannot connect to production server",
    "Scheduled maintenance complete",
]

for i, j, score in find_duplicates(tickets):
    print(f"{score:.3f}  '{tickets[i]}' ↔ '{tickets[j]}'")

3. Code search

python

from sentence_transformers import SentenceTransformer, util

model = SentenceTransformer("flax-sentence-embeddings/st-codesearch-distilroberta-base")

code_snippets = [
    "def connect_db(host, port, db): return psycopg2.connect(host=host, port=port, dbname=db)",
    "async def fetch_user(user_id: int) -> dict: ...",
    "for item in items: cache[item.id] = item",
    "class RetryPolicy: def __init__(self, max_retries=3, backoff=2.0): ...",
    "df.groupby('department')['salary'].mean()",
]

code_emb = model.encode(code_snippets, normalize_embeddings=True, convert_to_tensor=True)

def search_code(nl_query: str):
    q_emb = model.encode(nl_query, normalize_embeddings=True, convert_to_tensor=True)
    hits = util.semantic_search(q_emb, code_emb, top_k=3)[0]
    for hit in hits:
        print(f"  {hit['score']:.3f}  {code_snippets[hit['corpus_id']]}")

search_code("connect to postgres database")
search_code("retry with exponential backoff")
search_code("average salary by department")

4. RAG embedding pipeline

python

import json
from pathlib import Path
from sentence_transformers import SentenceTransformer
import numpy as np

model = SentenceTransformer("all-mpnet-base-v2")

def build_index(chunks: list[dict], output_dir: str = "./index"):
    Path(output_dir).mkdir(exist_ok=True)

    texts = [c["text"] for c in chunks]
    emb = model.encode(texts, batch_size=64, normalize_embeddings=True, show_progress_bar=True)

    np.save(f"{output_dir}/embeddings.npy", emb)
    with open(f"{output_dir}/metadata.json", "w") as f:
        json.dump(chunks, f)

    print(f"Indexed {len(chunks)} chunks → {output_dir}")

def query_index(query: str, index_dir: str = "./index", top_k: int = 5):
    emb = np.load(f"{index_dir}/embeddings.npy")
    with open(f"{index_dir}/metadata.json") as f:
        metadata = json.load(f)

    q_emb = model.encode(query, normalize_embeddings=True)
    scores = emb @ q_emb
    top_idx = np.argsort(scores)[::-1][:top_k]

    return [(scores[i], metadata[i]) for i in top_idx]

5. Zero-shot classification with NLI

Use a cross-encoder trained on NLI to classify text into custom labels without any labelled examples.

python

from sentence_transformers import CrossEncoder

nli = CrossEncoder("cross-encoder/nli-deberta-v3-small")

text = "The quarterly earnings exceeded analyst expectations by 12%."
candidate_labels = ["finance", "sports", "technology", "politics", "healthcare"]

pairs = [(text, f"This text is about {label}.") for label in candidate_labels]
scores = nli.predict(pairs, apply_softmax=True)

# scores shape: (n_labels, 3) — contradiction, neutral, entailment
entailment_scores = scores[:, 2]   # index 2 = entailment
ranked = sorted(zip(entailment_scores, candidate_labels), reverse=True)

for score, label in ranked:
    print(f"  {score:.3f}  {label}")
# 0.921  finance
# 0.041  technology

6. Semantic de-duplication of a dataset

python

from sentence_transformers import SentenceTransformer, util
import torch

def deduplicate(texts: list[str], threshold: float = 0.90) -> list[str]:
    model = SentenceTransformer("all-MiniLM-L6-v2")
    emb = model.encode(texts, normalize_embeddings=True, convert_to_tensor=True)

    kept = []
    kept_emb = []

    for text, vec in zip(texts, emb):
        if not kept_emb:
            kept.append(text)
            kept_emb.append(vec)
            continue

        stack = torch.stack(kept_emb)
        sims = util.cos_sim(vec.unsqueeze(0), stack)[0]
        if sims.max().item() < threshold:
            kept.append(text)
            kept_emb.append(vec)

    print(f"Kept {len(kept)}/{len(texts)} after deduplication at threshold={threshold}")
    return kept

Performance tips

Tip	Impact
`normalize_embeddings=True` at encode time	Avoids re-normalizing later; enables dot product as cosine similarity
`convert_to_tensor=True`	Returns PyTorch tensor on GPU if available; faster for downstream ops
`batch_size=64–256` on GPU	Saturates GPU throughput; tune based on VRAM
Use `all-MiniLM-L6-v2` for prototyping	5× faster than `mpnet`, within ~5% on most benchmarks
FAISS IVF or HNSW for >100 k docs	Brute-force cosine doesn't scale; IVF gives 10–100× speedup
Cache embeddings to disk	Re-encode only new documents, not the full corpus
Pin `torch` version	Transitive upgrades can silently change GPU behaviour
`model.half()` (fp16)	Halves VRAM; negligible accuracy loss on most models

python

# fp16 inference example
model = SentenceTransformer("all-mpnet-base-v2")
model.half()   # convert weights to float16

emb = model.encode(texts, convert_to_tensor=True)

Common errors

Error	Cause	Fix
`CUDA out of memory`	Batch too large	Reduce `batch_size`; use `model.half()`
`RuntimeError: stack expects each tensor to be equal size`	Variable-length inputs in manual batching	Use `model.encode()` — it handles padding automatically
Similarity scores all near 1.0	Embeddings not normalized	Pass `normalize_embeddings=True`
Similarity scores all near 0.0	Mixed models between index build and query	Always use the exact same model checkpoint for both
`OSError: Can't load tokenizer for 'model-name'`	Model not on Hugging Face Hub or local path typo	Check spelling; try `SentenceTransformer("sentence-transformers/model-name")`
Cross-encoder scores are logits, not probabilities	Default `predict()` output	Pass `apply_softmax=True` when you need probabilities

Sentence Transformers — Embeddings, Search & Fine-Tuning

What it is

Why sentence-transformers?

Installation

Model selection guide

Model prefix requirements — quick reference

BGE models (BAAI)

Standard BGE retrieval models

BGE-M3 — dense retrieval via sentence-transformers

BGE-M3 — dense + sparse + ColBERT via FlagEmbedding

BGE reranking

E5 models (Microsoft / intfloat)

E5 English retrieval

Multilingual E5 with task instruction

GTE models (Alibaba / Tongyi)

GTE standard

GTE-Qwen2 (LLM-backed, near-SOTA)

Nomic Embed (long context, open license)

Jina Embeddings v3

Jina Matryoshka truncation

Core API

Encode sentences

Normalize for cosine similarity

Batch encode with progress bar

Semantic similarity

Sentence pair similarity

All-pairs similarity matrix

Semantic search

Brute-force cosine search (small corpora)

FAISS index (millions of docs)

FAISS with IVF (faster at scale)

Reranking with cross-encoders

Two-stage retrieval pipeline

Clustering

K-Means clustering

Agglomerative clustering (no fixed K)

Community detection (built-in fast clustering)

Semantic textual similarity (STS) benchmark evaluation

Matryoshka embeddings (truncatable dimensions)

Multi-modal embeddings (CLIP)

Fine-tuning

Prepare training data

CosineSimilarity loss (STS-style)

MultipleNegativesRankingLoss (contrastive, pairs only)

TripletLoss

Save and reload

Real-world use cases

1. FAQ / support ticket routing

2. Duplicate / near-duplicate detection

3. Code search

4. RAG embedding pipeline

5. Zero-shot classification with NLI

6. Semantic de-duplication of a dataset

Performance tips

Common errors