cheat sheet
sentence-transformers
Package-level reference for the sentence-transformers library on PyPI — install, transformers/torch deps, model registry, and embedding alternatives.
sentence-transformers
What it is
sentence-transformers is a high-level Python library for computing dense vector embeddings of sentences, paragraphs, and documents, plus running cross-encoder rerankers. It wraps Hugging Face transformers with the right pooling layer (mean, cls, or max) and normalisation step so a single .encode() call returns a usable embedding.
It is the default embedding library in most RAG stacks (LangChain, LlamaIndex, Haystack) — calling SentenceTransformer("BAAI/bge-large-en-v1.5") is how most Python pipelines get a vector store-ready encoder.
Install
pip install sentence-transformers
Output: transitively installs transformers, torch, huggingface-hub, tokenizers, and friends
uv add sentence-transformers
Output: dependency resolved + added to pyproject.toml
poetry add sentence-transformers
Output: updated lockfile + virtualenv install
pip install "sentence-transformers[onnx]"
Output: installs optimum + onnxruntime for ONNX-accelerated inference
Versioning & Python support
- Current stable is the
3.xseries (as of late 2025). The2.xline was long-lived;3.0(mid-2024) added native cross-encoder + sparse encoder APIs and a more PyTorch-native training loop. - Python
3.9+on current releases;3.10+recommended. - Tracks
transformersandtorchfloors closely — installing alongside an oldertransformerscan break model loading. Re-install both together when upgrading. - Loose semver — minor releases occasionally rename trainer classes; API on
SentenceTransformer.encode()has been stable since 1.0. - Cross-encoder API was renamed in
3.x(CrossEncoderis still importable; the trainer wrappers changed shape).
Package metadata
- Maintainer: Nils Reimers / UKP-Lab (originally); now community-led under the
UKPLabGitHub org with active contribution from Hugging Face - Project home: github.com/UKPLab/sentence-transformers
- Docs: sbert.net
- PyPI: pypi.org/project/sentence-transformers
- License: Apache-2.0
- Governance: open-source, research-originated, vendor-neutral
- First released: 2019 (paper: "Sentence-BERT")
- Downloads: millions per month
Optional dependencies & extras
sentence-transformers is a heavy install because of its transitive deps — transformers, torch, huggingface-hub, tokenizers, scikit-learn, numpy, Pillow. Direct extras:
| Extra | Purpose |
|---|---|
sentence-transformers[train] | Adds accelerate, datasets for the trainer API |
sentence-transformers[onnx] | optimum[onnxruntime] for CPU/GPU ONNX inference |
sentence-transformers[openvino] | Intel OpenVINO backend |
sentence-transformers[dev] | Linting, type-checking, doc-build deps |
Models are downloaded from the Hugging Face Hub — the most-used checkpoints (all-MiniLM-L6-v2, the BGE family, mxbai-embed-large-v1, Jina-embeddings, E5) live there. There is no separate "model registry"; the model_name_or_path argument hits the Hub directly.
Alternatives
| Package | Trade-off |
|---|---|
openai (embeddings endpoint) | Hosted text-embedding-3-small/-large. No GPU required; per-token cost. |
cohere (embed endpoint) | Hosted multilingual embeddings + reranker. Strong on retrieval. |
voyageai | Specialist embedding provider; competitive quality on benchmarks. |
fastembed | Lightweight ONNX-only embedding library from Qdrant — much smaller install, fewer model options. |
Raw transformers + manual pooling | More control; you handle pooling/normalisation yourself. |
instructor-embedding | Instruction-tuned embeddings; different API. |
Common gotchas
- Pooling-strategy mismatch. Some checkpoints expect
meanpooling, otherscls. Most modern models on the Hub set the right default in their config, but cross-encoders and older models can silently produce bad vectors if forced into the wrong mode. - GPU vs CPU model loading.
SentenceTransformer("model")loads to CPU by default. Passdevice="cuda"(or"mps"on Apple Silicon) — otherwise inference is 10–100× slower andnvidia-smishows zero utilisation. - Embedding dimensions vary wildly.
all-MiniLM-L6-v2is 384-d;bge-large-en-v1.5is 1024-d; OpenAItext-embedding-3-largeis 3072-d. A vector store schema baked for one is incompatible with another — re-index when switching models. - Normalisation defaults differ. Many models recommend L2-normalising the output (
normalize_embeddings=True) before cosine similarity. The default in.encode()isFalse. Forgetting this gives technically-correct-but-poorly-ranked retrieval. - Cross-encoders are not encoders. A
CrossEncodertakes a pair of texts and returns a relevance score — it does not produce a single embedding. Mixing the two APIs is a frequent beginner mistake. - Big batches OOM on small GPUs.
.encode(corpus, batch_size=64)defaults are tuned for ~16 GB. Drop to 8–16 on a laptop GPU. trust_remote_code=Truepropagates. Modern embedding models (Nomic, Jina) ship custom modeling code on the Hub — loading them requires the flag, same security caveats as intransformers.
Real-world recipes
The recipes below are the bread-and-butter shapes sentence-transformers ships in: indexing for retrieval, reranking retrieval results, and incremental re-encoding.
Recipe: semantic search index
Build a normalised embedding index over a corpus, then query it with cosine similarity.
from sentence_transformers import SentenceTransformer
import numpy as np
model = SentenceTransformer("BAAI/bge-large-en-v1.5", device="cuda")
corpus = [open(p).read() for p in glob.glob("./docs/*.md")]
# Encode corpus once; persist to disk
emb = model.encode(
corpus,
batch_size=32,
show_progress_bar=True,
convert_to_numpy=True,
normalize_embeddings=True,
)
np.save("emb.npy", emb)
# Query at runtime
def search(query: str, k: int = 5):
q = model.encode([f"Represent this sentence for searching relevant passages: {query}"],
normalize_embeddings=True)
scores = (emb @ q.T).squeeze()
top_k = np.argsort(-scores)[:k]
return [(corpus[i][:200], float(scores[i])) for i in top_k]
Output: ranked top-k snippets with cosine scores; the search step is a single matmul.
Recipe: cross-encoder reranker over retrieved candidates
A two-stage retrieval pipeline — fast bi-encoder retrieval, then a small cross-encoder rerank — beats either stage alone.
from sentence_transformers import SentenceTransformer, CrossEncoder
bi = SentenceTransformer("BAAI/bge-base-en-v1.5", device="cuda")
ce = CrossEncoder("BAAI/bge-reranker-base", device="cuda")
def two_stage(query, corpus, k_first=50, k_final=5):
qv = bi.encode(query, normalize_embeddings=True)
cv = bi.encode(corpus, normalize_embeddings=True, batch_size=64)
candidates = (cv @ qv).argsort()[-k_first:][::-1]
pairs = [(query, corpus[i]) for i in candidates]
scores = ce.predict(pairs, batch_size=32)
return [corpus[i] for i, _ in sorted(zip(candidates, scores), key=lambda x: -x[1])][:k_final]
Output: higher recall@k than either stage in isolation; standard pattern in production RAG.
Recipe: multi-process encoding for large corpora
from sentence_transformers import SentenceTransformer
model = SentenceTransformer("BAAI/bge-small-en-v1.5")
pool = model.start_multi_process_pool()
emb = model.encode_multi_process(huge_corpus, pool, batch_size=64)
model.stop_multi_process_pool(pool)
Output: uses one process per GPU for embedding; near-linear scaling.
Recipe: ONNX export for CPU inference
from sentence_transformers import SentenceTransformer
from sentence_transformers.backend import export_optimized_onnx_model
model = SentenceTransformer("BAAI/bge-small-en-v1.5", backend="onnx")
# Or export an existing model:
export_optimized_onnx_model(model, optimization_config="O3", model_name_or_path="bge-small-en-v1.5-onnx")
Output: ONNX checkpoint suitable for CPU serving via optimum-onnxruntime; typically 2-5× faster than the PyTorch path on CPU.
Recipe: incremental re-encoding when the model changes
When you swap embedding models, the index must be rebuilt — but you can stream the upgrade rather than do it all in one batch:
from sentence_transformers import SentenceTransformer
import sqlite3, json
db = sqlite3.connect("docs.db")
old_dim = 384
new = SentenceTransformer("BAAI/bge-large-en-v1.5", device="cuda")
new_dim = new.get_sentence_embedding_dimension()
cur = db.cursor()
cur.execute("ALTER TABLE docs ADD COLUMN emb_v2 BLOB")
for row in cur.execute("SELECT id, text FROM docs ORDER BY id"):
doc_id, text = row
vec = new.encode(text, normalize_embeddings=True)
cur.execute("UPDATE docs SET emb_v2=? WHERE id=?", (vec.tobytes(), doc_id))
if doc_id % 1000 == 0: db.commit()
db.commit()
Output: index migration without service downtime; switch read paths once the column is fully populated.
Performance tuning
Embedding throughput is dominated by batch size, model size, and whether you exit PyTorch at all.
- Batch size matters more than anything else. 32-64 on a single 16GB GPU; 128-256 on a 24GB+ GPU. Below 8, GPU utilisation drops below 30%.
device="cuda"(or"mps"on Apple Silicon). Default is CPU — 10-100× slower for most models.fp16/bf16. Passmodel_kwargs={"torch_dtype": torch.bfloat16}to halve VRAM and reduce latency on Ampere+.- ONNX backend.
SentenceTransformer("name", backend="onnx")exits PyTorch for inference. 2-5× speed-up on CPU, smaller gains on GPU. - OpenVINO backend.
backend="openvino"is the Intel-CPU equivalent — competitive with ONNX on Xeon CPUs, sometimes faster. - FlashAttention. Bigger embedding models (
*-large,*-basewith long context) benefit fromattn_implementation="flash_attention_2"if the underlyingtransformersversion supports it. - Smaller models for ingest, larger for query. Many setups use a 384-dim model for the index (storage cost) and the same model for queries — but mixing is occasionally useful with model-distillation tricks.
- Normalised dot product = cosine. With
normalize_embeddings=True, scoring is a single matmul; skip the explicit cosine formula. - Pin memory + uvloop at the FastAPI layer for embedding servers — every 1-2ms matters for high-QPS workloads.
Version migration guide
sentence-transformers has been remarkably stable on the .encode surface — code from 2021 generally still runs. The notable shifts happened around 3.0 (mid-2024) when training and cross-encoder APIs were modernised.
| Era | What changed |
|---|---|
1.x | The original API. SentenceTransformer.encode, CrossEncoder.predict, and train via the legacy SentencesDataset + LossFunctions. |
2.x | Added multi-process encoding, better trainer ergonomics, and a long deprecation list. CrossEncoder got a more transformers-Trainer-aligned shape. |
3.x (current) | Trainer rewrite — closer to transformers.Trainer. Native sparse encoder support. Backend selection (onnx, openvino) on SentenceTransformer(...). |
Migration discipline:
SentenceTransformer("name").encode(...)works the same across versions — most consumer code is portable.- Training code on
2.xmay need updates for the newSentenceTransformerTrainer/CrossEncoderTrainerclasses. - Pin
sentence-transformersandtransformerstogether; resolving them independently can mismatch the model loading code path.
Hedge: exact API symbol moves across 3.0 are documented in the project's CHANGELOG.md; consult it for surface-level training breakage. The .encode surface itself has been stable.
Troubleshooting common errors
- "Failed to load model" with
pooling_modeerrors. The config didn't ship a pooling layer (rare). SpecifySentenceTransformer.modulesmanually or use aTransformer+Poolingmodule pair. - OOM on
.encode(corpus). Lowerbatch_size; for huge corpora useencode_multi_processor chunk the corpus. - Garbage similarity scores. You forgot
normalize_embeddings=True. Some models (BGE family) recommend it explicitly; others don't care but it never hurts for cosine retrieval. trust_remote_code=Truewarning for Nomic / Jina / etc. Pin to a known commit SHA before enabling.- Slow CPU inference. Use the ONNX or OpenVINO backend; raw PyTorch on CPU leaves a lot on the table.
- Dimensions mismatch between query and index. You swapped models. Re-encode the entire index — dimensions are model-specific.
- Cross-encoder treated as encoder.
CrossEncoderdoesn't have.encode; it has.predict([(a, b), ...]). Mixing the two APIs is the #1 first-day mistake. device="mps"errors on Apple Silicon. Some models (BGE-large in fp16) don't yet run cleanly on MPS. Fall back todevice="cpu"with the ONNX backend.
Cost & rate-limit management
Self-hosted embeddings flip the cost model from "cents per million tokens" to "GPU/CPU hours". The win is usually decisive — most embedding-heavy workloads pay back a single-GPU box within weeks compared to hosted APIs.
- Index once, query forever. Encoding a corpus is a one-time cost; budgeting it correctly is more important than per-query optimisation.
- Cache by content hash. Many corpora have 10-30% near-duplicate documents. SHA-256 → vector cache pays for itself quickly.
- Match model to job. A 384-dim
all-MiniLM-L6-v2for a 1M-doc index uses 1.5 GB; a 1024-dim BGE-large uses 4 GB. Storage matters at scale. - ONNX on CPU for query-side serving. A small CPU box running ONNX often handles thousands of embedding QPS at single-digit dollars per month.
- Reranker is the throughput pinch point. Cross-encoders are 100-1000× slower per pair than bi-encoders. Limit re-rank candidates to
k=20-50. - Batch query encodes too. Single-query latency suffers from kernel launch overhead; even a small batch (8-16) cuts per-query overhead significantly.
Multi-provider patterns
sentence-transformers is a local library — multi-provider patterns mean "swap to a hosted embeddings API without rewriting the call site". The shape is straightforward:
- Provider-agnostic embedding interface. Wrap
SentenceTransformer.encodeand OpenAI/Cohere/Voyage embedding clients behind onedef embed(texts: list[str]) -> np.ndarray. Switch the implementation by config. - LangChain
Embeddingsabstraction.HuggingFaceEmbeddings(model_name=...)wrapssentence-transformers;OpenAIEmbeddings,CohereEmbeddings,VoyageEmbeddingsprovide hosted equivalents — same interface. fastembedfor ONNX-only deployments. Qdrant'sfastembedships pre-quantised ONNX embedding models; no torch dependency. Use when binary size matters (Lambda, edge).- Hybrid sparse + dense. Some retrievers combine
sentence-transformersdense vectors with BM25 / SPLADE sparse vectors.sentence-transformers3.x has native sparse encoder support;qdrant,weaviate, and others support hybrid scoring natively. - Reranker switching. Cross-encoder rerankers swap freely between providers — Cohere's hosted reranker has the same shape as a local
CrossEncoder.
Security considerations
trust_remote_code=Truefor Nomic, Jina, etc. Same risks as intransformers— pin to a revision SHA.- Hub supply chain. Pin model versions in
requirements.txtfor reproducibility and to defend against malicious updates. - PII in embeddings. Embeddings are reversible to a degree — adversaries can sometimes reconstruct sensitive content from vectors. Treat the vector store as carrying the same sensitivity as the source.
- Cross-tenant index isolation. Multi-tenant retrieval requires per-tenant filters; a single-index design risks cross-tenant leakage if a query accidentally bypasses the filter.
- Adversarial inputs. Out-of-distribution queries can produce near-random embeddings that match unintended documents. Cap top-k at sensible values and use a similarity threshold.
Production deployment
sentence-transformers is most often deployed as an embedding micro-service or as a library inside a RAG application.
Embedding micro-service. FastAPI + uvicorn, single worker per device, ONNX backend on CPU, PyTorch + GPU on GPU instances. Health-check endpoint that runs a 1-token encode.
from fastapi import FastAPI
from sentence_transformers import SentenceTransformer
model = SentenceTransformer("BAAI/bge-small-en-v1.5", backend="onnx")
app = FastAPI()
@app.post("/embed")
def embed(payload: dict):
vecs = model.encode(payload["texts"], normalize_embeddings=True, batch_size=64)
return {"embeddings": vecs.tolist()}
@app.get("/health")
def health():
model.encode(["health check"])
return {"ok": True}
Output: stateless, horizontally scalable; one process per CPU/GPU.
Container shape. Pre-bake the model into the image. Set HF_HUB_OFFLINE=1 at runtime. Lock vector dimensions to what your downstream index expects.
Worker pools for batch jobs. For ingest pipelines, use encode_multi_process across multiple GPUs.
Model versioning. Tag images with the model revision SHA; rebuilding the index on a model upgrade is a planned migration, not a hot swap.
When NOT to use this
- Hosted embeddings are simpler when you don't already have a GPU. OpenAI/Cohere/Voyage embeddings are HTTP calls; no infra to run.
- You only need a vector DB's "auto-embed" mode. Some vector DBs (Pinecone, Weaviate) embed for you with their own (or third-party) models.
- Pure keyword search suffices. BM25 with a good tokenizer beats sloppy dense retrieval on many corpora — try the boring solution first.
- Edge deployment. Reach for
fastembed(ONNX-only, no torch) or a Core ML / TFLite export. - You need cross-modal embeddings. Specialist libraries (CLIP family) live elsewhere — sentence-transformers covers some but not all multimodal models.
Evaluation & observability
Embedding quality is hard to spot-check — you need a benchmark. The standard moves are:
-
MTEB (Massive Text Embedding Benchmark) — Hugging Face's standard suite of 50+ retrieval, clustering, and STS tasks. The
mtebPyPI package runs aSentenceTransformerend-to-end:from mteb import MTEB from sentence_transformers import SentenceTransformer model = SentenceTransformer("BAAI/bge-small-en-v1.5") results = MTEB(tasks=["NFCorpus", "SciFact"]).run(model, output_folder="results/bge-small")Output: per-task NDCG / MAP scores in
results/bge-small/*.json; compare against the public leaderboard. -
Domain-specific eval datasets. MTEB tells you "average" — your domain is not average. Curate ~50-200 query-document pairs from your real traffic and score
Recall@kdirectly. -
Cosine score distributions. Plot the histogram of top-1 cosine scores for known-good queries. A long tail near 0 suggests the model isn't a good fit; a tight distribution near 1 suggests overconfidence and likely false positives.
-
Reranker calibration. Cross-encoder scores aren't directly comparable across models — calibrate against a held-out reference set if you display them.
-
A/B test in production. Two indices, two query paths, route a slice of traffic to each, compare downstream success (clickthrough, dwell time, conversion). The only honest test.
-
Drift monitoring. If your corpus shifts (new topics, new vocabulary), embedding quality drifts too. Re-evaluate on the latest data quarterly.
For traces of embedding calls in the wider system, treat them like any other LLM-adjacent operation: trace via langsmith, OpenTelemetry, or your APM of choice. Embedding latency tends to be the silent contributor to long p99 tails.
Ecosystem integrations
sentence-transformers sits at a junction in the RAG stack — most of its integrations are about feeding embeddings into vector stores and rerankers.
| Layer | Common integrations |
|---|---|
| Vector stores | chromadb, pinecone, weaviate, qdrant, milvus, pgvector (via psycopg), lancedb — all accept either raw .encode output or wrap a SentenceTransformer callable directly. |
| Frameworks | langchain-huggingface (HuggingFaceEmbeddings), llama-index (HuggingFaceEmbedding), haystack-ai (SentenceTransformersDocumentEmbedder) all wrap the library. |
| Backends | optimum-onnxruntime (ONNX), optimum-intel[openvino] (OpenVINO), optimum-neuron (AWS Inferentia). Selected via backend= on the SentenceTransformer(...) constructor. |
| Quantisation | optimum.quanto and dynamic-int8 ONNX quantisation knock ~2-4× off inference latency on CPU. |
| FlashAttention | Pass model_kwargs={"attn_implementation": "flash_attention_2"} for the large encoder backbones when flash-attn is installed. |
| Hub / Datasets | Tied closely to huggingface-hub for model loading and datasets for training corpora. |
| Rerankers | BAAI/bge-reranker-*, cross-encoder/ms-marco-MiniLM-L-12-v2, Cohere's hosted reranker — interchangeable via CrossEncoder or REST. |
The library is intentionally small — most integrations live one layer up in the consuming application.
See also
- AI: sentence-transformers —
.encode, training, cross-encoders - Packages: pip-transformers — the lower-level library
- Concept: rag — embeddings power retrieval
- Concept: api — embedding API design