cheat sheet

sentence-transformers

Package-level reference for the sentence-transformers library on PyPI — install, transformers/torch deps, model registry, and embedding alternatives.

updated 05-31-2026

sentence-transformers

What it is

sentence-transformers is a high-level Python library for computing dense vector embeddings of sentences, paragraphs, and documents, plus running cross-encoder rerankers. It wraps Hugging Face transformers with the right pooling layer (mean, cls, or max) and normalisation step so a single .encode() call returns a usable embedding.

It is the default embedding library in most RAG stacks (LangChain, LlamaIndex, Haystack) — calling SentenceTransformer("BAAI/bge-large-en-v1.5") is how most Python pipelines get a vector store-ready encoder.

Install

bash

pip install sentence-transformers

Output: transitively installs transformers, torch, huggingface-hub, tokenizers, and friends

bash

uv add sentence-transformers

Output: dependency resolved + added to pyproject.toml

bash

poetry add sentence-transformers

Output: updated lockfile + virtualenv install

bash

pip install "sentence-transformers[onnx]"

Output: installs optimum + onnxruntime for ONNX-accelerated inference

Versioning & Python support

Current stable is the 3.x series (as of late 2025). The 2.x line was long-lived; 3.0 (mid-2024) added native cross-encoder + sparse encoder APIs and a more PyTorch-native training loop.
Python 3.9+ on current releases; 3.10+ recommended.
Tracks transformers and torch floors closely — installing alongside an older transformers can break model loading. Re-install both together when upgrading.
Loose semver — minor releases occasionally rename trainer classes; API on SentenceTransformer.encode() has been stable since 1.0.
Cross-encoder API was renamed in 3.x (CrossEncoder is still importable; the trainer wrappers changed shape).

Package metadata

Maintainer: Nils Reimers / UKP-Lab (originally); now community-led under the UKPLab GitHub org with active contribution from Hugging Face
Project home: github.com/UKPLab/sentence-transformers
Docs: sbert.net
PyPI: pypi.org/project/sentence-transformers
License: Apache-2.0
Governance: open-source, research-originated, vendor-neutral
First released: 2019 (paper: "Sentence-BERT")
Downloads: millions per month

Optional dependencies & extras

sentence-transformers is a heavy install because of its transitive deps — transformers, torch, huggingface-hub, tokenizers, scikit-learn, numpy, Pillow. Direct extras:

Extra	Purpose
`sentence-transformers[train]`	Adds `accelerate`, `datasets` for the trainer API
`sentence-transformers[onnx]`	`optimum[onnxruntime]` for CPU/GPU ONNX inference
`sentence-transformers[openvino]`	Intel OpenVINO backend
`sentence-transformers[dev]`	Linting, type-checking, doc-build deps

Models are downloaded from the Hugging Face Hub — the most-used checkpoints (all-MiniLM-L6-v2, the BGE family, mxbai-embed-large-v1, Jina-embeddings, E5) live there. There is no separate "model registry"; the model_name_or_path argument hits the Hub directly.

Alternatives

Package	Trade-off
`openai` (embeddings endpoint)	Hosted `text-embedding-3-small`/`-large`. No GPU required; per-token cost.
`cohere` (embed endpoint)	Hosted multilingual embeddings + reranker. Strong on retrieval.
`voyageai`	Specialist embedding provider; competitive quality on benchmarks.
`fastembed`	Lightweight ONNX-only embedding library from Qdrant — much smaller install, fewer model options.
Raw `transformers` + manual pooling	More control; you handle pooling/normalisation yourself.
`instructor-embedding`	Instruction-tuned embeddings; different API.

Common gotchas

Pooling-strategy mismatch. Some checkpoints expect mean pooling, others cls. Most modern models on the Hub set the right default in their config, but cross-encoders and older models can silently produce bad vectors if forced into the wrong mode.
GPU vs CPU model loading. SentenceTransformer("model") loads to CPU by default. Pass device="cuda" (or "mps" on Apple Silicon) — otherwise inference is 10–100× slower and nvidia-smi shows zero utilisation.
Embedding dimensions vary wildly. all-MiniLM-L6-v2 is 384-d; bge-large-en-v1.5 is 1024-d; OpenAI text-embedding-3-large is 3072-d. A vector store schema baked for one is incompatible with another — re-index when switching models.
Normalisation defaults differ. Many models recommend L2-normalising the output (normalize_embeddings=True) before cosine similarity. The default in .encode() is False. Forgetting this gives technically-correct-but-poorly-ranked retrieval.
Cross-encoders are not encoders. A CrossEncoder takes a pair of texts and returns a relevance score — it does not produce a single embedding. Mixing the two APIs is a frequent beginner mistake.
Big batches OOM on small GPUs. .encode(corpus, batch_size=64) defaults are tuned for ~16 GB. Drop to 8–16 on a laptop GPU.
trust_remote_code=True propagates. Modern embedding models (Nomic, Jina) ship custom modeling code on the Hub — loading them requires the flag, same security caveats as in transformers.

Real-world recipes

The recipes below are the bread-and-butter shapes sentence-transformers ships in: indexing for retrieval, reranking retrieval results, and incremental re-encoding.

Recipe: semantic search index

Build a normalised embedding index over a corpus, then query it with cosine similarity.

python

from sentence_transformers import SentenceTransformer
import numpy as np

model = SentenceTransformer("BAAI/bge-large-en-v1.5", device="cuda")
corpus = [open(p).read() for p in glob.glob("./docs/*.md")]

# Encode corpus once; persist to disk
emb = model.encode(
    corpus,
    batch_size=32,
    show_progress_bar=True,
    convert_to_numpy=True,
    normalize_embeddings=True,
)
np.save("emb.npy", emb)

# Query at runtime
def search(query: str, k: int = 5):
    q = model.encode([f"Represent this sentence for searching relevant passages: {query}"],
                     normalize_embeddings=True)
    scores = (emb @ q.T).squeeze()
    top_k = np.argsort(-scores)[:k]
    return [(corpus[i][:200], float(scores[i])) for i in top_k]

Output: ranked top-k snippets with cosine scores; the search step is a single matmul.

Recipe: cross-encoder reranker over retrieved candidates

A two-stage retrieval pipeline — fast bi-encoder retrieval, then a small cross-encoder rerank — beats either stage alone.

python

from sentence_transformers import SentenceTransformer, CrossEncoder

bi = SentenceTransformer("BAAI/bge-base-en-v1.5", device="cuda")
ce = CrossEncoder("BAAI/bge-reranker-base", device="cuda")

def two_stage(query, corpus, k_first=50, k_final=5):
    qv = bi.encode(query, normalize_embeddings=True)
    cv = bi.encode(corpus, normalize_embeddings=True, batch_size=64)
    candidates = (cv @ qv).argsort()[-k_first:][::-1]
    pairs = [(query, corpus[i]) for i in candidates]
    scores = ce.predict(pairs, batch_size=32)
    return [corpus[i] for i, _ in sorted(zip(candidates, scores), key=lambda x: -x[1])][:k_final]

Output: higher recall@k than either stage in isolation; standard pattern in production RAG.

Recipe: multi-process encoding for large corpora

python

from sentence_transformers import SentenceTransformer

model = SentenceTransformer("BAAI/bge-small-en-v1.5")
pool = model.start_multi_process_pool()
emb = model.encode_multi_process(huge_corpus, pool, batch_size=64)
model.stop_multi_process_pool(pool)

Output: uses one process per GPU for embedding; near-linear scaling.

Recipe: ONNX export for CPU inference

python

from sentence_transformers import SentenceTransformer
from sentence_transformers.backend import export_optimized_onnx_model

model = SentenceTransformer("BAAI/bge-small-en-v1.5", backend="onnx")
# Or export an existing model:
export_optimized_onnx_model(model, optimization_config="O3", model_name_or_path="bge-small-en-v1.5-onnx")

Output: ONNX checkpoint suitable for CPU serving via optimum-onnxruntime; typically 2-5× faster than the PyTorch path on CPU.

Recipe: incremental re-encoding when the model changes

When you swap embedding models, the index must be rebuilt — but you can stream the upgrade rather than do it all in one batch:

python

from sentence_transformers import SentenceTransformer
import sqlite3, json

db = sqlite3.connect("docs.db")
old_dim = 384
new = SentenceTransformer("BAAI/bge-large-en-v1.5", device="cuda")
new_dim = new.get_sentence_embedding_dimension()

cur = db.cursor()
cur.execute("ALTER TABLE docs ADD COLUMN emb_v2 BLOB")
for row in cur.execute("SELECT id, text FROM docs ORDER BY id"):
    doc_id, text = row
    vec = new.encode(text, normalize_embeddings=True)
    cur.execute("UPDATE docs SET emb_v2=? WHERE id=?", (vec.tobytes(), doc_id))
    if doc_id % 1000 == 0: db.commit()
db.commit()

Output: index migration without service downtime; switch read paths once the column is fully populated.

Performance tuning

Embedding throughput is dominated by batch size, model size, and whether you exit PyTorch at all.

Batch size matters more than anything else. 32-64 on a single 16GB GPU; 128-256 on a 24GB+ GPU. Below 8, GPU utilisation drops below 30%.
device="cuda" (or "mps" on Apple Silicon). Default is CPU — 10-100× slower for most models.
fp16 / bf16. Pass model_kwargs={"torch_dtype": torch.bfloat16} to halve VRAM and reduce latency on Ampere+.
ONNX backend. SentenceTransformer("name", backend="onnx") exits PyTorch for inference. 2-5× speed-up on CPU, smaller gains on GPU.
OpenVINO backend. backend="openvino" is the Intel-CPU equivalent — competitive with ONNX on Xeon CPUs, sometimes faster.
FlashAttention. Bigger embedding models (*-large, *-base with long context) benefit from attn_implementation="flash_attention_2" if the underlying transformers version supports it.
Smaller models for ingest, larger for query. Many setups use a 384-dim model for the index (storage cost) and the same model for queries — but mixing is occasionally useful with model-distillation tricks.
Normalised dot product = cosine. With normalize_embeddings=True, scoring is a single matmul; skip the explicit cosine formula.
Pin memory + uvloop at the FastAPI layer for embedding servers — every 1-2ms matters for high-QPS workloads.

Version migration guide

sentence-transformers has been remarkably stable on the .encode surface — code from 2021 generally still runs. The notable shifts happened around 3.0 (mid-2024) when training and cross-encoder APIs were modernised.

Era	What changed
`1.x`	The original API. `SentenceTransformer.encode`, `CrossEncoder.predict`, and `train` via the legacy `SentencesDataset` + `LossFunctions`.
`2.x`	Added multi-process encoding, better trainer ergonomics, and a long deprecation list. `CrossEncoder` got a more `transformers`-Trainer-aligned shape.
`3.x` (current)	Trainer rewrite — closer to `transformers.Trainer`. Native sparse encoder support. Backend selection (`onnx`, `openvino`) on `SentenceTransformer(...)`.

Migration discipline:

SentenceTransformer("name").encode(...) works the same across versions — most consumer code is portable.
Training code on 2.x may need updates for the new SentenceTransformerTrainer / CrossEncoderTrainer classes.
Pin sentence-transformers and transformers together; resolving them independently can mismatch the model loading code path.

Hedge: exact API symbol moves across 3.0 are documented in the project's CHANGELOG.md; consult it for surface-level training breakage. The .encode surface itself has been stable.

Troubleshooting common errors

"Failed to load model" with pooling_mode errors. The config didn't ship a pooling layer (rare). Specify SentenceTransformer.modules manually or use a Transformer + Pooling module pair.
OOM on .encode(corpus). Lower batch_size; for huge corpora use encode_multi_process or chunk the corpus.
Garbage similarity scores. You forgot normalize_embeddings=True. Some models (BGE family) recommend it explicitly; others don't care but it never hurts for cosine retrieval.
trust_remote_code=True warning for Nomic / Jina / etc. Pin to a known commit SHA before enabling.
Slow CPU inference. Use the ONNX or OpenVINO backend; raw PyTorch on CPU leaves a lot on the table.
Dimensions mismatch between query and index. You swapped models. Re-encode the entire index — dimensions are model-specific.
Cross-encoder treated as encoder. CrossEncoder doesn't have .encode; it has .predict([(a, b), ...]). Mixing the two APIs is the #1 first-day mistake.
device="mps" errors on Apple Silicon. Some models (BGE-large in fp16) don't yet run cleanly on MPS. Fall back to device="cpu" with the ONNX backend.

Cost & rate-limit management

Self-hosted embeddings flip the cost model from "cents per million tokens" to "GPU/CPU hours". The win is usually decisive — most embedding-heavy workloads pay back a single-GPU box within weeks compared to hosted APIs.

Index once, query forever. Encoding a corpus is a one-time cost; budgeting it correctly is more important than per-query optimisation.
Cache by content hash. Many corpora have 10-30% near-duplicate documents. SHA-256 → vector cache pays for itself quickly.
Match model to job. A 384-dim all-MiniLM-L6-v2 for a 1M-doc index uses 1.5 GB; a 1024-dim BGE-large uses 4 GB. Storage matters at scale.
ONNX on CPU for query-side serving. A small CPU box running ONNX often handles thousands of embedding QPS at single-digit dollars per month.
Reranker is the throughput pinch point. Cross-encoders are 100-1000× slower per pair than bi-encoders. Limit re-rank candidates to k=20-50.
Batch query encodes too. Single-query latency suffers from kernel launch overhead; even a small batch (8-16) cuts per-query overhead significantly.

Multi-provider patterns

sentence-transformers is a local library — multi-provider patterns mean "swap to a hosted embeddings API without rewriting the call site". The shape is straightforward:

Provider-agnostic embedding interface. Wrap SentenceTransformer.encode and OpenAI/Cohere/Voyage embedding clients behind one def embed(texts: list[str]) -> np.ndarray. Switch the implementation by config.
LangChain Embeddings abstraction. HuggingFaceEmbeddings(model_name=...) wraps sentence-transformers; OpenAIEmbeddings, CohereEmbeddings, VoyageEmbeddings provide hosted equivalents — same interface.
fastembed for ONNX-only deployments. Qdrant's fastembed ships pre-quantised ONNX embedding models; no torch dependency. Use when binary size matters (Lambda, edge).
Hybrid sparse + dense. Some retrievers combine sentence-transformers dense vectors with BM25 / SPLADE sparse vectors. sentence-transformers 3.x has native sparse encoder support; qdrant, weaviate, and others support hybrid scoring natively.
Reranker switching. Cross-encoder rerankers swap freely between providers — Cohere's hosted reranker has the same shape as a local CrossEncoder.

Security considerations

trust_remote_code=True for Nomic, Jina, etc. Same risks as in transformers — pin to a revision SHA.
Hub supply chain. Pin model versions in requirements.txt for reproducibility and to defend against malicious updates.
PII in embeddings. Embeddings are reversible to a degree — adversaries can sometimes reconstruct sensitive content from vectors. Treat the vector store as carrying the same sensitivity as the source.
Cross-tenant index isolation. Multi-tenant retrieval requires per-tenant filters; a single-index design risks cross-tenant leakage if a query accidentally bypasses the filter.
Adversarial inputs. Out-of-distribution queries can produce near-random embeddings that match unintended documents. Cap top-k at sensible values and use a similarity threshold.

Production deployment

sentence-transformers is most often deployed as an embedding micro-service or as a library inside a RAG application.

Embedding micro-service. FastAPI + uvicorn, single worker per device, ONNX backend on CPU, PyTorch + GPU on GPU instances. Health-check endpoint that runs a 1-token encode.

python

from fastapi import FastAPI
from sentence_transformers import SentenceTransformer

model = SentenceTransformer("BAAI/bge-small-en-v1.5", backend="onnx")
app = FastAPI()

@app.post("/embed")
def embed(payload: dict):
    vecs = model.encode(payload["texts"], normalize_embeddings=True, batch_size=64)
    return {"embeddings": vecs.tolist()}

@app.get("/health")
def health():
    model.encode(["health check"])
    return {"ok": True}

Output: stateless, horizontally scalable; one process per CPU/GPU.

Container shape. Pre-bake the model into the image. Set HF_HUB_OFFLINE=1 at runtime. Lock vector dimensions to what your downstream index expects.

Worker pools for batch jobs. For ingest pipelines, use encode_multi_process across multiple GPUs.

Model versioning. Tag images with the model revision SHA; rebuilding the index on a model upgrade is a planned migration, not a hot swap.

When NOT to use this

Hosted embeddings are simpler when you don't already have a GPU. OpenAI/Cohere/Voyage embeddings are HTTP calls; no infra to run.
You only need a vector DB's "auto-embed" mode. Some vector DBs (Pinecone, Weaviate) embed for you with their own (or third-party) models.
Pure keyword search suffices. BM25 with a good tokenizer beats sloppy dense retrieval on many corpora — try the boring solution first.
Edge deployment. Reach for fastembed (ONNX-only, no torch) or a Core ML / TFLite export.
You need cross-modal embeddings. Specialist libraries (CLIP family) live elsewhere — sentence-transformers covers some but not all multimodal models.

Evaluation & observability

Embedding quality is hard to spot-check — you need a benchmark. The standard moves are:

MTEB (Massive Text Embedding Benchmark) — Hugging Face's standard suite of 50+ retrieval, clustering, and STS tasks. The mteb PyPI package runs a SentenceTransformer end-to-end:
python
```
from mteb import MTEB
from sentence_transformers import SentenceTransformer

model = SentenceTransformer("BAAI/bge-small-en-v1.5")
results = MTEB(tasks=["NFCorpus", "SciFact"]).run(model, output_folder="results/bge-small")
```
Output: per-task NDCG / MAP scores in results/bge-small/*.json; compare against the public leaderboard.
Domain-specific eval datasets. MTEB tells you "average" — your domain is not average. Curate ~50-200 query-document pairs from your real traffic and score Recall@k directly.
Cosine score distributions. Plot the histogram of top-1 cosine scores for known-good queries. A long tail near 0 suggests the model isn't a good fit; a tight distribution near 1 suggests overconfidence and likely false positives.
Reranker calibration. Cross-encoder scores aren't directly comparable across models — calibrate against a held-out reference set if you display them.
A/B test in production. Two indices, two query paths, route a slice of traffic to each, compare downstream success (clickthrough, dwell time, conversion). The only honest test.
Drift monitoring. If your corpus shifts (new topics, new vocabulary), embedding quality drifts too. Re-evaluate on the latest data quarterly.

For traces of embedding calls in the wider system, treat them like any other LLM-adjacent operation: trace via langsmith, OpenTelemetry, or your APM of choice. Embedding latency tends to be the silent contributor to long p99 tails.

Ecosystem integrations

sentence-transformers sits at a junction in the RAG stack — most of its integrations are about feeding embeddings into vector stores and rerankers.

Layer	Common integrations
Vector stores	`chromadb`, `pinecone`, `weaviate`, `qdrant`, `milvus`, `pgvector` (via `psycopg`), `lancedb` — all accept either raw `.encode` output or wrap a `SentenceTransformer` callable directly.
Frameworks	`langchain-huggingface` (HuggingFaceEmbeddings), `llama-index` (HuggingFaceEmbedding), `haystack-ai` (`SentenceTransformersDocumentEmbedder`) all wrap the library.
Backends	`optimum-onnxruntime` (ONNX), `optimum-intel[openvino]` (OpenVINO), `optimum-neuron` (AWS Inferentia). Selected via `backend=` on the `SentenceTransformer(...)` constructor.
Quantisation	`optimum.quanto` and dynamic-int8 ONNX quantisation knock ~2-4× off inference latency on CPU.
FlashAttention	Pass `model_kwargs={"attn_implementation": "flash_attention_2"}` for the large encoder backbones when `flash-attn` is installed.
Hub / Datasets	Tied closely to `huggingface-hub` for model loading and `datasets` for training corpora.
Rerankers	`BAAI/bge-reranker-*`, `cross-encoder/ms-marco-MiniLM-L-12-v2`, Cohere's hosted reranker — interchangeable via `CrossEncoder` or REST.

The library is intentionally small — most integrations live one layer up in the consuming application.

sentence-transformers

What it is

Install

Versioning & Python support

Package metadata

Optional dependencies & extras

Alternatives

Common gotchas

Real-world recipes

Recipe: semantic search index

Recipe: cross-encoder reranker over retrieved candidates

Recipe: multi-process encoding for large corpora

Recipe: ONNX export for CPU inference

Recipe: incremental re-encoding when the model changes

Performance tuning

Version migration guide

Troubleshooting common errors

Cost & rate-limit management

Multi-provider patterns

Security considerations

Production deployment

When NOT to use this

Evaluation & observability

Ecosystem integrations

See also