cheat sheet

sentence-transformers

Package-level reference for the sentence-transformers library on PyPI — install, transformers/torch deps, model registry, and embedding alternatives.

sentence-transformers

What it is

sentence-transformers is a high-level Python library for computing dense vector embeddings of sentences, paragraphs, and documents, plus running cross-encoder rerankers. It wraps Hugging Face transformers with the right pooling layer (mean, cls, or max) and normalisation step so a single .encode() call returns a usable embedding.

It is the default embedding library in most RAG stacks (LangChain, LlamaIndex, Haystack) — calling SentenceTransformer("BAAI/bge-large-en-v1.5") is how most Python pipelines get a vector store-ready encoder.

Install

bash
pip install sentence-transformers

Output: transitively installs transformers, torch, huggingface-hub, tokenizers, and friends

bash
uv add sentence-transformers

Output: dependency resolved + added to pyproject.toml

bash
poetry add sentence-transformers

Output: updated lockfile + virtualenv install

bash
pip install "sentence-transformers[onnx]"

Output: installs optimum + onnxruntime for ONNX-accelerated inference

Versioning & Python support

  • Current stable is the 3.x series (as of late 2025). The 2.x line was long-lived; 3.0 (mid-2024) added native cross-encoder + sparse encoder APIs and a more PyTorch-native training loop.
  • Python 3.9+ on current releases; 3.10+ recommended.
  • Tracks transformers and torch floors closely — installing alongside an older transformers can break model loading. Re-install both together when upgrading.
  • Loose semver — minor releases occasionally rename trainer classes; API on SentenceTransformer.encode() has been stable since 1.0.
  • Cross-encoder API was renamed in 3.x (CrossEncoder is still importable; the trainer wrappers changed shape).

Package metadata

  • Maintainer: Nils Reimers / UKP-Lab (originally); now community-led under the UKPLab GitHub org with active contribution from Hugging Face
  • Project home: github.com/UKPLab/sentence-transformers
  • Docs: sbert.net
  • PyPI: pypi.org/project/sentence-transformers
  • License: Apache-2.0
  • Governance: open-source, research-originated, vendor-neutral
  • First released: 2019 (paper: "Sentence-BERT")
  • Downloads: millions per month

Optional dependencies & extras

sentence-transformers is a heavy install because of its transitive deps — transformers, torch, huggingface-hub, tokenizers, scikit-learn, numpy, Pillow. Direct extras:

ExtraPurpose
sentence-transformers[train]Adds accelerate, datasets for the trainer API
sentence-transformers[onnx]optimum[onnxruntime] for CPU/GPU ONNX inference
sentence-transformers[openvino]Intel OpenVINO backend
sentence-transformers[dev]Linting, type-checking, doc-build deps

Models are downloaded from the Hugging Face Hub — the most-used checkpoints (all-MiniLM-L6-v2, the BGE family, mxbai-embed-large-v1, Jina-embeddings, E5) live there. There is no separate "model registry"; the model_name_or_path argument hits the Hub directly.

Alternatives

PackageTrade-off
openai (embeddings endpoint)Hosted text-embedding-3-small/-large. No GPU required; per-token cost.
cohere (embed endpoint)Hosted multilingual embeddings + reranker. Strong on retrieval.
voyageaiSpecialist embedding provider; competitive quality on benchmarks.
fastembedLightweight ONNX-only embedding library from Qdrant — much smaller install, fewer model options.
Raw transformers + manual poolingMore control; you handle pooling/normalisation yourself.
instructor-embeddingInstruction-tuned embeddings; different API.

Common gotchas

  1. Pooling-strategy mismatch. Some checkpoints expect mean pooling, others cls. Most modern models on the Hub set the right default in their config, but cross-encoders and older models can silently produce bad vectors if forced into the wrong mode.
  2. GPU vs CPU model loading. SentenceTransformer("model") loads to CPU by default. Pass device="cuda" (or "mps" on Apple Silicon) — otherwise inference is 10–100× slower and nvidia-smi shows zero utilisation.
  3. Embedding dimensions vary wildly. all-MiniLM-L6-v2 is 384-d; bge-large-en-v1.5 is 1024-d; OpenAI text-embedding-3-large is 3072-d. A vector store schema baked for one is incompatible with another — re-index when switching models.
  4. Normalisation defaults differ. Many models recommend L2-normalising the output (normalize_embeddings=True) before cosine similarity. The default in .encode() is False. Forgetting this gives technically-correct-but-poorly-ranked retrieval.
  5. Cross-encoders are not encoders. A CrossEncoder takes a pair of texts and returns a relevance score — it does not produce a single embedding. Mixing the two APIs is a frequent beginner mistake.
  6. Big batches OOM on small GPUs. .encode(corpus, batch_size=64) defaults are tuned for ~16 GB. Drop to 8–16 on a laptop GPU.
  7. trust_remote_code=True propagates. Modern embedding models (Nomic, Jina) ship custom modeling code on the Hub — loading them requires the flag, same security caveats as in transformers.

Real-world recipes

The recipes below are the bread-and-butter shapes sentence-transformers ships in: indexing for retrieval, reranking retrieval results, and incremental re-encoding.

Recipe: semantic search index

Build a normalised embedding index over a corpus, then query it with cosine similarity.

python
from sentence_transformers import SentenceTransformer
import numpy as np

model = SentenceTransformer("BAAI/bge-large-en-v1.5", device="cuda")
corpus = [open(p).read() for p in glob.glob("./docs/*.md")]

# Encode corpus once; persist to disk
emb = model.encode(
    corpus,
    batch_size=32,
    show_progress_bar=True,
    convert_to_numpy=True,
    normalize_embeddings=True,
)
np.save("emb.npy", emb)

# Query at runtime
def search(query: str, k: int = 5):
    q = model.encode([f"Represent this sentence for searching relevant passages: {query}"],
                     normalize_embeddings=True)
    scores = (emb @ q.T).squeeze()
    top_k = np.argsort(-scores)[:k]
    return [(corpus[i][:200], float(scores[i])) for i in top_k]

Output: ranked top-k snippets with cosine scores; the search step is a single matmul.

Recipe: cross-encoder reranker over retrieved candidates

A two-stage retrieval pipeline — fast bi-encoder retrieval, then a small cross-encoder rerank — beats either stage alone.

python
from sentence_transformers import SentenceTransformer, CrossEncoder

bi = SentenceTransformer("BAAI/bge-base-en-v1.5", device="cuda")
ce = CrossEncoder("BAAI/bge-reranker-base", device="cuda")

def two_stage(query, corpus, k_first=50, k_final=5):
    qv = bi.encode(query, normalize_embeddings=True)
    cv = bi.encode(corpus, normalize_embeddings=True, batch_size=64)
    candidates = (cv @ qv).argsort()[-k_first:][::-1]
    pairs = [(query, corpus[i]) for i in candidates]
    scores = ce.predict(pairs, batch_size=32)
    return [corpus[i] for i, _ in sorted(zip(candidates, scores), key=lambda x: -x[1])][:k_final]

Output: higher recall@k than either stage in isolation; standard pattern in production RAG.

Recipe: multi-process encoding for large corpora

python
from sentence_transformers import SentenceTransformer

model = SentenceTransformer("BAAI/bge-small-en-v1.5")
pool = model.start_multi_process_pool()
emb = model.encode_multi_process(huge_corpus, pool, batch_size=64)
model.stop_multi_process_pool(pool)

Output: uses one process per GPU for embedding; near-linear scaling.

Recipe: ONNX export for CPU inference

python
from sentence_transformers import SentenceTransformer
from sentence_transformers.backend import export_optimized_onnx_model

model = SentenceTransformer("BAAI/bge-small-en-v1.5", backend="onnx")
# Or export an existing model:
export_optimized_onnx_model(model, optimization_config="O3", model_name_or_path="bge-small-en-v1.5-onnx")

Output: ONNX checkpoint suitable for CPU serving via optimum-onnxruntime; typically 2-5× faster than the PyTorch path on CPU.

Recipe: incremental re-encoding when the model changes

When you swap embedding models, the index must be rebuilt — but you can stream the upgrade rather than do it all in one batch:

python
from sentence_transformers import SentenceTransformer
import sqlite3, json

db = sqlite3.connect("docs.db")
old_dim = 384
new = SentenceTransformer("BAAI/bge-large-en-v1.5", device="cuda")
new_dim = new.get_sentence_embedding_dimension()

cur = db.cursor()
cur.execute("ALTER TABLE docs ADD COLUMN emb_v2 BLOB")
for row in cur.execute("SELECT id, text FROM docs ORDER BY id"):
    doc_id, text = row
    vec = new.encode(text, normalize_embeddings=True)
    cur.execute("UPDATE docs SET emb_v2=? WHERE id=?", (vec.tobytes(), doc_id))
    if doc_id % 1000 == 0: db.commit()
db.commit()

Output: index migration without service downtime; switch read paths once the column is fully populated.

Performance tuning

Embedding throughput is dominated by batch size, model size, and whether you exit PyTorch at all.

  • Batch size matters more than anything else. 32-64 on a single 16GB GPU; 128-256 on a 24GB+ GPU. Below 8, GPU utilisation drops below 30%.
  • device="cuda" (or "mps" on Apple Silicon). Default is CPU — 10-100× slower for most models.
  • fp16 / bf16. Pass model_kwargs={"torch_dtype": torch.bfloat16} to halve VRAM and reduce latency on Ampere+.
  • ONNX backend. SentenceTransformer("name", backend="onnx") exits PyTorch for inference. 2-5× speed-up on CPU, smaller gains on GPU.
  • OpenVINO backend. backend="openvino" is the Intel-CPU equivalent — competitive with ONNX on Xeon CPUs, sometimes faster.
  • FlashAttention. Bigger embedding models (*-large, *-base with long context) benefit from attn_implementation="flash_attention_2" if the underlying transformers version supports it.
  • Smaller models for ingest, larger for query. Many setups use a 384-dim model for the index (storage cost) and the same model for queries — but mixing is occasionally useful with model-distillation tricks.
  • Normalised dot product = cosine. With normalize_embeddings=True, scoring is a single matmul; skip the explicit cosine formula.
  • Pin memory + uvloop at the FastAPI layer for embedding servers — every 1-2ms matters for high-QPS workloads.

Version migration guide

sentence-transformers has been remarkably stable on the .encode surface — code from 2021 generally still runs. The notable shifts happened around 3.0 (mid-2024) when training and cross-encoder APIs were modernised.

EraWhat changed
1.xThe original API. SentenceTransformer.encode, CrossEncoder.predict, and train via the legacy SentencesDataset + LossFunctions.
2.xAdded multi-process encoding, better trainer ergonomics, and a long deprecation list. CrossEncoder got a more transformers-Trainer-aligned shape.
3.x (current)Trainer rewrite — closer to transformers.Trainer. Native sparse encoder support. Backend selection (onnx, openvino) on SentenceTransformer(...).

Migration discipline:

  1. SentenceTransformer("name").encode(...) works the same across versions — most consumer code is portable.
  2. Training code on 2.x may need updates for the new SentenceTransformerTrainer / CrossEncoderTrainer classes.
  3. Pin sentence-transformers and transformers together; resolving them independently can mismatch the model loading code path.

Hedge: exact API symbol moves across 3.0 are documented in the project's CHANGELOG.md; consult it for surface-level training breakage. The .encode surface itself has been stable.

Troubleshooting common errors

  • "Failed to load model" with pooling_mode errors. The config didn't ship a pooling layer (rare). Specify SentenceTransformer.modules manually or use a Transformer + Pooling module pair.
  • OOM on .encode(corpus). Lower batch_size; for huge corpora use encode_multi_process or chunk the corpus.
  • Garbage similarity scores. You forgot normalize_embeddings=True. Some models (BGE family) recommend it explicitly; others don't care but it never hurts for cosine retrieval.
  • trust_remote_code=True warning for Nomic / Jina / etc. Pin to a known commit SHA before enabling.
  • Slow CPU inference. Use the ONNX or OpenVINO backend; raw PyTorch on CPU leaves a lot on the table.
  • Dimensions mismatch between query and index. You swapped models. Re-encode the entire index — dimensions are model-specific.
  • Cross-encoder treated as encoder. CrossEncoder doesn't have .encode; it has .predict([(a, b), ...]). Mixing the two APIs is the #1 first-day mistake.
  • device="mps" errors on Apple Silicon. Some models (BGE-large in fp16) don't yet run cleanly on MPS. Fall back to device="cpu" with the ONNX backend.

Cost & rate-limit management

Self-hosted embeddings flip the cost model from "cents per million tokens" to "GPU/CPU hours". The win is usually decisive — most embedding-heavy workloads pay back a single-GPU box within weeks compared to hosted APIs.

  • Index once, query forever. Encoding a corpus is a one-time cost; budgeting it correctly is more important than per-query optimisation.
  • Cache by content hash. Many corpora have 10-30% near-duplicate documents. SHA-256 → vector cache pays for itself quickly.
  • Match model to job. A 384-dim all-MiniLM-L6-v2 for a 1M-doc index uses 1.5 GB; a 1024-dim BGE-large uses 4 GB. Storage matters at scale.
  • ONNX on CPU for query-side serving. A small CPU box running ONNX often handles thousands of embedding QPS at single-digit dollars per month.
  • Reranker is the throughput pinch point. Cross-encoders are 100-1000× slower per pair than bi-encoders. Limit re-rank candidates to k=20-50.
  • Batch query encodes too. Single-query latency suffers from kernel launch overhead; even a small batch (8-16) cuts per-query overhead significantly.

Multi-provider patterns

sentence-transformers is a local library — multi-provider patterns mean "swap to a hosted embeddings API without rewriting the call site". The shape is straightforward:

  • Provider-agnostic embedding interface. Wrap SentenceTransformer.encode and OpenAI/Cohere/Voyage embedding clients behind one def embed(texts: list[str]) -> np.ndarray. Switch the implementation by config.
  • LangChain Embeddings abstraction. HuggingFaceEmbeddings(model_name=...) wraps sentence-transformers; OpenAIEmbeddings, CohereEmbeddings, VoyageEmbeddings provide hosted equivalents — same interface.
  • fastembed for ONNX-only deployments. Qdrant's fastembed ships pre-quantised ONNX embedding models; no torch dependency. Use when binary size matters (Lambda, edge).
  • Hybrid sparse + dense. Some retrievers combine sentence-transformers dense vectors with BM25 / SPLADE sparse vectors. sentence-transformers 3.x has native sparse encoder support; qdrant, weaviate, and others support hybrid scoring natively.
  • Reranker switching. Cross-encoder rerankers swap freely between providers — Cohere's hosted reranker has the same shape as a local CrossEncoder.

Security considerations

  • trust_remote_code=True for Nomic, Jina, etc. Same risks as in transformers — pin to a revision SHA.
  • Hub supply chain. Pin model versions in requirements.txt for reproducibility and to defend against malicious updates.
  • PII in embeddings. Embeddings are reversible to a degree — adversaries can sometimes reconstruct sensitive content from vectors. Treat the vector store as carrying the same sensitivity as the source.
  • Cross-tenant index isolation. Multi-tenant retrieval requires per-tenant filters; a single-index design risks cross-tenant leakage if a query accidentally bypasses the filter.
  • Adversarial inputs. Out-of-distribution queries can produce near-random embeddings that match unintended documents. Cap top-k at sensible values and use a similarity threshold.

Production deployment

sentence-transformers is most often deployed as an embedding micro-service or as a library inside a RAG application.

Embedding micro-service. FastAPI + uvicorn, single worker per device, ONNX backend on CPU, PyTorch + GPU on GPU instances. Health-check endpoint that runs a 1-token encode.

python
from fastapi import FastAPI
from sentence_transformers import SentenceTransformer

model = SentenceTransformer("BAAI/bge-small-en-v1.5", backend="onnx")
app = FastAPI()

@app.post("/embed")
def embed(payload: dict):
    vecs = model.encode(payload["texts"], normalize_embeddings=True, batch_size=64)
    return {"embeddings": vecs.tolist()}

@app.get("/health")
def health():
    model.encode(["health check"])
    return {"ok": True}

Output: stateless, horizontally scalable; one process per CPU/GPU.

Container shape. Pre-bake the model into the image. Set HF_HUB_OFFLINE=1 at runtime. Lock vector dimensions to what your downstream index expects.

Worker pools for batch jobs. For ingest pipelines, use encode_multi_process across multiple GPUs.

Model versioning. Tag images with the model revision SHA; rebuilding the index on a model upgrade is a planned migration, not a hot swap.

When NOT to use this

  • Hosted embeddings are simpler when you don't already have a GPU. OpenAI/Cohere/Voyage embeddings are HTTP calls; no infra to run.
  • You only need a vector DB's "auto-embed" mode. Some vector DBs (Pinecone, Weaviate) embed for you with their own (or third-party) models.
  • Pure keyword search suffices. BM25 with a good tokenizer beats sloppy dense retrieval on many corpora — try the boring solution first.
  • Edge deployment. Reach for fastembed (ONNX-only, no torch) or a Core ML / TFLite export.
  • You need cross-modal embeddings. Specialist libraries (CLIP family) live elsewhere — sentence-transformers covers some but not all multimodal models.

Evaluation & observability

Embedding quality is hard to spot-check — you need a benchmark. The standard moves are:

  • MTEB (Massive Text Embedding Benchmark) — Hugging Face's standard suite of 50+ retrieval, clustering, and STS tasks. The mteb PyPI package runs a SentenceTransformer end-to-end:

    python
    from mteb import MTEB
    from sentence_transformers import SentenceTransformer
    
    model = SentenceTransformer("BAAI/bge-small-en-v1.5")
    results = MTEB(tasks=["NFCorpus", "SciFact"]).run(model, output_folder="results/bge-small")
    

    Output: per-task NDCG / MAP scores in results/bge-small/*.json; compare against the public leaderboard.

  • Domain-specific eval datasets. MTEB tells you "average" — your domain is not average. Curate ~50-200 query-document pairs from your real traffic and score Recall@k directly.

  • Cosine score distributions. Plot the histogram of top-1 cosine scores for known-good queries. A long tail near 0 suggests the model isn't a good fit; a tight distribution near 1 suggests overconfidence and likely false positives.

  • Reranker calibration. Cross-encoder scores aren't directly comparable across models — calibrate against a held-out reference set if you display them.

  • A/B test in production. Two indices, two query paths, route a slice of traffic to each, compare downstream success (clickthrough, dwell time, conversion). The only honest test.

  • Drift monitoring. If your corpus shifts (new topics, new vocabulary), embedding quality drifts too. Re-evaluate on the latest data quarterly.

For traces of embedding calls in the wider system, treat them like any other LLM-adjacent operation: trace via langsmith, OpenTelemetry, or your APM of choice. Embedding latency tends to be the silent contributor to long p99 tails.

Ecosystem integrations

sentence-transformers sits at a junction in the RAG stack — most of its integrations are about feeding embeddings into vector stores and rerankers.

LayerCommon integrations
Vector storeschromadb, pinecone, weaviate, qdrant, milvus, pgvector (via psycopg), lancedb — all accept either raw .encode output or wrap a SentenceTransformer callable directly.
Frameworkslangchain-huggingface (HuggingFaceEmbeddings), llama-index (HuggingFaceEmbedding), haystack-ai (SentenceTransformersDocumentEmbedder) all wrap the library.
Backendsoptimum-onnxruntime (ONNX), optimum-intel[openvino] (OpenVINO), optimum-neuron (AWS Inferentia). Selected via backend= on the SentenceTransformer(...) constructor.
Quantisationoptimum.quanto and dynamic-int8 ONNX quantisation knock ~2-4× off inference latency on CPU.
FlashAttentionPass model_kwargs={"attn_implementation": "flash_attention_2"} for the large encoder backbones when flash-attn is installed.
Hub / DatasetsTied closely to huggingface-hub for model loading and datasets for training corpora.
RerankersBAAI/bge-reranker-*, cross-encoder/ms-marco-MiniLM-L-12-v2, Cohere's hosted reranker — interchangeable via CrossEncoder or REST.

The library is intentionally small — most integrations live one layer up in the consuming application.

See also