cheat sheet
qdrant-client
Package-level reference for qdrant-client on PyPI — install variants, server version matching, gRPC vs HTTP, fastembed extras, and alternatives.
qdrant-client
What it is
qdrant-client is the official Python SDK for Qdrant, a Rust-written vector database focused on production-grade similarity search with rich payload filtering. The client speaks both REST (over HTTP) and gRPC to a remote Qdrant server, and ships a built-in in-memory mode (QdrantClient(":memory:")) plus an on-disk single-node mode for local development without a running server.
Reach for qdrant-client when you want strong filtering on structured payloads alongside dense (or sparse, or hybrid) vector search, and you are comfortable running a separate Qdrant server in production. Reach for chromadb when you prefer an embedded, zero-infrastructure store; reach for weaviate-client when you want hybrid search with BM25 baked in.
Install
pip install qdrant-client
Output: (none — exits 0 on success)
uv add qdrant-client
Output: dependency resolved + added to pyproject.toml
poetry add qdrant-client
Output: updated lockfile + virtualenv install
pip install "qdrant-client[fastembed]" # adds on-device embedding via fastembed
pip install "qdrant-client[fastembed-gpu]" # GPU build of fastembed (CUDA)
Output: Qdrant client plus the chosen embedding bundle
Versioning & Python support
- The client follows Qdrant server major/minor versions closely —
qdrant-client~=1.10pairs with Qdrant server1.10.x. Cross-major combinations should be avoided; cross-minor usually works but new server features (e.g. multivector, sparse indices, named vectors) only land in the client a release or two later. - Recent versions support Python 3.9+. Wheels are pure-Python; gRPC support pulls in
grpcio(a binary wheel). - The Python client occasionally lags the server — a feature visible in the server's REST API may not yet have a typed Python wrapper. Workaround: drop to
client.http(the auto-generated REST methods) or post the raw payload viarequests. - Pre-
1.0to1.xjump (early 2024) added breaking changes around collection-config schemas; the1.10line further reshapedVectorParamsandOptimizersConfig. Pin a tight range.
Package metadata
- Maintainer: Qdrant (the company) and community contributors
- Project home: github.com/qdrant/qdrant-client
- Server repo: github.com/qdrant/qdrant
- Docs: qdrant.tech/documentation
- PyPI: pypi.org/project/qdrant-client
- License: Apache-2.0
- Governance: company-led with open contributions; Qdrant Cloud is the commercial hosted offering
- First released: 2021
- Downloads: millions per month; standard choice when Qdrant is the chosen vector DB
Optional dependencies & extras
qdrant-client[fastembed]— adds thefastembedlibrary so you can produce embeddings on the client side without pulling insentence-transformersor calling a remote API. Uses ONNX Runtime under the hood.qdrant-client[fastembed-gpu]— same, but with the CUDA build of ONNX Runtime for GPU-accelerated embedding.- The base install already supports both HTTP (via
httpx) and gRPC (viagrpcio) — you opt into gRPC at runtime withQdrantClient(prefer_grpc=True), no extra is needed.
Common companions installed alongside:
fastembed— standalone version of the same embedding library, if you want to share it across multiple processes.sentence-transformers— alternative client-side embeddings.openai/cohere/voyageai— remote embedding APIs you can wire intoclient.upsert(...).langchain-qdrantandllama-index-vector-stores-qdrant— framework adapters.
Alternatives
| Package | Trade-off |
|---|---|
chromadb | Embedded, zero-infrastructure. Use for prototypes or small RAG apps. |
weaviate-client | Hybrid vector + BM25 with schema-first GraphQL. Use when keyword search matters as much as vectors. |
pymilvus | Milvus client. Use for very large multi-billion-vector workloads. |
pinecone-client | Fully-hosted SaaS. Use when you want to outsource ops entirely. |
lancedb | Embedded columnar DB on Lance/Arrow. Use when your data is already columnar. |
pgvector (via psycopg/SQLAlchemy) | Postgres extension. Use when you already run Postgres and want one less moving part. |
Common gotchas
- HTTP vs gRPC ports. The default Qdrant server exposes
6333(REST) and6334(gRPC).prefer_grpc=Truerequires the gRPC port to be reachable — many Docker examples only publish6333, so gRPC silently falls back to HTTP or hangs. - Collection-config schema reshape in
1.10.VectorParams,OptimizersConfigDiff, andHnswConfigDiffwere tightened; oldrecreate_collection(...)calls written for1.7-era examples now raise validation errors. Regenerate from current docs. recreate_collectiondeletes data. It isdelete+createin one call, not a no-op when the schema already matches. Usecreate_collectionwith a try/except, or checkcollection_exists, in code that should be idempotent.- In-memory
:memory:is single-process only. It exists for unit tests; do not use it as a "lightweight production" mode. The on-disk single-node mode (path=...) is more durable but still single-writer. - Client occasionally lags server features. Sparse vectors, multivector storage, and quantization parameters often appear in the server REST API a release before they get typed Python wrappers. Drop to
client.httpfor the gap. grpciowheel size is non-trivial. Adds ~10 MB to the install. Slim Docker images that don't actually use gRPC can pin a constraint to skip it, or use a build without it.- API key vs JWT auth. Self-hosted Qdrant supports a static API key; Qdrant Cloud uses JWT-style tokens. Both go in the
api_key=constructor argument, but cluster-scoped permissions differ.
Real-world recipes
The recipes below focus on the install / transport / collection-config choices each pattern implies — the sections/ai/qdrant companion covers the points/filters API in depth.
In-memory client for unit tests — QdrantClient(":memory:") boots an in-process Qdrant simulation. Useful for CI; not durable, not the same code path as the production server.
from qdrant_client import QdrantClient
from qdrant_client.models import VectorParams, Distance, PointStruct
client = QdrantClient(":memory:")
client.create_collection(
"kb",
vectors_config=VectorParams(size=384, distance=Distance.COSINE),
)
client.upsert(
"kb",
points=[PointStruct(id=1, vector=[0.1] * 384, payload={"src": "intro"})],
)
print(client.search("kb", query_vector=[0.1] * 384, limit=1))
Output: the upserted point with its score and payload; no server is running — everything is in-process
gRPC client against a real server — for batch uploads and high-throughput query loads, gRPC is materially faster than REST. Make sure port 6334 is exposed.
from qdrant_client import QdrantClient
client = QdrantClient(
host="qdrant.internal",
grpc_port=6334,
prefer_grpc=True,
api_key="...", # or jwt token for Qdrant Cloud
https=True,
timeout=30,
)
Output: a client that uses gRPC for batch operations and falls back to HTTP for endpoints not yet on gRPC
Batch upload with progress and retries — upload_points and upload_collection are the bulk-load entry points. They batch internally and recover from transient errors.
from qdrant_client.models import PointStruct
client.upload_points(
collection_name="kb",
points=(PointStruct(id=i, vector=emb, payload=meta)
for i, emb, meta in iter_rows()),
batch_size=512,
parallel=4,
max_retries=3,
)
Output: points stream into the collection in 4 parallel batches of 512; the call blocks until the generator is exhausted
HNSW + quantization for a billion-scale collection — Qdrant supports scalar (int8) and product (PQ) quantization at the index level, trading recall for memory and disk.
from qdrant_client.models import (
VectorParams, Distance, HnswConfigDiff,
ScalarQuantization, ScalarQuantizationConfig, ScalarType,
OptimizersConfigDiff,
)
client.create_collection(
"huge",
vectors_config=VectorParams(
size=768,
distance=Distance.COSINE,
on_disk=True,
),
hnsw_config=HnswConfigDiff(m=32, ef_construct=256, on_disk=True),
quantization_config=ScalarQuantization(
scalar=ScalarQuantizationConfig(type=ScalarType.INT8, always_ram=True),
),
optimizers_config=OptimizersConfigDiff(memmap_threshold=50_000),
)
Output: a collection that holds vectors on disk, keeps the int8-quantized index in RAM for fast search, and memmaps the raw vectors past 50k points
Hybrid (sparse + dense) with named vectors — Qdrant supports multiple named vectors per point. The classic pattern is a dense vector for semantics and a sparse vector (BM42 / SPLADE) for keyword match, fused server-side or in application code.
from qdrant_client.models import (
VectorParams, Distance, SparseVectorParams, SparseVector, NamedVector,
)
client.create_collection(
"hybrid",
vectors_config={"dense": VectorParams(size=384, distance=Distance.COSINE)},
sparse_vectors_config={"keywords": SparseVectorParams()},
)
# Query with a dense vector, then rerank with a sparse query (RRF in app code)
dense_hits = client.search("hybrid", query_vector=NamedVector(name="dense", vector=q_dense), limit=50)
Output: the collection has both a dense and a sparse vector slot per point; the search call uses the named dense vector explicitly
Production deployment
Qdrant in production is almost always a separate Docker / Kubernetes deployment with the Python client speaking gRPC over a private network. The :memory: and on-disk single-node modes are for development; do not scale them.
Topology checklist:
| Concern | Single-node Docker | Cluster (multi-node) | Qdrant Cloud |
|---|---|---|---|
| Replicas | 1 | configurable per collection | configurable per collection |
| Shards | 1 | per-collection shard count | per-collection shard count |
| Backups | filesystem snapshot of /qdrant/storage | per-shard snapshot API | managed |
| Auth | static API key via env | static API key per node | JWT tokens with role claims |
| Transport | 6333 (HTTP) + 6334 (gRPC) | same per-node | TLS-only |
| Telemetry | sends anonymous pings | same | managed |
Sharding and replication. Set at collection creation:
from qdrant_client.models import VectorParams, Distance
client.create_collection(
"kb",
vectors_config=VectorParams(size=384, distance=Distance.COSINE),
shard_number=4,
replication_factor=2,
)
Output: the collection is split across 4 shards (per-node by default) with 2 replicas of each shard, distributed across cluster nodes
Snapshots. Snapshots are per-collection and per-shard. client.create_snapshot("kb") produces a tar archive on the server's snapshot directory; restore via recover_snapshot. Coordinate with the writer to get a consistent view — Qdrant does not freeze writes during snapshot creation.
Multi-tenancy. The robust pattern is filter-per-tenant with a tenant_id payload field plus a payload index for fast filtering. Collection-per-tenant works for low tenant counts but the cluster's collection-metadata overhead caps the practical limit at low thousands.
client.create_payload_index(
"kb",
field_name="tenant_id",
field_schema="keyword",
)
# Every query carries a tenant filter as a non-optional precondition
hits = client.search(
"kb",
query_vector=q,
query_filter=Filter(must=[FieldCondition(key="tenant_id", match=MatchValue(value=tid))]),
limit=10,
)
Output: the payload index makes the prefilter fast; the application enforces tenant isolation by injecting the filter on every call
Index tuning & retrieval quality
Qdrant's HNSW parameters can be set at collection creation and updated later (a key difference from Chroma — the index is rebuilt incrementally). The three knobs are m, ef_construct, and ef.
from qdrant_client.models import HnswConfigDiff
client.update_collection(
"kb",
hnsw_config=HnswConfigDiff(m=48, ef_construct=400),
)
# Per-query ef is the search-time knob
hits = client.search(
"kb",
query_vector=q,
limit=10,
search_params={"hnsw_ef": 256},
)
Output: the collection rebuilds its HNSW index in the background; per-query hnsw_ef controls recall vs latency on each call
Trade-off table:
| Parameter | Default | Higher value | Effect |
|---|---|---|---|
m | 16 | 32–64 | better recall, more RAM per vector |
ef_construct | 100 | 256–512 | better index quality, slower build |
hnsw_ef (query) | 128 | 256–1024 | better recall, slower query |
quantization | off | int8 / PQ | smaller index, slight recall hit |
On-disk vs in-memory. Vectors can live on_disk=True (memmapped) while the HNSW graph stays in RAM. Past a few million vectors this is the only practical setup; the memory budget is dominated by the graph, not the raw vectors.
Quantization. Scalar int8 quantization cuts RAM ~4× with typically <1% recall loss. Product Quantization (PQ) goes further (~16×) at higher recall cost. Use always_ram=True to pin the quantized index in memory while raw vectors page from disk.
Hybrid recipes. Qdrant supports server-side Reciprocal Rank Fusion in recent versions; for older clients/servers, fuse client-side. The reference pattern: dense top-50 + sparse top-50, fuse with RRF, take top-10.
Version migration guide
The qdrant-client library tracks server major/minor closely. Cross-major combinations should be avoided; cross-minor usually works but newer server features lag in typed Python wrappers.
0.x → 1.0 (early 2024):
- Collection-config schemas reshaped.
VectorParams,OptimizersConfigDiff, andHnswConfigDiffmoved underqdrant_client.modelswith stricter validation. Oldrecreate_collection(...)calls now raise. recreate_collectionbecamecreate_collection+delete_collection. The combined helper still exists but is data-destructive; use the explicit pair for idempotent code.
1.x minor-to-minor (notable):
1.7→1.10—VectorParams, sparse-vector params, andquantization_configshape tightened. Many fields became required that were optional before.- Multivector support (multiple vectors per point with a single name) and named vectors evolved independently — read the changelog when adopting either.
points_batchdeprecated in favour ofupload_points/upload_collection.- Server features lead the client. Sparse vectors, multivector, and quantization typically appear in the server REST API a release before they get typed Python wrappers. Fall back to
client.http(the auto-generated REST methods) or post raw JSON when the typed wrapper is missing.
Pinning strategy. Match qdrant-client minor to server minor exactly in production. qdrant-client~=1.10 paired with Qdrant server 1.10.x is the safe path; ~=1.10 allows patch upgrades but not minor drift.
Performance tuning
The two transports — REST (HTTP) and gRPC — have very different cost profiles. Use gRPC for batch upload and any hot query loop; REST is fine for ad-hoc calls.
| Lever | Mechanism | When it helps |
|---|---|---|
prefer_grpc=True | gRPC over port 6334 | batch uploads, query throughput |
upload_points(parallel=N) | concurrent batches | initial bulk-load |
wait=False on upsert | fire-and-forget | high-throughput ingestion; consistency relaxed |
on_disk=True for vectors | memmap raw vectors | large collections, RAM-bound |
always_ram=True for quantization | pin quantized index | latency-critical reads |
payload_index on filter fields | hashed lookup | filtered queries with selective predicates |
AsyncQdrantClient | concurrent queries from asyncio | many concurrent users |
Async client. AsyncQdrantClient mirrors the sync surface with await semantics — required for high-concurrency FastAPI / aiohttp apps.
import asyncio
from qdrant_client import AsyncQdrantClient
async def main():
client = AsyncQdrantClient(host="qdrant.internal", prefer_grpc=True)
hits = await client.search("kb", query_vector=q, limit=10)
await client.close()
asyncio.run(main())
Output: a coroutine that runs the search without blocking the event loop; always await client.close() to release gRPC channels
Batching guidance. For initial bulk-load: 256–1024 points per batch, 4–8 parallel batches. Smaller batches are network-overhead-bound; larger batches saturate server memory and trigger optimiser pauses.
Troubleshooting common errors
ConnectionRefusedErroron port 6334 — gRPC port not exposed. Many Docker Compose examples publish only 6333 (REST). Either expose 6334 or dropprefer_grpc=True.ValidationErroronVectorParams— collection-config schema changed. Regenerate the config call from current docs; old1.7-era examples no longer validate on1.10+.Service Unavailablemid-upload — server is in optimiser pause (index rebuild). Lowerparallel=and addmax_retries=; the client retries with backoff.recreate_collectiondeleted my data — by design. Usecreate_collectionwith atry/exceptonUnexpectedResponse, or checkclient.collection_exists("name")first.Unauthorizedagainst Qdrant Cloud — JWT token expired, or you used a Cloud token against a self-hosted instance. Tokens are tenant-scoped and time-bounded.- Typed wrapper missing for a server feature — drop to
client.http.points_api.upsert_points(...)(the auto-generated REST client) or post raw JSON viarequests. - Slow first query after restart — HNSW graph loads lazily from disk. Warm the cache with a synthetic query at startup.
- gRPC client leaks file descriptors — always
client.close()in long-running processes; the channel pool does not auto-clean on GC.
Security considerations
Qdrant ships with auth disabled by default — appropriate for localhost development, dangerous for any networked deployment.
- API key auth. Set
QDRANT__SERVICE__API_KEYon the server and passapi_key=to the client. Read-only keys are available viaQDRANT__SERVICE__READ_ONLY_API_KEY. - JWT (Qdrant Cloud). Cloud customers use JWT tokens with role claims (read/write per collection). Tokens are issued from the Cloud console; rotate alongside other secrets.
- TLS. Configure
QDRANT__SERVICE__ENABLE_TLS=truewith cert/key files, or terminate TLS at an ingress proxy. Without TLS, both API keys and vectors are visible on the wire. - mTLS for cluster traffic. Multi-node Qdrant clusters can require mutual TLS on the gossip and replication channels.
- Multi-tenant isolation. Filter-per-tenant is the recommended pattern; enforce the filter in a wrapper rather than trusting every call site.
- Payload size limits. Large per-point payloads (>10 KB) are accepted but slow queries; consider storing only IDs in Qdrant and fetching full content from a separate store. Also a smaller blast radius if the vector DB is compromised.
- Prompt injection via retrieved content. Documents returned to the LM may carry attack payloads. Sanitise before prompt assembly.
- Snapshots are plaintext. Encrypt at the filesystem / object-store layer (
s3://...SSE, EBS encryption, etc.). - Telemetry. Qdrant sends anonymous usage telemetry by default; disable via
QDRANT__TELEMETRY_DISABLED=trueif your compliance regime forbids it.
Ecosystem integrations
- LangChain —
langchain-qdrantpackage;QdrantVectorStoreretriever. - LlamaIndex —
llama-index-vector-stores-qdrant. - Haystack 2.x —
qdrant-haystackfrom the integrations namespace. - Semantic Kernel —
semantic-kernel[qdrant]extra;QdrantVectorStorevia the newVectorStoreabstraction. - DSPy — Qdrant retriever module ships with
dspy-ai. - fastembed — embedding library by the same team; runs ONNX models for fast client-side embeddings without
torch. - MCP — community MCP servers expose Qdrant collections as tools for agentic use.
When NOT to use this
Qdrant earns its keep when production filtering, sharding, or hybrid search matter. The trade-offs below are where another tool fits better.
- Notebook prototypes with no infra.
chromadbin-process is friendlier — onepip install, no server. Move to Qdrant when latency, filtering, or scale demand it. - You want hybrid search out of the box, not in app code. Weaviate fuses BM25 + vector server-side without RRF stitching.
- Postgres is already your operational database.
pgvectoradds vector search to an existing operational store; one less moving piece. - Very large clusters (>10B vectors). Milvus and Vespa have more battle-tested distributed stories at that scale.
- Fully-managed-only deployments. Qdrant Cloud exists but you may prefer Pinecone if you do not want any awareness of the engine internals.
See also
- AI: qdrant — collections, points, filters, hybrid search
- Concept: RAG — retrieval-augmented generation patterns
- Concept: API — REST design fundamentals