cheat sheet

qdrant-client

Package-level reference for qdrant-client on PyPI — install variants, server version matching, gRPC vs HTTP, fastembed extras, and alternatives.

qdrant-client

What it is

qdrant-client is the official Python SDK for Qdrant, a Rust-written vector database focused on production-grade similarity search with rich payload filtering. The client speaks both REST (over HTTP) and gRPC to a remote Qdrant server, and ships a built-in in-memory mode (QdrantClient(":memory:")) plus an on-disk single-node mode for local development without a running server.

Reach for qdrant-client when you want strong filtering on structured payloads alongside dense (or sparse, or hybrid) vector search, and you are comfortable running a separate Qdrant server in production. Reach for chromadb when you prefer an embedded, zero-infrastructure store; reach for weaviate-client when you want hybrid search with BM25 baked in.

Install

bash
pip install qdrant-client

Output: (none — exits 0 on success)

bash
uv add qdrant-client

Output: dependency resolved + added to pyproject.toml

bash
poetry add qdrant-client

Output: updated lockfile + virtualenv install

bash
pip install "qdrant-client[fastembed]"           # adds on-device embedding via fastembed
pip install "qdrant-client[fastembed-gpu]"       # GPU build of fastembed (CUDA)

Output: Qdrant client plus the chosen embedding bundle

Versioning & Python support

  • The client follows Qdrant server major/minor versions closely — qdrant-client~=1.10 pairs with Qdrant server 1.10.x. Cross-major combinations should be avoided; cross-minor usually works but new server features (e.g. multivector, sparse indices, named vectors) only land in the client a release or two later.
  • Recent versions support Python 3.9+. Wheels are pure-Python; gRPC support pulls in grpcio (a binary wheel).
  • The Python client occasionally lags the server — a feature visible in the server's REST API may not yet have a typed Python wrapper. Workaround: drop to client.http (the auto-generated REST methods) or post the raw payload via requests.
  • Pre-1.0 to 1.x jump (early 2024) added breaking changes around collection-config schemas; the 1.10 line further reshaped VectorParams and OptimizersConfig. Pin a tight range.

Package metadata

  • Maintainer: Qdrant (the company) and community contributors
  • Project home: github.com/qdrant/qdrant-client
  • Server repo: github.com/qdrant/qdrant
  • Docs: qdrant.tech/documentation
  • PyPI: pypi.org/project/qdrant-client
  • License: Apache-2.0
  • Governance: company-led with open contributions; Qdrant Cloud is the commercial hosted offering
  • First released: 2021
  • Downloads: millions per month; standard choice when Qdrant is the chosen vector DB

Optional dependencies & extras

  • qdrant-client[fastembed] — adds the fastembed library so you can produce embeddings on the client side without pulling in sentence-transformers or calling a remote API. Uses ONNX Runtime under the hood.
  • qdrant-client[fastembed-gpu] — same, but with the CUDA build of ONNX Runtime for GPU-accelerated embedding.
  • The base install already supports both HTTP (via httpx) and gRPC (via grpcio) — you opt into gRPC at runtime with QdrantClient(prefer_grpc=True), no extra is needed.

Common companions installed alongside:

  • fastembed — standalone version of the same embedding library, if you want to share it across multiple processes.
  • sentence-transformers — alternative client-side embeddings.
  • openai / cohere / voyageai — remote embedding APIs you can wire into client.upsert(...).
  • langchain-qdrant and llama-index-vector-stores-qdrant — framework adapters.

Alternatives

PackageTrade-off
chromadbEmbedded, zero-infrastructure. Use for prototypes or small RAG apps.
weaviate-clientHybrid vector + BM25 with schema-first GraphQL. Use when keyword search matters as much as vectors.
pymilvusMilvus client. Use for very large multi-billion-vector workloads.
pinecone-clientFully-hosted SaaS. Use when you want to outsource ops entirely.
lancedbEmbedded columnar DB on Lance/Arrow. Use when your data is already columnar.
pgvector (via psycopg/SQLAlchemy)Postgres extension. Use when you already run Postgres and want one less moving part.

Common gotchas

  1. HTTP vs gRPC ports. The default Qdrant server exposes 6333 (REST) and 6334 (gRPC). prefer_grpc=True requires the gRPC port to be reachable — many Docker examples only publish 6333, so gRPC silently falls back to HTTP or hangs.
  2. Collection-config schema reshape in 1.10. VectorParams, OptimizersConfigDiff, and HnswConfigDiff were tightened; old recreate_collection(...) calls written for 1.7-era examples now raise validation errors. Regenerate from current docs.
  3. recreate_collection deletes data. It is delete + create in one call, not a no-op when the schema already matches. Use create_collection with a try/except, or check collection_exists, in code that should be idempotent.
  4. In-memory :memory: is single-process only. It exists for unit tests; do not use it as a "lightweight production" mode. The on-disk single-node mode (path=...) is more durable but still single-writer.
  5. Client occasionally lags server features. Sparse vectors, multivector storage, and quantization parameters often appear in the server REST API a release before they get typed Python wrappers. Drop to client.http for the gap.
  6. grpcio wheel size is non-trivial. Adds ~10 MB to the install. Slim Docker images that don't actually use gRPC can pin a constraint to skip it, or use a build without it.
  7. API key vs JWT auth. Self-hosted Qdrant supports a static API key; Qdrant Cloud uses JWT-style tokens. Both go in the api_key= constructor argument, but cluster-scoped permissions differ.

Real-world recipes

The recipes below focus on the install / transport / collection-config choices each pattern implies — the sections/ai/qdrant companion covers the points/filters API in depth.

In-memory client for unit testsQdrantClient(":memory:") boots an in-process Qdrant simulation. Useful for CI; not durable, not the same code path as the production server.

python
from qdrant_client import QdrantClient
from qdrant_client.models import VectorParams, Distance, PointStruct

client = QdrantClient(":memory:")
client.create_collection(
    "kb",
    vectors_config=VectorParams(size=384, distance=Distance.COSINE),
)
client.upsert(
    "kb",
    points=[PointStruct(id=1, vector=[0.1] * 384, payload={"src": "intro"})],
)
print(client.search("kb", query_vector=[0.1] * 384, limit=1))

Output: the upserted point with its score and payload; no server is running — everything is in-process

gRPC client against a real server — for batch uploads and high-throughput query loads, gRPC is materially faster than REST. Make sure port 6334 is exposed.

python
from qdrant_client import QdrantClient

client = QdrantClient(
    host="qdrant.internal",
    grpc_port=6334,
    prefer_grpc=True,
    api_key="...",     # or jwt token for Qdrant Cloud
    https=True,
    timeout=30,
)

Output: a client that uses gRPC for batch operations and falls back to HTTP for endpoints not yet on gRPC

Batch upload with progress and retriesupload_points and upload_collection are the bulk-load entry points. They batch internally and recover from transient errors.

python
from qdrant_client.models import PointStruct

client.upload_points(
    collection_name="kb",
    points=(PointStruct(id=i, vector=emb, payload=meta)
            for i, emb, meta in iter_rows()),
    batch_size=512,
    parallel=4,
    max_retries=3,
)

Output: points stream into the collection in 4 parallel batches of 512; the call blocks until the generator is exhausted

HNSW + quantization for a billion-scale collection — Qdrant supports scalar (int8) and product (PQ) quantization at the index level, trading recall for memory and disk.

python
from qdrant_client.models import (
    VectorParams, Distance, HnswConfigDiff,
    ScalarQuantization, ScalarQuantizationConfig, ScalarType,
    OptimizersConfigDiff,
)

client.create_collection(
    "huge",
    vectors_config=VectorParams(
        size=768,
        distance=Distance.COSINE,
        on_disk=True,
    ),
    hnsw_config=HnswConfigDiff(m=32, ef_construct=256, on_disk=True),
    quantization_config=ScalarQuantization(
        scalar=ScalarQuantizationConfig(type=ScalarType.INT8, always_ram=True),
    ),
    optimizers_config=OptimizersConfigDiff(memmap_threshold=50_000),
)

Output: a collection that holds vectors on disk, keeps the int8-quantized index in RAM for fast search, and memmaps the raw vectors past 50k points

Hybrid (sparse + dense) with named vectors — Qdrant supports multiple named vectors per point. The classic pattern is a dense vector for semantics and a sparse vector (BM42 / SPLADE) for keyword match, fused server-side or in application code.

python
from qdrant_client.models import (
    VectorParams, Distance, SparseVectorParams, SparseVector, NamedVector,
)

client.create_collection(
    "hybrid",
    vectors_config={"dense": VectorParams(size=384, distance=Distance.COSINE)},
    sparse_vectors_config={"keywords": SparseVectorParams()},
)
# Query with a dense vector, then rerank with a sparse query (RRF in app code)
dense_hits = client.search("hybrid", query_vector=NamedVector(name="dense", vector=q_dense), limit=50)

Output: the collection has both a dense and a sparse vector slot per point; the search call uses the named dense vector explicitly

Production deployment

Qdrant in production is almost always a separate Docker / Kubernetes deployment with the Python client speaking gRPC over a private network. The :memory: and on-disk single-node modes are for development; do not scale them.

Topology checklist:

ConcernSingle-node DockerCluster (multi-node)Qdrant Cloud
Replicas1configurable per collectionconfigurable per collection
Shards1per-collection shard countper-collection shard count
Backupsfilesystem snapshot of /qdrant/storageper-shard snapshot APImanaged
Authstatic API key via envstatic API key per nodeJWT tokens with role claims
Transport6333 (HTTP) + 6334 (gRPC)same per-nodeTLS-only
Telemetrysends anonymous pingssamemanaged

Sharding and replication. Set at collection creation:

python
from qdrant_client.models import VectorParams, Distance

client.create_collection(
    "kb",
    vectors_config=VectorParams(size=384, distance=Distance.COSINE),
    shard_number=4,
    replication_factor=2,
)

Output: the collection is split across 4 shards (per-node by default) with 2 replicas of each shard, distributed across cluster nodes

Snapshots. Snapshots are per-collection and per-shard. client.create_snapshot("kb") produces a tar archive on the server's snapshot directory; restore via recover_snapshot. Coordinate with the writer to get a consistent view — Qdrant does not freeze writes during snapshot creation.

Multi-tenancy. The robust pattern is filter-per-tenant with a tenant_id payload field plus a payload index for fast filtering. Collection-per-tenant works for low tenant counts but the cluster's collection-metadata overhead caps the practical limit at low thousands.

python
client.create_payload_index(
    "kb",
    field_name="tenant_id",
    field_schema="keyword",
)
# Every query carries a tenant filter as a non-optional precondition
hits = client.search(
    "kb",
    query_vector=q,
    query_filter=Filter(must=[FieldCondition(key="tenant_id", match=MatchValue(value=tid))]),
    limit=10,
)

Output: the payload index makes the prefilter fast; the application enforces tenant isolation by injecting the filter on every call

Index tuning & retrieval quality

Qdrant's HNSW parameters can be set at collection creation and updated later (a key difference from Chroma — the index is rebuilt incrementally). The three knobs are m, ef_construct, and ef.

python
from qdrant_client.models import HnswConfigDiff

client.update_collection(
    "kb",
    hnsw_config=HnswConfigDiff(m=48, ef_construct=400),
)
# Per-query ef is the search-time knob
hits = client.search(
    "kb",
    query_vector=q,
    limit=10,
    search_params={"hnsw_ef": 256},
)

Output: the collection rebuilds its HNSW index in the background; per-query hnsw_ef controls recall vs latency on each call

Trade-off table:

ParameterDefaultHigher valueEffect
m1632–64better recall, more RAM per vector
ef_construct100256–512better index quality, slower build
hnsw_ef (query)128256–1024better recall, slower query
quantizationoffint8 / PQsmaller index, slight recall hit

On-disk vs in-memory. Vectors can live on_disk=True (memmapped) while the HNSW graph stays in RAM. Past a few million vectors this is the only practical setup; the memory budget is dominated by the graph, not the raw vectors.

Quantization. Scalar int8 quantization cuts RAM ~4× with typically <1% recall loss. Product Quantization (PQ) goes further (~16×) at higher recall cost. Use always_ram=True to pin the quantized index in memory while raw vectors page from disk.

Hybrid recipes. Qdrant supports server-side Reciprocal Rank Fusion in recent versions; for older clients/servers, fuse client-side. The reference pattern: dense top-50 + sparse top-50, fuse with RRF, take top-10.

Version migration guide

The qdrant-client library tracks server major/minor closely. Cross-major combinations should be avoided; cross-minor usually works but newer server features lag in typed Python wrappers.

0.x → 1.0 (early 2024):

  • Collection-config schemas reshaped. VectorParams, OptimizersConfigDiff, and HnswConfigDiff moved under qdrant_client.models with stricter validation. Old recreate_collection(...) calls now raise.
  • recreate_collection became create_collection + delete_collection. The combined helper still exists but is data-destructive; use the explicit pair for idempotent code.

1.x minor-to-minor (notable):

  • 1.71.10VectorParams, sparse-vector params, and quantization_config shape tightened. Many fields became required that were optional before.
  • Multivector support (multiple vectors per point with a single name) and named vectors evolved independently — read the changelog when adopting either.
  • points_batch deprecated in favour of upload_points / upload_collection.
  • Server features lead the client. Sparse vectors, multivector, and quantization typically appear in the server REST API a release before they get typed Python wrappers. Fall back to client.http (the auto-generated REST methods) or post raw JSON when the typed wrapper is missing.

Pinning strategy. Match qdrant-client minor to server minor exactly in production. qdrant-client~=1.10 paired with Qdrant server 1.10.x is the safe path; ~=1.10 allows patch upgrades but not minor drift.

Performance tuning

The two transports — REST (HTTP) and gRPC — have very different cost profiles. Use gRPC for batch upload and any hot query loop; REST is fine for ad-hoc calls.

LeverMechanismWhen it helps
prefer_grpc=TruegRPC over port 6334batch uploads, query throughput
upload_points(parallel=N)concurrent batchesinitial bulk-load
wait=False on upsertfire-and-forgethigh-throughput ingestion; consistency relaxed
on_disk=True for vectorsmemmap raw vectorslarge collections, RAM-bound
always_ram=True for quantizationpin quantized indexlatency-critical reads
payload_index on filter fieldshashed lookupfiltered queries with selective predicates
AsyncQdrantClientconcurrent queries from asynciomany concurrent users

Async client. AsyncQdrantClient mirrors the sync surface with await semantics — required for high-concurrency FastAPI / aiohttp apps.

python
import asyncio
from qdrant_client import AsyncQdrantClient

async def main():
    client = AsyncQdrantClient(host="qdrant.internal", prefer_grpc=True)
    hits = await client.search("kb", query_vector=q, limit=10)
    await client.close()

asyncio.run(main())

Output: a coroutine that runs the search without blocking the event loop; always await client.close() to release gRPC channels

Batching guidance. For initial bulk-load: 256–1024 points per batch, 4–8 parallel batches. Smaller batches are network-overhead-bound; larger batches saturate server memory and trigger optimiser pauses.

Troubleshooting common errors

  • ConnectionRefusedError on port 6334 — gRPC port not exposed. Many Docker Compose examples publish only 6333 (REST). Either expose 6334 or drop prefer_grpc=True.
  • ValidationError on VectorParams — collection-config schema changed. Regenerate the config call from current docs; old 1.7-era examples no longer validate on 1.10+.
  • Service Unavailable mid-upload — server is in optimiser pause (index rebuild). Lower parallel= and add max_retries=; the client retries with backoff.
  • recreate_collection deleted my data — by design. Use create_collection with a try/except on UnexpectedResponse, or check client.collection_exists("name") first.
  • Unauthorized against Qdrant Cloud — JWT token expired, or you used a Cloud token against a self-hosted instance. Tokens are tenant-scoped and time-bounded.
  • Typed wrapper missing for a server feature — drop to client.http.points_api.upsert_points(...) (the auto-generated REST client) or post raw JSON via requests.
  • Slow first query after restart — HNSW graph loads lazily from disk. Warm the cache with a synthetic query at startup.
  • gRPC client leaks file descriptors — always client.close() in long-running processes; the channel pool does not auto-clean on GC.

Security considerations

Qdrant ships with auth disabled by default — appropriate for localhost development, dangerous for any networked deployment.

  • API key auth. Set QDRANT__SERVICE__API_KEY on the server and pass api_key= to the client. Read-only keys are available via QDRANT__SERVICE__READ_ONLY_API_KEY.
  • JWT (Qdrant Cloud). Cloud customers use JWT tokens with role claims (read/write per collection). Tokens are issued from the Cloud console; rotate alongside other secrets.
  • TLS. Configure QDRANT__SERVICE__ENABLE_TLS=true with cert/key files, or terminate TLS at an ingress proxy. Without TLS, both API keys and vectors are visible on the wire.
  • mTLS for cluster traffic. Multi-node Qdrant clusters can require mutual TLS on the gossip and replication channels.
  • Multi-tenant isolation. Filter-per-tenant is the recommended pattern; enforce the filter in a wrapper rather than trusting every call site.
  • Payload size limits. Large per-point payloads (>10 KB) are accepted but slow queries; consider storing only IDs in Qdrant and fetching full content from a separate store. Also a smaller blast radius if the vector DB is compromised.
  • Prompt injection via retrieved content. Documents returned to the LM may carry attack payloads. Sanitise before prompt assembly.
  • Snapshots are plaintext. Encrypt at the filesystem / object-store layer (s3://... SSE, EBS encryption, etc.).
  • Telemetry. Qdrant sends anonymous usage telemetry by default; disable via QDRANT__TELEMETRY_DISABLED=true if your compliance regime forbids it.

Ecosystem integrations

  • LangChainlangchain-qdrant package; QdrantVectorStore retriever.
  • LlamaIndexllama-index-vector-stores-qdrant.
  • Haystack 2.xqdrant-haystack from the integrations namespace.
  • Semantic Kernelsemantic-kernel[qdrant] extra; QdrantVectorStore via the new VectorStore abstraction.
  • DSPy — Qdrant retriever module ships with dspy-ai.
  • fastembed — embedding library by the same team; runs ONNX models for fast client-side embeddings without torch.
  • MCP — community MCP servers expose Qdrant collections as tools for agentic use.

When NOT to use this

Qdrant earns its keep when production filtering, sharding, or hybrid search matter. The trade-offs below are where another tool fits better.

  • Notebook prototypes with no infra. chromadb in-process is friendlier — one pip install, no server. Move to Qdrant when latency, filtering, or scale demand it.
  • You want hybrid search out of the box, not in app code. Weaviate fuses BM25 + vector server-side without RRF stitching.
  • Postgres is already your operational database. pgvector adds vector search to an existing operational store; one less moving piece.
  • Very large clusters (>10B vectors). Milvus and Vespa have more battle-tested distributed stories at that scale.
  • Fully-managed-only deployments. Qdrant Cloud exists but you may prefer Pinecone if you do not want any awareness of the engine internals.

See also