cheat sheet
ChromaDB
Store and query vector embeddings locally or over a network with ChromaDB. Covers client types, collections, add, query, metadata filters, embedding functions, and LangChain/LlamaIndex integration.
ChromaDB — Embedded Vector Database
What it is
ChromaDB is an open-source vector database designed for AI applications. It stores embeddings (dense float vectors) alongside documents and metadata, and retrieves the nearest neighbours to a query vector using approximate nearest-neighbour search. Chroma runs embedded (in-process, no server), as a persistent local store, or as a client/server pair. It is the default vector store for many LangChain and LlamaIndex tutorials because it requires zero infrastructure to get started.
Install
pip install chromadb
Output: (none — exits 0 on success)
Quick example
import chromadb
client = chromadb.Client() # in-memory
collection = client.create_collection("my_docs")
collection.add(
documents=["Python is a high-level language.", "Rust is a systems language."],
ids=["doc1", "doc2"],
)
results = collection.query(query_texts=["scripting language"], n_results=1)
print(results["documents"])
print(results["distances"])
Output:
[['Python is a high-level language.']]
[[0.6321]]
When / why to use it
- Adding semantic search to an application without deploying a separate database.
- Storing and querying embeddings in a RAG pipeline alongside LangChain or LlamaIndex.
- Prototyping: Chroma runs in-process with zero setup; switch to a persistent or server client for production.
- Metadata-filtered retrieval: combine vector similarity with structured filters (
where={"category": "news"}). - Multi-tenant systems: one collection per tenant, all in the same Chroma instance.
Common pitfalls
Duplicate IDs raise — adding a document with an ID that already exists raises
chromadb.errors.IDAlreadyExistsError. Useupsert()when you may be re-adding existing documents.
Dimension mismatch — all vectors in a collection must have the same dimension. Mixing embedding models (e.g. OpenAI 1536-dim and HuggingFace 768-dim) in one collection raises a dimension error on the second add.
chromadb.Client()is in-memory only — data is lost when the process exits. Usechromadb.PersistentClient(path="./chroma_db")for data that must survive restarts.
The default embedding function (
DefaultEmbeddingFunction) usessentence-transformers/all-MiniLM-L6-v2running locally. It is accurate enough for prototyping and requires no API key, but needspip install chromadb[default].
Pass
include=["documents", "metadatas", "distances"]toquery()to control what the response contains. Omittingdocumentssaves bandwidth when you only need IDs.
Client types
Chroma offers three client modes. Switch modes by changing only the client construction line.
import chromadb
# In-memory — data lost on exit
client = chromadb.Client()
# Persistent — saved to disk, survives restarts
client = chromadb.PersistentClient(path="./chroma_db")
# HTTP client — connects to a running Chroma server
client = chromadb.HttpClient(host="localhost", port=8000)
Start the Chroma server for the HTTP client:
chroma run --path ./chroma_db --port 8000
Output: (none — exits 0 on success)
Creating and managing collections
A collection groups documents with the same embedding dimension. Collections are created once and retrieved by name on subsequent runs.
import chromadb
client = chromadb.PersistentClient(path="./chroma_db")
# Create (fails if exists)
col = client.create_collection("research_papers")
# Get or create (idempotent)
col = client.get_or_create_collection("research_papers")
# Get existing (raises if missing)
col = client.get_collection("research_papers")
# List all collections
print(client.list_collections()) # ['research_papers']
# Delete
client.delete_collection("research_papers")
# Collection metadata and distance function
col = client.create_collection(
"products",
metadata={"hnsw:space": "cosine"}, # cosine | l2 (default) | ip
)
Adding documents
The add() method stores documents with their embeddings (or lets Chroma embed them) and optional metadata.
import chromadb
client = chromadb.PersistentClient(path="./chroma_db")
col = client.get_or_create_collection("articles")
# Add with auto-embedding (uses default embedding function)
col.add(
documents=[
"ChromaDB is an open-source vector database for AI applications.",
"LangChain is a framework for building LLM-powered pipelines.",
"PyTorch is a deep learning framework developed by Meta.",
],
ids=["art_001", "art_002", "art_003"],
metadatas=[
{"category": "database", "year": 2023},
{"category": "framework", "year": 2022},
{"category": "ml", "year": 2016},
],
)
print(f"Collection count: {col.count()}")
Output:
Collection count: 3
# Add with pre-computed embeddings (skips the embedding step)
import numpy as np
col.add(
embeddings=np.random.rand(2, 384).tolist(), # must match collection dimension
documents=["Document A", "Document B"],
ids=["doc_a", "doc_b"],
)
Querying
query() takes one or more query texts (or pre-computed query embeddings) and returns the n_results nearest neighbours.
import chromadb
client = chromadb.PersistentClient(path="./chroma_db")
col = client.get_or_create_collection("articles")
results = col.query(
query_texts=["vector similarity search database"],
n_results=2,
include=["documents", "metadatas", "distances"],
)
for doc, meta, dist in zip(
results["documents"][0],
results["metadatas"][0],
results["distances"][0],
):
print(f"[{dist:.4f}] {doc[:60]} | {meta}")
Output:
[0.2841] ChromaDB is an open-source vector database for AI applic | {'category': 'database', 'year': 2023}
[0.5912] LangChain is a framework for building LLM-powered pipeli | {'category': 'framework', 'year': 2022}
# Batch queries (multiple query texts at once)
results = col.query(
query_texts=["machine learning", "database storage"],
n_results=1,
)
print(results["documents"]) # list of lists, one per query
Output:
[['PyTorch is a deep learning framework developed by Meta.'],
['ChromaDB is an open-source vector database for AI applications.']]
Metadata filters — where and where_document
where= filters by document metadata before scoring; where_document= filters by document text content. Both use a MongoDB-style operator dict.
import chromadb
client = chromadb.PersistentClient(path="./chroma_db")
col = client.get_or_create_collection("articles")
# Exact match
results = col.query(
query_texts=["vector database"],
n_results=5,
where={"category": "database"},
)
# Numeric comparison
results = col.query(
query_texts=["deep learning"],
n_results=5,
where={"year": {"$gte": 2020}},
)
# Multiple conditions (AND)
results = col.query(
query_texts=["framework"],
n_results=5,
where={"$and": [{"category": "framework"}, {"year": {"$gte": 2022}}]},
)
# OR
results = col.query(
query_texts=["model"],
n_results=5,
where={"$or": [{"category": "database"}, {"category": "ml"}]},
)
# Text content filter
results = col.query(
query_texts=["pipeline"],
n_results=5,
where_document={"$contains": "LangChain"},
)
print(results["ids"])
Supported operators: $eq, $ne, $gt, $gte, $lt, $lte, $in, $nin, $and, $or.
Upsert — add or update
upsert() inserts new documents and updates existing ones by ID. Use it when your ingestion pipeline may re-process the same source documents.
import chromadb
client = chromadb.PersistentClient(path="./chroma_db")
col = client.get_or_create_collection("wiki")
# First run — inserts
col.upsert(documents=["Python was created by Guido van Rossum."], ids=["py_001"])
# Second run — updates in place (same ID, new content)
col.upsert(documents=["Python was created by Guido van Rossum in 1991."], ids=["py_001"])
print(col.get(ids=["py_001"])["documents"])
Output:
['Python was created by Guido van Rossum in 1991.']
Update and delete
import chromadb
client = chromadb.PersistentClient(path="./chroma_db")
col = client.get_or_create_collection("articles")
# Update metadata only
col.update(ids=["art_001"], metadatas=[{"category": "database", "year": 2024}])
# Update document and metadata
col.update(
ids=["art_002"],
documents=["LangChain builds LLM-powered applications and agents."],
metadatas=[{"category": "framework", "year": 2024}],
)
# Delete by ID
col.delete(ids=["art_003"])
# Delete by metadata filter
col.delete(where={"year": {"$lt": 2020}})
print(f"Remaining: {col.count()}")
Embedding functions
Chroma's embedding functions convert raw text to vectors. Swap them at collection creation time.
import chromadb
from chromadb.utils import embedding_functions
# Default (all-MiniLM-L6-v2, local, no API key)
ef = embedding_functions.DefaultEmbeddingFunction()
# OpenAI
ef = embedding_functions.OpenAIEmbeddingFunction(
api_key="sk-...",
model_name="text-embedding-3-small",
)
# HuggingFace (local, any sentence-transformers model)
ef = embedding_functions.HuggingFaceEmbeddingFunction(
model_name="BAAI/bge-base-en-v1.5",
)
# Google Generative AI
ef = embedding_functions.GoogleGenerativeAiEmbeddingFunction(
api_key="...",
model_name="models/text-embedding-004",
)
col = client.get_or_create_collection("docs", embedding_function=ef)
The embedding function must be passed at every
get_collection()call — Chroma does not persist it. If you omit it on retrieval, Chroma uses the default embedding function, which will mismatch dimensions if you used a different one duringadd().
LangChain integration
from langchain_chroma import Chroma
from langchain_openai import OpenAIEmbeddings
from langchain_core.documents import Document
import os
embeddings = OpenAIEmbeddings(
model="text-embedding-3-small",
api_key=os.environ["OPENAI_API_KEY"],
)
# Create vectorstore from documents
docs = [
Document(page_content="ChromaDB stores embeddings.", metadata={"source": "chroma_docs"}),
Document(page_content="LangChain builds LLM chains.", metadata={"source": "lc_docs"}),
]
vectorstore = Chroma.from_documents(
docs,
embedding=embeddings,
persist_directory="./chroma_lc",
)
# As a retriever
retriever = vectorstore.as_retriever(search_kwargs={"k": 2})
results = retriever.invoke("vector similarity")
for doc in results:
print(doc.page_content)
Output:
ChromaDB stores embeddings.
LangChain builds LLM chains.
LlamaIndex integration
from llama_index.core import VectorStoreIndex, SimpleDirectoryReader, StorageContext
from llama_index.vector_stores.chroma import ChromaVectorStore
import chromadb
chroma_client = chromadb.PersistentClient(path="./chroma_li")
collection = chroma_client.get_or_create_collection("research")
vector_store = ChromaVectorStore(chroma_collection=collection)
storage_ctx = StorageContext.from_defaults(vector_store=vector_store)
docs = SimpleDirectoryReader("./docs").load_data()
index = VectorStoreIndex.from_documents(docs, storage_context=storage_ctx)
engine = index.as_query_engine()
print(engine.query("What is multi-head attention?"))
Distance functions
| Value | Meaning | Best for |
|---|---|---|
l2 (default) | Euclidean distance — smaller = more similar | Unnormalised embeddings |
cosine | Cosine distance — 0 = identical, 2 = opposite | Normalised/sentence embeddings |
ip | Inner product — larger = more similar | When vectors are already normalised |
Set at collection creation: metadata={"hnsw:space": "cosine"}.
Quick reference
| Task | Code |
|---|---|
| In-memory client | chromadb.Client() |
| Persistent client | chromadb.PersistentClient(path="./dir") |
| HTTP client | chromadb.HttpClient(host="host", port=8000) |
| Get or create | client.get_or_create_collection("name") |
| Add documents | col.add(documents=[...], ids=[...], metadatas=[...]) |
| Add embeddings | col.add(embeddings=[[...]], documents=[...], ids=[...]) |
| Query | col.query(query_texts=["..."], n_results=5) |
| Metadata filter | col.query(..., where={"key": "value"}) |
| Text filter | col.query(..., where_document={"$contains": "word"}) |
| Upsert | col.upsert(documents=[...], ids=[...]) |
| Update | col.update(ids=[...], documents=[...], metadatas=[...]) |
| Delete by ID | col.delete(ids=["id1"]) |
| Delete by filter | col.delete(where={"year": {"$lt": 2020}}) |
| Count | col.count() |
| Get by ID | col.get(ids=["id1"]) |
| Cosine distance | create_collection("name", metadata={"hnsw:space": "cosine"}) |
| OpenAI embeddings | OpenAIEmbeddingFunction(api_key=..., model_name="text-embedding-3-small") |
| HF embeddings | HuggingFaceEmbeddingFunction(model_name="BAAI/bge-base-en-v1.5") |
| LangChain store | Chroma.from_documents(docs, embedding=embeddings, persist_directory="./dir") |