cheat sheet

ChromaDB

Store and query vector embeddings locally or over a network with ChromaDB. Covers client types, collections, add, query, metadata filters, embedding functions, and LangChain/LlamaIndex integration.

ChromaDB — Embedded Vector Database

What it is

ChromaDB is an open-source vector database designed for AI applications. It stores embeddings (dense float vectors) alongside documents and metadata, and retrieves the nearest neighbours to a query vector using approximate nearest-neighbour search. Chroma runs embedded (in-process, no server), as a persistent local store, or as a client/server pair. It is the default vector store for many LangChain and LlamaIndex tutorials because it requires zero infrastructure to get started.

Install

bash
pip install chromadb

Output: (none — exits 0 on success)

Quick example

python
import chromadb

client = chromadb.Client()   # in-memory

collection = client.create_collection("my_docs")

collection.add(
    documents=["Python is a high-level language.", "Rust is a systems language."],
    ids=["doc1", "doc2"],
)

results = collection.query(query_texts=["scripting language"], n_results=1)
print(results["documents"])
print(results["distances"])

Output:

text
[['Python is a high-level language.']]
[[0.6321]]

When / why to use it

  • Adding semantic search to an application without deploying a separate database.
  • Storing and querying embeddings in a RAG pipeline alongside LangChain or LlamaIndex.
  • Prototyping: Chroma runs in-process with zero setup; switch to a persistent or server client for production.
  • Metadata-filtered retrieval: combine vector similarity with structured filters (where={"category": "news"}).
  • Multi-tenant systems: one collection per tenant, all in the same Chroma instance.

Common pitfalls

Duplicate IDs raise — adding a document with an ID that already exists raises chromadb.errors.IDAlreadyExistsError. Use upsert() when you may be re-adding existing documents.

Dimension mismatch — all vectors in a collection must have the same dimension. Mixing embedding models (e.g. OpenAI 1536-dim and HuggingFace 768-dim) in one collection raises a dimension error on the second add.

chromadb.Client() is in-memory only — data is lost when the process exits. Use chromadb.PersistentClient(path="./chroma_db") for data that must survive restarts.

The default embedding function (DefaultEmbeddingFunction) uses sentence-transformers/all-MiniLM-L6-v2 running locally. It is accurate enough for prototyping and requires no API key, but needs pip install chromadb[default].

Pass include=["documents", "metadatas", "distances"] to query() to control what the response contains. Omitting documents saves bandwidth when you only need IDs.

Client types

Chroma offers three client modes. Switch modes by changing only the client construction line.

python
import chromadb

# In-memory — data lost on exit
client = chromadb.Client()

# Persistent — saved to disk, survives restarts
client = chromadb.PersistentClient(path="./chroma_db")

# HTTP client — connects to a running Chroma server
client = chromadb.HttpClient(host="localhost", port=8000)

Start the Chroma server for the HTTP client:

bash
chroma run --path ./chroma_db --port 8000

Output: (none — exits 0 on success)

Creating and managing collections

A collection groups documents with the same embedding dimension. Collections are created once and retrieved by name on subsequent runs.

python
import chromadb

client = chromadb.PersistentClient(path="./chroma_db")

# Create (fails if exists)
col = client.create_collection("research_papers")

# Get or create (idempotent)
col = client.get_or_create_collection("research_papers")

# Get existing (raises if missing)
col = client.get_collection("research_papers")

# List all collections
print(client.list_collections())   # ['research_papers']

# Delete
client.delete_collection("research_papers")

# Collection metadata and distance function
col = client.create_collection(
    "products",
    metadata={"hnsw:space": "cosine"},  # cosine | l2 (default) | ip
)

Adding documents

The add() method stores documents with their embeddings (or lets Chroma embed them) and optional metadata.

python
import chromadb

client = chromadb.PersistentClient(path="./chroma_db")
col = client.get_or_create_collection("articles")

# Add with auto-embedding (uses default embedding function)
col.add(
    documents=[
        "ChromaDB is an open-source vector database for AI applications.",
        "LangChain is a framework for building LLM-powered pipelines.",
        "PyTorch is a deep learning framework developed by Meta.",
    ],
    ids=["art_001", "art_002", "art_003"],
    metadatas=[
        {"category": "database", "year": 2023},
        {"category": "framework",  "year": 2022},
        {"category": "ml",         "year": 2016},
    ],
)

print(f"Collection count: {col.count()}")

Output:

text
Collection count: 3
python
# Add with pre-computed embeddings (skips the embedding step)
import numpy as np

col.add(
    embeddings=np.random.rand(2, 384).tolist(),   # must match collection dimension
    documents=["Document A", "Document B"],
    ids=["doc_a", "doc_b"],
)

Querying

query() takes one or more query texts (or pre-computed query embeddings) and returns the n_results nearest neighbours.

python
import chromadb

client = chromadb.PersistentClient(path="./chroma_db")
col = client.get_or_create_collection("articles")

results = col.query(
    query_texts=["vector similarity search database"],
    n_results=2,
    include=["documents", "metadatas", "distances"],
)

for doc, meta, dist in zip(
    results["documents"][0],
    results["metadatas"][0],
    results["distances"][0],
):
    print(f"[{dist:.4f}] {doc[:60]} | {meta}")

Output:

text
[0.2841] ChromaDB is an open-source vector database for AI applic | {'category': 'database', 'year': 2023}
[0.5912] LangChain is a framework for building LLM-powered pipeli | {'category': 'framework', 'year': 2022}
python
# Batch queries (multiple query texts at once)
results = col.query(
    query_texts=["machine learning", "database storage"],
    n_results=1,
)
print(results["documents"])   # list of lists, one per query

Output:

text
[['PyTorch is a deep learning framework developed by Meta.'],
 ['ChromaDB is an open-source vector database for AI applications.']]

Metadata filters — where and where_document

where= filters by document metadata before scoring; where_document= filters by document text content. Both use a MongoDB-style operator dict.

python
import chromadb

client = chromadb.PersistentClient(path="./chroma_db")
col = client.get_or_create_collection("articles")

# Exact match
results = col.query(
    query_texts=["vector database"],
    n_results=5,
    where={"category": "database"},
)

# Numeric comparison
results = col.query(
    query_texts=["deep learning"],
    n_results=5,
    where={"year": {"$gte": 2020}},
)

# Multiple conditions (AND)
results = col.query(
    query_texts=["framework"],
    n_results=5,
    where={"$and": [{"category": "framework"}, {"year": {"$gte": 2022}}]},
)

# OR
results = col.query(
    query_texts=["model"],
    n_results=5,
    where={"$or": [{"category": "database"}, {"category": "ml"}]},
)

# Text content filter
results = col.query(
    query_texts=["pipeline"],
    n_results=5,
    where_document={"$contains": "LangChain"},
)

print(results["ids"])

Supported operators: $eq, $ne, $gt, $gte, $lt, $lte, $in, $nin, $and, $or.

Upsert — add or update

upsert() inserts new documents and updates existing ones by ID. Use it when your ingestion pipeline may re-process the same source documents.

python
import chromadb

client = chromadb.PersistentClient(path="./chroma_db")
col = client.get_or_create_collection("wiki")

# First run — inserts
col.upsert(documents=["Python was created by Guido van Rossum."], ids=["py_001"])

# Second run — updates in place (same ID, new content)
col.upsert(documents=["Python was created by Guido van Rossum in 1991."], ids=["py_001"])

print(col.get(ids=["py_001"])["documents"])

Output:

text
['Python was created by Guido van Rossum in 1991.']

Update and delete

python
import chromadb

client = chromadb.PersistentClient(path="./chroma_db")
col = client.get_or_create_collection("articles")

# Update metadata only
col.update(ids=["art_001"], metadatas=[{"category": "database", "year": 2024}])

# Update document and metadata
col.update(
    ids=["art_002"],
    documents=["LangChain builds LLM-powered applications and agents."],
    metadatas=[{"category": "framework", "year": 2024}],
)

# Delete by ID
col.delete(ids=["art_003"])

# Delete by metadata filter
col.delete(where={"year": {"$lt": 2020}})

print(f"Remaining: {col.count()}")

Embedding functions

Chroma's embedding functions convert raw text to vectors. Swap them at collection creation time.

python
import chromadb
from chromadb.utils import embedding_functions

# Default (all-MiniLM-L6-v2, local, no API key)
ef = embedding_functions.DefaultEmbeddingFunction()

# OpenAI
ef = embedding_functions.OpenAIEmbeddingFunction(
    api_key="sk-...",
    model_name="text-embedding-3-small",
)

# HuggingFace (local, any sentence-transformers model)
ef = embedding_functions.HuggingFaceEmbeddingFunction(
    model_name="BAAI/bge-base-en-v1.5",
)

# Google Generative AI
ef = embedding_functions.GoogleGenerativeAiEmbeddingFunction(
    api_key="...",
    model_name="models/text-embedding-004",
)

col = client.get_or_create_collection("docs", embedding_function=ef)

The embedding function must be passed at every get_collection() call — Chroma does not persist it. If you omit it on retrieval, Chroma uses the default embedding function, which will mismatch dimensions if you used a different one during add().

LangChain integration

python
from langchain_chroma import Chroma
from langchain_openai import OpenAIEmbeddings
from langchain_core.documents import Document
import os

embeddings = OpenAIEmbeddings(
    model="text-embedding-3-small",
    api_key=os.environ["OPENAI_API_KEY"],
)

# Create vectorstore from documents
docs = [
    Document(page_content="ChromaDB stores embeddings.", metadata={"source": "chroma_docs"}),
    Document(page_content="LangChain builds LLM chains.", metadata={"source": "lc_docs"}),
]
vectorstore = Chroma.from_documents(
    docs,
    embedding=embeddings,
    persist_directory="./chroma_lc",
)

# As a retriever
retriever = vectorstore.as_retriever(search_kwargs={"k": 2})
results = retriever.invoke("vector similarity")
for doc in results:
    print(doc.page_content)

Output:

text
ChromaDB stores embeddings.
LangChain builds LLM chains.

LlamaIndex integration

python
from llama_index.core import VectorStoreIndex, SimpleDirectoryReader, StorageContext
from llama_index.vector_stores.chroma import ChromaVectorStore
import chromadb

chroma_client = chromadb.PersistentClient(path="./chroma_li")
collection = chroma_client.get_or_create_collection("research")

vector_store = ChromaVectorStore(chroma_collection=collection)
storage_ctx  = StorageContext.from_defaults(vector_store=vector_store)

docs  = SimpleDirectoryReader("./docs").load_data()
index = VectorStoreIndex.from_documents(docs, storage_context=storage_ctx)

engine = index.as_query_engine()
print(engine.query("What is multi-head attention?"))

Distance functions

ValueMeaningBest for
l2 (default)Euclidean distance — smaller = more similarUnnormalised embeddings
cosineCosine distance — 0 = identical, 2 = oppositeNormalised/sentence embeddings
ipInner product — larger = more similarWhen vectors are already normalised

Set at collection creation: metadata={"hnsw:space": "cosine"}.

Quick reference

TaskCode
In-memory clientchromadb.Client()
Persistent clientchromadb.PersistentClient(path="./dir")
HTTP clientchromadb.HttpClient(host="host", port=8000)
Get or createclient.get_or_create_collection("name")
Add documentscol.add(documents=[...], ids=[...], metadatas=[...])
Add embeddingscol.add(embeddings=[[...]], documents=[...], ids=[...])
Querycol.query(query_texts=["..."], n_results=5)
Metadata filtercol.query(..., where={"key": "value"})
Text filtercol.query(..., where_document={"$contains": "word"})
Upsertcol.upsert(documents=[...], ids=[...])
Updatecol.update(ids=[...], documents=[...], metadatas=[...])
Delete by IDcol.delete(ids=["id1"])
Delete by filtercol.delete(where={"year": {"$lt": 2020}})
Countcol.count()
Get by IDcol.get(ids=["id1"])
Cosine distancecreate_collection("name", metadata={"hnsw:space": "cosine"})
OpenAI embeddingsOpenAIEmbeddingFunction(api_key=..., model_name="text-embedding-3-small")
HF embeddingsHuggingFaceEmbeddingFunction(model_name="BAAI/bge-base-en-v1.5")
LangChain storeChroma.from_documents(docs, embedding=embeddings, persist_directory="./dir")