cheat sheet
RAG Implementation Checklist
End-to-end checklist and code for building reliable Retrieval-Augmented Generation pipelines — chunking, embedding, vector DBs, retrieval, and evaluation.
RAG Implementation Checklist
What it is
Retrieval-Augmented Generation (RAG) is a pattern that grounds LLM responses in a retrieved document corpus to reduce hallucination and enable up-to-date answers. Instead of relying on knowledge baked into model weights, RAG embeds your documents into a vector store, retrieves the most relevant chunks at query time, and injects them into the prompt so the model reasons over real, citable sources.
Architecture overview
Documents → Chunking → Embedding → Vector DB
↓
User Query → Embed Query → Retrieve Top-K → Rerank → Context Assembly → LLM → Answer
Document ingestion checklist
The steps required to get raw documents into a retrievable state: parse, clean, chunk, deduplicate, embed, and store. Skipping any step silently degrades retrieval quality — broken chunks, duplicate vectors, and missing metadata are the most common sources of bad RAG answers.
☐ Split documents into chunks (300–600 tokens is typical for dense text)
☐ Preserve metadata per chunk: source URL, page number, section heading, date, author
☐ Handle multiple formats: PDF, HTML, Markdown, DOCX, plain text
☐ Strip boilerplate from web sources (nav, headers, footers, cookie banners)
☐ Deduplicate chunks with a content hash before embedding
☐ Test chunking on your actual data — verify no splits mid-sentence or mid-table
☐ Store chunk text alongside its vector — never rely on ID-only lookups
Chunking strategies
How you split documents directly determines what can be retrieved. Chunks that are too large dilute the signal; chunks that are too small lose context. Fixed-size splitting is fast and predictable; recursive character splitting respects natural boundaries; semantic chunking groups by meaning at the cost of extra embedding calls during ingestion.
Fixed-size (baseline)
def fixed_chunk(text: str, size: int = 500, overlap: int = 50) -> list[str]:
words = text.split()
chunks = []
for i in range(0, len(words), size - overlap):
chunk = " ".join(words[i : i + size])
if chunk:
chunks.append(chunk)
return chunks
Recursive character (recommended for prose)
from langchain_text_splitters import RecursiveCharacterTextSplitter
splitter = RecursiveCharacterTextSplitter(
chunk_size=500,
chunk_overlap=50,
separators=["\n\n", "\n", ". ", " ", ""],
)
chunks = splitter.split_text(document_text)
Semantic chunking (best quality, higher cost)
from langchain_experimental.text_splitter import SemanticChunker
from langchain_openai import OpenAIEmbeddings # or any embed model
chunker = SemanticChunker(
OpenAIEmbeddings(),
breakpoint_threshold_type="percentile",
breakpoint_threshold_amount=95,
)
chunks = chunker.split_text(document_text)
For code, chunk by function/class boundaries, not by token count. For tables, keep table rows together. For long lists, chunk entire lists rather than splitting mid-list.
Embedding
Embeddings are dense vector representations of text that encode semantic meaning into a fixed-length float array. The model you choose determines retrieval quality — use the same model at both ingestion and query time, or results will be incoherent. Smaller local models (sentence-transformers) are fast and free; API models like voyage-3 offer top benchmark performance.
Checklist
☐ Choose model appropriate for your domain (general vs code vs legal vs medical)
☐ Use the exact same model at query time as at ingestion time
☐ Normalize embeddings (most models expect cosine similarity on unit vectors)
☐ Batch embed during ingestion — avoid per-chunk API calls
☐ Store raw text + metadata alongside each vector
Embedding with sentence-transformers (local, free)
from sentence_transformers import SentenceTransformer
import numpy as np
model = SentenceTransformer("all-MiniLM-L6-v2") # 384 dims, fast
chunks = ["The capital of France is Paris.", "Python 3.12 released in 2023."]
embeddings = model.encode(chunks, normalize_embeddings=True, batch_size=32)
print(embeddings.shape) # (2, 384)
Output:
(2, 384)
Embedding model comparison
| Model | Dims | Size | Best for |
|---|---|---|---|
all-MiniLM-L6-v2 | 384 | 80 MB | Fast general-purpose |
all-mpnet-base-v2 | 768 | 420 MB | Higher quality general |
text-embedding-3-small (OpenAI) | 1536 | API | Good quality, cost-effective |
text-embedding-3-large (OpenAI) | 3072 | API | Best OpenAI quality |
voyage-3 (Voyage AI) | 1024 | API | Best for RAG (benchmarks) |
nomic-embed-text (Nomic) | 768 | API/local | Open, competitive quality |
Vector database options
Stores vectors alongside metadata and serves approximate nearest-neighbor queries. Embedded options (Chroma, LanceDB) require no infrastructure and suit local dev; dedicated servers (Qdrant, Weaviate) add filtering and scalability; managed SaaS (Pinecone) eliminates ops at the cost of vendor lock-in. Pick based on your existing stack — if you already run Postgres, pgvector is often the lowest-friction choice.
| DB | Type | Best for | Free tier |
|---|---|---|---|
| Chroma | Embedded/server | Local dev, prototypes | ✅ self-hosted |
| pgvector | Postgres extension | Existing Postgres stack | ✅ self-hosted |
| Qdrant | Dedicated vector DB | Production, filtering | ✅ self-hosted |
| Weaviate | Dedicated vector DB | Multi-modal, GraphQL | ✅ self-hosted |
| Pinecone | Managed SaaS | Fully managed, scale | Free tier (1 index) |
| Milvus | Distributed | High-scale production | ✅ self-hosted |
| LanceDB | Embedded (files) | Serverless, embedded | ✅ self-hosted |
Chroma (local dev)
import chromadb
from sentence_transformers import SentenceTransformer
model = SentenceTransformer("all-MiniLM-L6-v2")
client = chromadb.Client() # in-memory; use PersistentClient for disk
collection = client.create_collection("docs")
# Ingest
texts = ["Paris is the capital of France.", "Berlin is the capital of Germany."]
embeddings = model.encode(texts, normalize_embeddings=True).tolist()
collection.add(
documents=texts,
embeddings=embeddings,
ids=["doc-0", "doc-1"],
metadatas=[{"source": "geography"}, {"source": "geography"}],
)
# Query
query_vec = model.encode(["What is the capital of France?"],
normalize_embeddings=True).tolist()
results = collection.query(query_embeddings=query_vec, n_results=2)
print(results["documents"][0])
Output:
['Paris is the capital of France.', 'Berlin is the capital of Germany.']
pgvector (production)
-- Enable extension
CREATE EXTENSION vector;
-- Table with embedding column
CREATE TABLE doc_chunks (
id SERIAL PRIMARY KEY,
source TEXT,
chunk TEXT,
embedding VECTOR(384)
);
-- Approximate nearest-neighbor index (HNSW — fast)
CREATE INDEX ON doc_chunks USING hnsw (embedding vector_cosine_ops);
import psycopg2
import numpy as np
conn = psycopg2.connect("postgresql://user:pass@localhost/mydb")
cur = conn.cursor()
query_vec = model.encode(["What is the capital of France?"],
normalize_embeddings=True)[0]
cur.execute(
"""
SELECT source, chunk, 1 - (embedding <=> %s::vector) AS similarity
FROM doc_chunks
ORDER BY embedding <=> %s::vector
LIMIT 5
""",
(query_vec.tolist(), query_vec.tolist())
)
rows = cur.fetchall()
for source, chunk, sim in rows:
print(f"{sim:.3f} [{source}] {chunk[:80]}")
Output:
0.932 [geography] Paris is the capital of France.
0.801 [geography] France is a country in Western Europe.
Retrieval
Converts the user query into an embedding, then fetches the top-k most similar chunks from the vector store. Fetching 2× the desired count and reranking with a cross-encoder improves precision significantly (~10–20% on standard benchmarks) at the cost of an extra 100ms. For queries that benefit from diversity, Maximal Marginal Relevance (MMR) reduces redundancy among returned chunks.
def retrieve(query: str, k: int = 5) -> list[dict]:
query_vec = model.encode([query], normalize_embeddings=True)[0]
# Vector similarity search (top 2k candidates)
vector_results = collection.query(
query_embeddings=[query_vec.tolist()],
n_results=k * 2
)
candidates = [
{"text": doc, "metadata": meta, "score": None}
for doc, meta in zip(
vector_results["documents"][0],
vector_results["metadatas"][0]
)
]
# Optional: cross-encoder reranking (high-value, ~100ms)
# from sentence_transformers import CrossEncoder
# reranker = CrossEncoder("cross-encoder/ms-marco-MiniLM-L-6-v2")
# pairs = [(query, c["text"]) for c in candidates]
# scores = reranker.predict(pairs)
# candidates = sorted(zip(scores, candidates), reverse=True)
# candidates = [c for _, c in candidates]
return candidates[:k]
Context assembly
Takes the retrieved chunks and serializes them into a single string to be injected into the prompt. Always cap the assembled context to a token budget (leaving room for the system prompt, question, and model output), attach source labels to each chunk so the model can cite them, and put the most relevant chunk first — models attend more strongly to early context.
def build_context(chunks: list[dict], max_tokens: int = 6000) -> str:
context_parts = []
token_count = 0
for chunk in chunks:
# Rough token estimate: 1 token ≈ 4 chars
chunk_tokens = len(chunk["text"]) // 4
if token_count + chunk_tokens > max_tokens:
break
source = chunk["metadata"].get("source", "unknown")
context_parts.append(f"[Source: {source}]\n{chunk['text']}")
token_count += chunk_tokens
return "\n\n---\n\n".join(context_parts)
Prompt template for RAG
The standard RAG prompt instructs the model to answer using only the provided sources, refuse when the answer is not present, and cite which source it used. The explicit "do not speculate" and "say I don't know" constraints are the primary levers against hallucination — omit them and the model will fill gaps from training data.
Answer the question using ONLY the sources provided below.
If the answer is not in the sources, say "I don't have enough information."
Do not speculate or draw on outside knowledge.
Cite sources by their [Source: ...] label.
Sources:
{context}
Question: {question}
Answer:
Full RAG pipeline
An end-to-end skeleton that wires together retrieval, context assembly, and LLM generation in a single answer() function. Use this as a starting point, then layer in reranking, streaming, and caching as your latency and cost requirements demand.
import anthropic
anthropic_client = anthropic.Anthropic()
def answer(question: str) -> str:
chunks = retrieve(question, k=5)
context = build_context(chunks)
response = anthropic_client.messages.create(
model="claude-opus-4-7",
max_tokens=1024,
messages=[{
"role": "user",
"content": (
f"Answer using ONLY the sources below. "
f"Cite sources. If not in sources, say so.\n\n"
f"Sources:\n{context}\n\n"
f"Question: {question}"
)
}]
)
return response.content[0].text
print(answer("What is the capital of France?"))
Output:
According to [Source: geography], Paris is the capital of France.
Agentic RAG
For multi-hop questions (answer depends on multiple retrieval steps), give Claude a search tool and let it decide what to retrieve.
search_tool = {
"name": "search_docs",
"description": (
"Search the documentation for relevant information. "
"Call this when you need specific facts to answer the question. "
"You may call it multiple times with different queries."
),
"input_schema": {
"type": "object",
"properties": {
"query": {"type": "string", "description": "Search query"},
"max_results": {"type": "integer", "default": 5}
},
"required": ["query"]
}
}
def handle_search(inputs: dict) -> str:
chunks = retrieve(inputs["query"], k=inputs.get("max_results", 5))
return build_context(chunks)
# Let Claude drive the retrieval loop
answer = run_agent(
user_message="What are the differences between Chroma and pgvector?",
tools=[search_tool],
max_turns=8,
)
Evaluation checklist
RAG quality is measured across two independent axes: retrieval quality (did we fetch the right chunks?) and generation quality (did the model answer faithfully from those chunks?). Measure both separately — a high generation score on top of poor retrieval just means the model is hallucinating confidently.
☐ Faithfulness — does the answer use only retrieved chunks? (no hallucination)
☐ Answer relevance — does the answer address the actual question?
☐ Context recall — does the top-k contain the chunk needed to answer?
☐ Context precision — are retrieved chunks on-topic, or noisy?
☐ Latency — p50/p95 retrieval + generation time within SLA
☐ Hallucination rate — spot-check a sample against source documents
Hybrid retrieval (vector + keyword)
Pure vector search misses exact-term matches (product codes, error strings, proper nouns); pure keyword search misses paraphrases. Hybrid retrieval runs both and fuses the scores, recovering recall on both axes. BM25 is the standard sparse complement to dense vectors, and Reciprocal Rank Fusion (RRF) is the simplest fusion strategy — no score calibration required.
from rank_bm25 import BM25Okapi
# Pre-build BM25 index over the same corpus you embedded
tokenized_corpus = [doc.split() for doc in corpus_texts]
bm25 = BM25Okapi(tokenized_corpus)
def hybrid_retrieve(query: str, k: int = 5, k_fetch: int = 20) -> list[dict]:
# Dense vector candidates
query_vec = model.encode([query], normalize_embeddings=True)[0]
dense = collection.query(query_embeddings=[query_vec.tolist()], n_results=k_fetch)
dense_ids = dense["ids"][0]
# Sparse BM25 candidates
sparse_scores = bm25.get_scores(query.split())
sparse_ids = [str(i) for i in sparse_scores.argsort()[::-1][:k_fetch]]
# Reciprocal Rank Fusion (constant k=60 is community standard)
rrf_scores: dict[str, float] = {}
for rank, doc_id in enumerate(dense_ids):
rrf_scores[doc_id] = rrf_scores.get(doc_id, 0.0) + 1.0 / (60 + rank)
for rank, doc_id in enumerate(sparse_ids):
rrf_scores[doc_id] = rrf_scores.get(doc_id, 0.0) + 1.0 / (60 + rank)
top = sorted(rrf_scores.items(), key=lambda x: x[1], reverse=True)[:k]
return [{"id": did, "score": score} for did, score in top]
RRF works without normalizing the dense and sparse scores onto the same scale. If you want fine control, replace it with a learned reranker (
cross-encoder/ms-marco-MiniLM-L-6-v2) fed the union of candidates from both retrievers.
Query rewriting and expansion
The user's query is rarely the optimal retrieval query. Three high-leverage rewrites: (1) decompose multi-part questions into sub-queries, retrieve for each, union the results; (2) hypothetical document embedding (HyDE) — ask the LLM to draft a fake answer and embed that, which matches the answer style of your indexed chunks better than a question; (3) for short cryptic queries, expand with synonyms.
def hyde_retrieve(query: str, k: int = 5) -> list[dict]:
# Generate a hypothetical answer to embed instead of the question
hyde_prompt = (
f"Write a short hypothetical answer to the following question. "
f"Style should match a technical documentation excerpt. "
f"Do not say 'I don't know' — invent a plausible answer.\n\n"
f"Question: {query}"
)
fake_answer = call_claude(hyde_prompt, max_tokens=300)
fake_vec = model.encode([fake_answer], normalize_embeddings=True)[0]
results = collection.query(query_embeddings=[fake_vec.tolist()], n_results=k)
return [
{"text": d, "metadata": m}
for d, m in zip(results["documents"][0], results["metadatas"][0])
]
def decompose_query(query: str) -> list[str]:
"""Break a compound question into atomic sub-queries."""
prompt = (
"Break the following question into 1-4 atomic sub-questions, each of which "
"can be answered from a single document. Return ONE per line. No prose.\n\n"
f"Question: {query}"
)
raw = call_claude(prompt, max_tokens=300)
return [line.strip("-* ").strip() for line in raw.splitlines() if line.strip()]
Metadata filtering
The fastest way to improve precision is to constrain the search space before similarity ranking runs. Pre-filter by date range, source type, language, or access level — most vector DBs support attribute filters on the same query call. A pre-filter that cuts the candidate pool by 10x typically improves precision more than swapping embedding models.
# Chroma example: filter by metadata before similarity search
results = collection.query(
query_embeddings=[query_vec.tolist()],
n_results=5,
where={
"$and": [
{"published_date": {"$gte": "2024-01-01"}},
{"language": {"$eq": "en"}},
{"access_tier": {"$in": ["public", "internal"]}},
]
},
)
-- pgvector equivalent: combine vector search with WHERE clause
SELECT source, chunk, 1 - (embedding <=> $1::vector) AS similarity
FROM doc_chunks
WHERE published_date >= '2024-01-01'
AND language = 'en'
AND access_tier IN ('public', 'internal')
ORDER BY embedding <=> $1::vector
LIMIT 5;
Observability and tracing
A RAG pipeline that returns a wrong answer can fail at chunking, embedding, retrieval, reranking, prompt assembly, or generation. Without tracing, debugging is guessing. Log every stage with structured records: query, query embedding hash, top-k IDs + scores, reranker scores, assembled context length, model response, and latencies. Tools like LangSmith and Arize Phoenix automate this.
import time, json, uuid
def traced_answer(question: str) -> dict:
trace_id = str(uuid.uuid4())
t0 = time.time()
# Retrieval
t1 = time.time()
chunks = retrieve(question, k=5)
t_retrieve = time.time() - t1
# Context assembly
t1 = time.time()
context = build_context(chunks)
t_assemble = time.time() - t1
# Generation
t1 = time.time()
answer = generate(question, context)
t_generate = time.time() - t1
trace = {
"trace_id": trace_id,
"question": question,
"retrieved_ids": [c.get("id") for c in chunks],
"retrieved_scores": [c.get("score") for c in chunks],
"context_chars": len(context),
"answer": answer,
"latency_ms": {
"retrieve": int(t_retrieve * 1000),
"assemble": int(t_assemble * 1000),
"generate": int(t_generate * 1000),
"total": int((time.time() - t0) * 1000),
},
}
print(json.dumps(trace)) # ship to your logging stack
return {"answer": answer, "trace_id": trace_id}
Cost and latency tuning
RAG cost is dominated by generation tokens; latency is dominated by retrieval + reranking when k is large. The cheap wins are: cache embedding-generation calls for repeat queries, use Haiku for retrieval-grounded generation (the quality gap shrinks with good context), batch-embed during ingestion, and pre-filter aggressively.
| Lever | Cost impact | Latency impact | Quality impact |
|---|---|---|---|
| Smaller embedding model | Lower per-token | Faster query embedding | -5 to -10% recall |
| Reduce top-k from 10 to 5 | Lower input tokens | Faster reranking | Usually negligible |
| Prompt cache static context | Up to -90% on hit | -50ms TTFT | None |
| Swap Opus to Haiku for gen | -85% per output token | 2-3x faster gen | -5 to -15% subjective |
| Pre-filter by metadata | Lower retrieval cost | Faster ANN search | +precision if filter is correct |
| Cross-encoder reranker | +negligible cost | +50–150ms | +10–20% precision |
Common failure modes
Most RAG failures fall into three root causes: the right chunk was never retrieved, the right chunk was retrieved but the prompt didn't constrain the model to use it, or the index is stale relative to the source documents. The table below maps symptoms to causes so you can diagnose without guessing.
| Symptom | Likely cause | Fix |
|---|---|---|
| Wrong answer despite correct chunk retrieved | Prompt doesn't constrain to sources | Add explicit "ONLY use sources" instruction |
| Correct answer but wrong source cited | Chunk metadata lost at storage | Persist source field alongside vector |
| Good on short docs, bad on long | Fixed chunk too large (diluted) | Use smaller chunks or semantic chunking |
| Misses recent information | Stale index | Add incremental ingestion + reindex trigger |
| Slow retrieval | Full scan without index | Add HNSW/IVF index; shard by date |
| Hallucinations despite good retrieval | Context too long, key chunk buried | Use reranker; put most relevant chunk first |
| Poor performance on tables/lists | Character-level chunking splits structure | Keep tables and lists whole as single chunks |