cheat sheet

RAG Implementation Checklist

End-to-end checklist and code for building reliable Retrieval-Augmented Generation pipelines — chunking, embedding, vector DBs, retrieval, and evaluation.

updated 05-25-2026

RAG Implementation Checklist

What it is

Retrieval-Augmented Generation (RAG) is a pattern that grounds LLM responses in a retrieved document corpus to reduce hallucination and enable up-to-date answers. Instead of relying on knowledge baked into model weights, RAG embeds your documents into a vector store, retrieves the most relevant chunks at query time, and injects them into the prompt so the model reasons over real, citable sources.

Architecture overview

text

Documents → Chunking → Embedding → Vector DB
                                       ↓
User Query → Embed Query → Retrieve Top-K → Rerank → Context Assembly → LLM → Answer

Document ingestion checklist

The steps required to get raw documents into a retrievable state: parse, clean, chunk, deduplicate, embed, and store. Skipping any step silently degrades retrieval quality — broken chunks, duplicate vectors, and missing metadata are the most common sources of bad RAG answers.

vbnet

☐ Split documents into chunks (300–600 tokens is typical for dense text)
☐ Preserve metadata per chunk: source URL, page number, section heading, date, author
☐ Handle multiple formats: PDF, HTML, Markdown, DOCX, plain text
☐ Strip boilerplate from web sources (nav, headers, footers, cookie banners)
☐ Deduplicate chunks with a content hash before embedding
☐ Test chunking on your actual data — verify no splits mid-sentence or mid-table
☐ Store chunk text alongside its vector — never rely on ID-only lookups

Chunking strategies

How you split documents directly determines what can be retrieved. Chunks that are too large dilute the signal; chunks that are too small lose context. Fixed-size splitting is fast and predictable; recursive character splitting respects natural boundaries; semantic chunking groups by meaning at the cost of extra embedding calls during ingestion.

Fixed-size (baseline)

python

def fixed_chunk(text: str, size: int = 500, overlap: int = 50) -> list[str]:
    words = text.split()
    chunks = []
    for i in range(0, len(words), size - overlap):
        chunk = " ".join(words[i : i + size])
        if chunk:
            chunks.append(chunk)
    return chunks

Recursive character (recommended for prose)

python

from langchain_text_splitters import RecursiveCharacterTextSplitter

splitter = RecursiveCharacterTextSplitter(
    chunk_size=500,
    chunk_overlap=50,
    separators=["\n\n", "\n", ". ", " ", ""],
)
chunks = splitter.split_text(document_text)

Semantic chunking (best quality, higher cost)

python

from langchain_experimental.text_splitter import SemanticChunker
from langchain_openai import OpenAIEmbeddings  # or any embed model

chunker = SemanticChunker(
    OpenAIEmbeddings(),
    breakpoint_threshold_type="percentile",
    breakpoint_threshold_amount=95,
)
chunks = chunker.split_text(document_text)

For code, chunk by function/class boundaries, not by token count. For tables, keep table rows together. For long lists, chunk entire lists rather than splitting mid-list.

Embedding

Embeddings are dense vector representations of text that encode semantic meaning into a fixed-length float array. The model you choose determines retrieval quality — use the same model at both ingestion and query time, or results will be incoherent. Smaller local models (sentence-transformers) are fast and free; API models like voyage-3 offer top benchmark performance.

Checklist

css

☐ Choose model appropriate for your domain (general vs code vs legal vs medical)
☐ Use the exact same model at query time as at ingestion time
☐ Normalize embeddings (most models expect cosine similarity on unit vectors)
☐ Batch embed during ingestion — avoid per-chunk API calls
☐ Store raw text + metadata alongside each vector

Embedding with sentence-transformers (local, free)

python

from sentence_transformers import SentenceTransformer
import numpy as np

model = SentenceTransformer("all-MiniLM-L6-v2")   # 384 dims, fast

chunks = ["The capital of France is Paris.", "Python 3.12 released in 2023."]
embeddings = model.encode(chunks, normalize_embeddings=True, batch_size=32)

print(embeddings.shape)   # (2, 384)

Output:

text

(2, 384)

Embedding model comparison

Model	Dims	Size	Best for
`all-MiniLM-L6-v2`	384	80 MB	Fast general-purpose
`all-mpnet-base-v2`	768	420 MB	Higher quality general
`text-embedding-3-small` (OpenAI)	1536	API	Good quality, cost-effective
`text-embedding-3-large` (OpenAI)	3072	API	Best OpenAI quality
`voyage-3` (Voyage AI)	1024	API	Best for RAG (benchmarks)
`nomic-embed-text` (Nomic)	768	API/local	Open, competitive quality

Vector database options

Stores vectors alongside metadata and serves approximate nearest-neighbor queries. Embedded options (Chroma, LanceDB) require no infrastructure and suit local dev; dedicated servers (Qdrant, Weaviate) add filtering and scalability; managed SaaS (Pinecone) eliminates ops at the cost of vendor lock-in. Pick based on your existing stack — if you already run Postgres, pgvector is often the lowest-friction choice.

DB	Type	Best for	Free tier
Chroma	Embedded/server	Local dev, prototypes	✅ self-hosted
pgvector	Postgres extension	Existing Postgres stack	✅ self-hosted
Qdrant	Dedicated vector DB	Production, filtering	✅ self-hosted
Weaviate	Dedicated vector DB	Multi-modal, GraphQL	✅ self-hosted
Pinecone	Managed SaaS	Fully managed, scale	Free tier (1 index)
Milvus	Distributed	High-scale production	✅ self-hosted
LanceDB	Embedded (files)	Serverless, embedded	✅ self-hosted

Chroma (local dev)

python

import chromadb
from sentence_transformers import SentenceTransformer

model = SentenceTransformer("all-MiniLM-L6-v2")
client = chromadb.Client()   # in-memory; use PersistentClient for disk
collection = client.create_collection("docs")

# Ingest
texts = ["Paris is the capital of France.", "Berlin is the capital of Germany."]
embeddings = model.encode(texts, normalize_embeddings=True).tolist()
collection.add(
    documents=texts,
    embeddings=embeddings,
    ids=["doc-0", "doc-1"],
    metadatas=[{"source": "geography"}, {"source": "geography"}],
)

# Query
query_vec = model.encode(["What is the capital of France?"],
                          normalize_embeddings=True).tolist()
results = collection.query(query_embeddings=query_vec, n_results=2)
print(results["documents"][0])

Output:

text

['Paris is the capital of France.', 'Berlin is the capital of Germany.']

pgvector (production)

sql

-- Enable extension
CREATE EXTENSION vector;

-- Table with embedding column
CREATE TABLE doc_chunks (
    id       SERIAL PRIMARY KEY,
    source   TEXT,
    chunk    TEXT,
    embedding VECTOR(384)
);

-- Approximate nearest-neighbor index (HNSW — fast)
CREATE INDEX ON doc_chunks USING hnsw (embedding vector_cosine_ops);

python

import psycopg2
import numpy as np

conn = psycopg2.connect("postgresql://user:pass@localhost/mydb")
cur = conn.cursor()

query_vec = model.encode(["What is the capital of France?"],
                          normalize_embeddings=True)[0]

cur.execute(
    """
    SELECT source, chunk, 1 - (embedding <=> %s::vector) AS similarity
    FROM doc_chunks
    ORDER BY embedding <=> %s::vector
    LIMIT 5
    """,
    (query_vec.tolist(), query_vec.tolist())
)
rows = cur.fetchall()
for source, chunk, sim in rows:
    print(f"{sim:.3f}  [{source}]  {chunk[:80]}")

Output:

text

0.932  [geography]  Paris is the capital of France.
0.801  [geography]  France is a country in Western Europe.

Retrieval

Converts the user query into an embedding, then fetches the top-k most similar chunks from the vector store. Fetching 2× the desired count and reranking with a cross-encoder improves precision significantly (~10–20% on standard benchmarks) at the cost of an extra 100ms. For queries that benefit from diversity, Maximal Marginal Relevance (MMR) reduces redundancy among returned chunks.

python

def retrieve(query: str, k: int = 5) -> list[dict]:
    query_vec = model.encode([query], normalize_embeddings=True)[0]

    # Vector similarity search (top 2k candidates)
    vector_results = collection.query(
        query_embeddings=[query_vec.tolist()],
        n_results=k * 2
    )
    candidates = [
        {"text": doc, "metadata": meta, "score": None}
        for doc, meta in zip(
            vector_results["documents"][0],
            vector_results["metadatas"][0]
        )
    ]

    # Optional: cross-encoder reranking (high-value, ~100ms)
    # from sentence_transformers import CrossEncoder
    # reranker = CrossEncoder("cross-encoder/ms-marco-MiniLM-L-6-v2")
    # pairs = [(query, c["text"]) for c in candidates]
    # scores = reranker.predict(pairs)
    # candidates = sorted(zip(scores, candidates), reverse=True)
    # candidates = [c for _, c in candidates]

    return candidates[:k]

Context assembly

Takes the retrieved chunks and serializes them into a single string to be injected into the prompt. Always cap the assembled context to a token budget (leaving room for the system prompt, question, and model output), attach source labels to each chunk so the model can cite them, and put the most relevant chunk first — models attend more strongly to early context.

python

def build_context(chunks: list[dict], max_tokens: int = 6000) -> str:
    context_parts = []
    token_count = 0

    for chunk in chunks:
        # Rough token estimate: 1 token ≈ 4 chars
        chunk_tokens = len(chunk["text"]) // 4
        if token_count + chunk_tokens > max_tokens:
            break
        source = chunk["metadata"].get("source", "unknown")
        context_parts.append(f"[Source: {source}]\n{chunk['text']}")
        token_count += chunk_tokens

    return "\n\n---\n\n".join(context_parts)

Prompt template for RAG

The standard RAG prompt instructs the model to answer using only the provided sources, refuse when the answer is not present, and cite which source it used. The explicit "do not speculate" and "say I don't know" constraints are the primary levers against hallucination — omit them and the model will fill gaps from training data.

text

Answer the question using ONLY the sources provided below.
If the answer is not in the sources, say "I don't have enough information."
Do not speculate or draw on outside knowledge.
Cite sources by their [Source: ...] label.

Sources:
{context}

Question: {question}

Answer:

Full RAG pipeline

An end-to-end skeleton that wires together retrieval, context assembly, and LLM generation in a single answer() function. Use this as a starting point, then layer in reranking, streaming, and caching as your latency and cost requirements demand.

python

import anthropic

anthropic_client = anthropic.Anthropic()

def answer(question: str) -> str:
    chunks = retrieve(question, k=5)
    context = build_context(chunks)

    response = anthropic_client.messages.create(
        model="claude-opus-4-7",
        max_tokens=1024,
        messages=[{
            "role": "user",
            "content": (
                f"Answer using ONLY the sources below. "
                f"Cite sources. If not in sources, say so.\n\n"
                f"Sources:\n{context}\n\n"
                f"Question: {question}"
            )
        }]
    )
    return response.content[0].text

print(answer("What is the capital of France?"))

Output:

text

According to [Source: geography], Paris is the capital of France.

Agentic RAG

For multi-hop questions (answer depends on multiple retrieval steps), give Claude a search tool and let it decide what to retrieve.

python

search_tool = {
    "name": "search_docs",
    "description": (
        "Search the documentation for relevant information. "
        "Call this when you need specific facts to answer the question. "
        "You may call it multiple times with different queries."
    ),
    "input_schema": {
        "type": "object",
        "properties": {
            "query": {"type": "string", "description": "Search query"},
            "max_results": {"type": "integer", "default": 5}
        },
        "required": ["query"]
    }
}

def handle_search(inputs: dict) -> str:
    chunks = retrieve(inputs["query"], k=inputs.get("max_results", 5))
    return build_context(chunks)

# Let Claude drive the retrieval loop
answer = run_agent(
    user_message="What are the differences between Chroma and pgvector?",
    tools=[search_tool],
    max_turns=8,
)

Evaluation checklist

RAG quality is measured across two independent axes: retrieval quality (did we fetch the right chunks?) and generation quality (did the model answer faithfully from those chunks?). Measure both separately — a high generation score on top of poor retrieval just means the model is hallucinating confidently.

css

☐ Faithfulness — does the answer use only retrieved chunks? (no hallucination)
☐ Answer relevance — does the answer address the actual question?
☐ Context recall — does the top-k contain the chunk needed to answer?
☐ Context precision — are retrieved chunks on-topic, or noisy?
☐ Latency — p50/p95 retrieval + generation time within SLA
☐ Hallucination rate — spot-check a sample against source documents

Hybrid retrieval (vector + keyword)

Pure vector search misses exact-term matches (product codes, error strings, proper nouns); pure keyword search misses paraphrases. Hybrid retrieval runs both and fuses the scores, recovering recall on both axes. BM25 is the standard sparse complement to dense vectors, and Reciprocal Rank Fusion (RRF) is the simplest fusion strategy — no score calibration required.

python

from rank_bm25 import BM25Okapi

# Pre-build BM25 index over the same corpus you embedded
tokenized_corpus = [doc.split() for doc in corpus_texts]
bm25 = BM25Okapi(tokenized_corpus)

def hybrid_retrieve(query: str, k: int = 5, k_fetch: int = 20) -> list[dict]:
    # Dense vector candidates
    query_vec = model.encode([query], normalize_embeddings=True)[0]
    dense = collection.query(query_embeddings=[query_vec.tolist()], n_results=k_fetch)
    dense_ids = dense["ids"][0]

    # Sparse BM25 candidates
    sparse_scores = bm25.get_scores(query.split())
    sparse_ids = [str(i) for i in sparse_scores.argsort()[::-1][:k_fetch]]

    # Reciprocal Rank Fusion (constant k=60 is community standard)
    rrf_scores: dict[str, float] = {}
    for rank, doc_id in enumerate(dense_ids):
        rrf_scores[doc_id] = rrf_scores.get(doc_id, 0.0) + 1.0 / (60 + rank)
    for rank, doc_id in enumerate(sparse_ids):
        rrf_scores[doc_id] = rrf_scores.get(doc_id, 0.0) + 1.0 / (60 + rank)

    top = sorted(rrf_scores.items(), key=lambda x: x[1], reverse=True)[:k]
    return [{"id": did, "score": score} for did, score in top]

RRF works without normalizing the dense and sparse scores onto the same scale. If you want fine control, replace it with a learned reranker (cross-encoder/ms-marco-MiniLM-L-6-v2) fed the union of candidates from both retrievers.

Query rewriting and expansion

The user's query is rarely the optimal retrieval query. Three high-leverage rewrites: (1) decompose multi-part questions into sub-queries, retrieve for each, union the results; (2) hypothetical document embedding (HyDE) — ask the LLM to draft a fake answer and embed that, which matches the answer style of your indexed chunks better than a question; (3) for short cryptic queries, expand with synonyms.

python

def hyde_retrieve(query: str, k: int = 5) -> list[dict]:
    # Generate a hypothetical answer to embed instead of the question
    hyde_prompt = (
        f"Write a short hypothetical answer to the following question. "
        f"Style should match a technical documentation excerpt. "
        f"Do not say 'I don't know' — invent a plausible answer.\n\n"
        f"Question: {query}"
    )
    fake_answer = call_claude(hyde_prompt, max_tokens=300)
    fake_vec = model.encode([fake_answer], normalize_embeddings=True)[0]

    results = collection.query(query_embeddings=[fake_vec.tolist()], n_results=k)
    return [
        {"text": d, "metadata": m}
        for d, m in zip(results["documents"][0], results["metadatas"][0])
    ]

python

def decompose_query(query: str) -> list[str]:
    """Break a compound question into atomic sub-queries."""
    prompt = (
        "Break the following question into 1-4 atomic sub-questions, each of which "
        "can be answered from a single document. Return ONE per line. No prose.\n\n"
        f"Question: {query}"
    )
    raw = call_claude(prompt, max_tokens=300)
    return [line.strip("-* ").strip() for line in raw.splitlines() if line.strip()]

Metadata filtering

The fastest way to improve precision is to constrain the search space before similarity ranking runs. Pre-filter by date range, source type, language, or access level — most vector DBs support attribute filters on the same query call. A pre-filter that cuts the candidate pool by 10x typically improves precision more than swapping embedding models.

python

# Chroma example: filter by metadata before similarity search
results = collection.query(
    query_embeddings=[query_vec.tolist()],
    n_results=5,
    where={
        "$and": [
            {"published_date": {"$gte": "2024-01-01"}},
            {"language": {"$eq": "en"}},
            {"access_tier": {"$in": ["public", "internal"]}},
        ]
    },
)

sql

-- pgvector equivalent: combine vector search with WHERE clause
SELECT source, chunk, 1 - (embedding <=> $1::vector) AS similarity
FROM doc_chunks
WHERE published_date >= '2024-01-01'
  AND language = 'en'
  AND access_tier IN ('public', 'internal')
ORDER BY embedding <=> $1::vector
LIMIT 5;

Observability and tracing

A RAG pipeline that returns a wrong answer can fail at chunking, embedding, retrieval, reranking, prompt assembly, or generation. Without tracing, debugging is guessing. Log every stage with structured records: query, query embedding hash, top-k IDs + scores, reranker scores, assembled context length, model response, and latencies. Tools like LangSmith and Arize Phoenix automate this.

python

import time, json, uuid

def traced_answer(question: str) -> dict:
    trace_id = str(uuid.uuid4())
    t0 = time.time()

    # Retrieval
    t1 = time.time()
    chunks = retrieve(question, k=5)
    t_retrieve = time.time() - t1

    # Context assembly
    t1 = time.time()
    context = build_context(chunks)
    t_assemble = time.time() - t1

    # Generation
    t1 = time.time()
    answer = generate(question, context)
    t_generate = time.time() - t1

    trace = {
        "trace_id": trace_id,
        "question": question,
        "retrieved_ids": [c.get("id") for c in chunks],
        "retrieved_scores": [c.get("score") for c in chunks],
        "context_chars": len(context),
        "answer": answer,
        "latency_ms": {
            "retrieve": int(t_retrieve * 1000),
            "assemble": int(t_assemble * 1000),
            "generate": int(t_generate * 1000),
            "total": int((time.time() - t0) * 1000),
        },
    }
    print(json.dumps(trace))    # ship to your logging stack
    return {"answer": answer, "trace_id": trace_id}

Cost and latency tuning

RAG cost is dominated by generation tokens; latency is dominated by retrieval + reranking when k is large. The cheap wins are: cache embedding-generation calls for repeat queries, use Haiku for retrieval-grounded generation (the quality gap shrinks with good context), batch-embed during ingestion, and pre-filter aggressively.

Lever	Cost impact	Latency impact	Quality impact
Smaller embedding model	Lower per-token	Faster query embedding	-5 to -10% recall
Reduce top-k from 10 to 5	Lower input tokens	Faster reranking	Usually negligible
Prompt cache static context	Up to -90% on hit	-50ms TTFT	None
Swap Opus to Haiku for gen	-85% per output token	2-3x faster gen	-5 to -15% subjective
Pre-filter by metadata	Lower retrieval cost	Faster ANN search	+precision if filter is correct
Cross-encoder reranker	+negligible cost	+50–150ms	+10–20% precision

Common failure modes

Most RAG failures fall into three root causes: the right chunk was never retrieved, the right chunk was retrieved but the prompt didn't constrain the model to use it, or the index is stale relative to the source documents. The table below maps symptoms to causes so you can diagnose without guessing.

Symptom	Likely cause	Fix
Wrong answer despite correct chunk retrieved	Prompt doesn't constrain to sources	Add explicit "ONLY use sources" instruction
Correct answer but wrong source cited	Chunk metadata lost at storage	Persist `source` field alongside vector
Good on short docs, bad on long	Fixed chunk too large (diluted)	Use smaller chunks or semantic chunking
Misses recent information	Stale index	Add incremental ingestion + reindex trigger
Slow retrieval	Full scan without index	Add HNSW/IVF index; shard by date
Hallucinations despite good retrieval	Context too long, key chunk buried	Use reranker; put most relevant chunk first
Poor performance on tables/lists	Character-level chunking splits structure	Keep tables and lists whole as single chunks