cheat sheet

RAG Implementation Checklist

End-to-end checklist and code for building reliable Retrieval-Augmented Generation pipelines — chunking, embedding, vector DBs, retrieval, and evaluation.

RAG Implementation Checklist

What it is

Retrieval-Augmented Generation (RAG) is a pattern that grounds LLM responses in a retrieved document corpus to reduce hallucination and enable up-to-date answers. Instead of relying on knowledge baked into model weights, RAG embeds your documents into a vector store, retrieves the most relevant chunks at query time, and injects them into the prompt so the model reasons over real, citable sources.

Architecture overview

text
Documents → Chunking → Embedding → Vector DB
                                       ↓
User Query → Embed Query → Retrieve Top-K → Rerank → Context Assembly → LLM → Answer

Document ingestion checklist

The steps required to get raw documents into a retrievable state: parse, clean, chunk, deduplicate, embed, and store. Skipping any step silently degrades retrieval quality — broken chunks, duplicate vectors, and missing metadata are the most common sources of bad RAG answers.

vbnet
☐ Split documents into chunks (300600 tokens is typical for dense text)
☐ Preserve metadata per chunk: source URL, page number, section heading, date, author
☐ Handle multiple formats: PDF, HTML, Markdown, DOCX, plain text
☐ Strip boilerplate from web sources (nav, headers, footers, cookie banners)
☐ Deduplicate chunks with a content hash before embedding
☐ Test chunking on your actual data — verify no splits mid-sentence or mid-table
☐ Store chunk text alongside its vector — never rely on ID-only lookups

Chunking strategies

How you split documents directly determines what can be retrieved. Chunks that are too large dilute the signal; chunks that are too small lose context. Fixed-size splitting is fast and predictable; recursive character splitting respects natural boundaries; semantic chunking groups by meaning at the cost of extra embedding calls during ingestion.

Fixed-size (baseline)

python
def fixed_chunk(text: str, size: int = 500, overlap: int = 50) -> list[str]:
    words = text.split()
    chunks = []
    for i in range(0, len(words), size - overlap):
        chunk = " ".join(words[i : i + size])
        if chunk:
            chunks.append(chunk)
    return chunks
python
from langchain_text_splitters import RecursiveCharacterTextSplitter

splitter = RecursiveCharacterTextSplitter(
    chunk_size=500,
    chunk_overlap=50,
    separators=["\n\n", "\n", ". ", " ", ""],
)
chunks = splitter.split_text(document_text)

Semantic chunking (best quality, higher cost)

python
from langchain_experimental.text_splitter import SemanticChunker
from langchain_openai import OpenAIEmbeddings  # or any embed model

chunker = SemanticChunker(
    OpenAIEmbeddings(),
    breakpoint_threshold_type="percentile",
    breakpoint_threshold_amount=95,
)
chunks = chunker.split_text(document_text)

For code, chunk by function/class boundaries, not by token count. For tables, keep table rows together. For long lists, chunk entire lists rather than splitting mid-list.

Embedding

Embeddings are dense vector representations of text that encode semantic meaning into a fixed-length float array. The model you choose determines retrieval quality — use the same model at both ingestion and query time, or results will be incoherent. Smaller local models (sentence-transformers) are fast and free; API models like voyage-3 offer top benchmark performance.

Checklist

css
☐ Choose model appropriate for your domain (general vs code vs legal vs medical)
☐ Use the exact same model at query time as at ingestion time
☐ Normalize embeddings (most models expect cosine similarity on unit vectors)
☐ Batch embed during ingestion — avoid per-chunk API calls
☐ Store raw text + metadata alongside each vector

Embedding with sentence-transformers (local, free)

python
from sentence_transformers import SentenceTransformer
import numpy as np

model = SentenceTransformer("all-MiniLM-L6-v2")   # 384 dims, fast

chunks = ["The capital of France is Paris.", "Python 3.12 released in 2023."]
embeddings = model.encode(chunks, normalize_embeddings=True, batch_size=32)

print(embeddings.shape)   # (2, 384)

Output:

text
(2, 384)

Embedding model comparison

ModelDimsSizeBest for
all-MiniLM-L6-v238480 MBFast general-purpose
all-mpnet-base-v2768420 MBHigher quality general
text-embedding-3-small (OpenAI)1536APIGood quality, cost-effective
text-embedding-3-large (OpenAI)3072APIBest OpenAI quality
voyage-3 (Voyage AI)1024APIBest for RAG (benchmarks)
nomic-embed-text (Nomic)768API/localOpen, competitive quality

Vector database options

Stores vectors alongside metadata and serves approximate nearest-neighbor queries. Embedded options (Chroma, LanceDB) require no infrastructure and suit local dev; dedicated servers (Qdrant, Weaviate) add filtering and scalability; managed SaaS (Pinecone) eliminates ops at the cost of vendor lock-in. Pick based on your existing stack — if you already run Postgres, pgvector is often the lowest-friction choice.

DBTypeBest forFree tier
ChromaEmbedded/serverLocal dev, prototypes✅ self-hosted
pgvectorPostgres extensionExisting Postgres stack✅ self-hosted
QdrantDedicated vector DBProduction, filtering✅ self-hosted
WeaviateDedicated vector DBMulti-modal, GraphQL✅ self-hosted
PineconeManaged SaaSFully managed, scaleFree tier (1 index)
MilvusDistributedHigh-scale production✅ self-hosted
LanceDBEmbedded (files)Serverless, embedded✅ self-hosted

Chroma (local dev)

python
import chromadb
from sentence_transformers import SentenceTransformer

model = SentenceTransformer("all-MiniLM-L6-v2")
client = chromadb.Client()   # in-memory; use PersistentClient for disk
collection = client.create_collection("docs")

# Ingest
texts = ["Paris is the capital of France.", "Berlin is the capital of Germany."]
embeddings = model.encode(texts, normalize_embeddings=True).tolist()
collection.add(
    documents=texts,
    embeddings=embeddings,
    ids=["doc-0", "doc-1"],
    metadatas=[{"source": "geography"}, {"source": "geography"}],
)

# Query
query_vec = model.encode(["What is the capital of France?"],
                          normalize_embeddings=True).tolist()
results = collection.query(query_embeddings=query_vec, n_results=2)
print(results["documents"][0])

Output:

text
['Paris is the capital of France.', 'Berlin is the capital of Germany.']

pgvector (production)

sql
-- Enable extension
CREATE EXTENSION vector;

-- Table with embedding column
CREATE TABLE doc_chunks (
    id       SERIAL PRIMARY KEY,
    source   TEXT,
    chunk    TEXT,
    embedding VECTOR(384)
);

-- Approximate nearest-neighbor index (HNSW — fast)
CREATE INDEX ON doc_chunks USING hnsw (embedding vector_cosine_ops);
python
import psycopg2
import numpy as np

conn = psycopg2.connect("postgresql://user:pass@localhost/mydb")
cur = conn.cursor()

query_vec = model.encode(["What is the capital of France?"],
                          normalize_embeddings=True)[0]

cur.execute(
    """
    SELECT source, chunk, 1 - (embedding <=> %s::vector) AS similarity
    FROM doc_chunks
    ORDER BY embedding <=> %s::vector
    LIMIT 5
    """,
    (query_vec.tolist(), query_vec.tolist())
)
rows = cur.fetchall()
for source, chunk, sim in rows:
    print(f"{sim:.3f}  [{source}]  {chunk[:80]}")

Output:

text
0.932  [geography]  Paris is the capital of France.
0.801  [geography]  France is a country in Western Europe.

Retrieval

Converts the user query into an embedding, then fetches the top-k most similar chunks from the vector store. Fetching 2× the desired count and reranking with a cross-encoder improves precision significantly (~10–20% on standard benchmarks) at the cost of an extra 100ms. For queries that benefit from diversity, Maximal Marginal Relevance (MMR) reduces redundancy among returned chunks.

python
def retrieve(query: str, k: int = 5) -> list[dict]:
    query_vec = model.encode([query], normalize_embeddings=True)[0]

    # Vector similarity search (top 2k candidates)
    vector_results = collection.query(
        query_embeddings=[query_vec.tolist()],
        n_results=k * 2
    )
    candidates = [
        {"text": doc, "metadata": meta, "score": None}
        for doc, meta in zip(
            vector_results["documents"][0],
            vector_results["metadatas"][0]
        )
    ]

    # Optional: cross-encoder reranking (high-value, ~100ms)
    # from sentence_transformers import CrossEncoder
    # reranker = CrossEncoder("cross-encoder/ms-marco-MiniLM-L-6-v2")
    # pairs = [(query, c["text"]) for c in candidates]
    # scores = reranker.predict(pairs)
    # candidates = sorted(zip(scores, candidates), reverse=True)
    # candidates = [c for _, c in candidates]

    return candidates[:k]

Context assembly

Takes the retrieved chunks and serializes them into a single string to be injected into the prompt. Always cap the assembled context to a token budget (leaving room for the system prompt, question, and model output), attach source labels to each chunk so the model can cite them, and put the most relevant chunk first — models attend more strongly to early context.

python
def build_context(chunks: list[dict], max_tokens: int = 6000) -> str:
    context_parts = []
    token_count = 0

    for chunk in chunks:
        # Rough token estimate: 1 token ≈ 4 chars
        chunk_tokens = len(chunk["text"]) // 4
        if token_count + chunk_tokens > max_tokens:
            break
        source = chunk["metadata"].get("source", "unknown")
        context_parts.append(f"[Source: {source}]\n{chunk['text']}")
        token_count += chunk_tokens

    return "\n\n---\n\n".join(context_parts)

Prompt template for RAG

The standard RAG prompt instructs the model to answer using only the provided sources, refuse when the answer is not present, and cite which source it used. The explicit "do not speculate" and "say I don't know" constraints are the primary levers against hallucination — omit them and the model will fill gaps from training data.

text
Answer the question using ONLY the sources provided below.
If the answer is not in the sources, say "I don't have enough information."
Do not speculate or draw on outside knowledge.
Cite sources by their [Source: ...] label.

Sources:
{context}

Question: {question}

Answer:

Full RAG pipeline

An end-to-end skeleton that wires together retrieval, context assembly, and LLM generation in a single answer() function. Use this as a starting point, then layer in reranking, streaming, and caching as your latency and cost requirements demand.

python
import anthropic

anthropic_client = anthropic.Anthropic()

def answer(question: str) -> str:
    chunks = retrieve(question, k=5)
    context = build_context(chunks)

    response = anthropic_client.messages.create(
        model="claude-opus-4-7",
        max_tokens=1024,
        messages=[{
            "role": "user",
            "content": (
                f"Answer using ONLY the sources below. "
                f"Cite sources. If not in sources, say so.\n\n"
                f"Sources:\n{context}\n\n"
                f"Question: {question}"
            )
        }]
    )
    return response.content[0].text

print(answer("What is the capital of France?"))

Output:

text
According to [Source: geography], Paris is the capital of France.

Agentic RAG

For multi-hop questions (answer depends on multiple retrieval steps), give Claude a search tool and let it decide what to retrieve.

python
search_tool = {
    "name": "search_docs",
    "description": (
        "Search the documentation for relevant information. "
        "Call this when you need specific facts to answer the question. "
        "You may call it multiple times with different queries."
    ),
    "input_schema": {
        "type": "object",
        "properties": {
            "query": {"type": "string", "description": "Search query"},
            "max_results": {"type": "integer", "default": 5}
        },
        "required": ["query"]
    }
}

def handle_search(inputs: dict) -> str:
    chunks = retrieve(inputs["query"], k=inputs.get("max_results", 5))
    return build_context(chunks)

# Let Claude drive the retrieval loop
answer = run_agent(
    user_message="What are the differences between Chroma and pgvector?",
    tools=[search_tool],
    max_turns=8,
)

Evaluation checklist

RAG quality is measured across two independent axes: retrieval quality (did we fetch the right chunks?) and generation quality (did the model answer faithfully from those chunks?). Measure both separately — a high generation score on top of poor retrieval just means the model is hallucinating confidently.

css
☐ Faithfulness — does the answer use only retrieved chunks? (no hallucination)
☐ Answer relevance — does the answer address the actual question?
☐ Context recall — does the top-k contain the chunk needed to answer?
☐ Context precision — are retrieved chunks on-topic, or noisy?
☐ Latency — p50/p95 retrieval + generation time within SLA
☐ Hallucination rate — spot-check a sample against source documents

Hybrid retrieval (vector + keyword)

Pure vector search misses exact-term matches (product codes, error strings, proper nouns); pure keyword search misses paraphrases. Hybrid retrieval runs both and fuses the scores, recovering recall on both axes. BM25 is the standard sparse complement to dense vectors, and Reciprocal Rank Fusion (RRF) is the simplest fusion strategy — no score calibration required.

python
from rank_bm25 import BM25Okapi

# Pre-build BM25 index over the same corpus you embedded
tokenized_corpus = [doc.split() for doc in corpus_texts]
bm25 = BM25Okapi(tokenized_corpus)

def hybrid_retrieve(query: str, k: int = 5, k_fetch: int = 20) -> list[dict]:
    # Dense vector candidates
    query_vec = model.encode([query], normalize_embeddings=True)[0]
    dense = collection.query(query_embeddings=[query_vec.tolist()], n_results=k_fetch)
    dense_ids = dense["ids"][0]

    # Sparse BM25 candidates
    sparse_scores = bm25.get_scores(query.split())
    sparse_ids = [str(i) for i in sparse_scores.argsort()[::-1][:k_fetch]]

    # Reciprocal Rank Fusion (constant k=60 is community standard)
    rrf_scores: dict[str, float] = {}
    for rank, doc_id in enumerate(dense_ids):
        rrf_scores[doc_id] = rrf_scores.get(doc_id, 0.0) + 1.0 / (60 + rank)
    for rank, doc_id in enumerate(sparse_ids):
        rrf_scores[doc_id] = rrf_scores.get(doc_id, 0.0) + 1.0 / (60 + rank)

    top = sorted(rrf_scores.items(), key=lambda x: x[1], reverse=True)[:k]
    return [{"id": did, "score": score} for did, score in top]

RRF works without normalizing the dense and sparse scores onto the same scale. If you want fine control, replace it with a learned reranker (cross-encoder/ms-marco-MiniLM-L-6-v2) fed the union of candidates from both retrievers.

Query rewriting and expansion

The user's query is rarely the optimal retrieval query. Three high-leverage rewrites: (1) decompose multi-part questions into sub-queries, retrieve for each, union the results; (2) hypothetical document embedding (HyDE) — ask the LLM to draft a fake answer and embed that, which matches the answer style of your indexed chunks better than a question; (3) for short cryptic queries, expand with synonyms.

python
def hyde_retrieve(query: str, k: int = 5) -> list[dict]:
    # Generate a hypothetical answer to embed instead of the question
    hyde_prompt = (
        f"Write a short hypothetical answer to the following question. "
        f"Style should match a technical documentation excerpt. "
        f"Do not say 'I don't know' — invent a plausible answer.\n\n"
        f"Question: {query}"
    )
    fake_answer = call_claude(hyde_prompt, max_tokens=300)
    fake_vec = model.encode([fake_answer], normalize_embeddings=True)[0]

    results = collection.query(query_embeddings=[fake_vec.tolist()], n_results=k)
    return [
        {"text": d, "metadata": m}
        for d, m in zip(results["documents"][0], results["metadatas"][0])
    ]
python
def decompose_query(query: str) -> list[str]:
    """Break a compound question into atomic sub-queries."""
    prompt = (
        "Break the following question into 1-4 atomic sub-questions, each of which "
        "can be answered from a single document. Return ONE per line. No prose.\n\n"
        f"Question: {query}"
    )
    raw = call_claude(prompt, max_tokens=300)
    return [line.strip("-* ").strip() for line in raw.splitlines() if line.strip()]

Metadata filtering

The fastest way to improve precision is to constrain the search space before similarity ranking runs. Pre-filter by date range, source type, language, or access level — most vector DBs support attribute filters on the same query call. A pre-filter that cuts the candidate pool by 10x typically improves precision more than swapping embedding models.

python
# Chroma example: filter by metadata before similarity search
results = collection.query(
    query_embeddings=[query_vec.tolist()],
    n_results=5,
    where={
        "$and": [
            {"published_date": {"$gte": "2024-01-01"}},
            {"language": {"$eq": "en"}},
            {"access_tier": {"$in": ["public", "internal"]}},
        ]
    },
)
sql
-- pgvector equivalent: combine vector search with WHERE clause
SELECT source, chunk, 1 - (embedding <=> $1::vector) AS similarity
FROM doc_chunks
WHERE published_date >= '2024-01-01'
  AND language = 'en'
  AND access_tier IN ('public', 'internal')
ORDER BY embedding <=> $1::vector
LIMIT 5;

Observability and tracing

A RAG pipeline that returns a wrong answer can fail at chunking, embedding, retrieval, reranking, prompt assembly, or generation. Without tracing, debugging is guessing. Log every stage with structured records: query, query embedding hash, top-k IDs + scores, reranker scores, assembled context length, model response, and latencies. Tools like LangSmith and Arize Phoenix automate this.

python
import time, json, uuid

def traced_answer(question: str) -> dict:
    trace_id = str(uuid.uuid4())
    t0 = time.time()

    # Retrieval
    t1 = time.time()
    chunks = retrieve(question, k=5)
    t_retrieve = time.time() - t1

    # Context assembly
    t1 = time.time()
    context = build_context(chunks)
    t_assemble = time.time() - t1

    # Generation
    t1 = time.time()
    answer = generate(question, context)
    t_generate = time.time() - t1

    trace = {
        "trace_id": trace_id,
        "question": question,
        "retrieved_ids": [c.get("id") for c in chunks],
        "retrieved_scores": [c.get("score") for c in chunks],
        "context_chars": len(context),
        "answer": answer,
        "latency_ms": {
            "retrieve": int(t_retrieve * 1000),
            "assemble": int(t_assemble * 1000),
            "generate": int(t_generate * 1000),
            "total": int((time.time() - t0) * 1000),
        },
    }
    print(json.dumps(trace))    # ship to your logging stack
    return {"answer": answer, "trace_id": trace_id}

Cost and latency tuning

RAG cost is dominated by generation tokens; latency is dominated by retrieval + reranking when k is large. The cheap wins are: cache embedding-generation calls for repeat queries, use Haiku for retrieval-grounded generation (the quality gap shrinks with good context), batch-embed during ingestion, and pre-filter aggressively.

LeverCost impactLatency impactQuality impact
Smaller embedding modelLower per-tokenFaster query embedding-5 to -10% recall
Reduce top-k from 10 to 5Lower input tokensFaster rerankingUsually negligible
Prompt cache static contextUp to -90% on hit-50ms TTFTNone
Swap Opus to Haiku for gen-85% per output token2-3x faster gen-5 to -15% subjective
Pre-filter by metadataLower retrieval costFaster ANN search+precision if filter is correct
Cross-encoder reranker+negligible cost+50–150ms+10–20% precision

Common failure modes

Most RAG failures fall into three root causes: the right chunk was never retrieved, the right chunk was retrieved but the prompt didn't constrain the model to use it, or the index is stale relative to the source documents. The table below maps symptoms to causes so you can diagnose without guessing.

SymptomLikely causeFix
Wrong answer despite correct chunk retrievedPrompt doesn't constrain to sourcesAdd explicit "ONLY use sources" instruction
Correct answer but wrong source citedChunk metadata lost at storagePersist source field alongside vector
Good on short docs, bad on longFixed chunk too large (diluted)Use smaller chunks or semantic chunking
Misses recent informationStale indexAdd incremental ingestion + reindex trigger
Slow retrievalFull scan without indexAdd HNSW/IVF index; shard by date
Hallucinations despite good retrievalContext too long, key chunk buriedUse reranker; put most relevant chunk first
Poor performance on tables/listsCharacter-level chunking splits structureKeep tables and lists whole as single chunks