concept · weight 10

Retrieval-Augmented Generation (RAG)

Grounding LLM responses in chunks retrieved from an external corpus so the model reasons over real, citable sources instead of parametric memory alone.

Retrieval-Augmented Generation (RAG)

Definition

Retrieval-Augmented Generation (RAG) is the pattern of fetching relevant snippets from an external corpus at query time and injecting them into an LLM prompt so the model answers from real sources rather than from its frozen training weights. Coined by Lewis et al. (2020), the original architecture paired a parametric seq2seq generator with a non-parametric dense vector index of Wikipedia; modern usage generalises to any retriever (vector, lexical, hybrid, graph) plus any chat or completion model. RAG applies whenever the answer depends on data the model could not have memorised — private documents, fresh news, regulated knowledge, or anything that changes after training cut-off.

Why it matters

Out-of-the-box LLMs hallucinate confidently when asked about unfamiliar facts and have no way to cite sources. RAG addresses both problems at once: by constraining the model to a small, query-relevant context window of retrieved chunks, you reduce fabrication and gain the metadata needed for inline citations and audit trails. It is the cheapest path to making a general-purpose model useful on a specific corpus — far cheaper than fine-tuning, since the index can be rebuilt incrementally as documents change while the model itself stays untouched. RAG also unlocks compliance use cases (regulated industries that require provenance) and multi-tenant SaaS (one index per customer, one shared model). Even with million-token context windows arriving in 2025, RAG remains the right tool for dynamic corpora, cost-sensitive workloads, and any application where source attribution is non-negotiable.

How it works

A RAG pipeline has two phases that share infrastructure but run at different times.

Indexing (offline):

  1. Parse heterogeneous source files — PDFs, HTML, Markdown, Office docs — into clean text plus metadata (filename, page, section, last-modified). Libraries like unstructured typed-element output preserve structure.
  2. Chunk the text into retrievable units. Fixed-size chunks (256–1024 tokens with 10–20 % overlap) are the default; semantic chunking (split on headings or topic shifts) usually wins on quality.
  3. Embed each chunk with a sentence-encoder (BGE, E5, GTE, OpenAI text-embedding-3-*, Voyage, Nomic). The resulting dense vectors live alongside the original text and metadata in a vector store (Chroma, Qdrant, Weaviate, pgvector, Pinecone).

Querying (online):

  1. Embed the query with the same encoder used during indexing.
  2. Retrieve the top-k chunks by approximate-nearest-neighbour search. Production systems combine dense vectors with BM25 lexical scoring (hybrid search) and apply metadata filters to scope by tenant, date, or document type.
  3. Rerank the candidate set with a cross-encoder (Cohere Rerank, BGE reranker, ms-marco-MiniLM) — fetch 20, keep 5. Reranking is consistently the highest-leverage single optimisation in production RAG.
  4. Assemble the prompt: a system instruction that constrains the model to the provided sources, the retrieved chunks (often tagged with [source: filename, p.4]), and the user query.
  5. Generate with the LLM and surface the cited sources back to the user.

The mental model is parametric memory + non-parametric memory: model weights remember language and reasoning, the index remembers facts, and the prompt is the bus that connects them. A modern variant ecosystem has emerged around this skeleton — HyDE generates a hypothetical answer first and embeds that to improve recall on sparse queries; RAPTOR builds hierarchical summary trees so multi-document questions can retrieve at the right zoom level; GraphRAG extracts entity-relationship graphs from the corpus to enable global-structure queries; Self-RAG trains the model to decide when to retrieve and to critique its own outputs; FLARE triggers retrieval mid-generation when token confidence drops; and agentic RAG lets a planner agent issue multiple targeted retrievals, reflect, and iterate. All of these are refinements on the same core loop — they change when, what, and how often you retrieve, not the fundamental retrieve-then-generate pattern.

Common pitfalls

  1. Bad chunking dominates bad answers — splitting mid-sentence at a fixed character count fragments meaning. Prefer heading-aware or recursive splitters and tune chunk size to your embedding model's context window.
  2. Retriever-generator drift — embedding the query and the corpus with different models silently destroys recall. Pin the encoder name in config and re-index when you change it.
  3. Vector-only search misses exact terms — product codes, legal citations, error messages, and acronyms tank under pure dense retrieval. Add BM25 in a hybrid setup; the cost is one extra index, the gain is consistent.
  4. No reranker — top-k by cosine alone is noisy. A cross-encoder reranker on 20 candidates routinely beats top-5 dense retrieval by 10–30 points on Recall@5.
  5. Stuffing the context window — passing 50 chunks does not improve answers; it dilutes attention and raises cost. Tune k, rerank, and trim.
  6. Hallucination despite retrieval — without explicit "answer only from these sources; say I don't know if absent; cite the source ID" instructions, the model will still fill gaps from training data.
  7. No evaluation harness — shipping RAG changes without RAGAS / TruLens scores means every refactor is a coin flip. Track faithfulness, answer relevancy, context precision, and context recall in CI and fail builds on regression.
  8. Stale index — the corpus changes; the embeddings do not. Schedule re-indexing and track per-document updated_at so retrieval reflects reality.
  9. Choosing RAG when long context fits — for small static corpora that fit in a 200 K–1 M-token window, paying RAG's pipeline complexity costs more than it saves. Long-context wins on cohesive whole-document reasoning; RAG wins on dynamic data, precise citations, cost at scale, and multi-tenant isolation.
  10. No source surfacing in the UI — citations the user cannot click are citations the user cannot trust. Pass document IDs through the prompt and render them in the response.

Where to go next

Concepts and tools that compose with RAG, and articles in this knowledge base that show the pattern in code.

Sources

References consulted while writing this concept page. Links open in a new tab.

  • Lewis et al. (2020) — Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks — The original paper that coined RAG and framed it as a fusion of parametric (seq2seq) and non-parametric (dense vector index) memory.
  • RAGAS — list of available metrics — Canonical definitions for faithfulness, answer relevancy, context precision, and context recall used in the evaluation section.
  • RAGFlow — RAG at the Crossroads (mid-2025) — Recent survey of the modern RAG architecture landscape (GraphRAG, Agentic RAG, Self-RAG, RAPTOR, late-interaction models).
  • Data Nucleus — Enterprise RAG and Graph RAG guide (2025) — Source for the enterprise framing of agentic and graph-based variants and their production tradeoffs.
  • Tian Pan — Long-Context vs RAG production decision framework — Cost and latency numbers behind the "long-context vs RAG" pitfall (≈30–60× latency, ≈1,250× per-query cost for 1 M-token requests).
  • Li et al. (2025) — Long Context vs. RAG for LLMs: An Evaluation and Revisits — Empirical study showing long-context wins on whole-document reasoning while RAG wins on precise factual retrieval with attribution.
  • Superlinked VectorHub — Optimizing RAG with hybrid search and reranking — Source for the "retrieve 20, rerank to 5" pattern and the dominance of cross-encoder reranking over pure dense top-k.
  • Machine Learning Mastery — Beyond vector search: 5 next-gen RAG retrieval strategies — Compact overview of HyDE, RAPTOR, GraphRAG, Self-RAG, and late-interaction models cited in the variants paragraph.