cheat sheet
LlamaIndex
Build RAG pipelines and LLM-powered data applications with LlamaIndex. Covers document loading, indexing, query engines, custom LLMs and embeddings, persistent storage, and agents.
LlamaIndex — Data Framework for LLMs
What it is
LlamaIndex (formerly GPT Index) is a data framework for building LLM-powered applications over your own data. Its core workflow is: load documents with readers → index them (chunk, embed, store) → query with natural language. LlamaIndex provides a higher-level abstraction than LangChain for document-centric RAG: it handles chunking strategies, index types, and retrieval pipelines out of the box, while still allowing deep customisation of every component.
Install
pip install llama-index
pip install llama-index-llms-openai # OpenAI LLMs
pip install llama-index-llms-anthropic # Anthropic Claude
pip install llama-index-embeddings-huggingface # HuggingFace embeddings
pip install llama-index-vector-stores-chroma # ChromaDB vector store
Output: (none — exits 0 on success)
Quick example
from llama_index.core import VectorStoreIndex, SimpleDirectoryReader
import os
os.environ["OPENAI_API_KEY"] = "sk-..." # or set in environment
# Load documents from a directory
documents = SimpleDirectoryReader("./docs").load_data()
# Build index (chunks + embeds + stores in memory)
index = VectorStoreIndex.from_documents(documents)
# Query
query_engine = index.as_query_engine()
response = query_engine.query("What are the main topics covered in these documents?")
print(response)
Output:
The documents cover three main topics: transformer architecture, attention mechanisms,
and the shift from recurrent neural networks to self-attention for sequence modelling.
Key themes include positional encoding, multi-head attention, and encoder-decoder design.
When / why to use it
- Building RAG systems over your own documents (PDFs, Notion, Google Docs, databases).
- When you want structured indexing strategies (vector, keyword, knowledge graph) rather than building the retrieval layer from scratch.
- Multi-document reasoning: sub-question decomposition, routing, knowledge graph traversal.
- When the primary workflow is document → query rather than tool-calling agents.
- Custom embedding models and LLMs without changing the query interface.
Common pitfalls
Settingsmust be configured before indexing — if you swap the default LLM or embedding model viaSettings.llm/Settings.embed_model, do it before callingVectorStoreIndex.from_documents(). Objects created before the setting change use the old model.
Default model is
text-embedding-ada-002— without explicit configuration, LlamaIndex calls OpenAI's embedding API. If you have no OpenAI key or want to use a local model, configureSettings.embed_modelfirst.
Re-indexing cost —
VectorStoreIndex.from_documents()embeds every chunk via the embedding API. For large document sets, use a persistent vector store and check whether documents already exist before re-embedding.
Use
index.storage_context.persist("./storage")to save index data locally. Load it later withStorageContext.from_defaults(persist_dir="./storage")to avoid re-embedding.
response.source_nodescontains the retrieved chunks that grounded the answer. Print them with scores to understand retrieval quality:for n in response.source_nodes: print(n.score, n.text[:100]).
Richer example — persistent ChromaDB index
from llama_index.core import VectorStoreIndex, SimpleDirectoryReader, StorageContext, Settings
from llama_index.vector_stores.chroma import ChromaVectorStore
from llama_index.embeddings.huggingface import HuggingFaceEmbedding
from llama_index.llms.anthropic import Anthropic
import chromadb
import os
# Configure LLM and embeddings (no OpenAI required)
Settings.llm = Anthropic(model="claude-sonnet-4-6", api_key=os.environ["ANTHROPIC_API_KEY"])
Settings.embed_model = HuggingFaceEmbedding(model_name="BAAI/bge-small-en-v1.5")
Settings.chunk_size = 512
Settings.chunk_overlap = 64
# ChromaDB persistent storage
chroma_client = chromadb.PersistentClient(path="./chroma_db")
collection = chroma_client.get_or_create_collection("research_docs")
vector_store = ChromaVectorStore(chroma_collection=collection)
storage_ctx = StorageContext.from_defaults(vector_store=vector_store)
# Index documents (only adds new docs; existing embeddings reused)
documents = SimpleDirectoryReader("./research").load_data()
index = VectorStoreIndex.from_documents(documents, storage_context=storage_ctx)
# Query
engine = index.as_query_engine(similarity_top_k=5)
response = engine.query("What evidence supports the claim that attention is sufficient?")
print(response.response)
print(f"\nSources used: {len(response.source_nodes)}")
for node in response.source_nodes:
print(f" [{node.score:.3f}] {node.metadata.get('file_name', '?')} — {node.text[:60]}...")
Settings — global configuration
Settings is a singleton that configures the default LLM, embedding model, chunk size, and other pipeline parameters. All index and engine objects created after a Settings change use the new values.
from llama_index.core import Settings
from llama_index.llms.openai import OpenAI
from llama_index.llms.anthropic import Anthropic
from llama_index.embeddings.huggingface import HuggingFaceEmbedding
import os
# Use Anthropic + local HuggingFace embeddings
Settings.llm = Anthropic(
model="claude-sonnet-4-6",
api_key=os.environ["ANTHROPIC_API_KEY"],
max_tokens=1024,
)
Settings.embed_model = HuggingFaceEmbedding(
model_name="BAAI/bge-base-en-v1.5",
device="cpu",
)
Settings.chunk_size = 512
Settings.chunk_overlap = 50
Settings.num_output = 512
Document loaders — SimpleDirectoryReader
SimpleDirectoryReader reads files from a local directory (PDF, TXT, DOCX, Markdown, CSV, HTML) using format-specific parsers. It returns a list of Document objects.
from llama_index.core import SimpleDirectoryReader
# Load all supported files from a directory
docs = SimpleDirectoryReader("./docs").load_data()
print(f"Loaded {len(docs)} documents")
for d in docs[:2]:
print(d.metadata["file_name"], len(d.text), "chars")
# Load specific file types only
docs = SimpleDirectoryReader("./docs", required_exts=[".pdf", ".md"]).load_data()
# Load recursively
docs = SimpleDirectoryReader("./docs", recursive=True).load_data()
Output:
Loaded 8 documents
attention_paper.pdf 43892 chars
transformers_overview.md 12041 chars
Other loaders (install separately): llama-index-readers-web, llama-index-readers-notion, llama-index-readers-google (Drive, Docs), llama-index-readers-database.
Query engine vs chat engine
A query engine answers single questions grounded in retrieved documents. A chat engine maintains conversation history and allows follow-up questions that reference prior answers.
from llama_index.core import VectorStoreIndex, SimpleDirectoryReader
docs = SimpleDirectoryReader("./docs").load_data()
index = VectorStoreIndex.from_documents(docs)
# Query engine — single-turn Q&A
query_engine = index.as_query_engine(similarity_top_k=4, response_mode="compact")
r = query_engine.query("What is multi-head attention?")
print(r.response[:200])
# Chat engine — multi-turn conversation
chat_engine = index.as_chat_engine(chat_mode="condense_plus_context")
r1 = chat_engine.chat("What is multi-head attention?")
print(r1.response[:120])
r2 = chat_engine.chat("How many heads does BERT use?")
print(r2.response[:120]) # knows "heads" refers to attention heads from prior turn
Node parsers and chunking
Node parsers split documents into chunks (nodes). The default SentenceSplitter cuts on sentence boundaries up to chunk_size tokens. More options:
from llama_index.core.node_parser import (
SentenceSplitter,
SemanticSplitterNodeParser,
MarkdownNodeParser,
CodeSplitter,
)
from llama_index.embeddings.huggingface import HuggingFaceEmbedding
# Default — sentence-aware chunking
splitter = SentenceSplitter(chunk_size=512, chunk_overlap=64)
# Semantic chunking — splits where topic changes (requires embedding call per chunk)
embed_model = HuggingFaceEmbedding(model_name="BAAI/bge-small-en-v1.5")
semantic_splitter = SemanticSplitterNodeParser(
buffer_size=1,
breakpoint_percentile_threshold=95,
embed_model=embed_model,
)
# Markdown-aware (preserves headers as metadata)
md_splitter = MarkdownNodeParser()
# Apply manually
from llama_index.core import SimpleDirectoryReader
docs = SimpleDirectoryReader("./docs").load_data()
nodes = splitter.get_nodes_from_documents(docs)
print(f"Split into {len(nodes)} nodes")
Persistent storage — StorageContext
Save and restore index data locally to avoid re-embedding documents.
from llama_index.core import VectorStoreIndex, SimpleDirectoryReader, StorageContext, load_index_from_storage
docs = SimpleDirectoryReader("./docs").load_data()
# First run — build and save
index = VectorStoreIndex.from_documents(docs)
index.storage_context.persist(persist_dir="./storage")
print("Index saved")
# Subsequent runs — load from disk
storage_ctx = StorageContext.from_defaults(persist_dir="./storage")
index = load_index_from_storage(storage_ctx)
print("Index loaded from disk")
query_engine = index.as_query_engine()
print(query_engine.query("Summarise the key findings."))
Retrieval modes and response synthesis
LlamaIndex separates retrieval (which chunks to fetch) from synthesis (how to generate the answer). Response modes control synthesis strategy:
| Mode | Behaviour |
|---|---|
compact | Concatenate chunks into fewest prompts possible |
refine | Iteratively refine the answer over each chunk |
tree_summarize | Recursive summarisation tree (best for long docs) |
simple_summarize | Single prompt, first chunk only |
no_text | Return retrieved nodes without generating a response |
# Retrieve top-8 nodes, use tree_summarize for a long summary
engine = index.as_query_engine(
similarity_top_k=8,
response_mode="tree_summarize",
)
Routing and sub-question query engines
Route queries to different indexes based on content type, or decompose complex questions into sub-questions each answered by a separate engine.
from llama_index.core.query_engine import SubQuestionQueryEngine
from llama_index.core.tools import QueryEngineTool, ToolMetadata
# Two separate indexes
engine_papers = index_papers.as_query_engine()
engine_blogs = index_blogs.as_query_engine()
tools = [
QueryEngineTool(engine_papers, metadata=ToolMetadata(
name="papers", description="Academic papers on transformers"
)),
QueryEngineTool(engine_blogs, metadata=ToolMetadata(
name="blogs", description="Blog posts explaining ML concepts"
)),
]
sub_engine = SubQuestionQueryEngine.from_defaults(query_engine_tools=tools)
response = sub_engine.query(
"Compare how academic papers and blog posts describe the transformer attention mechanism."
)
print(response)
Agents and tool use
LlamaIndex agents combine query engines, function tools, and the LLM into a reasoning loop. ReActAgent uses chain-of-thought reasoning; OpenAIAgent / AnthropicAgent use native tool calling.
from llama_index.core.agent import ReActAgent
from llama_index.core.tools import QueryEngineTool, FunctionTool, ToolMetadata
from llama_index.llms.anthropic import Anthropic
import os
def multiply(a: float, b: float) -> float:
"""Multiply two numbers and return the result."""
return a * b
multiply_tool = FunctionTool.from_defaults(fn=multiply)
docs_tool = QueryEngineTool(
query_engine=engine,
metadata=ToolMetadata(name="research_docs", description="Research papers on AI"),
)
agent = ReActAgent.from_tools(
[multiply_tool, docs_tool],
llm=Anthropic(model="claude-sonnet-4-6", api_key=os.environ["ANTHROPIC_API_KEY"]),
verbose=True,
)
response = agent.chat(
"What is 42 * 17, and what does the research say about attention mechanisms?"
)
print(response)
Evaluation
LlamaIndex has a built-in evaluation module for RAG quality measurement.
from llama_index.core.evaluation import (
FaithfulnessEvaluator,
RelevancyEvaluator,
BatchEvalRunner,
)
from llama_index.llms.openai import OpenAI
import os
llm = OpenAI(model="gpt-4o", api_key=os.environ["OPENAI_API_KEY"])
faithfulness_eval = FaithfulnessEvaluator(llm=llm)
relevancy_eval = RelevancyEvaluator(llm=llm)
response = query_engine.query("What is self-attention?")
faith_result = faithfulness_eval.evaluate_response(response=response)
relev_result = relevancy_eval.evaluate_response(query="What is self-attention?", response=response)
print(f"Faithful: {faith_result.passing} score={faith_result.score:.2f}")
print(f"Relevant: {relev_result.passing} score={relev_result.score:.2f}")
Output:
Faithful: True score=1.00
Relevant: True score=0.95
Index types — beyond vector
LlamaIndex offers several index classes, each optimised for a different query pattern. Vector is the default, but the others are useful for specific shapes of data.
| Index | Best for | Storage |
|---|---|---|
VectorStoreIndex | Semantic similarity / RAG | Embeddings + nodes |
SummaryIndex | "Summarise everything" queries; traverses all nodes | Linear list of nodes |
KeywordTableIndex | Exact-keyword lookup (older docs, structured glossaries) | Keyword → node map |
TreeIndex | Hierarchical summarisation for long documents | Recursive summary tree |
KnowledgeGraphIndex | Entity-relation queries ("how is X connected to Y?") | Triples extracted by LLM |
DocumentSummaryIndex | Routing across many documents by summary | Per-doc summary + vector |
from llama_index.core import (
SummaryIndex, KeywordTableIndex, TreeIndex,
KnowledgeGraphIndex, DocumentSummaryIndex,
SimpleDirectoryReader,
)
docs = SimpleDirectoryReader("./docs").load_data()
# Summary index — every chunk is read for every query
summary_idx = SummaryIndex.from_documents(docs)
# Keyword index — fast for exact terms
keyword_idx = KeywordTableIndex.from_documents(docs)
# Tree index — recursive summarisation, good for long single docs
tree_idx = TreeIndex.from_documents(docs, num_children=4)
Output: (none — exits 0 on success)
DocumentSummaryIndex — routing many docs
A DocumentSummaryIndex builds a summary per document and stores both the summary embeddings and the chunk-level nodes. Queries first match against summaries to pick relevant documents, then drill into chunks.
from llama_index.core import DocumentSummaryIndex
from llama_index.core.response_synthesizers import get_response_synthesizer
synth = get_response_synthesizer(response_mode="tree_summarize", use_async=True)
doc_summary_idx = DocumentSummaryIndex.from_documents(
docs,
response_synthesizer=synth,
show_progress=True,
)
# Use the summary-aware retriever
engine = doc_summary_idx.as_query_engine()
response = engine.query("Which of these documents discusses positional encoding?")
print(response.response)
Output:
The document "attention_paper.pdf" introduces sinusoidal positional encoding.
"transformers_overview.md" discusses both sinusoidal and learned positional encodings.
Retrievers — splitting fetch from synthesis
A retriever returns nodes; a query engine wraps a retriever with a response synthesiser. Sometimes you want the retrieval step alone — for reranking, hybrid search, or feeding into another component.
from llama_index.core import VectorStoreIndex
from llama_index.core.retrievers import VectorIndexRetriever
from llama_index.core.query_engine import RetrieverQueryEngine
from llama_index.core.postprocessor import SimilarityPostprocessor
retriever = VectorIndexRetriever(index=index, similarity_top_k=20)
nodes = retriever.retrieve("What is multi-head attention?")
for n in nodes[:5]:
print(f" [{n.score:.3f}] {n.text[:60]}...")
# Build a query engine from a custom retriever + filter
engine = RetrieverQueryEngine.from_args(
retriever=retriever,
node_postprocessors=[SimilarityPostprocessor(similarity_cutoff=0.7)],
response_mode="compact",
)
print(engine.query("What is multi-head attention?"))
Output:
[0.872] Multi-head attention runs h attention operations in parallel...
[0.851] Each head learns a different subspace projection of Q, K, V...
[0.834] The outputs of all heads are concatenated and projected once more...
[0.802] Multi-head attention was introduced in "Attention Is All You Need"...
[0.781] In BERT, the base model uses 12 attention heads per layer...
Postprocessors — rerank and filter
A postprocessor runs after retrieval and before synthesis. It can rerank, filter on similarity, deduplicate, or call a cross-encoder model.
from llama_index.core.postprocessor import (
SimilarityPostprocessor,
KeywordNodePostprocessor,
LongContextReorder,
)
from llama_index.postprocessor.cohere_rerank import CohereRerank
import os
postprocessors = [
SimilarityPostprocessor(similarity_cutoff=0.65), # drop weak matches
KeywordNodePostprocessor(required_keywords=["attention"]),
LongContextReorder(), # put best at start/end
CohereRerank(api_key=os.environ["COHERE_API_KEY"], top_n=5, model="rerank-english-v3.0"),
]
engine = index.as_query_engine(similarity_top_k=20, node_postprocessors=postprocessors)
print(engine.query("How does positional encoding interact with attention?"))
Output:
Positional encoding is added to token embeddings before they enter the attention
layer. The attention mechanism itself has no notion of order — the positional
encoding gives each query/key pair a position-dependent component...
Metadata filters — narrowing by structured fields
Documents carry metadata (file name, date, author, custom tags). Filters apply at retrieval time to narrow the search space.
from llama_index.core import VectorStoreIndex, Document
from llama_index.core.vector_stores import MetadataFilter, MetadataFilters, FilterOperator
docs = [
Document(text="Pandas DataFrame operations", metadata={"section": "python", "year": 2024}),
Document(text="Polars eager vs lazy", metadata={"section": "python", "year": 2026}),
Document(text="grep tutorial", metadata={"section": "linux", "year": 2025}),
Document(text="ffmpeg recipes", metadata={"section": "linux", "year": 2026}),
]
index = VectorStoreIndex.from_documents(docs)
filters = MetadataFilters(
filters=[
MetadataFilter(key="section", value="python"),
MetadataFilter(key="year", value=2025, operator=FilterOperator.GT),
],
)
engine = index.as_query_engine(similarity_top_k=5, filters=filters)
print(engine.query("What about DataFrames?"))
Output:
Polars provides DataFrame-like operations with eager and lazy execution modes...
Hybrid retrieval — vector + BM25
Hybrid retrieval combines dense (vector) and sparse (BM25) scores so the system handles both semantic and lexical queries well. Use QueryFusionRetriever to merge multiple retrievers.
from llama_index.core.retrievers import QueryFusionRetriever, VectorIndexRetriever
from llama_index.retrievers.bm25 import BM25Retriever
from llama_index.core import VectorStoreIndex, SimpleDirectoryReader
docs = SimpleDirectoryReader("./docs").load_data()
index = VectorStoreIndex.from_documents(docs)
vector_ret = VectorIndexRetriever(index=index, similarity_top_k=10)
bm25_ret = BM25Retriever.from_defaults(docstore=index.docstore, similarity_top_k=10)
fusion = QueryFusionRetriever(
[vector_ret, bm25_ret],
similarity_top_k=10,
num_queries=4, # generate query variants for HyDE-like recall boost
mode="reciprocal_rerank", # 'simple' or 'relative_score' also work
use_async=True,
)
nodes = fusion.retrieve("attention is all you need")
for n in nodes[:3]:
print(f" [{n.score:.3f}] {n.text[:60]}...")
Output:
[0.0833] The transformer architecture, introduced in "Attention Is All You..."
[0.0814] Self-attention computes weighted sums where weights come from query...
[0.0667] Multi-head attention runs h parallel attention operations...
Streaming responses
For chat UIs, stream tokens as they arrive instead of waiting for the full answer.
engine = index.as_query_engine(streaming=True, similarity_top_k=5)
streaming_response = engine.query("Explain attention in three paragraphs.")
for token in streaming_response.response_gen:
print(token, end="", flush=True)
print()
Output:
Attention is a mechanism that allows a neural network to focus on different parts
of the input when producing each part of the output. Rather than compressing the
entire input into a fixed-size hidden state...
Async queries
All major engines have async equivalents. They share connection pools and dramatically improve throughput for batched evaluation or multi-tenant servers.
import asyncio
from llama_index.core import VectorStoreIndex, SimpleDirectoryReader
docs = SimpleDirectoryReader("./docs").load_data()
index = VectorStoreIndex.from_documents(docs)
engine = index.as_query_engine(use_async=True)
async def main():
questions = [
"What is self-attention?",
"How does BERT differ from GPT?",
"What is positional encoding?",
]
results = await asyncio.gather(*(engine.aquery(q) for q in questions))
for q, r in zip(questions, results):
print(f"Q: {q}\nA: {r.response[:80]}\n")
asyncio.run(main())
Output:
Q: What is self-attention?
A: Self-attention is a mechanism where each token attends to every other token...
Q: How does BERT differ from GPT?
A: BERT is a bidirectional encoder trained with masked language modelling...
Q: What is positional encoding?
A: Positional encoding adds order information to token embeddings since...
Vector store integrations
LlamaIndex ships first-class integrations with major vector stores. The interface is uniform — swap one line to change backends.
# Qdrant
from llama_index.vector_stores.qdrant import QdrantVectorStore
import qdrant_client
client = qdrant_client.QdrantClient(url="http://localhost:6333")
vs = QdrantVectorStore(client=client, collection_name="docs")
# Weaviate
from llama_index.vector_stores.weaviate import WeaviateVectorStore
import weaviate
wclient = weaviate.connect_to_local()
vs = WeaviateVectorStore(weaviate_client=wclient, index_name="Docs")
# pgvector
from llama_index.vector_stores.postgres import PGVectorStore
vs = PGVectorStore.from_params(
database="rag", host="localhost", password="...",
port=5432, user="postgres", table_name="llama_docs", embed_dim=768,
)
# Pinecone
from llama_index.vector_stores.pinecone import PineconeVectorStore
import pinecone
pc = pinecone.Pinecone(api_key=os.environ["PINECONE_API_KEY"])
vs = PineconeVectorStore(pinecone_index=pc.Index("docs"))
Output: (none — exits 0 on success)
Ingestion pipelines — declarative transformation
An IngestionPipeline chains transformations (splitter, metadata extractor, embedding) and applies them to incoming documents. Pipelines can deduplicate, cache, and run incrementally.
from llama_index.core.ingestion import IngestionPipeline, IngestionCache
from llama_index.core.node_parser import SentenceSplitter
from llama_index.core.extractors import (
TitleExtractor,
KeywordExtractor,
QuestionsAnsweredExtractor,
)
from llama_index.embeddings.huggingface import HuggingFaceEmbedding
from llama_index.core.storage.docstore import SimpleDocumentStore
pipeline = IngestionPipeline(
transformations=[
SentenceSplitter(chunk_size=512, chunk_overlap=64),
TitleExtractor(nodes=5), # LLM-derived title metadata
KeywordExtractor(keywords=8), # auto keywords
QuestionsAnsweredExtractor(questions=3), # questions the chunk answers
HuggingFaceEmbedding(model_name="BAAI/bge-small-en-v1.5"),
],
docstore=SimpleDocumentStore(), # enables deduplication
cache=IngestionCache(), # skip already-processed nodes
)
nodes = pipeline.run(documents=docs, show_progress=True)
print(f"Produced {len(nodes)} nodes with auto-metadata")
print(nodes[0].metadata)
Output:
Produced 184 nodes with auto-metadata
{'file_name': 'attention_paper.pdf', 'document_title': 'Attention Is All You Need',
'excerpt_keywords': 'attention, transformer, encoder, decoder, query, key, value, softmax',
'questions_this_excerpt_can_answer': '1. What is self-attention?\n2. ...'}
Workflows — the v0.10+ orchestration primitive
A Workflow is an event-driven graph of steps with type-checked events. Workflows replace ad-hoc agent loops with a debuggable, branchable graph.
from llama_index.core.workflow import (
Workflow, StartEvent, StopEvent, Event, step,
)
from llama_index.llms.anthropic import Anthropic
import os
class RetrievedEvent(Event):
chunks: list[str]
class RagWorkflow(Workflow):
@step
async def retrieve(self, ev: StartEvent) -> RetrievedEvent:
nodes = retriever.retrieve(ev.query)
return RetrievedEvent(chunks=[n.text for n in nodes])
@step
async def synthesize(self, ev: RetrievedEvent) -> StopEvent:
prompt = "Context:\n" + "\n".join(ev.chunks) + f"\n\nQuestion: {self._query}"
llm = Anthropic(model="claude-sonnet-4-6", api_key=os.environ["ANTHROPIC_API_KEY"])
answer = await llm.acomplete(prompt)
return StopEvent(result=str(answer))
import asyncio
result = asyncio.run(RagWorkflow(timeout=30).run(query="What is positional encoding?"))
print(result)
Output:
Positional encoding is the technique used in transformers to inject information
about token order into the embeddings, since the self-attention mechanism itself
treats the input as an unordered set.
Multi-modal — text + images
LlamaIndex supports multi-modal retrieval — store images alongside text and retrieve both for a query.
from llama_index.core import SimpleDirectoryReader
from llama_index.core.indices.multi_modal import MultiModalVectorStoreIndex
from llama_index.multi_modal_llms.openai import OpenAIMultiModal
import os
docs = SimpleDirectoryReader("./mixed_media", recursive=True).load_data()
mm_index = MultiModalVectorStoreIndex.from_documents(docs)
mm_llm = OpenAIMultiModal(model="gpt-4o-mini", api_key=os.environ["OPENAI_API_KEY"], max_new_tokens=512)
engine = mm_index.as_query_engine(multi_modal_llm=mm_llm, similarity_top_k=4, image_similarity_top_k=2)
response = engine.query("Show me figures explaining the encoder-decoder split.")
print(response.response)
for n in response.source_nodes:
print(f" {n.metadata.get('file_name', '?')}")
Output:
The encoder-decoder split separates input encoding from output generation. Figure
1 of the original paper shows the encoder stack on the left and the decoder stack
on the right, with cross-attention connecting them.
attention_paper.pdf
fig1_arch.png
fig2_attention.png
Observability — Phoenix and LangSmith
LlamaIndex emits OpenTelemetry-style events. Connect them to Arize Phoenix (local) or LangSmith for trace inspection.
import os
from llama_index.core import set_global_handler
# Phoenix (local, free)
set_global_handler("arize_phoenix", endpoint="http://localhost:6006")
# Or LangSmith
os.environ["LANGCHAIN_TRACING_V2"] = "true"
os.environ["LANGCHAIN_API_KEY"] = "ls__..."
set_global_handler("langfuse") # langfuse, wandb, langsmith etc.
pip install arize-phoenix
phoenix serve &
python my_rag_app.py
Output:
🌍 Phoenix UI: http://localhost:6006/
📺 Phoenix collector: 127.0.0.1:6006
Real-world recipes
Recipe: incremental re-indexing of a docs folder
Use the ingestion pipeline's docstore to skip documents whose content hasn't changed.
from pathlib import Path
from llama_index.core import SimpleDirectoryReader, VectorStoreIndex, StorageContext
from llama_index.core.ingestion import IngestionPipeline, IngestionCache
from llama_index.core.storage.docstore import SimpleDocumentStore
STORE = Path("./storage")
docstore = SimpleDocumentStore.from_persist_dir(STORE) if STORE.exists() else SimpleDocumentStore()
cache = IngestionCache.from_persist_path(STORE / "cache.json") if (STORE / "cache.json").exists() else IngestionCache()
pipeline = IngestionPipeline(
transformations=[SentenceSplitter(chunk_size=512), embed_model],
docstore=docstore,
cache=cache,
)
docs = SimpleDirectoryReader("./docs").load_data(num_workers=4)
nodes = pipeline.run(documents=docs, show_progress=True)
STORE.mkdir(exist_ok=True)
docstore.persist(STORE / "docstore.json")
cache.persist(STORE / "cache.json")
print(f"Updated index — processed {len(nodes)} nodes (cached entries skipped)")
Output:
Updated index — processed 12 nodes (cached entries skipped)
Recipe: route question to domain-specific index
from llama_index.core.query_engine import RouterQueryEngine
from llama_index.core.tools import QueryEngineTool
python_engine = python_index.as_query_engine(similarity_top_k=5)
linux_engine = linux_index.as_query_engine(similarity_top_k=5)
tools = [
QueryEngineTool.from_defaults(
query_engine=python_engine,
name="python_docs",
description="Python language, packages, and ecosystem questions",
),
QueryEngineTool.from_defaults(
query_engine=linux_engine,
name="linux_docs",
description="Linux command-line, shell, and system administration",
),
]
router = RouterQueryEngine.from_defaults(query_engine_tools=tools)
print(router.query("How do I create a virtualenv?"))
print(router.query("How do I tail a log with grep?"))
Output:
Use `python -m venv .venv` then activate with `source .venv/bin/activate`...
Use `tail -f /var/log/syslog | grep --line-buffered ERROR` to filter as lines arrive...
Recipe: citation-style answers
Force the engine to include source citations inline using CitationQueryEngine.
from llama_index.core.query_engine import CitationQueryEngine
engine = CitationQueryEngine.from_args(
index,
similarity_top_k=6,
citation_chunk_size=512,
)
response = engine.query("What are the components of self-attention?")
print(response.response)
print("\nSources:")
for source in response.source_nodes:
print(f" [{source.node.metadata.get('source_id', '?')}] {source.node.text[:60]}...")
Output:
Self-attention has three components: query, key, and value projections [1]. The
attention weights come from the softmax of query-key dot products scaled by sqrt(d_k) [2].
Sources:
[1] Multi-head attention defines Q = XW_q, K = XW_k, V = XW_v...
[2] Attention(Q,K,V) = softmax(QK^T / sqrt(d_k)) V...
Recipe: golden-set evaluation in CI
from llama_index.core.evaluation import (
FaithfulnessEvaluator, RelevancyEvaluator,
)
from llama_index.llms.openai import OpenAI
import os
llm = OpenAI(model="gpt-4o", api_key=os.environ["OPENAI_API_KEY"])
faith = FaithfulnessEvaluator(llm=llm)
relev = RelevancyEvaluator(llm=llm)
GOLDEN = ["What is multi-head attention?", "Why scaled dot-product?"]
THRESHOLD = 0.80
for q in GOLDEN:
r = engine.query(q)
f = faith.evaluate_response(response=r).score
v = relev.evaluate_response(query=q, response=r).score
assert f >= THRESHOLD, f"FAIL faith {f:.2f} on: {q}"
assert v >= THRESHOLD, f"FAIL relev {v:.2f} on: {q}"
print("All golden queries pass")
pytest tests/test_eval_golden.py -q
Output:
. [100%]
1 passed in 18.42s
Performance and reliability tips
- For very large corpora (> 100 k chunks), use a real vector store (Qdrant, Weaviate, pgvector). The default in-memory store is for development only.
similarity_top_kof 5–10 hits a sweet spot for most RAG tasks. Higher values increase synthesis cost and hurt answer focus.- Use
IngestionCache+SimpleDocumentStoreto skip re-embedding unchanged documents. Embedding is the most expensive step. - Enable
streaming=Trueon user-facing chat to drop perceived latency in half. - Use
Settings.callback_manager = CallbackManager([TokenCountingHandler()])to monitor token spend per query. - For Qdrant / Weaviate, persist the vector store outside the LlamaIndex
storage_contextand pass it in — this lets you re-create the index pointer cheaply.
Quick reference
| Task | Code |
|---|---|
| Load directory | SimpleDirectoryReader("./docs").load_data() |
| Build index | VectorStoreIndex.from_documents(docs) |
| Summary index | SummaryIndex.from_documents(docs) |
| Doc summary index | DocumentSummaryIndex.from_documents(docs) |
| Knowledge graph | KnowledgeGraphIndex.from_documents(docs, max_triplets_per_chunk=2) |
| Query engine | index.as_query_engine(similarity_top_k=5) |
| Chat engine | index.as_chat_engine(chat_mode="condense_plus_context") |
| Streaming | index.as_query_engine(streaming=True) then iterate response.response_gen |
| Async query | await engine.aquery("...") |
| Custom retriever | VectorIndexRetriever(index=index, similarity_top_k=20) |
| Rerank | node_postprocessors=[CohereRerank(top_n=5)] |
| Hybrid (vec+BM25) | QueryFusionRetriever([vec, bm25], mode="reciprocal_rerank") |
| Metadata filter | MetadataFilters(filters=[MetadataFilter(key="section", value="python")]) |
| Set LLM | Settings.llm = Anthropic(...) |
| Set embeddings | Settings.embed_model = HuggingFaceEmbedding("model") |
| Chunk size | Settings.chunk_size = 512 |
| Save index | index.storage_context.persist("./storage") |
| Load index | load_index_from_storage(StorageContext.from_defaults(...)) |
| Source nodes | response.source_nodes |
| Citations | CitationQueryEngine.from_args(index) |
| ChromaDB store | ChromaVectorStore(chroma_collection=collection) |
| Qdrant store | QdrantVectorStore(client=qclient, collection_name="docs") |
| Ingestion pipeline | IngestionPipeline(transformations=[...], docstore=..., cache=...) |
| Sub-questions | SubQuestionQueryEngine.from_defaults(tools) |
| Router | RouterQueryEngine.from_defaults(query_engine_tools=...) |
| ReAct agent | ReActAgent.from_tools(tools, llm=..., verbose=True) |
| Multi-modal | MultiModalVectorStoreIndex.from_documents(docs) |
| Observability | set_global_handler("arize_phoenix") |