cheat sheet

LlamaIndex

Build RAG pipelines and LLM-powered data applications with LlamaIndex. Covers document loading, indexing, query engines, custom LLMs and embeddings, persistent storage, and agents.

LlamaIndex — Data Framework for LLMs

What it is

LlamaIndex (formerly GPT Index) is a data framework for building LLM-powered applications over your own data. Its core workflow is: load documents with readers → index them (chunk, embed, store) → query with natural language. LlamaIndex provides a higher-level abstraction than LangChain for document-centric RAG: it handles chunking strategies, index types, and retrieval pipelines out of the box, while still allowing deep customisation of every component.

Install

bash
pip install llama-index
pip install llama-index-llms-openai          # OpenAI LLMs
pip install llama-index-llms-anthropic       # Anthropic Claude
pip install llama-index-embeddings-huggingface  # HuggingFace embeddings
pip install llama-index-vector-stores-chroma    # ChromaDB vector store

Output: (none — exits 0 on success)

Quick example

python
from llama_index.core import VectorStoreIndex, SimpleDirectoryReader
import os

os.environ["OPENAI_API_KEY"] = "sk-..."   # or set in environment

# Load documents from a directory
documents = SimpleDirectoryReader("./docs").load_data()

# Build index (chunks + embeds + stores in memory)
index = VectorStoreIndex.from_documents(documents)

# Query
query_engine = index.as_query_engine()
response = query_engine.query("What are the main topics covered in these documents?")
print(response)

Output:

text
The documents cover three main topics: transformer architecture, attention mechanisms,
and the shift from recurrent neural networks to self-attention for sequence modelling.
Key themes include positional encoding, multi-head attention, and encoder-decoder design.

When / why to use it

  • Building RAG systems over your own documents (PDFs, Notion, Google Docs, databases).
  • When you want structured indexing strategies (vector, keyword, knowledge graph) rather than building the retrieval layer from scratch.
  • Multi-document reasoning: sub-question decomposition, routing, knowledge graph traversal.
  • When the primary workflow is document → query rather than tool-calling agents.
  • Custom embedding models and LLMs without changing the query interface.

Common pitfalls

Settings must be configured before indexing — if you swap the default LLM or embedding model via Settings.llm / Settings.embed_model, do it before calling VectorStoreIndex.from_documents(). Objects created before the setting change use the old model.

Default model is text-embedding-ada-002 — without explicit configuration, LlamaIndex calls OpenAI's embedding API. If you have no OpenAI key or want to use a local model, configure Settings.embed_model first.

Re-indexing costVectorStoreIndex.from_documents() embeds every chunk via the embedding API. For large document sets, use a persistent vector store and check whether documents already exist before re-embedding.

Use index.storage_context.persist("./storage") to save index data locally. Load it later with StorageContext.from_defaults(persist_dir="./storage") to avoid re-embedding.

response.source_nodes contains the retrieved chunks that grounded the answer. Print them with scores to understand retrieval quality: for n in response.source_nodes: print(n.score, n.text[:100]).

Richer example — persistent ChromaDB index

python
from llama_index.core import VectorStoreIndex, SimpleDirectoryReader, StorageContext, Settings
from llama_index.vector_stores.chroma import ChromaVectorStore
from llama_index.embeddings.huggingface import HuggingFaceEmbedding
from llama_index.llms.anthropic import Anthropic
import chromadb
import os

# Configure LLM and embeddings (no OpenAI required)
Settings.llm = Anthropic(model="claude-sonnet-4-6", api_key=os.environ["ANTHROPIC_API_KEY"])
Settings.embed_model = HuggingFaceEmbedding(model_name="BAAI/bge-small-en-v1.5")
Settings.chunk_size = 512
Settings.chunk_overlap = 64

# ChromaDB persistent storage
chroma_client = chromadb.PersistentClient(path="./chroma_db")
collection = chroma_client.get_or_create_collection("research_docs")
vector_store = ChromaVectorStore(chroma_collection=collection)
storage_ctx = StorageContext.from_defaults(vector_store=vector_store)

# Index documents (only adds new docs; existing embeddings reused)
documents = SimpleDirectoryReader("./research").load_data()
index = VectorStoreIndex.from_documents(documents, storage_context=storage_ctx)

# Query
engine = index.as_query_engine(similarity_top_k=5)
response = engine.query("What evidence supports the claim that attention is sufficient?")
print(response.response)
print(f"\nSources used: {len(response.source_nodes)}")
for node in response.source_nodes:
    print(f"  [{node.score:.3f}] {node.metadata.get('file_name', '?')}{node.text[:60]}...")

Settings — global configuration

Settings is a singleton that configures the default LLM, embedding model, chunk size, and other pipeline parameters. All index and engine objects created after a Settings change use the new values.

python
from llama_index.core import Settings
from llama_index.llms.openai import OpenAI
from llama_index.llms.anthropic import Anthropic
from llama_index.embeddings.huggingface import HuggingFaceEmbedding
import os

# Use Anthropic + local HuggingFace embeddings
Settings.llm = Anthropic(
    model="claude-sonnet-4-6",
    api_key=os.environ["ANTHROPIC_API_KEY"],
    max_tokens=1024,
)
Settings.embed_model = HuggingFaceEmbedding(
    model_name="BAAI/bge-base-en-v1.5",
    device="cpu",
)
Settings.chunk_size = 512
Settings.chunk_overlap = 50
Settings.num_output = 512

Document loaders — SimpleDirectoryReader

SimpleDirectoryReader reads files from a local directory (PDF, TXT, DOCX, Markdown, CSV, HTML) using format-specific parsers. It returns a list of Document objects.

python
from llama_index.core import SimpleDirectoryReader

# Load all supported files from a directory
docs = SimpleDirectoryReader("./docs").load_data()
print(f"Loaded {len(docs)} documents")
for d in docs[:2]:
    print(d.metadata["file_name"], len(d.text), "chars")

# Load specific file types only
docs = SimpleDirectoryReader("./docs", required_exts=[".pdf", ".md"]).load_data()

# Load recursively
docs = SimpleDirectoryReader("./docs", recursive=True).load_data()

Output:

text
Loaded 8 documents
attention_paper.pdf 43892 chars
transformers_overview.md 12041 chars

Other loaders (install separately): llama-index-readers-web, llama-index-readers-notion, llama-index-readers-google (Drive, Docs), llama-index-readers-database.

Query engine vs chat engine

A query engine answers single questions grounded in retrieved documents. A chat engine maintains conversation history and allows follow-up questions that reference prior answers.

python
from llama_index.core import VectorStoreIndex, SimpleDirectoryReader

docs = SimpleDirectoryReader("./docs").load_data()
index = VectorStoreIndex.from_documents(docs)

# Query engine — single-turn Q&A
query_engine = index.as_query_engine(similarity_top_k=4, response_mode="compact")
r = query_engine.query("What is multi-head attention?")
print(r.response[:200])

# Chat engine — multi-turn conversation
chat_engine = index.as_chat_engine(chat_mode="condense_plus_context")
r1 = chat_engine.chat("What is multi-head attention?")
print(r1.response[:120])
r2 = chat_engine.chat("How many heads does BERT use?")
print(r2.response[:120])  # knows "heads" refers to attention heads from prior turn

Node parsers and chunking

Node parsers split documents into chunks (nodes). The default SentenceSplitter cuts on sentence boundaries up to chunk_size tokens. More options:

python
from llama_index.core.node_parser import (
    SentenceSplitter,
    SemanticSplitterNodeParser,
    MarkdownNodeParser,
    CodeSplitter,
)
from llama_index.embeddings.huggingface import HuggingFaceEmbedding

# Default — sentence-aware chunking
splitter = SentenceSplitter(chunk_size=512, chunk_overlap=64)

# Semantic chunking — splits where topic changes (requires embedding call per chunk)
embed_model = HuggingFaceEmbedding(model_name="BAAI/bge-small-en-v1.5")
semantic_splitter = SemanticSplitterNodeParser(
    buffer_size=1,
    breakpoint_percentile_threshold=95,
    embed_model=embed_model,
)

# Markdown-aware (preserves headers as metadata)
md_splitter = MarkdownNodeParser()

# Apply manually
from llama_index.core import SimpleDirectoryReader
docs = SimpleDirectoryReader("./docs").load_data()
nodes = splitter.get_nodes_from_documents(docs)
print(f"Split into {len(nodes)} nodes")

Persistent storage — StorageContext

Save and restore index data locally to avoid re-embedding documents.

python
from llama_index.core import VectorStoreIndex, SimpleDirectoryReader, StorageContext, load_index_from_storage

docs = SimpleDirectoryReader("./docs").load_data()

# First run — build and save
index = VectorStoreIndex.from_documents(docs)
index.storage_context.persist(persist_dir="./storage")
print("Index saved")

# Subsequent runs — load from disk
storage_ctx = StorageContext.from_defaults(persist_dir="./storage")
index = load_index_from_storage(storage_ctx)
print("Index loaded from disk")

query_engine = index.as_query_engine()
print(query_engine.query("Summarise the key findings."))

Retrieval modes and response synthesis

LlamaIndex separates retrieval (which chunks to fetch) from synthesis (how to generate the answer). Response modes control synthesis strategy:

ModeBehaviour
compactConcatenate chunks into fewest prompts possible
refineIteratively refine the answer over each chunk
tree_summarizeRecursive summarisation tree (best for long docs)
simple_summarizeSingle prompt, first chunk only
no_textReturn retrieved nodes without generating a response
python
# Retrieve top-8 nodes, use tree_summarize for a long summary
engine = index.as_query_engine(
    similarity_top_k=8,
    response_mode="tree_summarize",
)

Routing and sub-question query engines

Route queries to different indexes based on content type, or decompose complex questions into sub-questions each answered by a separate engine.

python
from llama_index.core.query_engine import SubQuestionQueryEngine
from llama_index.core.tools import QueryEngineTool, ToolMetadata

# Two separate indexes
engine_papers = index_papers.as_query_engine()
engine_blogs   = index_blogs.as_query_engine()

tools = [
    QueryEngineTool(engine_papers, metadata=ToolMetadata(
        name="papers", description="Academic papers on transformers"
    )),
    QueryEngineTool(engine_blogs, metadata=ToolMetadata(
        name="blogs", description="Blog posts explaining ML concepts"
    )),
]

sub_engine = SubQuestionQueryEngine.from_defaults(query_engine_tools=tools)
response = sub_engine.query(
    "Compare how academic papers and blog posts describe the transformer attention mechanism."
)
print(response)

Agents and tool use

LlamaIndex agents combine query engines, function tools, and the LLM into a reasoning loop. ReActAgent uses chain-of-thought reasoning; OpenAIAgent / AnthropicAgent use native tool calling.

python
from llama_index.core.agent import ReActAgent
from llama_index.core.tools import QueryEngineTool, FunctionTool, ToolMetadata
from llama_index.llms.anthropic import Anthropic
import os

def multiply(a: float, b: float) -> float:
    """Multiply two numbers and return the result."""
    return a * b

multiply_tool = FunctionTool.from_defaults(fn=multiply)
docs_tool = QueryEngineTool(
    query_engine=engine,
    metadata=ToolMetadata(name="research_docs", description="Research papers on AI"),
)

agent = ReActAgent.from_tools(
    [multiply_tool, docs_tool],
    llm=Anthropic(model="claude-sonnet-4-6", api_key=os.environ["ANTHROPIC_API_KEY"]),
    verbose=True,
)

response = agent.chat(
    "What is 42 * 17, and what does the research say about attention mechanisms?"
)
print(response)

Evaluation

LlamaIndex has a built-in evaluation module for RAG quality measurement.

python
from llama_index.core.evaluation import (
    FaithfulnessEvaluator,
    RelevancyEvaluator,
    BatchEvalRunner,
)
from llama_index.llms.openai import OpenAI
import os

llm = OpenAI(model="gpt-4o", api_key=os.environ["OPENAI_API_KEY"])
faithfulness_eval = FaithfulnessEvaluator(llm=llm)
relevancy_eval    = RelevancyEvaluator(llm=llm)

response = query_engine.query("What is self-attention?")
faith_result = faithfulness_eval.evaluate_response(response=response)
relev_result = relevancy_eval.evaluate_response(query="What is self-attention?", response=response)

print(f"Faithful: {faith_result.passing}  score={faith_result.score:.2f}")
print(f"Relevant: {relev_result.passing}  score={relev_result.score:.2f}")

Output:

text
Faithful: True  score=1.00
Relevant: True  score=0.95

Index types — beyond vector

LlamaIndex offers several index classes, each optimised for a different query pattern. Vector is the default, but the others are useful for specific shapes of data.

IndexBest forStorage
VectorStoreIndexSemantic similarity / RAGEmbeddings + nodes
SummaryIndex"Summarise everything" queries; traverses all nodesLinear list of nodes
KeywordTableIndexExact-keyword lookup (older docs, structured glossaries)Keyword → node map
TreeIndexHierarchical summarisation for long documentsRecursive summary tree
KnowledgeGraphIndexEntity-relation queries ("how is X connected to Y?")Triples extracted by LLM
DocumentSummaryIndexRouting across many documents by summaryPer-doc summary + vector
python
from llama_index.core import (
    SummaryIndex, KeywordTableIndex, TreeIndex,
    KnowledgeGraphIndex, DocumentSummaryIndex,
    SimpleDirectoryReader,
)

docs = SimpleDirectoryReader("./docs").load_data()

# Summary index — every chunk is read for every query
summary_idx = SummaryIndex.from_documents(docs)

# Keyword index — fast for exact terms
keyword_idx = KeywordTableIndex.from_documents(docs)

# Tree index — recursive summarisation, good for long single docs
tree_idx = TreeIndex.from_documents(docs, num_children=4)

Output: (none — exits 0 on success)

DocumentSummaryIndex — routing many docs

A DocumentSummaryIndex builds a summary per document and stores both the summary embeddings and the chunk-level nodes. Queries first match against summaries to pick relevant documents, then drill into chunks.

python
from llama_index.core import DocumentSummaryIndex
from llama_index.core.response_synthesizers import get_response_synthesizer

synth = get_response_synthesizer(response_mode="tree_summarize", use_async=True)
doc_summary_idx = DocumentSummaryIndex.from_documents(
    docs,
    response_synthesizer=synth,
    show_progress=True,
)

# Use the summary-aware retriever
engine = doc_summary_idx.as_query_engine()
response = engine.query("Which of these documents discusses positional encoding?")
print(response.response)

Output:

text
The document "attention_paper.pdf" introduces sinusoidal positional encoding.
"transformers_overview.md" discusses both sinusoidal and learned positional encodings.

Retrievers — splitting fetch from synthesis

A retriever returns nodes; a query engine wraps a retriever with a response synthesiser. Sometimes you want the retrieval step alone — for reranking, hybrid search, or feeding into another component.

python
from llama_index.core import VectorStoreIndex
from llama_index.core.retrievers import VectorIndexRetriever
from llama_index.core.query_engine import RetrieverQueryEngine
from llama_index.core.postprocessor import SimilarityPostprocessor

retriever = VectorIndexRetriever(index=index, similarity_top_k=20)
nodes = retriever.retrieve("What is multi-head attention?")
for n in nodes[:5]:
    print(f"  [{n.score:.3f}] {n.text[:60]}...")

# Build a query engine from a custom retriever + filter
engine = RetrieverQueryEngine.from_args(
    retriever=retriever,
    node_postprocessors=[SimilarityPostprocessor(similarity_cutoff=0.7)],
    response_mode="compact",
)
print(engine.query("What is multi-head attention?"))

Output:

text
  [0.872] Multi-head attention runs h attention operations in parallel...
  [0.851] Each head learns a different subspace projection of Q, K, V...
  [0.834] The outputs of all heads are concatenated and projected once more...
  [0.802] Multi-head attention was introduced in "Attention Is All You Need"...
  [0.781] In BERT, the base model uses 12 attention heads per layer...

Postprocessors — rerank and filter

A postprocessor runs after retrieval and before synthesis. It can rerank, filter on similarity, deduplicate, or call a cross-encoder model.

python
from llama_index.core.postprocessor import (
    SimilarityPostprocessor,
    KeywordNodePostprocessor,
    LongContextReorder,
)
from llama_index.postprocessor.cohere_rerank import CohereRerank
import os

postprocessors = [
    SimilarityPostprocessor(similarity_cutoff=0.65),   # drop weak matches
    KeywordNodePostprocessor(required_keywords=["attention"]),
    LongContextReorder(),                              # put best at start/end
    CohereRerank(api_key=os.environ["COHERE_API_KEY"], top_n=5, model="rerank-english-v3.0"),
]

engine = index.as_query_engine(similarity_top_k=20, node_postprocessors=postprocessors)
print(engine.query("How does positional encoding interact with attention?"))

Output:

text
Positional encoding is added to token embeddings before they enter the attention
layer. The attention mechanism itself has no notion of order — the positional
encoding gives each query/key pair a position-dependent component...

Metadata filters — narrowing by structured fields

Documents carry metadata (file name, date, author, custom tags). Filters apply at retrieval time to narrow the search space.

python
from llama_index.core import VectorStoreIndex, Document
from llama_index.core.vector_stores import MetadataFilter, MetadataFilters, FilterOperator

docs = [
    Document(text="Pandas DataFrame operations", metadata={"section": "python",     "year": 2024}),
    Document(text="Polars eager vs lazy",       metadata={"section": "python",     "year": 2026}),
    Document(text="grep tutorial",              metadata={"section": "linux",      "year": 2025}),
    Document(text="ffmpeg recipes",             metadata={"section": "linux",      "year": 2026}),
]
index = VectorStoreIndex.from_documents(docs)

filters = MetadataFilters(
    filters=[
        MetadataFilter(key="section", value="python"),
        MetadataFilter(key="year", value=2025, operator=FilterOperator.GT),
    ],
)
engine = index.as_query_engine(similarity_top_k=5, filters=filters)
print(engine.query("What about DataFrames?"))

Output:

text
Polars provides DataFrame-like operations with eager and lazy execution modes...

Hybrid retrieval — vector + BM25

Hybrid retrieval combines dense (vector) and sparse (BM25) scores so the system handles both semantic and lexical queries well. Use QueryFusionRetriever to merge multiple retrievers.

python
from llama_index.core.retrievers import QueryFusionRetriever, VectorIndexRetriever
from llama_index.retrievers.bm25 import BM25Retriever
from llama_index.core import VectorStoreIndex, SimpleDirectoryReader

docs = SimpleDirectoryReader("./docs").load_data()
index = VectorStoreIndex.from_documents(docs)

vector_ret = VectorIndexRetriever(index=index, similarity_top_k=10)
bm25_ret   = BM25Retriever.from_defaults(docstore=index.docstore, similarity_top_k=10)

fusion = QueryFusionRetriever(
    [vector_ret, bm25_ret],
    similarity_top_k=10,
    num_queries=4,                  # generate query variants for HyDE-like recall boost
    mode="reciprocal_rerank",       # 'simple' or 'relative_score' also work
    use_async=True,
)

nodes = fusion.retrieve("attention is all you need")
for n in nodes[:3]:
    print(f"  [{n.score:.3f}] {n.text[:60]}...")

Output:

text
  [0.0833] The transformer architecture, introduced in "Attention Is All You..."
  [0.0814] Self-attention computes weighted sums where weights come from query...
  [0.0667] Multi-head attention runs h parallel attention operations...

Streaming responses

For chat UIs, stream tokens as they arrive instead of waiting for the full answer.

python
engine = index.as_query_engine(streaming=True, similarity_top_k=5)
streaming_response = engine.query("Explain attention in three paragraphs.")
for token in streaming_response.response_gen:
    print(token, end="", flush=True)
print()

Output:

text
Attention is a mechanism that allows a neural network to focus on different parts
of the input when producing each part of the output. Rather than compressing the
entire input into a fixed-size hidden state...

Async queries

All major engines have async equivalents. They share connection pools and dramatically improve throughput for batched evaluation or multi-tenant servers.

python
import asyncio
from llama_index.core import VectorStoreIndex, SimpleDirectoryReader

docs = SimpleDirectoryReader("./docs").load_data()
index = VectorStoreIndex.from_documents(docs)
engine = index.as_query_engine(use_async=True)

async def main():
    questions = [
        "What is self-attention?",
        "How does BERT differ from GPT?",
        "What is positional encoding?",
    ]
    results = await asyncio.gather(*(engine.aquery(q) for q in questions))
    for q, r in zip(questions, results):
        print(f"Q: {q}\nA: {r.response[:80]}\n")

asyncio.run(main())

Output:

text
Q: What is self-attention?
A: Self-attention is a mechanism where each token attends to every other token...

Q: How does BERT differ from GPT?
A: BERT is a bidirectional encoder trained with masked language modelling...

Q: What is positional encoding?
A: Positional encoding adds order information to token embeddings since...

Vector store integrations

LlamaIndex ships first-class integrations with major vector stores. The interface is uniform — swap one line to change backends.

python
# Qdrant
from llama_index.vector_stores.qdrant import QdrantVectorStore
import qdrant_client
client = qdrant_client.QdrantClient(url="http://localhost:6333")
vs = QdrantVectorStore(client=client, collection_name="docs")

# Weaviate
from llama_index.vector_stores.weaviate import WeaviateVectorStore
import weaviate
wclient = weaviate.connect_to_local()
vs = WeaviateVectorStore(weaviate_client=wclient, index_name="Docs")

# pgvector
from llama_index.vector_stores.postgres import PGVectorStore
vs = PGVectorStore.from_params(
    database="rag", host="localhost", password="...",
    port=5432, user="postgres", table_name="llama_docs", embed_dim=768,
)

# Pinecone
from llama_index.vector_stores.pinecone import PineconeVectorStore
import pinecone
pc = pinecone.Pinecone(api_key=os.environ["PINECONE_API_KEY"])
vs = PineconeVectorStore(pinecone_index=pc.Index("docs"))

Output: (none — exits 0 on success)

Ingestion pipelines — declarative transformation

An IngestionPipeline chains transformations (splitter, metadata extractor, embedding) and applies them to incoming documents. Pipelines can deduplicate, cache, and run incrementally.

python
from llama_index.core.ingestion import IngestionPipeline, IngestionCache
from llama_index.core.node_parser import SentenceSplitter
from llama_index.core.extractors import (
    TitleExtractor,
    KeywordExtractor,
    QuestionsAnsweredExtractor,
)
from llama_index.embeddings.huggingface import HuggingFaceEmbedding
from llama_index.core.storage.docstore import SimpleDocumentStore

pipeline = IngestionPipeline(
    transformations=[
        SentenceSplitter(chunk_size=512, chunk_overlap=64),
        TitleExtractor(nodes=5),                              # LLM-derived title metadata
        KeywordExtractor(keywords=8),                         # auto keywords
        QuestionsAnsweredExtractor(questions=3),              # questions the chunk answers
        HuggingFaceEmbedding(model_name="BAAI/bge-small-en-v1.5"),
    ],
    docstore=SimpleDocumentStore(),                           # enables deduplication
    cache=IngestionCache(),                                   # skip already-processed nodes
)

nodes = pipeline.run(documents=docs, show_progress=True)
print(f"Produced {len(nodes)} nodes with auto-metadata")
print(nodes[0].metadata)

Output:

text
Produced 184 nodes with auto-metadata
{'file_name': 'attention_paper.pdf', 'document_title': 'Attention Is All You Need',
 'excerpt_keywords': 'attention, transformer, encoder, decoder, query, key, value, softmax',
 'questions_this_excerpt_can_answer': '1. What is self-attention?\n2. ...'}

Workflows — the v0.10+ orchestration primitive

A Workflow is an event-driven graph of steps with type-checked events. Workflows replace ad-hoc agent loops with a debuggable, branchable graph.

python
from llama_index.core.workflow import (
    Workflow, StartEvent, StopEvent, Event, step,
)
from llama_index.llms.anthropic import Anthropic
import os

class RetrievedEvent(Event):
    chunks: list[str]

class RagWorkflow(Workflow):
    @step
    async def retrieve(self, ev: StartEvent) -> RetrievedEvent:
        nodes = retriever.retrieve(ev.query)
        return RetrievedEvent(chunks=[n.text for n in nodes])

    @step
    async def synthesize(self, ev: RetrievedEvent) -> StopEvent:
        prompt = "Context:\n" + "\n".join(ev.chunks) + f"\n\nQuestion: {self._query}"
        llm = Anthropic(model="claude-sonnet-4-6", api_key=os.environ["ANTHROPIC_API_KEY"])
        answer = await llm.acomplete(prompt)
        return StopEvent(result=str(answer))

import asyncio
result = asyncio.run(RagWorkflow(timeout=30).run(query="What is positional encoding?"))
print(result)

Output:

text
Positional encoding is the technique used in transformers to inject information
about token order into the embeddings, since the self-attention mechanism itself
treats the input as an unordered set.

Multi-modal — text + images

LlamaIndex supports multi-modal retrieval — store images alongside text and retrieve both for a query.

python
from llama_index.core import SimpleDirectoryReader
from llama_index.core.indices.multi_modal import MultiModalVectorStoreIndex
from llama_index.multi_modal_llms.openai import OpenAIMultiModal
import os

docs = SimpleDirectoryReader("./mixed_media", recursive=True).load_data()
mm_index = MultiModalVectorStoreIndex.from_documents(docs)

mm_llm = OpenAIMultiModal(model="gpt-4o-mini", api_key=os.environ["OPENAI_API_KEY"], max_new_tokens=512)
engine = mm_index.as_query_engine(multi_modal_llm=mm_llm, similarity_top_k=4, image_similarity_top_k=2)

response = engine.query("Show me figures explaining the encoder-decoder split.")
print(response.response)
for n in response.source_nodes:
    print(f"  {n.metadata.get('file_name', '?')}")

Output:

text
The encoder-decoder split separates input encoding from output generation. Figure
1 of the original paper shows the encoder stack on the left and the decoder stack
on the right, with cross-attention connecting them.
  attention_paper.pdf
  fig1_arch.png
  fig2_attention.png

Observability — Phoenix and LangSmith

LlamaIndex emits OpenTelemetry-style events. Connect them to Arize Phoenix (local) or LangSmith for trace inspection.

python
import os
from llama_index.core import set_global_handler

# Phoenix (local, free)
set_global_handler("arize_phoenix", endpoint="http://localhost:6006")

# Or LangSmith
os.environ["LANGCHAIN_TRACING_V2"] = "true"
os.environ["LANGCHAIN_API_KEY"]    = "ls__..."
set_global_handler("langfuse")     # langfuse, wandb, langsmith etc.
bash
pip install arize-phoenix
phoenix serve &
python my_rag_app.py

Output:

text
🌍 Phoenix UI: http://localhost:6006/
📺 Phoenix collector: 127.0.0.1:6006

Real-world recipes

Recipe: incremental re-indexing of a docs folder

Use the ingestion pipeline's docstore to skip documents whose content hasn't changed.

python
from pathlib import Path
from llama_index.core import SimpleDirectoryReader, VectorStoreIndex, StorageContext
from llama_index.core.ingestion import IngestionPipeline, IngestionCache
from llama_index.core.storage.docstore import SimpleDocumentStore

STORE = Path("./storage")
docstore = SimpleDocumentStore.from_persist_dir(STORE) if STORE.exists() else SimpleDocumentStore()
cache    = IngestionCache.from_persist_path(STORE / "cache.json") if (STORE / "cache.json").exists() else IngestionCache()

pipeline = IngestionPipeline(
    transformations=[SentenceSplitter(chunk_size=512), embed_model],
    docstore=docstore,
    cache=cache,
)

docs = SimpleDirectoryReader("./docs").load_data(num_workers=4)
nodes = pipeline.run(documents=docs, show_progress=True)

STORE.mkdir(exist_ok=True)
docstore.persist(STORE / "docstore.json")
cache.persist(STORE / "cache.json")
print(f"Updated index — processed {len(nodes)} nodes (cached entries skipped)")

Output:

text
Updated index — processed 12 nodes (cached entries skipped)

Recipe: route question to domain-specific index

python
from llama_index.core.query_engine import RouterQueryEngine
from llama_index.core.tools import QueryEngineTool

python_engine = python_index.as_query_engine(similarity_top_k=5)
linux_engine  = linux_index.as_query_engine(similarity_top_k=5)

tools = [
    QueryEngineTool.from_defaults(
        query_engine=python_engine,
        name="python_docs",
        description="Python language, packages, and ecosystem questions",
    ),
    QueryEngineTool.from_defaults(
        query_engine=linux_engine,
        name="linux_docs",
        description="Linux command-line, shell, and system administration",
    ),
]
router = RouterQueryEngine.from_defaults(query_engine_tools=tools)
print(router.query("How do I create a virtualenv?"))
print(router.query("How do I tail a log with grep?"))

Output:

text
Use `python -m venv .venv` then activate with `source .venv/bin/activate`...
Use `tail -f /var/log/syslog | grep --line-buffered ERROR` to filter as lines arrive...

Recipe: citation-style answers

Force the engine to include source citations inline using CitationQueryEngine.

python
from llama_index.core.query_engine import CitationQueryEngine

engine = CitationQueryEngine.from_args(
    index,
    similarity_top_k=6,
    citation_chunk_size=512,
)
response = engine.query("What are the components of self-attention?")
print(response.response)
print("\nSources:")
for source in response.source_nodes:
    print(f"  [{source.node.metadata.get('source_id', '?')}] {source.node.text[:60]}...")

Output:

text
Self-attention has three components: query, key, and value projections [1]. The
attention weights come from the softmax of query-key dot products scaled by sqrt(d_k) [2].

Sources:
  [1] Multi-head attention defines Q = XW_q, K = XW_k, V = XW_v...
  [2] Attention(Q,K,V) = softmax(QK^T / sqrt(d_k)) V...

Recipe: golden-set evaluation in CI

python
from llama_index.core.evaluation import (
    FaithfulnessEvaluator, RelevancyEvaluator,
)
from llama_index.llms.openai import OpenAI
import os

llm = OpenAI(model="gpt-4o", api_key=os.environ["OPENAI_API_KEY"])
faith = FaithfulnessEvaluator(llm=llm)
relev = RelevancyEvaluator(llm=llm)

GOLDEN = ["What is multi-head attention?", "Why scaled dot-product?"]
THRESHOLD = 0.80

for q in GOLDEN:
    r = engine.query(q)
    f = faith.evaluate_response(response=r).score
    v = relev.evaluate_response(query=q, response=r).score
    assert f >= THRESHOLD, f"FAIL faith {f:.2f} on: {q}"
    assert v >= THRESHOLD, f"FAIL relev {v:.2f} on: {q}"
print("All golden queries pass")
bash
pytest tests/test_eval_golden.py -q

Output:

text
.                                                                        [100%]
1 passed in 18.42s

Performance and reliability tips

  • For very large corpora (> 100 k chunks), use a real vector store (Qdrant, Weaviate, pgvector). The default in-memory store is for development only.
  • similarity_top_k of 5–10 hits a sweet spot for most RAG tasks. Higher values increase synthesis cost and hurt answer focus.
  • Use IngestionCache + SimpleDocumentStore to skip re-embedding unchanged documents. Embedding is the most expensive step.
  • Enable streaming=True on user-facing chat to drop perceived latency in half.
  • Use Settings.callback_manager = CallbackManager([TokenCountingHandler()]) to monitor token spend per query.
  • For Qdrant / Weaviate, persist the vector store outside the LlamaIndex storage_context and pass it in — this lets you re-create the index pointer cheaply.

Quick reference

TaskCode
Load directorySimpleDirectoryReader("./docs").load_data()
Build indexVectorStoreIndex.from_documents(docs)
Summary indexSummaryIndex.from_documents(docs)
Doc summary indexDocumentSummaryIndex.from_documents(docs)
Knowledge graphKnowledgeGraphIndex.from_documents(docs, max_triplets_per_chunk=2)
Query engineindex.as_query_engine(similarity_top_k=5)
Chat engineindex.as_chat_engine(chat_mode="condense_plus_context")
Streamingindex.as_query_engine(streaming=True) then iterate response.response_gen
Async queryawait engine.aquery("...")
Custom retrieverVectorIndexRetriever(index=index, similarity_top_k=20)
Reranknode_postprocessors=[CohereRerank(top_n=5)]
Hybrid (vec+BM25)QueryFusionRetriever([vec, bm25], mode="reciprocal_rerank")
Metadata filterMetadataFilters(filters=[MetadataFilter(key="section", value="python")])
Set LLMSettings.llm = Anthropic(...)
Set embeddingsSettings.embed_model = HuggingFaceEmbedding("model")
Chunk sizeSettings.chunk_size = 512
Save indexindex.storage_context.persist("./storage")
Load indexload_index_from_storage(StorageContext.from_defaults(...))
Source nodesresponse.source_nodes
CitationsCitationQueryEngine.from_args(index)
ChromaDB storeChromaVectorStore(chroma_collection=collection)
Qdrant storeQdrantVectorStore(client=qclient, collection_name="docs")
Ingestion pipelineIngestionPipeline(transformations=[...], docstore=..., cache=...)
Sub-questionsSubQuestionQueryEngine.from_defaults(tools)
RouterRouterQueryEngine.from_defaults(query_engine_tools=...)
ReAct agentReActAgent.from_tools(tools, llm=..., verbose=True)
Multi-modalMultiModalVectorStoreIndex.from_documents(docs)
Observabilityset_global_handler("arize_phoenix")