cheat sheet

LlamaIndex

Build RAG pipelines and LLM-powered data applications with LlamaIndex. Covers document loading, indexing, query engines, custom LLMs and embeddings, persistent storage, and agents.

updated 04-27-2026

LlamaIndex — Data Framework for LLMs

What it is

LlamaIndex (formerly GPT Index) is a data framework for building LLM-powered applications over your own data. Its core workflow is: load documents with readers → index them (chunk, embed, store) → query with natural language. LlamaIndex provides a higher-level abstraction than LangChain for document-centric RAG: it handles chunking strategies, index types, and retrieval pipelines out of the box, while still allowing deep customisation of every component.

Install

bash

pip install llama-index
pip install llama-index-llms-openai          # OpenAI LLMs
pip install llama-index-llms-anthropic       # Anthropic Claude
pip install llama-index-embeddings-huggingface  # HuggingFace embeddings
pip install llama-index-vector-stores-chroma    # ChromaDB vector store

Output: (none — exits 0 on success)

Quick example

python

from llama_index.core import VectorStoreIndex, SimpleDirectoryReader
import os

os.environ["OPENAI_API_KEY"] = "sk-..."   # or set in environment

# Load documents from a directory
documents = SimpleDirectoryReader("./docs").load_data()

# Build index (chunks + embeds + stores in memory)
index = VectorStoreIndex.from_documents(documents)

# Query
query_engine = index.as_query_engine()
response = query_engine.query("What are the main topics covered in these documents?")
print(response)

Output:

text

The documents cover three main topics: transformer architecture, attention mechanisms,
and the shift from recurrent neural networks to self-attention for sequence modelling.
Key themes include positional encoding, multi-head attention, and encoder-decoder design.

When / why to use it

Building RAG systems over your own documents (PDFs, Notion, Google Docs, databases).
When you want structured indexing strategies (vector, keyword, knowledge graph) rather than building the retrieval layer from scratch.
Multi-document reasoning: sub-question decomposition, routing, knowledge graph traversal.
When the primary workflow is document → query rather than tool-calling agents.
Custom embedding models and LLMs without changing the query interface.

Common pitfalls

Settings must be configured before indexing — if you swap the default LLM or embedding model via Settings.llm / Settings.embed_model, do it before calling VectorStoreIndex.from_documents(). Objects created before the setting change use the old model.

Default model is text-embedding-ada-002 — without explicit configuration, LlamaIndex calls OpenAI's embedding API. If you have no OpenAI key or want to use a local model, configure Settings.embed_model first.

Re-indexing cost — VectorStoreIndex.from_documents() embeds every chunk via the embedding API. For large document sets, use a persistent vector store and check whether documents already exist before re-embedding.

Use index.storage_context.persist("./storage") to save index data locally. Load it later with StorageContext.from_defaults(persist_dir="./storage") to avoid re-embedding.

response.source_nodes contains the retrieved chunks that grounded the answer. Print them with scores to understand retrieval quality: for n in response.source_nodes: print(n.score, n.text[:100]).

Richer example — persistent ChromaDB index

python

from llama_index.core import VectorStoreIndex, SimpleDirectoryReader, StorageContext, Settings
from llama_index.vector_stores.chroma import ChromaVectorStore
from llama_index.embeddings.huggingface import HuggingFaceEmbedding
from llama_index.llms.anthropic import Anthropic
import chromadb
import os

# Configure LLM and embeddings (no OpenAI required)
Settings.llm = Anthropic(model="claude-sonnet-4-6", api_key=os.environ["ANTHROPIC_API_KEY"])
Settings.embed_model = HuggingFaceEmbedding(model_name="BAAI/bge-small-en-v1.5")
Settings.chunk_size = 512
Settings.chunk_overlap = 64

# ChromaDB persistent storage
chroma_client = chromadb.PersistentClient(path="./chroma_db")
collection = chroma_client.get_or_create_collection("research_docs")
vector_store = ChromaVectorStore(chroma_collection=collection)
storage_ctx = StorageContext.from_defaults(vector_store=vector_store)

# Index documents (only adds new docs; existing embeddings reused)
documents = SimpleDirectoryReader("./research").load_data()
index = VectorStoreIndex.from_documents(documents, storage_context=storage_ctx)

# Query
engine = index.as_query_engine(similarity_top_k=5)
response = engine.query("What evidence supports the claim that attention is sufficient?")
print(response.response)
print(f"\nSources used: {len(response.source_nodes)}")
for node in response.source_nodes:
    print(f"  [{node.score:.3f}] {node.metadata.get('file_name', '?')} — {node.text[:60]}...")

Settings — global configuration

Settings is a singleton that configures the default LLM, embedding model, chunk size, and other pipeline parameters. All index and engine objects created after a Settings change use the new values.

python

from llama_index.core import Settings
from llama_index.llms.openai import OpenAI
from llama_index.llms.anthropic import Anthropic
from llama_index.embeddings.huggingface import HuggingFaceEmbedding
import os

# Use Anthropic + local HuggingFace embeddings
Settings.llm = Anthropic(
    model="claude-sonnet-4-6",
    api_key=os.environ["ANTHROPIC_API_KEY"],
    max_tokens=1024,
)
Settings.embed_model = HuggingFaceEmbedding(
    model_name="BAAI/bge-base-en-v1.5",
    device="cpu",
)
Settings.chunk_size = 512
Settings.chunk_overlap = 50
Settings.num_output = 512

Document loaders — SimpleDirectoryReader

SimpleDirectoryReader reads files from a local directory (PDF, TXT, DOCX, Markdown, CSV, HTML) using format-specific parsers. It returns a list of Document objects.

python

from llama_index.core import SimpleDirectoryReader

# Load all supported files from a directory
docs = SimpleDirectoryReader("./docs").load_data()
print(f"Loaded {len(docs)} documents")
for d in docs[:2]:
    print(d.metadata["file_name"], len(d.text), "chars")

# Load specific file types only
docs = SimpleDirectoryReader("./docs", required_exts=[".pdf", ".md"]).load_data()

# Load recursively
docs = SimpleDirectoryReader("./docs", recursive=True).load_data()

Output:

text

Loaded 8 documents
attention_paper.pdf 43892 chars
transformers_overview.md 12041 chars

Other loaders (install separately): llama-index-readers-web, llama-index-readers-notion, llama-index-readers-google (Drive, Docs), llama-index-readers-database.

Query engine vs chat engine

A query engine answers single questions grounded in retrieved documents. A chat engine maintains conversation history and allows follow-up questions that reference prior answers.

python

from llama_index.core import VectorStoreIndex, SimpleDirectoryReader

docs = SimpleDirectoryReader("./docs").load_data()
index = VectorStoreIndex.from_documents(docs)

# Query engine — single-turn Q&A
query_engine = index.as_query_engine(similarity_top_k=4, response_mode="compact")
r = query_engine.query("What is multi-head attention?")
print(r.response[:200])

# Chat engine — multi-turn conversation
chat_engine = index.as_chat_engine(chat_mode="condense_plus_context")
r1 = chat_engine.chat("What is multi-head attention?")
print(r1.response[:120])
r2 = chat_engine.chat("How many heads does BERT use?")
print(r2.response[:120])  # knows "heads" refers to attention heads from prior turn

Node parsers and chunking

Node parsers split documents into chunks (nodes). The default SentenceSplitter cuts on sentence boundaries up to chunk_size tokens. More options:

python

from llama_index.core.node_parser import (
    SentenceSplitter,
    SemanticSplitterNodeParser,
    MarkdownNodeParser,
    CodeSplitter,
)
from llama_index.embeddings.huggingface import HuggingFaceEmbedding

# Default — sentence-aware chunking
splitter = SentenceSplitter(chunk_size=512, chunk_overlap=64)

# Semantic chunking — splits where topic changes (requires embedding call per chunk)
embed_model = HuggingFaceEmbedding(model_name="BAAI/bge-small-en-v1.5")
semantic_splitter = SemanticSplitterNodeParser(
    buffer_size=1,
    breakpoint_percentile_threshold=95,
    embed_model=embed_model,
)

# Markdown-aware (preserves headers as metadata)
md_splitter = MarkdownNodeParser()

# Apply manually
from llama_index.core import SimpleDirectoryReader
docs = SimpleDirectoryReader("./docs").load_data()
nodes = splitter.get_nodes_from_documents(docs)
print(f"Split into {len(nodes)} nodes")

Persistent storage — StorageContext

Save and restore index data locally to avoid re-embedding documents.

python

from llama_index.core import VectorStoreIndex, SimpleDirectoryReader, StorageContext, load_index_from_storage

docs = SimpleDirectoryReader("./docs").load_data()

# First run — build and save
index = VectorStoreIndex.from_documents(docs)
index.storage_context.persist(persist_dir="./storage")
print("Index saved")

# Subsequent runs — load from disk
storage_ctx = StorageContext.from_defaults(persist_dir="./storage")
index = load_index_from_storage(storage_ctx)
print("Index loaded from disk")

query_engine = index.as_query_engine()
print(query_engine.query("Summarise the key findings."))

Retrieval modes and response synthesis

LlamaIndex separates retrieval (which chunks to fetch) from synthesis (how to generate the answer). Response modes control synthesis strategy:

Mode	Behaviour
`compact`	Concatenate chunks into fewest prompts possible
`refine`	Iteratively refine the answer over each chunk
`tree_summarize`	Recursive summarisation tree (best for long docs)
`simple_summarize`	Single prompt, first chunk only
`no_text`	Return retrieved nodes without generating a response

python

# Retrieve top-8 nodes, use tree_summarize for a long summary
engine = index.as_query_engine(
    similarity_top_k=8,
    response_mode="tree_summarize",
)

Routing and sub-question query engines

Route queries to different indexes based on content type, or decompose complex questions into sub-questions each answered by a separate engine.

python

from llama_index.core.query_engine import SubQuestionQueryEngine
from llama_index.core.tools import QueryEngineTool, ToolMetadata

# Two separate indexes
engine_papers = index_papers.as_query_engine()
engine_blogs   = index_blogs.as_query_engine()

tools = [
    QueryEngineTool(engine_papers, metadata=ToolMetadata(
        name="papers", description="Academic papers on transformers"
    )),
    QueryEngineTool(engine_blogs, metadata=ToolMetadata(
        name="blogs", description="Blog posts explaining ML concepts"
    )),
]

sub_engine = SubQuestionQueryEngine.from_defaults(query_engine_tools=tools)
response = sub_engine.query(
    "Compare how academic papers and blog posts describe the transformer attention mechanism."
)
print(response)

Agents and tool use

LlamaIndex agents combine query engines, function tools, and the LLM into a reasoning loop. ReActAgent uses chain-of-thought reasoning; OpenAIAgent / AnthropicAgent use native tool calling.

python

from llama_index.core.agent import ReActAgent
from llama_index.core.tools import QueryEngineTool, FunctionTool, ToolMetadata
from llama_index.llms.anthropic import Anthropic
import os

def multiply(a: float, b: float) -> float:
    """Multiply two numbers and return the result."""
    return a * b

multiply_tool = FunctionTool.from_defaults(fn=multiply)
docs_tool = QueryEngineTool(
    query_engine=engine,
    metadata=ToolMetadata(name="research_docs", description="Research papers on AI"),
)

agent = ReActAgent.from_tools(
    [multiply_tool, docs_tool],
    llm=Anthropic(model="claude-sonnet-4-6", api_key=os.environ["ANTHROPIC_API_KEY"]),
    verbose=True,
)

response = agent.chat(
    "What is 42 * 17, and what does the research say about attention mechanisms?"
)
print(response)

Evaluation

LlamaIndex has a built-in evaluation module for RAG quality measurement.

python

from llama_index.core.evaluation import (
    FaithfulnessEvaluator,
    RelevancyEvaluator,
    BatchEvalRunner,
)
from llama_index.llms.openai import OpenAI
import os

llm = OpenAI(model="gpt-4o", api_key=os.environ["OPENAI_API_KEY"])
faithfulness_eval = FaithfulnessEvaluator(llm=llm)
relevancy_eval    = RelevancyEvaluator(llm=llm)

response = query_engine.query("What is self-attention?")
faith_result = faithfulness_eval.evaluate_response(response=response)
relev_result = relevancy_eval.evaluate_response(query="What is self-attention?", response=response)

print(f"Faithful: {faith_result.passing}  score={faith_result.score:.2f}")
print(f"Relevant: {relev_result.passing}  score={relev_result.score:.2f}")

Output:

text

Faithful: True  score=1.00
Relevant: True  score=0.95

Index types — beyond vector

LlamaIndex offers several index classes, each optimised for a different query pattern. Vector is the default, but the others are useful for specific shapes of data.

Index	Best for	Storage
`VectorStoreIndex`	Semantic similarity / RAG	Embeddings + nodes
`SummaryIndex`	"Summarise everything" queries; traverses all nodes	Linear list of nodes
`KeywordTableIndex`	Exact-keyword lookup (older docs, structured glossaries)	Keyword → node map
`TreeIndex`	Hierarchical summarisation for long documents	Recursive summary tree
`KnowledgeGraphIndex`	Entity-relation queries ("how is X connected to Y?")	Triples extracted by LLM
`DocumentSummaryIndex`	Routing across many documents by summary	Per-doc summary + vector

python

from llama_index.core import (
    SummaryIndex, KeywordTableIndex, TreeIndex,
    KnowledgeGraphIndex, DocumentSummaryIndex,
    SimpleDirectoryReader,
)

docs = SimpleDirectoryReader("./docs").load_data()

# Summary index — every chunk is read for every query
summary_idx = SummaryIndex.from_documents(docs)

# Keyword index — fast for exact terms
keyword_idx = KeywordTableIndex.from_documents(docs)

# Tree index — recursive summarisation, good for long single docs
tree_idx = TreeIndex.from_documents(docs, num_children=4)

Output: (none — exits 0 on success)

DocumentSummaryIndex — routing many docs

A DocumentSummaryIndex builds a summary per document and stores both the summary embeddings and the chunk-level nodes. Queries first match against summaries to pick relevant documents, then drill into chunks.

python

from llama_index.core import DocumentSummaryIndex
from llama_index.core.response_synthesizers import get_response_synthesizer

synth = get_response_synthesizer(response_mode="tree_summarize", use_async=True)
doc_summary_idx = DocumentSummaryIndex.from_documents(
    docs,
    response_synthesizer=synth,
    show_progress=True,
)

# Use the summary-aware retriever
engine = doc_summary_idx.as_query_engine()
response = engine.query("Which of these documents discusses positional encoding?")
print(response.response)

Output:

text

The document "attention_paper.pdf" introduces sinusoidal positional encoding.
"transformers_overview.md" discusses both sinusoidal and learned positional encodings.

Retrievers — splitting fetch from synthesis

A retriever returns nodes; a query engine wraps a retriever with a response synthesiser. Sometimes you want the retrieval step alone — for reranking, hybrid search, or feeding into another component.

python

from llama_index.core import VectorStoreIndex
from llama_index.core.retrievers import VectorIndexRetriever
from llama_index.core.query_engine import RetrieverQueryEngine
from llama_index.core.postprocessor import SimilarityPostprocessor

retriever = VectorIndexRetriever(index=index, similarity_top_k=20)
nodes = retriever.retrieve("What is multi-head attention?")
for n in nodes[:5]:
    print(f"  [{n.score:.3f}] {n.text[:60]}...")

# Build a query engine from a custom retriever + filter
engine = RetrieverQueryEngine.from_args(
    retriever=retriever,
    node_postprocessors=[SimilarityPostprocessor(similarity_cutoff=0.7)],
    response_mode="compact",
)
print(engine.query("What is multi-head attention?"))

Output:

text

  [0.872] Multi-head attention runs h attention operations in parallel...
  [0.851] Each head learns a different subspace projection of Q, K, V...
  [0.834] The outputs of all heads are concatenated and projected once more...
  [0.802] Multi-head attention was introduced in "Attention Is All You Need"...
  [0.781] In BERT, the base model uses 12 attention heads per layer...

Postprocessors — rerank and filter

A postprocessor runs after retrieval and before synthesis. It can rerank, filter on similarity, deduplicate, or call a cross-encoder model.

python

from llama_index.core.postprocessor import (
    SimilarityPostprocessor,
    KeywordNodePostprocessor,
    LongContextReorder,
)
from llama_index.postprocessor.cohere_rerank import CohereRerank
import os

postprocessors = [
    SimilarityPostprocessor(similarity_cutoff=0.65),   # drop weak matches
    KeywordNodePostprocessor(required_keywords=["attention"]),
    LongContextReorder(),                              # put best at start/end
    CohereRerank(api_key=os.environ["COHERE_API_KEY"], top_n=5, model="rerank-english-v3.0"),
]

engine = index.as_query_engine(similarity_top_k=20, node_postprocessors=postprocessors)
print(engine.query("How does positional encoding interact with attention?"))

Output:

text

Positional encoding is added to token embeddings before they enter the attention
layer. The attention mechanism itself has no notion of order — the positional
encoding gives each query/key pair a position-dependent component...

Metadata filters — narrowing by structured fields

Documents carry metadata (file name, date, author, custom tags). Filters apply at retrieval time to narrow the search space.

python

from llama_index.core import VectorStoreIndex, Document
from llama_index.core.vector_stores import MetadataFilter, MetadataFilters, FilterOperator

docs = [
    Document(text="Pandas DataFrame operations", metadata={"section": "python",     "year": 2024}),
    Document(text="Polars eager vs lazy",       metadata={"section": "python",     "year": 2026}),
    Document(text="grep tutorial",              metadata={"section": "linux",      "year": 2025}),
    Document(text="ffmpeg recipes",             metadata={"section": "linux",      "year": 2026}),
]
index = VectorStoreIndex.from_documents(docs)

filters = MetadataFilters(
    filters=[
        MetadataFilter(key="section", value="python"),
        MetadataFilter(key="year", value=2025, operator=FilterOperator.GT),
    ],
)
engine = index.as_query_engine(similarity_top_k=5, filters=filters)
print(engine.query("What about DataFrames?"))

Output:

text

Polars provides DataFrame-like operations with eager and lazy execution modes...

Hybrid retrieval — vector + BM25

Hybrid retrieval combines dense (vector) and sparse (BM25) scores so the system handles both semantic and lexical queries well. Use QueryFusionRetriever to merge multiple retrievers.

python

from llama_index.core.retrievers import QueryFusionRetriever, VectorIndexRetriever
from llama_index.retrievers.bm25 import BM25Retriever
from llama_index.core import VectorStoreIndex, SimpleDirectoryReader

docs = SimpleDirectoryReader("./docs").load_data()
index = VectorStoreIndex.from_documents(docs)

vector_ret = VectorIndexRetriever(index=index, similarity_top_k=10)
bm25_ret   = BM25Retriever.from_defaults(docstore=index.docstore, similarity_top_k=10)

fusion = QueryFusionRetriever(
    [vector_ret, bm25_ret],
    similarity_top_k=10,
    num_queries=4,                  # generate query variants for HyDE-like recall boost
    mode="reciprocal_rerank",       # 'simple' or 'relative_score' also work
    use_async=True,
)

nodes = fusion.retrieve("attention is all you need")
for n in nodes[:3]:
    print(f"  [{n.score:.3f}] {n.text[:60]}...")

Output:

text

  [0.0833] The transformer architecture, introduced in "Attention Is All You..."
  [0.0814] Self-attention computes weighted sums where weights come from query...
  [0.0667] Multi-head attention runs h parallel attention operations...

Streaming responses

For chat UIs, stream tokens as they arrive instead of waiting for the full answer.

python

engine = index.as_query_engine(streaming=True, similarity_top_k=5)
streaming_response = engine.query("Explain attention in three paragraphs.")
for token in streaming_response.response_gen:
    print(token, end="", flush=True)
print()

Output:

text

Attention is a mechanism that allows a neural network to focus on different parts
of the input when producing each part of the output. Rather than compressing the
entire input into a fixed-size hidden state...

Async queries

All major engines have async equivalents. They share connection pools and dramatically improve throughput for batched evaluation or multi-tenant servers.

python

import asyncio
from llama_index.core import VectorStoreIndex, SimpleDirectoryReader

docs = SimpleDirectoryReader("./docs").load_data()
index = VectorStoreIndex.from_documents(docs)
engine = index.as_query_engine(use_async=True)

async def main():
    questions = [
        "What is self-attention?",
        "How does BERT differ from GPT?",
        "What is positional encoding?",
    ]
    results = await asyncio.gather(*(engine.aquery(q) for q in questions))
    for q, r in zip(questions, results):
        print(f"Q: {q}\nA: {r.response[:80]}\n")

asyncio.run(main())

Output:

text

Q: What is self-attention?
A: Self-attention is a mechanism where each token attends to every other token...

Q: How does BERT differ from GPT?
A: BERT is a bidirectional encoder trained with masked language modelling...

Q: What is positional encoding?
A: Positional encoding adds order information to token embeddings since...

Vector store integrations

LlamaIndex ships first-class integrations with major vector stores. The interface is uniform — swap one line to change backends.

python

# Qdrant
from llama_index.vector_stores.qdrant import QdrantVectorStore
import qdrant_client
client = qdrant_client.QdrantClient(url="http://localhost:6333")
vs = QdrantVectorStore(client=client, collection_name="docs")

# Weaviate
from llama_index.vector_stores.weaviate import WeaviateVectorStore
import weaviate
wclient = weaviate.connect_to_local()
vs = WeaviateVectorStore(weaviate_client=wclient, index_name="Docs")

# pgvector
from llama_index.vector_stores.postgres import PGVectorStore
vs = PGVectorStore.from_params(
    database="rag", host="localhost", password="...",
    port=5432, user="postgres", table_name="llama_docs", embed_dim=768,
)

# Pinecone
from llama_index.vector_stores.pinecone import PineconeVectorStore
import pinecone
pc = pinecone.Pinecone(api_key=os.environ["PINECONE_API_KEY"])
vs = PineconeVectorStore(pinecone_index=pc.Index("docs"))

Output: (none — exits 0 on success)

Ingestion pipelines — declarative transformation

An IngestionPipeline chains transformations (splitter, metadata extractor, embedding) and applies them to incoming documents. Pipelines can deduplicate, cache, and run incrementally.

python

from llama_index.core.ingestion import IngestionPipeline, IngestionCache
from llama_index.core.node_parser import SentenceSplitter
from llama_index.core.extractors import (
    TitleExtractor,
    KeywordExtractor,
    QuestionsAnsweredExtractor,
)
from llama_index.embeddings.huggingface import HuggingFaceEmbedding
from llama_index.core.storage.docstore import SimpleDocumentStore

pipeline = IngestionPipeline(
    transformations=[
        SentenceSplitter(chunk_size=512, chunk_overlap=64),
        TitleExtractor(nodes=5),                              # LLM-derived title metadata
        KeywordExtractor(keywords=8),                         # auto keywords
        QuestionsAnsweredExtractor(questions=3),              # questions the chunk answers
        HuggingFaceEmbedding(model_name="BAAI/bge-small-en-v1.5"),
    ],
    docstore=SimpleDocumentStore(),                           # enables deduplication
    cache=IngestionCache(),                                   # skip already-processed nodes
)

nodes = pipeline.run(documents=docs, show_progress=True)
print(f"Produced {len(nodes)} nodes with auto-metadata")
print(nodes[0].metadata)

Output:

text

Produced 184 nodes with auto-metadata
{'file_name': 'attention_paper.pdf', 'document_title': 'Attention Is All You Need',
 'excerpt_keywords': 'attention, transformer, encoder, decoder, query, key, value, softmax',
 'questions_this_excerpt_can_answer': '1. What is self-attention?\n2. ...'}

Workflows — the v0.10+ orchestration primitive

A Workflow is an event-driven graph of steps with type-checked events. Workflows replace ad-hoc agent loops with a debuggable, branchable graph.

python

from llama_index.core.workflow import (
    Workflow, StartEvent, StopEvent, Event, step,
)
from llama_index.llms.anthropic import Anthropic
import os

class RetrievedEvent(Event):
    chunks: list[str]

class RagWorkflow(Workflow):
    @step
    async def retrieve(self, ev: StartEvent) -> RetrievedEvent:
        nodes = retriever.retrieve(ev.query)
        return RetrievedEvent(chunks=[n.text for n in nodes])

    @step
    async def synthesize(self, ev: RetrievedEvent) -> StopEvent:
        prompt = "Context:\n" + "\n".join(ev.chunks) + f"\n\nQuestion: {self._query}"
        llm = Anthropic(model="claude-sonnet-4-6", api_key=os.environ["ANTHROPIC_API_KEY"])
        answer = await llm.acomplete(prompt)
        return StopEvent(result=str(answer))

import asyncio
result = asyncio.run(RagWorkflow(timeout=30).run(query="What is positional encoding?"))
print(result)

Output:

text

Positional encoding is the technique used in transformers to inject information
about token order into the embeddings, since the self-attention mechanism itself
treats the input as an unordered set.

LlamaIndex supports multi-modal retrieval — store images alongside text and retrieve both for a query.

python

from llama_index.core import SimpleDirectoryReader
from llama_index.core.indices.multi_modal import MultiModalVectorStoreIndex
from llama_index.multi_modal_llms.openai import OpenAIMultiModal
import os

docs = SimpleDirectoryReader("./mixed_media", recursive=True).load_data()
mm_index = MultiModalVectorStoreIndex.from_documents(docs)

mm_llm = OpenAIMultiModal(model="gpt-4o-mini", api_key=os.environ["OPENAI_API_KEY"], max_new_tokens=512)
engine = mm_index.as_query_engine(multi_modal_llm=mm_llm, similarity_top_k=4, image_similarity_top_k=2)

response = engine.query("Show me figures explaining the encoder-decoder split.")
print(response.response)
for n in response.source_nodes:
    print(f"  {n.metadata.get('file_name', '?')}")

Output:

text

The encoder-decoder split separates input encoding from output generation. Figure
1 of the original paper shows the encoder stack on the left and the decoder stack
on the right, with cross-attention connecting them.
  attention_paper.pdf
  fig1_arch.png
  fig2_attention.png

Observability — Phoenix and LangSmith

LlamaIndex emits OpenTelemetry-style events. Connect them to Arize Phoenix (local) or LangSmith for trace inspection.

python

import os
from llama_index.core import set_global_handler

# Phoenix (local, free)
set_global_handler("arize_phoenix", endpoint="http://localhost:6006")

# Or LangSmith
os.environ["LANGCHAIN_TRACING_V2"] = "true"
os.environ["LANGCHAIN_API_KEY"]    = "ls__..."
set_global_handler("langfuse")     # langfuse, wandb, langsmith etc.

bash

pip install arize-phoenix
phoenix serve &
python my_rag_app.py

Output:

text

🌍 Phoenix UI: http://localhost:6006/
📺 Phoenix collector: 127.0.0.1:6006

Real-world recipes

Recipe: incremental re-indexing of a docs folder

Use the ingestion pipeline's docstore to skip documents whose content hasn't changed.

python

from pathlib import Path
from llama_index.core import SimpleDirectoryReader, VectorStoreIndex, StorageContext
from llama_index.core.ingestion import IngestionPipeline, IngestionCache
from llama_index.core.storage.docstore import SimpleDocumentStore

STORE = Path("./storage")
docstore = SimpleDocumentStore.from_persist_dir(STORE) if STORE.exists() else SimpleDocumentStore()
cache    = IngestionCache.from_persist_path(STORE / "cache.json") if (STORE / "cache.json").exists() else IngestionCache()

pipeline = IngestionPipeline(
    transformations=[SentenceSplitter(chunk_size=512), embed_model],
    docstore=docstore,
    cache=cache,
)

docs = SimpleDirectoryReader("./docs").load_data(num_workers=4)
nodes = pipeline.run(documents=docs, show_progress=True)

STORE.mkdir(exist_ok=True)
docstore.persist(STORE / "docstore.json")
cache.persist(STORE / "cache.json")
print(f"Updated index — processed {len(nodes)} nodes (cached entries skipped)")

Output:

text

Updated index — processed 12 nodes (cached entries skipped)

Recipe: route question to domain-specific index

python

from llama_index.core.query_engine import RouterQueryEngine
from llama_index.core.tools import QueryEngineTool

python_engine = python_index.as_query_engine(similarity_top_k=5)
linux_engine  = linux_index.as_query_engine(similarity_top_k=5)

tools = [
    QueryEngineTool.from_defaults(
        query_engine=python_engine,
        name="python_docs",
        description="Python language, packages, and ecosystem questions",
    ),
    QueryEngineTool.from_defaults(
        query_engine=linux_engine,
        name="linux_docs",
        description="Linux command-line, shell, and system administration",
    ),
]
router = RouterQueryEngine.from_defaults(query_engine_tools=tools)
print(router.query("How do I create a virtualenv?"))
print(router.query("How do I tail a log with grep?"))

Output:

text

Use `python -m venv .venv` then activate with `source .venv/bin/activate`...
Use `tail -f /var/log/syslog | grep --line-buffered ERROR` to filter as lines arrive...

Recipe: citation-style answers

Force the engine to include source citations inline using CitationQueryEngine.

python

from llama_index.core.query_engine import CitationQueryEngine

engine = CitationQueryEngine.from_args(
    index,
    similarity_top_k=6,
    citation_chunk_size=512,
)
response = engine.query("What are the components of self-attention?")
print(response.response)
print("\nSources:")
for source in response.source_nodes:
    print(f"  [{source.node.metadata.get('source_id', '?')}] {source.node.text[:60]}...")

Output:

text

Self-attention has three components: query, key, and value projections [1]. The
attention weights come from the softmax of query-key dot products scaled by sqrt(d_k) [2].

Sources:
  [1] Multi-head attention defines Q = XW_q, K = XW_k, V = XW_v...
  [2] Attention(Q,K,V) = softmax(QK^T / sqrt(d_k)) V...

Recipe: golden-set evaluation in CI

python

from llama_index.core.evaluation import (
    FaithfulnessEvaluator, RelevancyEvaluator,
)
from llama_index.llms.openai import OpenAI
import os

llm = OpenAI(model="gpt-4o", api_key=os.environ["OPENAI_API_KEY"])
faith = FaithfulnessEvaluator(llm=llm)
relev = RelevancyEvaluator(llm=llm)

GOLDEN = ["What is multi-head attention?", "Why scaled dot-product?"]
THRESHOLD = 0.80

for q in GOLDEN:
    r = engine.query(q)
    f = faith.evaluate_response(response=r).score
    v = relev.evaluate_response(query=q, response=r).score
    assert f >= THRESHOLD, f"FAIL faith {f:.2f} on: {q}"
    assert v >= THRESHOLD, f"FAIL relev {v:.2f} on: {q}"
print("All golden queries pass")

bash

pytest tests/test_eval_golden.py -q

Output:

text

.                                                                        [100%]
1 passed in 18.42s

Performance and reliability tips

For very large corpora (> 100 k chunks), use a real vector store (Qdrant, Weaviate, pgvector). The default in-memory store is for development only.
similarity_top_k of 5–10 hits a sweet spot for most RAG tasks. Higher values increase synthesis cost and hurt answer focus.
Use IngestionCache + SimpleDocumentStore to skip re-embedding unchanged documents. Embedding is the most expensive step.
Enable streaming=True on user-facing chat to drop perceived latency in half.
Use Settings.callback_manager = CallbackManager([TokenCountingHandler()]) to monitor token spend per query.
For Qdrant / Weaviate, persist the vector store outside the LlamaIndex storage_context and pass it in — this lets you re-create the index pointer cheaply.

Quick reference

Task	Code
Load directory	`SimpleDirectoryReader("./docs").load_data()`
Build index	`VectorStoreIndex.from_documents(docs)`
Summary index	`SummaryIndex.from_documents(docs)`
Doc summary index	`DocumentSummaryIndex.from_documents(docs)`
Knowledge graph	`KnowledgeGraphIndex.from_documents(docs, max_triplets_per_chunk=2)`
Query engine	`index.as_query_engine(similarity_top_k=5)`
Chat engine	`index.as_chat_engine(chat_mode="condense_plus_context")`
Streaming	`index.as_query_engine(streaming=True)` then iterate `response.response_gen`
Async query	`await engine.aquery("...")`
Custom retriever	`VectorIndexRetriever(index=index, similarity_top_k=20)`
Rerank	`node_postprocessors=[CohereRerank(top_n=5)]`
Hybrid (vec+BM25)	`QueryFusionRetriever([vec, bm25], mode="reciprocal_rerank")`
Metadata filter	`MetadataFilters(filters=[MetadataFilter(key="section", value="python")])`
Set LLM	`Settings.llm = Anthropic(...)`
Set embeddings	`Settings.embed_model = HuggingFaceEmbedding("model")`
Chunk size	`Settings.chunk_size = 512`
Save index	`index.storage_context.persist("./storage")`
Load index	`load_index_from_storage(StorageContext.from_defaults(...))`
Source nodes	`response.source_nodes`
Citations	`CitationQueryEngine.from_args(index)`
ChromaDB store	`ChromaVectorStore(chroma_collection=collection)`
Qdrant store	`QdrantVectorStore(client=qclient, collection_name="docs")`
Ingestion pipeline	`IngestionPipeline(transformations=[...], docstore=..., cache=...)`
Sub-questions	`SubQuestionQueryEngine.from_defaults(tools)`
Router	`RouterQueryEngine.from_defaults(query_engine_tools=...)`
ReAct agent	`ReActAgent.from_tools(tools, llm=..., verbose=True)`
Multi-modal	`MultiModalVectorStoreIndex.from_documents(docs)`
Observability	`set_global_handler("arize_phoenix")`

LlamaIndex — Data Framework for LLMs

What it is

Install

Quick example

When / why to use it

Common pitfalls

Richer example — persistent ChromaDB index

Settings — global configuration

Document loaders — SimpleDirectoryReader

Query engine vs chat engine

Node parsers and chunking

Persistent storage — StorageContext

Retrieval modes and response synthesis

Routing and sub-question query engines

Agents and tool use

Evaluation

Index types — beyond vector

DocumentSummaryIndex — routing many docs

Retrievers — splitting fetch from synthesis

Postprocessors — rerank and filter

Metadata filters — narrowing by structured fields

Hybrid retrieval — vector + BM25

Streaming responses

Async queries

Vector store integrations

Ingestion pipelines — declarative transformation

Workflows — the v0.10+ orchestration primitive

Multi-modal — text + images

Observability — Phoenix and LangSmith

Real-world recipes

Recipe: incremental re-indexing of a docs folder

Recipe: route question to domain-specific index

Recipe: citation-style answers

Recipe: golden-set evaluation in CI

Performance and reliability tips

Quick reference