cheat sheet
ragas
Measure and improve RAG pipeline quality with ragas. Covers faithfulness, answer relevancy, context precision, context recall, dataset format, LLM judges, and CI integration.
ragas — RAG Evaluation Framework
What it is
ragas (RAG Assessment) is a Python library for evaluating retrieval-augmented generation pipelines. It measures the quality of both the retrieval step (did the right chunks come back?) and the generation step (did the model answer faithfully using those chunks?). ragas uses LLMs as judges to score outputs across a set of defined metrics, producing numeric scores you can track across code changes and compare in CI.
Install
pip install ragas
pip install ragas[all] # includes optional integrations (LangChain, LlamaIndex, Hugging Face)
Output: (none — exits 0 on success)
Quick example
from ragas import evaluate
from ragas.metrics import faithfulness, answer_relevancy
from datasets import Dataset
data = {
"question": ["What is attention in transformers?"],
"answer": ["Attention allows the model to weigh the relevance of different input tokens."],
"contexts": [["The attention mechanism computes a weighted sum of values based on query-key similarity."]],
"ground_truth": ["Attention computes weighted sums over input tokens using query-key similarity scores."],
}
dataset = Dataset.from_dict(data)
result = evaluate(dataset, metrics=[faithfulness, answer_relevancy])
print(result)
Output:
{'faithfulness': 0.9500, 'answer_relevancy': 0.8800}
When / why to use it
- Quantifying RAG quality before and after changing the chunking strategy, retriever, or LLM.
- Catching regressions in CI: fail the build if faithfulness drops below a threshold.
- Ablation studies: compare embedding models, chunk sizes,
kvalues, and rerankers with objective scores. - Debugging: low context precision tells you the retriever is fetching off-topic chunks; low faithfulness tells you the LLM is hallucinating.
- Building a leaderboard across pipeline variants.
Common pitfalls
ragas uses an LLM to judge — most metrics require an LLM (OpenAI GPT-4o by default) to evaluate outputs. Each evaluation call costs tokens. For large datasets, configure a cheaper judge model or use a local model via the custom LLM interface.
contextsmust be a list of lists — each row'scontextsfield is a list of strings (the retrieved chunks), even if there is only one chunk:[["chunk text"]]not["chunk text"].
ground_truthis required for some metrics —context_recallandanswer_correctnessneed a reference answer.faithfulnessandanswer_relevancydo not — they evaluate against the provided contexts only.
Start with
faithfulnessandanswer_relevancyas the two most informative metrics. Addcontext_precisionandcontext_recallwhen you suspect the retriever is the bottleneck.
Generate evaluation datasets automatically from production traces using
ragas.testset.generate— it creates diverse question/context/answer samples from your documents without manual labelling.
Dataset format
ragas uses Hugging Face Dataset objects. Each row represents one Q&A interaction and must contain the columns required by the metrics you're running.
from datasets import Dataset
data = {
# Required by all metrics
"question": [
"What is a transformer?",
"How does BERT differ from GPT?",
"What is positional encoding?",
],
# Required: model-generated answer
"answer": [
"A transformer uses self-attention to process sequences in parallel.",
"BERT is bidirectional while GPT is unidirectional (autoregressive).",
"Positional encoding injects token order information into embeddings.",
],
# Required: retrieved chunks (list of strings per question)
"contexts": [
["Transformers use multi-head self-attention to process all tokens simultaneously."],
["BERT uses masked language modelling to learn bidirectional representations.",
"GPT trains as a left-to-right language model predicting the next token."],
["Since transformers have no recurrence, positional encodings are added to embeddings "
"to preserve sequence order information."],
],
# Optional: reference answer for precision/recall/correctness metrics
"ground_truth": [
"A transformer processes sequences using self-attention without recurrence.",
"BERT is bidirectional; GPT is unidirectional and autoregressive.",
"Positional encoding adds order information to token embeddings in transformers.",
],
}
dataset = Dataset.from_dict(data)
Core metrics
faithfulness
Faithfulness measures whether every claim in the answer is supported by the retrieved contexts. Score 1.0 = fully grounded; 0.0 = completely hallucinated.
from ragas import evaluate
from ragas.metrics import faithfulness
from datasets import Dataset
data = {
"question": ["What year was Python created?"],
"answer": ["Python was created in 1991 by Guido van Rossum. It is written in C and Java."],
"contexts": [["Python was created by Guido van Rossum and first released in 1991."]],
"ground_truth": ["Python was created in 1991."],
}
result = evaluate(Dataset.from_dict(data), metrics=[faithfulness])
print(result["faithfulness"])
Output:
0.6667 ← "written in C and Java" is not supported by the context
answer_relevancy
Answer relevancy measures how well the answer addresses the question, ignoring whether it is factually correct. A verbose answer that restates the question before answering scores lower than a concise, direct answer.
from ragas.metrics import answer_relevancy
result = evaluate(Dataset.from_dict(data), metrics=[answer_relevancy])
print(result["answer_relevancy"])
context_precision
Context precision measures the signal-to-noise ratio of the retrieved chunks: what fraction of the retrieved chunks are actually relevant to answering the question? Requires ground_truth.
from ragas.metrics import context_precision
result = evaluate(Dataset.from_dict(data), metrics=[context_precision])
print(result["context_precision"])
context_recall
Context recall measures how much of the ground-truth answer is covered by the retrieved chunks. A low score means the retriever missed relevant content. Requires ground_truth.
from ragas.metrics import context_recall
result = evaluate(Dataset.from_dict(data), metrics=[context_recall])
print(result["context_recall"])
answer_correctness
Answer correctness combines factual accuracy and semantic similarity between the generated answer and ground_truth. Requires ground_truth.
from ragas.metrics import answer_correctness
result = evaluate(Dataset.from_dict(data), metrics=[answer_correctness])
print(result["answer_correctness"])
Evaluating all core metrics at once
from ragas import evaluate
from ragas.metrics import (
faithfulness,
answer_relevancy,
context_precision,
context_recall,
answer_correctness,
)
from datasets import Dataset
result = evaluate(
dataset,
metrics=[
faithfulness,
answer_relevancy,
context_precision,
context_recall,
answer_correctness,
],
)
print(result.to_pandas())
Output:
faithfulness answer_relevancy context_precision context_recall answer_correctness
0 0.95 0.88 0.92 0.87 0.91
1 1.00 0.91 0.88 0.94 0.93
2 0.90 0.85 0.95 0.89 0.88
Configuring the LLM judge
By default ragas uses gpt-4o-mini from OpenAI. Change the judge model to any LangChain-compatible LLM.
from ragas import evaluate
from ragas.metrics import faithfulness, answer_relevancy
from ragas.llms import LangchainLLMWrapper
from langchain_anthropic import ChatAnthropic
from langchain_openai import OpenAIEmbeddings
from ragas.embeddings import LangchainEmbeddingsWrapper
import os
llm = LangchainLLMWrapper(
ChatAnthropic(model="claude-sonnet-4-6", api_key=os.environ["ANTHROPIC_API_KEY"])
)
embeddings = LangchainEmbeddingsWrapper(
OpenAIEmbeddings(api_key=os.environ["OPENAI_API_KEY"])
)
result = evaluate(
dataset,
metrics=[faithfulness, answer_relevancy],
llm=llm,
embeddings=embeddings,
)
print(result)
Synthetic test set generation
ragas can generate a diverse evaluation dataset from your documents — no manual labelling required.
from ragas.testset.generator import TestsetGenerator
from ragas.testset.evolutions import simple, reasoning, multi_context
from langchain_openai import ChatOpenAI, OpenAIEmbeddings
from langchain_community.document_loaders import DirectoryLoader
import os
loader = DirectoryLoader("./docs", glob="**/*.md")
documents = loader.load()
generator_llm = ChatOpenAI(model="gpt-4o-mini", api_key=os.environ["OPENAI_API_KEY"])
critic_llm = ChatOpenAI(model="gpt-4o", api_key=os.environ["OPENAI_API_KEY"])
embeddings = OpenAIEmbeddings(api_key=os.environ["OPENAI_API_KEY"])
generator = TestsetGenerator.from_langchain(
generator_llm=generator_llm,
critic_llm=critic_llm,
embeddings=embeddings,
)
testset = generator.generate_with_langchain_docs(
documents,
test_size=20,
distributions={simple: 0.5, reasoning: 0.25, multi_context: 0.25},
)
df = testset.to_pandas()
print(df.head(3)[["question", "ground_truth"]])
LangChain RAG pipeline evaluation
Evaluate an end-to-end LangChain RAG chain — ragas calls the chain for each row and scores the output.
from ragas.integrations.langchain import EvaluatorChain
from ragas.metrics import faithfulness
evaluator = EvaluatorChain(metric=faithfulness)
# Sample run
result = evaluator.invoke({
"question": "What is multi-head attention?",
"answer": "Multi-head attention runs several attention heads in parallel.",
"contexts": ["Multi-head attention allows the model to jointly attend to "
"information from different representation subspaces."],
})
print(result["faithfulness_score"])
CI example — fail on score regression
from ragas import evaluate
from ragas.metrics import faithfulness, answer_relevancy
from datasets import load_from_disk
def test_rag_quality():
dataset = load_from_disk("./eval_dataset")
result = evaluate(dataset, metrics=[faithfulness, answer_relevancy])
assert result["faithfulness"] >= 0.85, f"Faithfulness {result['faithfulness']:.2f} below 0.85"
assert result["answer_relevancy"] >= 0.80, f"Relevancy {result['answer_relevancy']:.2f} below 0.80"
How each metric is computed under the hood
Understanding the LLM prompts behind each metric helps you interpret edge cases and write your own custom metrics.
faithfulness — claim decomposition + verification
- Ask the judge LLM to extract atomic claims from the answer (
Generate a list of statements from the answer). - For each claim, ask the judge whether the claim can be inferred from the provided contexts (
Yes/No). - Score = supported_claims / total_claims.
Implication: a long answer with many trivially-supported claims can offset a single hallucinated claim. Score the same answer twice with very different word counts and the proportions shift even though the hallucination is identical.
answer_relevancy — reverse-question generation
- The judge generates N hypothetical questions (default 3) that the answer would be a good response to.
- Each generated question is embedded and compared to the original question via cosine similarity.
- Score = mean cosine similarity.
A verbose answer that drifts off-topic earns low relevancy because the regenerated questions don't match the original. The metric requires embeddings, not just an LLM.
context_precision — relevance ranking of retrieved chunks
- For each retrieved chunk, the judge LLM is asked: "Is this chunk relevant to answering the question, given the ground truth?" → 1 or 0.
- The position of each "1" in the ranking matters — relevant chunks at rank 1 score higher than the same chunk at rank 5.
- Score = mean-average-precision over the retrieved sequence.
context_recall — ground-truth statement coverage
- The judge decomposes
ground_truthinto atomic statements. - For each statement, it checks whether the statement can be inferred from the union of retrieved contexts.
- Score = covered_statements / total_statements.
answer_correctness — weighted factual + semantic
- Factual: classify each claim in the answer vs the ground truth into True Positives (in both), False Positives (only in answer), False Negatives (only in ground truth). F1 from those.
- Semantic: cosine similarity between answer and ground-truth embeddings.
- Default weights: 0.75 factual, 0.25 semantic — tune with
Weights.
from ragas.metrics import AnswerCorrectness
ac = AnswerCorrectness(weights=[0.5, 0.5]) # equal factual + semantic
Diagnostic playbook — which metric → which problem
A drop in each metric points to a different layer of the pipeline. This table is the most actionable artefact of running ragas.
| Symptom | Most likely cause | Where to fix |
|---|---|---|
Low faithfulness | LLM is hallucinating beyond the contexts | Tighten the prompt ("only use the context"); switch to a more obedient model |
Low answer_relevancy | Answer drifts, restates the question, or rambles | Reduce max_tokens; add "answer the question directly" to system prompt |
Low context_precision | Retriever returns noisy / off-topic chunks | Add a reranker; raise the score threshold; reduce k |
Low context_recall | Retriever misses chunks that contain the answer | Increase k; switch embedding model; smaller chunks |
High context_recall, low faithfulness | Right info reached the LLM but it didn't use it | Tighten prompt; few-shot examples that quote the context |
| Low precision and recall | The whole retrieval layer is broken | Re-embed; check chunk size; verify the index isn't stale |
answer_correctness low but faithfulness high | The retrieved contexts are wrong | Curate the corpus; remove outdated docs |
| High variance across runs | LLM judge nondeterminism | Set temperature=0 on the judge; raise sample size |
Configuring metric internals
Most metrics expose parameters. Faithfulness lets you swap the claim-decomposition strategy; AnswerRelevancy exposes strictness (number of regenerated questions).
from ragas.metrics import Faithfulness, AnswerRelevancy, ContextPrecision
faith = Faithfulness(
name="faithfulness_strict",
max_retries=2, # retry the judge on JSON parse errors
)
rel = AnswerRelevancy(
name="relevancy_high",
strictness=5, # generate 5 reverse-questions instead of 3
)
precision = ContextPrecision(
name="precision_with_ref",
# Provide a custom prompt template by subclassing if needed
)
result = evaluate(dataset, metrics=[faith, rel, precision])
Output:
{'faithfulness_strict': 0.9333, 'relevancy_high': 0.8721, 'precision_with_ref': 0.9100}
Local judge — use a small open-source model
LLM-as-judge calls add up. For dev iteration, point ragas at a local model via Ollama or vLLM.
from ragas import evaluate
from ragas.metrics import faithfulness, answer_relevancy
from ragas.llms import LangchainLLMWrapper
from ragas.embeddings import LangchainEmbeddingsWrapper
from langchain_community.chat_models import ChatOllama
from langchain_community.embeddings import HuggingFaceEmbeddings
llm = LangchainLLMWrapper(ChatOllama(model="qwen2:7b", temperature=0))
emb = LangchainEmbeddingsWrapper(HuggingFaceEmbeddings(model_name="BAAI/bge-small-en-v1.5"))
result = evaluate(dataset, metrics=[faithfulness, answer_relevancy], llm=llm, embeddings=emb)
print(result)
Output:
{'faithfulness': 0.91, 'answer_relevancy': 0.84}
ollama serve &
ollama pull qwen2:7b
python eval.py
Output:
pulling manifest
pulling 73895fa49e92... 100%
verifying sha256 digest
success
{'faithfulness': 0.91, 'answer_relevancy': 0.84}
A small judge can be biased — it tends to be over-lenient on faithfulness (everything looks supported) and under-confident on relevancy. For final/release evaluation, use a strong judge (Claude Sonnet, GPT-4o); use the local judge only for fast iteration.
Cost control — caching, sampling, and judging only what changed
Each metric makes 1–5 LLM calls per row. A 500-row eval with 5 metrics easily hits 5 000 judge calls.
import os
os.environ["RAGAS_CACHE_PATH"] = "./.ragas_cache" # on-disk cache for identical (prompt, inputs)
# Sample 100 rows for fast feedback during dev
small = dataset.shuffle(seed=42).select(range(100))
quick = evaluate(small, metrics=[faithfulness, answer_relevancy])
# Full run for nightly / pre-release
full = evaluate(dataset, metrics=[faithfulness, answer_relevancy, context_precision, context_recall])
Output:
quick (100 rows): {'faithfulness': 0.89, 'answer_relevancy': 0.85}
full (500 rows): {'faithfulness': 0.88, 'answer_relevancy': 0.86, 'context_precision': 0.91, 'context_recall': 0.84}
Async and concurrent execution
evaluate() runs synchronously by default. Pass run_config=RunConfig(max_workers=...) to parallelise judge calls. The judge LLM's rate limit is usually the bottleneck.
from ragas import evaluate
from ragas.run_config import RunConfig
from ragas.metrics import faithfulness, answer_relevancy
import time
cfg = RunConfig(
timeout=120, # per-judge-call timeout
max_retries=3, # retry on transient errors
max_wait=60, # exponential backoff cap
max_workers=8, # concurrent judge calls
)
start = time.time()
result = evaluate(dataset, metrics=[faithfulness, answer_relevancy], run_config=cfg)
print(f"Evaluated {len(dataset)} rows in {time.time()-start:.1f}s")
Output:
Evaluated 500 rows in 78.3s
Saving and loading datasets
ragas uses the Hugging Face datasets library, so the standard save_to_disk / load_from_disk round-trip works.
mkdir -p ./eval_datasets
Output: (none — exits 0 on success)
from datasets import Dataset, load_from_disk
import json
# Save dataset
dataset.save_to_disk("./eval_datasets/qa-v1")
# Save eval result alongside
result.to_pandas().to_csv("./eval_datasets/qa-v1.scores.csv", index=False)
# Reload later
ds = load_from_disk("./eval_datasets/qa-v1")
print(f"Loaded {len(ds)} rows")
Output:
Loaded 500 rows
Integration with LangSmith
Push ragas scores into LangSmith as feedback so the runs appear with metric scores in the LangSmith UI.
from ragas import evaluate
from ragas.metrics import faithfulness, answer_relevancy
from langsmith import Client
import os
ls = Client(api_key=os.environ["LANGCHAIN_API_KEY"])
# Each row in dataset has metadata['run_id'] linking it back to the original LangSmith run
result_df = evaluate(dataset, metrics=[faithfulness, answer_relevancy]).to_pandas()
for _, row in result_df.iterrows():
ls.create_feedback(
run_id=row["run_id"],
key="faithfulness",
score=float(row["faithfulness"]),
comment="ragas auto-eval",
)
ls.create_feedback(
run_id=row["run_id"],
key="answer_relevancy",
score=float(row["answer_relevancy"]),
)
print(f"Pushed scores for {len(result_df)} runs")
Output:
Pushed scores for 500 runs
Integration with LlamaIndex
from ragas.integrations.llama_index import evaluate as li_evaluate
from llama_index.core import VectorStoreIndex, SimpleDirectoryReader
docs = SimpleDirectoryReader("./docs").load_data()
index = VectorStoreIndex.from_documents(docs)
query_engine = index.as_query_engine()
questions = [
"What is multi-head attention?",
"Why use positional encoding?",
]
ground_truths = [
"Multiple attention heads attending to different subspaces.",
"Transformers have no recurrence so order info must be added explicitly.",
]
result = li_evaluate(
query_engine=query_engine,
metrics=[faithfulness, answer_relevancy],
dataset={"question": questions, "ground_truth": ground_truths},
)
print(result)
Output:
{'faithfulness': 0.93, 'answer_relevancy': 0.87}
Real-world recipes
Recipe: A/B test two embedding models
Build the same RAG pipeline twice with different embedding models and compare ragas scores.
from langchain_community.embeddings import HuggingFaceEmbeddings
from langchain_community.vectorstores import Chroma
variants = {
"minilm": HuggingFaceEmbeddings(model_name="sentence-transformers/all-MiniLM-L6-v2"),
"bge_base": HuggingFaceEmbeddings(model_name="BAAI/bge-base-en-v1.5"),
}
scores = {}
for name, emb in variants.items():
store = Chroma.from_documents(docs, embedding=emb, collection_name=f"eval_{name}")
retriever = store.as_retriever(search_kwargs={"k": 5})
rows = []
for q, gt in zip(eval_questions, eval_truths):
hits = retriever.get_relevant_documents(q)
answer = llm.invoke(build_prompt(q, hits)).content
rows.append({
"question": q,
"answer": answer,
"contexts": [h.page_content for h in hits],
"ground_truth": gt,
})
ds = Dataset.from_list(rows)
result = evaluate(ds, metrics=[faithfulness, answer_relevancy, context_recall])
scores[name] = result
for name, s in scores.items():
print(f"{name:>10} faith={s['faithfulness']:.3f} rel={s['answer_relevancy']:.3f} recall={s['context_recall']:.3f}")
Output:
minilm faith=0.852 rel=0.811 recall=0.738
bge_base faith=0.891 rel=0.842 recall=0.823
Recipe: chunk-size sweep
Find the sweet spot between chunk size and retrieval quality.
from langchain.text_splitter import RecursiveCharacterTextSplitter
scores = {}
for chunk_size in (256, 512, 1024, 2048):
splitter = RecursiveCharacterTextSplitter(chunk_size=chunk_size, chunk_overlap=chunk_size // 8)
chunks = splitter.split_documents(raw_docs)
store = Chroma.from_documents(chunks, embedding=emb)
pipeline = build_rag(store)
ds = run_pipeline(pipeline, eval_questions, eval_truths)
scores[chunk_size] = evaluate(ds, metrics=[context_precision, context_recall, faithfulness])
import json
print(json.dumps({k: {m: round(v, 3) for m, v in s.items()} for k, s in scores.items()}, indent=2))
Output:
{
"256": {"context_precision": 0.71, "context_recall": 0.92, "faithfulness": 0.88},
"512": {"context_precision": 0.84, "context_recall": 0.91, "faithfulness": 0.91},
"1024": {"context_precision": 0.86, "context_recall": 0.83, "faithfulness": 0.90},
"2048": {"context_precision": 0.82, "context_recall": 0.71, "faithfulness": 0.86}
}
Recipe: CI gate with per-metric thresholds
THRESHOLDS = {
"faithfulness": 0.85,
"answer_relevancy": 0.80,
"context_recall": 0.75,
}
def main():
ds = load_from_disk("./eval_datasets/golden-v3")
result = evaluate(ds, metrics=[faithfulness, answer_relevancy, context_recall])
failures = [
f"{k}: {result[k]:.3f} < {v:.2f}"
for k, v in THRESHOLDS.items()
if result[k] < v
]
if failures:
for f in failures:
print(f"FAIL {f}")
raise SystemExit(1)
print("PASS all thresholds met")
python -m scripts.eval_gate
Output:
PASS all thresholds met
Recipe: bulk-score production traces
Score yesterday's production traces overnight and store the metrics for trend analysis.
from datetime import datetime, timedelta, timezone
from langsmith import Client
from datasets import Dataset
ls = Client()
runs = list(ls.list_runs(
project_name="prod",
start_time=datetime.now(timezone.utc) - timedelta(days=1),
run_type="chain",
limit=2000,
))
rows = [
{
"question": r.inputs.get("question", ""),
"answer": r.outputs.get("answer", ""),
"contexts": r.outputs.get("sources", []) or [""],
"ground_truth": "", # leave empty for ref-free metrics only
"run_id": str(r.id),
}
for r in runs
if r.outputs and r.inputs.get("question")
]
ds = Dataset.from_list(rows)
result = evaluate(ds, metrics=[faithfulness, answer_relevancy])
result.to_pandas().to_parquet(f"./scores/{datetime.utcnow():%Y-%m-%d}.parquet")
print(f"Scored {len(rows)} prod runs")
Output:
Scored 1184 prod runs
Performance and reliability tips
- Set
temperature=0on the judge LLM. Default sampling causes 5–10% score variance between runs. - For long contexts, the judge may truncate. Verify the judge model's context window can hold question + all retrieved chunks + ground truth.
ragasraisesExceptionInRunnerwhen too many judge calls fail in a row. Lowermax_workersor raisemax_retriesinRunConfig.- Embedding-based metrics (
answer_relevancy,answer_correctness) cache per-text embeddings — reuse the sameembeddingsobject across calls. - Always seed
shuffle(seed=...)so subset runs are comparable across iterations.
Quick reference
| Metric | Requires | Measures |
|---|---|---|
faithfulness | question, answer, contexts | Fraction of answer claims grounded in contexts |
answer_relevancy | question, answer, contexts | How directly the answer addresses the question |
context_precision | question, contexts, ground_truth | Fraction of retrieved chunks that are relevant |
context_recall | question, contexts, ground_truth | Coverage of ground truth by retrieved chunks |
answer_correctness | question, answer, ground_truth | Factual + semantic similarity to reference |
| Task | Code |
|---|---|
| Basic evaluation | evaluate(dataset, metrics=[...]) |
| Custom LLM judge | evaluate(..., llm=LangchainLLMWrapper(model)) |
| Custom embeddings | evaluate(..., embeddings=LangchainEmbeddingsWrapper(emb)) |
| Concurrency | evaluate(..., run_config=RunConfig(max_workers=8)) |
| Cache judge calls | os.environ["RAGAS_CACHE_PATH"] = "./.ragas_cache" |
| To pandas | result.to_pandas() |
| Save dataset | dataset.save_to_disk("./path") |
| Load dataset | load_from_disk("./path") |
| Generate testset | TestsetGenerator.from_langchain(...).generate_with_langchain_docs(...) |
| LangSmith push | ls.create_feedback(run_id, key="faithfulness", score=...) |
| LlamaIndex eval | ragas.integrations.llama_index.evaluate(query_engine, metrics, dataset) |
| Dataset format | Dataset.from_dict({"question": [], "answer": [], "contexts": [[]], "ground_truth": []}) |