cheat sheet
ragas
Package-level reference for ragas on PyPI — install variants, LLM-as-judge dependencies, metric churn, and alternative evaluators.
ragas
What it is
ragas is a Python evaluation framework for Retrieval-Augmented Generation pipelines, originated by Explodinggradients. It provides a catalogue of metrics — faithfulness, answer relevance, context precision, context recall, answer correctness, and more — most of which are implemented as LLM-as-judge evaluations: an LLM scores the output along the metric's rubric, optionally with a reference answer.
Reach for ragas when you want a batteries-included library of RAG metrics that integrates cleanly with datasets, LangChain, and LlamaIndex, and you are happy to pay for LLM calls during evaluation. Reach for trulens-eval for richer in-process tracing and a feedback-function dashboard, or deepeval for a Pytest-native evaluator.
Install
pip install ragas
Output: (none — exits 0 on success)
uv add ragas
Output: dependency resolved + added to pyproject.toml
poetry add ragas
Output: updated lockfile + virtualenv install
Versioning & Python support
- The package is still pre-
1.0and iterates quickly. Metric implementations, prompts, and even metric names occasionally change between minors — for reproducibility of historical evaluation numbers, pin the exact version in CI. - Recent versions support Python 3.9+. Pure-Python with
datasets,langchain, and one or more LLM-provider SDKs as runtime dependencies. - The base install pulls in
langchain-coreandlangchainfor prompt and LLM-wrapping abstractions, even if you do not otherwise use LangChain. Newer versions are working to slim this dependency.
Package metadata
- Maintainer: Explodinggradients and community contributors
- Project home: github.com/explodinggradients/ragas
- Docs: docs.ragas.io
- PyPI: pypi.org/project/ragas
- License: Apache-2.0
- Governance: independent open-source project
- First released: 2023
- Downloads: hundreds of thousands per month
Optional dependencies & extras
ragas does not publish formal ragas[xxx] extras. Instead, you install the LLM-provider SDK you want to use as the judge, and ragas discovers it at runtime:
openai— required if your judge is OpenAI (the most common path; default in many examples).anthropic— for Claude as judge.cohere,google-generativeai,mistralai,vertexai— for other providers.langchain-openai,langchain-anthropic,langchain-google-genai, … — most ragas metrics accept any LangChainBaseChatModel, so the provider's LangChain integration is often what you actually install.datasets— pulled in transitively; ragas's evaluation API takes adatasets.Datasetof(question, contexts, answer, ground_truths)rows.sentence-transformers— required for embedding-based metrics likeanswer_similarity.
For local-only evaluation without external API calls, install a local-LLM stack (vllm, ollama, or llama-cpp-python) and point ragas at it via its LangChain wrapper.
Alternatives
| Package | Trade-off |
|---|---|
trulens-eval | Tracing + dashboard + RAG triad feedback functions. Use when you want a UI on top of evaluation. |
deepeval | Pytest-native; assertion-style metrics. Use when evaluation should live next to unit tests. |
langsmith | Hosted observability + datasets + evaluators from LangChain. Use when you are already on LangSmith. |
phoenix (Arize) | Open-source LLM observability with trace replay. Use for trace-first workflows. |
promptfoo | Node-based prompt evaluation CLI. Use when you care more about prompt A/B than RAG-specific metrics. |
evals (OpenAI) | OpenAI's classic eval harness. Use for low-level eval registries. |
Common gotchas
- Metric implementation churn. Faithfulness, context precision, and answer relevance have all been rewritten across versions, sometimes with materially different prompts. Two
ragasversions can give different scores for the same data — pin the version in any longitudinal evaluation. - LLM-as-judge costs add up fast. Many metrics make multiple LLM calls per row; evaluating 1,000 rows with five metrics on GPT-4 class models is a noticeable bill. Use a smaller judge model for development.
datasetsschema is strict. The expected columns arequestion,contexts(list[str]),answer, andground_truths(list[str]). Off-by-one shapes (a string instead of a list) silently produce zero scores for some metrics.- LangChain coupling. Earlier versions hard-imported LangChain at module load; even if you only used the OpenAI judge, you paid for the LangChain import time. Newer versions are slimming this, but the dependency is still present.
- Determinism is approximate. Even with
temperature=0, LLM judges vary across providers and across days. For CI, treat ragas scores as bands (e.g.faithfulness > 0.75), not exact values. - Reference-free vs reference-based metrics.
context_recallandanswer_correctnessneedground_truths;faithfulnessandanswer_relevancedo not. Mixing the two in one call without aligning columns is a common cause ofKeyError.
Real-world recipes
The recipes below focus on the install footprint and judge-LM topology each pattern requires — the sections/ai/ragas companion covers the metric catalogue in detail.
Minimal eval harness against an OpenAI judge — the canonical first run. pip install ragas openai datasets is the whole footprint.
from datasets import Dataset
from ragas import evaluate
from ragas.metrics import faithfulness, answer_relevancy, context_precision
samples = Dataset.from_list([
{
"question": "What is HNSW?",
"contexts": ["HNSW is a graph-based ANN algorithm..."],
"answer": "HNSW is a layered graph index for approximate nearest neighbour search.",
"ground_truths": ["HNSW stands for Hierarchical Navigable Small World."],
},
# ... more rows
])
result = evaluate(
samples,
metrics=[faithfulness, answer_relevancy, context_precision],
)
print(result)
Output: a per-metric mean across all rows, plus per-row scores accessible via result.to_pandas(); the judge LM is OpenAI's default (gpt-4o-mini in recent ragas versions)
Anthropic judge with LangChain wrapper — most ragas metrics accept any LangChain BaseChatModel via llm=. This is how you swap providers.
from langchain_anthropic import ChatAnthropic
from ragas.llms import LangchainLLMWrapper
from ragas import evaluate
judge = LangchainLLMWrapper(ChatAnthropic(model="claude-3-5-sonnet-latest"))
result = evaluate(samples, metrics=[faithfulness], llm=judge)
Output: the same dataset evaluated with Claude as the judge; results differ from the OpenAI judge — this is why pinning the judge model is part of the eval methodology
Local judge via Ollama — for cost-free CI evaluation or air-gapped environments. Trade off: smaller models produce noisier judgments.
from langchain_community.chat_models import ChatOllama
from ragas.llms import LangchainLLMWrapper
judge = LangchainLLMWrapper(ChatOllama(model="llama3.1:70b", temperature=0))
result = evaluate(samples, metrics=[faithfulness], llm=judge)
Output: evaluation against a local model; expect slower runs (10× minimum) and lower-confidence scores than a frontier model judge
Embedding-based metrics need a separate embedder. answer_similarity and semantic_similarity-flavoured metrics require an embeddings= argument in addition to the LM judge.
from langchain_openai import OpenAIEmbeddings
from ragas.embeddings import LangchainEmbeddingsWrapper
from ragas.metrics import answer_similarity
emb = LangchainEmbeddingsWrapper(OpenAIEmbeddings(model="text-embedding-3-small"))
result = evaluate(
samples,
metrics=[answer_similarity],
llm=judge,
embeddings=emb,
)
Output: cosine similarity between answer and ground_truths embedded with the supplied model
Regression CI on a small fixed dataset — wire ragas results into a pytest assertion so the eval gates merges.
def test_rag_quality():
result = evaluate(
load_eval_set(),
metrics=[faithfulness, answer_relevancy],
llm=judge,
)
df = result.to_pandas()
assert df["faithfulness"].mean() > 0.80
assert df["answer_relevancy"].mean() > 0.75
Output: the test fails the build if mean scores drop below thresholds; treat scores as bands, not exact equality (LM judges are not perfectly deterministic)
Production deployment
Ragas is a batch-evaluation library, not a runtime observability tool. The deployment story is a scheduled job that runs against a curated dataset and persists results.
Topology checklist:
| Concern | Approach |
|---|---|
| Where eval runs | CI for regression gating; nightly cron for trend graphs |
| Where datasets live | versioned in git (small) or in S3/DVC (large) |
| Where results land | Postgres/DuckDB table; LangSmith dataset; CSV in object storage |
| LM judge selection | strong model for release-gating, cheap model for inner-loop dev |
| Cost control | sample subsets in dev, full set on schedule |
| Reproducibility | pin ragas version, judge model name, temperature=0 |
Batch evaluation with evaluate(...). Recent ragas versions parallelise judge calls automatically. For larger datasets, control concurrency with run_config=:
from ragas.run_config import RunConfig
cfg = RunConfig(max_workers=8, timeout=120, max_retries=2)
result = evaluate(samples, metrics=[faithfulness], llm=judge, run_config=cfg)
Output: judge calls fan out across 8 workers; the wall-clock cost drops near-linearly with concurrency until provider rate limits bite
Persisting results. result.to_pandas() returns a per-row DataFrame; persist to whatever your trends dashboard reads. The minimum useful schema is (run_id, timestamp, ragas_version, judge_model, sample_id, metric, score).
Cost management. Each metric makes 1–5 LM calls per row. A 1,000-row eval with five metrics on GPT-4-class is non-trivial. Budgeting checklist:
- Use a smaller judge for dev (e.g.
gpt-4o-minior local Llama). - Run the full eval only on a release candidate.
- Cache by
(question, contexts, answer)hash for unchanged rows. - Sample-then-fan-out: spot-check 50 rows daily, full 1,000 weekly.
Evaluation methodology
Ragas scores are rubric-based LLM-as-judge outputs. They are not ground truth — they are an LM's interpretation of a rubric applied to the data. Treating them as exact ranks rather than indicators is the most common methodology mistake.
Metric reference (canonical RAG metrics, names current at writing):
| Metric | Reference-free? | What it measures |
|---|---|---|
faithfulness | yes | does answer follow from contexts? |
answer_relevancy | yes | does answer actually answer question? |
context_precision | needs ground_truths | are retrieved contexts relevant, ordered well? |
context_recall | needs ground_truths | did retrieval pull all needed info? |
answer_correctness | needs ground_truths | does answer match the reference? |
answer_similarity | needs ground_truths + embeddings | embedding cosine vs reference |
Sample-size statistics. Ragas scores have non-trivial variance per row. As a rule of thumb, you need ~30 rows minimum for a stable mean per metric, and you should report 95% bootstrap CIs alongside the mean for any released number.
Reference data quality dominates. A clean 100-row ground_truths set tells you more than a noisy 10,000-row crawl. Spend the eval-curation effort on writing accurate references, not scaling row count.
Human-rated baseline. Before trusting an LM judge for a release decision, sample ~30 rows, have a human rate them on the same rubric, and check the LM-vs-human agreement. Below 80% agreement, the judge is too noisy to gate releases on.
Regression vs absolute. Treat ragas scores as a diff signal: did answer_relevancy drop > 5 points from the last release? Investigate. Treating absolute scores ("our pipeline scored 0.83") as comparable across versions is unreliable because metric implementations change.
Version migration guide
Ragas is pre-1.0 and iterates quickly. Several axes change across versions:
Metric renames and behaviour changes.
- Early versions had
context_relevancy; later versions split intocontext_precisionandcontext_recallwith different semantics. Scores from the two are not comparable. faithfulnessprompts have been rewritten across minors — same input data can score differently.answer_similarityrequires an explicitembeddings=argument in newer versions; older versions defaulted silently.
API shape.
evaluate(...)signature has gainedrun_config,embeddings, andraise_exceptionsparameters across versions.- The
RunConfigobject replaced positional concurrency arguments. result.to_pandas()is the stable way to get per-row scores.
Pin for reproducibility. For any longitudinal evaluation (week-over-week trend, release-gating thresholds), pin:
ragas==x.y.z— exact version.- Judge model name (e.g.
gpt-4o-2024-08-06, notgpt-4o). temperature=0and a fixedseed=where the judge LM supports it.
LangChain coupling. Earlier ragas hard-imported langchain at module load; recent versions slim this. Either way, langchain-core is a transitive dep — your pip resolver will pull it in.
Schema column names. The expected columns on the datasets.Dataset input have varied: older versions accepted ground_truths (plural), some intermediate versions wanted ground_truth (singular), and the column for retrieved contexts has always been contexts (list[str]). Check the version's source if scores come back as 0 for some rows.
Index tuning & retrieval quality
Ragas doesn't index anything itself, but its context_precision and context_recall metrics are the primary signal for tuning the retrieval system feeding the pipeline. The eval loop you build with ragas is the feedback channel for HNSW parameters, chunk sizes, and reranker choices.
Tuning loop pattern:
- Build a fixed eval set with
(question, contexts, answer, ground_truths)rows. - For each retrieval variant (different
top_k, chunk size, reranker), produce thecontextscolumn by running retrieval against your store. - Score with
context_precision(are retrieved contexts relevant?) andcontext_recall(did retrieval find the needed info?). - Compare means and per-row variance; pick the variant with the best Pareto trade-off.
Common variants worth sweeping:
| Variant | What changes | What metric moves |
|---|---|---|
top_k from 3 to 10 | wider retrieval | context_recall ↑, context_precision ↓ |
| Chunk size 500 → 1500 chars | bigger context per chunk | context_recall ↑ if previously truncated |
| Add cross-encoder reranker | reorder top-K | context_precision ↑ at fixed K |
| Hybrid (BM25 + vector) vs vector-only | retrieval method | both should ↑ on keyword-heavy queries |
| HyDE on/off | query rewriting | helpful on short noisy queries |
Caution: same-data optimisation. Sweeping retrieval against the same fixed eval set risks overfitting to that set. Hold out a separate test set and only evaluate the final candidate on it.
Performance tuning
The headline cost is LM judge calls. Tuning is mostly about parallelism, caching, and choosing a cheaper judge.
| Lever | Mechanism | When it helps |
|---|---|---|
RunConfig(max_workers=N) | concurrent judge calls | provider rate limit allows it |
| Cheaper judge model | smaller LM | inner-loop dev; CI smoke tests |
| Subset sampling | run on N% of data | quick feedback during prompt iteration |
| Result caching | hash (question, contexts, answer) | repeated runs on unchanged rows |
| Async LM wrapper | non-blocking calls | when wrapping ragas in an async service |
| Local judge | Ollama / vLLM / llama-cpp | no API cost, slower wall-clock |
Provider rate limits. OpenAI tier-1 limits cap throughput around hundreds of requests per minute per model. For larger eval runs, request a higher tier or batch across providers. max_retries= in RunConfig recovers from transient 429s.
Async vs threaded execution. Recent ragas versions use asyncio under the hood when the LM wrapper is async. The thread-based path remains for sync wrappers. Either way, you express concurrency through max_workers.
Troubleshooting common errors
KeyError: 'ground_truths'— you included a reference-based metric (e.g.context_recall) without supplyingground_truths. Either supply the column or remove the metric.- Scores are all NaN — the judge consistently failed to parse its own output. Common causes: judge model too small, prompt language barrier, or extreme rate limiting. Inspect
result.dataset[0]for the raw responses. RateLimitErrorafter a few hundred rows — concurrency too high. Lowermax_workersor upgrade the provider tier.ImportError: cannot import name 'context_relevancy'— metric was renamed. Usecontext_precision+context_recallin current versions.- Different scores on identical input across versions — metric prompts changed. Pin the version for longitudinal numbers.
pip install ragaspulls in 200+ MB —langchainecosystem and provider SDKs. Usepip install ragas --no-depsand curate, or accept the size.- Local judge produces zero-variance scores — small models often default to "5/5" on every rubric. Validate the judge with a human-rated baseline first.
Ecosystem integrations
- LangChain / LlamaIndex — first-class. Ragas accepts any LangChain
BaseChatModelviaLangchainLLMWrapper. LlamaIndex integration viallama_index.core.evaluation.ragas. - LangSmith — ragas can write per-row scores back to a LangSmith dataset for trend dashboards.
- DSPy — DSPy programs can use ragas metrics as the optimisation signal in
dspy.evaluate. - Haystack —
ragas-haystackexposes ragas metrics as Haystack components. - Phoenix (Arize) — Phoenix imports ragas as one of its built-in evaluator backends.
Embeddings & chunking strategy
Some ragas metrics use embeddings rather than (or in addition to) an LM judge. answer_similarity and semantic_similarity-flavoured metrics need an explicit embeddings= argument; everything else is LM-only.
When embeddings enter the picture.
answer_similarity— cosine betweenanswerandground_truthsembeddings.context_entity_recall(in some versions) — uses embeddings to match entities.- Custom metrics built on
MetricWithEmbeddingsbase class.
Embedder choice. Match the embedder to your retrieval pipeline. Evaluating with text-embedding-3-large while the production pipeline uses MiniLM gives misleading numbers — use the same model.
from langchain_openai import OpenAIEmbeddings
from ragas.embeddings import LangchainEmbeddingsWrapper
emb = LangchainEmbeddingsWrapper(OpenAIEmbeddings(model="text-embedding-3-small"))
Output: the embedder is reusable across metric calls; cache it as a module-level singleton if you call evaluate(...) multiple times
Chunking is not ragas's concern. Ragas evaluates the chunks your retrieval returned. If you suspect chunking is degrading retrieval quality, evaluate the same questions with progressively wider chunks and watch context_precision and context_recall trend.
Test-set chunking parity. When building an eval dataset, the contexts column must be populated using the same chunker your production pipeline uses. Mismatched chunking inflates or deflates scores artificially.
Security considerations
Ragas's evaluation pipeline sends both your questions and your retrieved contexts to an LM judge — usually a third-party provider. This is the dominant security concern.
- Sensitive data exposure to judges. If your RAG pipeline retrieves PII, internal documents, or regulated content, those documents are sent to the judge LM. For OpenAI/Anthropic/Google APIs, that means the third party sees them. Mitigations:
- Use a self-hosted judge (Ollama / vLLM / local Llama) for sensitive data.
- Redact or hash PII before evaluation.
- Use a provider with a data-residency / no-training guarantee (Azure OpenAI, Bedrock with the right config).
- Reference answers may also be sensitive.
ground_truthsare part of every reference-based metric call. - Prompt injection in the eval data. A malicious row in the eval set can hijack the judge LM. Inspect external datasets before evaluating on them.
- Cost as a denial-of-wallet vector. Self-service eval triggers (e.g. via a PR comment) should rate-limit and cap concurrency.
- Key rotation. Judge LLM API keys live in env vars; rotate alongside the rest of your fleet.
- CI secrets exposure. Eval-in-CI exposes the judge key to every PR-triggered job. Use OIDC-issued short-lived tokens where the provider supports it (Anthropic and OpenAI have organisation-key controls).
When NOT to use this
Ragas is the right tool for batch evaluation against a curated dataset. It's the wrong tool when:
- Your dataset is < 10 rows. Spot-checking by hand is faster and produces a better signal than a noisy LM judge.
- You need runtime tracing of a live app. Use
trulens-eval,langfuse,langsmith, orphoenix. - You want Pytest-style assertions next to unit tests.
deepevalis Pytest-native. - Your eval should not call out to an LM provider. A local Ollama judge works but is noisy; you may prefer a programmatic metric (BLEU, ROUGE, exact-match) for that case.
- You need legally-defensible accuracy numbers. LM-judge scores are not ground truth; for compliance evidence, you need human raters with documented inter-rater agreement.
See also
- AI: ragas — metrics catalogue, eval workflow, dataset preparation
- Concept: RAG — retrieval-augmented generation patterns
- Concept: API — REST design fundamentals