cheat sheet

ragas

Package-level reference for ragas on PyPI — install variants, LLM-as-judge dependencies, metric churn, and alternative evaluators.

updated 05-31-2026

ragas

What it is

ragas is a Python evaluation framework for Retrieval-Augmented Generation pipelines, originated by Explodinggradients. It provides a catalogue of metrics — faithfulness, answer relevance, context precision, context recall, answer correctness, and more — most of which are implemented as LLM-as-judge evaluations: an LLM scores the output along the metric's rubric, optionally with a reference answer.

Reach for ragas when you want a batteries-included library of RAG metrics that integrates cleanly with datasets, LangChain, and LlamaIndex, and you are happy to pay for LLM calls during evaluation. Reach for trulens-eval for richer in-process tracing and a feedback-function dashboard, or deepeval for a Pytest-native evaluator.

Install

bash

pip install ragas

Output: (none — exits 0 on success)

bash

uv add ragas

Output: dependency resolved + added to pyproject.toml

bash

poetry add ragas

Output: updated lockfile + virtualenv install

Versioning & Python support

The package is still pre-1.0 and iterates quickly. Metric implementations, prompts, and even metric names occasionally change between minors — for reproducibility of historical evaluation numbers, pin the exact version in CI.
Recent versions support Python 3.9+. Pure-Python with datasets, langchain, and one or more LLM-provider SDKs as runtime dependencies.
The base install pulls in langchain-core and langchain for prompt and LLM-wrapping abstractions, even if you do not otherwise use LangChain. Newer versions are working to slim this dependency.

Package metadata

Maintainer: Explodinggradients and community contributors
Project home: github.com/explodinggradients/ragas
Docs: docs.ragas.io
PyPI: pypi.org/project/ragas
License: Apache-2.0
Governance: independent open-source project
First released: 2023
Downloads: hundreds of thousands per month

Optional dependencies & extras

ragas does not publish formal ragas[xxx] extras. Instead, you install the LLM-provider SDK you want to use as the judge, and ragas discovers it at runtime:

openai — required if your judge is OpenAI (the most common path; default in many examples).
anthropic — for Claude as judge.
cohere, google-generativeai, mistralai, vertexai — for other providers.
langchain-openai, langchain-anthropic, langchain-google-genai, … — most ragas metrics accept any LangChain BaseChatModel, so the provider's LangChain integration is often what you actually install.
datasets — pulled in transitively; ragas's evaluation API takes a datasets.Dataset of (question, contexts, answer, ground_truths) rows.
sentence-transformers — required for embedding-based metrics like answer_similarity.

For local-only evaluation without external API calls, install a local-LLM stack (vllm, ollama, or llama-cpp-python) and point ragas at it via its LangChain wrapper.

Alternatives

Package	Trade-off
`trulens-eval`	Tracing + dashboard + RAG triad feedback functions. Use when you want a UI on top of evaluation.
`deepeval`	Pytest-native; assertion-style metrics. Use when evaluation should live next to unit tests.
`langsmith`	Hosted observability + datasets + evaluators from LangChain. Use when you are already on LangSmith.
`phoenix` (Arize)	Open-source LLM observability with trace replay. Use for trace-first workflows.
`promptfoo`	Node-based prompt evaluation CLI. Use when you care more about prompt A/B than RAG-specific metrics.
`evals` (OpenAI)	OpenAI's classic eval harness. Use for low-level eval registries.

Common gotchas

Metric implementation churn. Faithfulness, context precision, and answer relevance have all been rewritten across versions, sometimes with materially different prompts. Two ragas versions can give different scores for the same data — pin the version in any longitudinal evaluation.
LLM-as-judge costs add up fast. Many metrics make multiple LLM calls per row; evaluating 1,000 rows with five metrics on GPT-4 class models is a noticeable bill. Use a smaller judge model for development.
datasets schema is strict. The expected columns are question, contexts (list[str]), answer, and ground_truths (list[str]). Off-by-one shapes (a string instead of a list) silently produce zero scores for some metrics.
LangChain coupling. Earlier versions hard-imported LangChain at module load; even if you only used the OpenAI judge, you paid for the LangChain import time. Newer versions are slimming this, but the dependency is still present.
Determinism is approximate. Even with temperature=0, LLM judges vary across providers and across days. For CI, treat ragas scores as bands (e.g. faithfulness > 0.75), not exact values.
Reference-free vs reference-based metrics. context_recall and answer_correctness need ground_truths; faithfulness and answer_relevance do not. Mixing the two in one call without aligning columns is a common cause of KeyError.

Real-world recipes

The recipes below focus on the install footprint and judge-LM topology each pattern requires — the sections/ai/ragas companion covers the metric catalogue in detail.

Minimal eval harness against an OpenAI judge — the canonical first run. pip install ragas openai datasets is the whole footprint.

python

from datasets import Dataset
from ragas import evaluate
from ragas.metrics import faithfulness, answer_relevancy, context_precision

samples = Dataset.from_list([
    {
        "question": "What is HNSW?",
        "contexts": ["HNSW is a graph-based ANN algorithm..."],
        "answer": "HNSW is a layered graph index for approximate nearest neighbour search.",
        "ground_truths": ["HNSW stands for Hierarchical Navigable Small World."],
    },
    # ... more rows
])

result = evaluate(
    samples,
    metrics=[faithfulness, answer_relevancy, context_precision],
)
print(result)

Output: a per-metric mean across all rows, plus per-row scores accessible via result.to_pandas(); the judge LM is OpenAI's default (gpt-4o-mini in recent ragas versions)

Anthropic judge with LangChain wrapper — most ragas metrics accept any LangChain BaseChatModel via llm=. This is how you swap providers.

python

from langchain_anthropic import ChatAnthropic
from ragas.llms import LangchainLLMWrapper
from ragas import evaluate

judge = LangchainLLMWrapper(ChatAnthropic(model="claude-3-5-sonnet-latest"))
result = evaluate(samples, metrics=[faithfulness], llm=judge)

Output: the same dataset evaluated with Claude as the judge; results differ from the OpenAI judge — this is why pinning the judge model is part of the eval methodology

Local judge via Ollama — for cost-free CI evaluation or air-gapped environments. Trade off: smaller models produce noisier judgments.

python

from langchain_community.chat_models import ChatOllama
from ragas.llms import LangchainLLMWrapper

judge = LangchainLLMWrapper(ChatOllama(model="llama3.1:70b", temperature=0))
result = evaluate(samples, metrics=[faithfulness], llm=judge)

Output: evaluation against a local model; expect slower runs (10× minimum) and lower-confidence scores than a frontier model judge

Embedding-based metrics need a separate embedder. answer_similarity and semantic_similarity-flavoured metrics require an embeddings= argument in addition to the LM judge.

python

from langchain_openai import OpenAIEmbeddings
from ragas.embeddings import LangchainEmbeddingsWrapper
from ragas.metrics import answer_similarity

emb = LangchainEmbeddingsWrapper(OpenAIEmbeddings(model="text-embedding-3-small"))
result = evaluate(
    samples,
    metrics=[answer_similarity],
    llm=judge,
    embeddings=emb,
)

Output: cosine similarity between answer and ground_truths embedded with the supplied model

Regression CI on a small fixed dataset — wire ragas results into a pytest assertion so the eval gates merges.

python

def test_rag_quality():
    result = evaluate(
        load_eval_set(),
        metrics=[faithfulness, answer_relevancy],
        llm=judge,
    )
    df = result.to_pandas()
    assert df["faithfulness"].mean() > 0.80
    assert df["answer_relevancy"].mean() > 0.75

Output: the test fails the build if mean scores drop below thresholds; treat scores as bands, not exact equality (LM judges are not perfectly deterministic)

Production deployment

Ragas is a batch-evaluation library, not a runtime observability tool. The deployment story is a scheduled job that runs against a curated dataset and persists results.

Topology checklist:

Concern	Approach
Where eval runs	CI for regression gating; nightly cron for trend graphs
Where datasets live	versioned in git (small) or in S3/DVC (large)
Where results land	Postgres/DuckDB table; LangSmith dataset; CSV in object storage
LM judge selection	strong model for release-gating, cheap model for inner-loop dev
Cost control	sample subsets in dev, full set on schedule
Reproducibility	pin ragas version, judge model name, temperature=0

Batch evaluation with evaluate(...). Recent ragas versions parallelise judge calls automatically. For larger datasets, control concurrency with run_config=:

python

from ragas.run_config import RunConfig

cfg = RunConfig(max_workers=8, timeout=120, max_retries=2)
result = evaluate(samples, metrics=[faithfulness], llm=judge, run_config=cfg)

Output: judge calls fan out across 8 workers; the wall-clock cost drops near-linearly with concurrency until provider rate limits bite

Persisting results. result.to_pandas() returns a per-row DataFrame; persist to whatever your trends dashboard reads. The minimum useful schema is (run_id, timestamp, ragas_version, judge_model, sample_id, metric, score).

Cost management. Each metric makes 1–5 LM calls per row. A 1,000-row eval with five metrics on GPT-4-class is non-trivial. Budgeting checklist:

Use a smaller judge for dev (e.g. gpt-4o-mini or local Llama).
Run the full eval only on a release candidate.
Cache by (question, contexts, answer) hash for unchanged rows.
Sample-then-fan-out: spot-check 50 rows daily, full 1,000 weekly.

Evaluation methodology

Ragas scores are rubric-based LLM-as-judge outputs. They are not ground truth — they are an LM's interpretation of a rubric applied to the data. Treating them as exact ranks rather than indicators is the most common methodology mistake.

Metric reference (canonical RAG metrics, names current at writing):

Metric	Reference-free?	What it measures
`faithfulness`	yes	does `answer` follow from `contexts`?
`answer_relevancy`	yes	does `answer` actually answer `question`?
`context_precision`	needs `ground_truths`	are retrieved contexts relevant, ordered well?
`context_recall`	needs `ground_truths`	did retrieval pull all needed info?
`answer_correctness`	needs `ground_truths`	does `answer` match the reference?
`answer_similarity`	needs `ground_truths` + embeddings	embedding cosine vs reference

Sample-size statistics. Ragas scores have non-trivial variance per row. As a rule of thumb, you need ~30 rows minimum for a stable mean per metric, and you should report 95% bootstrap CIs alongside the mean for any released number.

Reference data quality dominates. A clean 100-row ground_truths set tells you more than a noisy 10,000-row crawl. Spend the eval-curation effort on writing accurate references, not scaling row count.

Human-rated baseline. Before trusting an LM judge for a release decision, sample ~30 rows, have a human rate them on the same rubric, and check the LM-vs-human agreement. Below 80% agreement, the judge is too noisy to gate releases on.

Regression vs absolute. Treat ragas scores as a diff signal: did answer_relevancy drop > 5 points from the last release? Investigate. Treating absolute scores ("our pipeline scored 0.83") as comparable across versions is unreliable because metric implementations change.

Version migration guide

Ragas is pre-1.0 and iterates quickly. Several axes change across versions:

Metric renames and behaviour changes.

Early versions had context_relevancy; later versions split into context_precision and context_recall with different semantics. Scores from the two are not comparable.
faithfulness prompts have been rewritten across minors — same input data can score differently.
answer_similarity requires an explicit embeddings= argument in newer versions; older versions defaulted silently.

API shape.

evaluate(...) signature has gained run_config, embeddings, and raise_exceptions parameters across versions.
The RunConfig object replaced positional concurrency arguments.
result.to_pandas() is the stable way to get per-row scores.

Pin for reproducibility. For any longitudinal evaluation (week-over-week trend, release-gating thresholds), pin:

ragas==x.y.z — exact version.
Judge model name (e.g. gpt-4o-2024-08-06, not gpt-4o).
temperature=0 and a fixed seed= where the judge LM supports it.

LangChain coupling. Earlier ragas hard-imported langchain at module load; recent versions slim this. Either way, langchain-core is a transitive dep — your pip resolver will pull it in.

Schema column names. The expected columns on the datasets.Dataset input have varied: older versions accepted ground_truths (plural), some intermediate versions wanted ground_truth (singular), and the column for retrieved contexts has always been contexts (list[str]). Check the version's source if scores come back as 0 for some rows.

Index tuning & retrieval quality

Ragas doesn't index anything itself, but its context_precision and context_recall metrics are the primary signal for tuning the retrieval system feeding the pipeline. The eval loop you build with ragas is the feedback channel for HNSW parameters, chunk sizes, and reranker choices.

Tuning loop pattern:

Build a fixed eval set with (question, contexts, answer, ground_truths) rows.
For each retrieval variant (different top_k, chunk size, reranker), produce the contexts column by running retrieval against your store.
Score with context_precision (are retrieved contexts relevant?) and context_recall (did retrieval find the needed info?).
Compare means and per-row variance; pick the variant with the best Pareto trade-off.

Common variants worth sweeping:

Variant	What changes	What metric moves
`top_k` from 3 to 10	wider retrieval	`context_recall` ↑, `context_precision` ↓
Chunk size 500 → 1500 chars	bigger context per chunk	`context_recall` ↑ if previously truncated
Add cross-encoder reranker	reorder top-K	`context_precision` ↑ at fixed K
Hybrid (BM25 + vector) vs vector-only	retrieval method	both should ↑ on keyword-heavy queries
HyDE on/off	query rewriting	helpful on short noisy queries

Caution: same-data optimisation. Sweeping retrieval against the same fixed eval set risks overfitting to that set. Hold out a separate test set and only evaluate the final candidate on it.

Performance tuning

The headline cost is LM judge calls. Tuning is mostly about parallelism, caching, and choosing a cheaper judge.

Lever	Mechanism	When it helps
`RunConfig(max_workers=N)`	concurrent judge calls	provider rate limit allows it
Cheaper judge model	smaller LM	inner-loop dev; CI smoke tests
Subset sampling	run on N% of data	quick feedback during prompt iteration
Result caching	hash `(question, contexts, answer)`	repeated runs on unchanged rows
Async LM wrapper	non-blocking calls	when wrapping ragas in an async service
Local judge	Ollama / vLLM / llama-cpp	no API cost, slower wall-clock

Provider rate limits. OpenAI tier-1 limits cap throughput around hundreds of requests per minute per model. For larger eval runs, request a higher tier or batch across providers. max_retries= in RunConfig recovers from transient 429s.

Async vs threaded execution. Recent ragas versions use asyncio under the hood when the LM wrapper is async. The thread-based path remains for sync wrappers. Either way, you express concurrency through max_workers.

Troubleshooting common errors

KeyError: 'ground_truths' — you included a reference-based metric (e.g. context_recall) without supplying ground_truths. Either supply the column or remove the metric.
Scores are all NaN — the judge consistently failed to parse its own output. Common causes: judge model too small, prompt language barrier, or extreme rate limiting. Inspect result.dataset[0] for the raw responses.
RateLimitError after a few hundred rows — concurrency too high. Lower max_workers or upgrade the provider tier.
ImportError: cannot import name 'context_relevancy' — metric was renamed. Use context_precision + context_recall in current versions.
Different scores on identical input across versions — metric prompts changed. Pin the version for longitudinal numbers.
pip install ragas pulls in 200+ MB — langchain ecosystem and provider SDKs. Use pip install ragas --no-deps and curate, or accept the size.
Local judge produces zero-variance scores — small models often default to "5/5" on every rubric. Validate the judge with a human-rated baseline first.

Ecosystem integrations

LangChain / LlamaIndex — first-class. Ragas accepts any LangChain BaseChatModel via LangchainLLMWrapper. LlamaIndex integration via llama_index.core.evaluation.ragas.
LangSmith — ragas can write per-row scores back to a LangSmith dataset for trend dashboards.
DSPy — DSPy programs can use ragas metrics as the optimisation signal in dspy.evaluate.
Haystack — ragas-haystack exposes ragas metrics as Haystack components.
Phoenix (Arize) — Phoenix imports ragas as one of its built-in evaluator backends.

Embeddings & chunking strategy

Some ragas metrics use embeddings rather than (or in addition to) an LM judge. answer_similarity and semantic_similarity-flavoured metrics need an explicit embeddings= argument; everything else is LM-only.

When embeddings enter the picture.

answer_similarity — cosine between answer and ground_truths embeddings.
context_entity_recall (in some versions) — uses embeddings to match entities.
Custom metrics built on MetricWithEmbeddings base class.

Embedder choice. Match the embedder to your retrieval pipeline. Evaluating with text-embedding-3-large while the production pipeline uses MiniLM gives misleading numbers — use the same model.

python

from langchain_openai import OpenAIEmbeddings
from ragas.embeddings import LangchainEmbeddingsWrapper

emb = LangchainEmbeddingsWrapper(OpenAIEmbeddings(model="text-embedding-3-small"))

Output: the embedder is reusable across metric calls; cache it as a module-level singleton if you call evaluate(...) multiple times

Chunking is not ragas's concern. Ragas evaluates the chunks your retrieval returned. If you suspect chunking is degrading retrieval quality, evaluate the same questions with progressively wider chunks and watch context_precision and context_recall trend.

Test-set chunking parity. When building an eval dataset, the contexts column must be populated using the same chunker your production pipeline uses. Mismatched chunking inflates or deflates scores artificially.

Security considerations

Ragas's evaluation pipeline sends both your questions and your retrieved contexts to an LM judge — usually a third-party provider. This is the dominant security concern.

Sensitive data exposure to judges. If your RAG pipeline retrieves PII, internal documents, or regulated content, those documents are sent to the judge LM. For OpenAI/Anthropic/Google APIs, that means the third party sees them. Mitigations:
- Use a self-hosted judge (Ollama / vLLM / local Llama) for sensitive data.
- Redact or hash PII before evaluation.
- Use a provider with a data-residency / no-training guarantee (Azure OpenAI, Bedrock with the right config).
Reference answers may also be sensitive. ground_truths are part of every reference-based metric call.
Prompt injection in the eval data. A malicious row in the eval set can hijack the judge LM. Inspect external datasets before evaluating on them.
Cost as a denial-of-wallet vector. Self-service eval triggers (e.g. via a PR comment) should rate-limit and cap concurrency.
Key rotation. Judge LLM API keys live in env vars; rotate alongside the rest of your fleet.
CI secrets exposure. Eval-in-CI exposes the judge key to every PR-triggered job. Use OIDC-issued short-lived tokens where the provider supports it (Anthropic and OpenAI have organisation-key controls).

When NOT to use this

Ragas is the right tool for batch evaluation against a curated dataset. It's the wrong tool when:

Your dataset is < 10 rows. Spot-checking by hand is faster and produces a better signal than a noisy LM judge.
You need runtime tracing of a live app. Use trulens-eval, langfuse, langsmith, or phoenix.
You want Pytest-style assertions next to unit tests. deepeval is Pytest-native.
Your eval should not call out to an LM provider. A local Ollama judge works but is noisy; you may prefer a programmatic metric (BLEU, ROUGE, exact-match) for that case.
You need legally-defensible accuracy numbers. LM-judge scores are not ground truth; for compliance evidence, you need human raters with documented inter-rater agreement.

ragas

What it is

Install

Versioning & Python support

Package metadata

Optional dependencies & extras

Alternatives

Common gotchas

Real-world recipes

Production deployment

Evaluation methodology

Version migration guide

Index tuning & retrieval quality

Performance tuning

Troubleshooting common errors

Ecosystem integrations

Embeddings & chunking strategy

Security considerations

When NOT to use this

See also