cheat sheet

ragas

Measure and improve RAG pipeline quality with ragas. Covers faithfulness, answer relevancy, context precision, context recall, dataset format, LLM judges, and CI integration.

updated 04-27-2026

ragas — RAG Evaluation Framework

What it is

ragas (RAG Assessment) is a Python library for evaluating retrieval-augmented generation pipelines. It measures the quality of both the retrieval step (did the right chunks come back?) and the generation step (did the model answer faithfully using those chunks?). ragas uses LLMs as judges to score outputs across a set of defined metrics, producing numeric scores you can track across code changes and compare in CI.

Install

bash

pip install ragas
pip install ragas[all]   # includes optional integrations (LangChain, LlamaIndex, Hugging Face)

Output: (none — exits 0 on success)

Quick example

python

from ragas import evaluate
from ragas.metrics import faithfulness, answer_relevancy
from datasets import Dataset

data = {
    "question":  ["What is attention in transformers?"],
    "answer":    ["Attention allows the model to weigh the relevance of different input tokens."],
    "contexts":  [["The attention mechanism computes a weighted sum of values based on query-key similarity."]],
    "ground_truth": ["Attention computes weighted sums over input tokens using query-key similarity scores."],
}
dataset = Dataset.from_dict(data)

result = evaluate(dataset, metrics=[faithfulness, answer_relevancy])
print(result)

Output:

text

{'faithfulness': 0.9500, 'answer_relevancy': 0.8800}

When / why to use it

Quantifying RAG quality before and after changing the chunking strategy, retriever, or LLM.
Catching regressions in CI: fail the build if faithfulness drops below a threshold.
Ablation studies: compare embedding models, chunk sizes, k values, and rerankers with objective scores.
Debugging: low context precision tells you the retriever is fetching off-topic chunks; low faithfulness tells you the LLM is hallucinating.
Building a leaderboard across pipeline variants.

Common pitfalls

ragas uses an LLM to judge — most metrics require an LLM (OpenAI GPT-4o by default) to evaluate outputs. Each evaluation call costs tokens. For large datasets, configure a cheaper judge model or use a local model via the custom LLM interface.

contexts must be a list of lists — each row's contexts field is a list of strings (the retrieved chunks), even if there is only one chunk: [["chunk text"]] not ["chunk text"].

ground_truth is required for some metrics — context_recall and answer_correctness need a reference answer. faithfulness and answer_relevancy do not — they evaluate against the provided contexts only.

Start with faithfulness and answer_relevancy as the two most informative metrics. Add context_precision and context_recall when you suspect the retriever is the bottleneck.

Generate evaluation datasets automatically from production traces using ragas.testset.generate — it creates diverse question/context/answer samples from your documents without manual labelling.

Dataset format

ragas uses Hugging Face Dataset objects. Each row represents one Q&A interaction and must contain the columns required by the metrics you're running.

python

from datasets import Dataset

data = {
    # Required by all metrics
    "question": [
        "What is a transformer?",
        "How does BERT differ from GPT?",
        "What is positional encoding?",
    ],
    # Required: model-generated answer
    "answer": [
        "A transformer uses self-attention to process sequences in parallel.",
        "BERT is bidirectional while GPT is unidirectional (autoregressive).",
        "Positional encoding injects token order information into embeddings.",
    ],
    # Required: retrieved chunks (list of strings per question)
    "contexts": [
        ["Transformers use multi-head self-attention to process all tokens simultaneously."],
        ["BERT uses masked language modelling to learn bidirectional representations.",
         "GPT trains as a left-to-right language model predicting the next token."],
        ["Since transformers have no recurrence, positional encodings are added to embeddings "
         "to preserve sequence order information."],
    ],
    # Optional: reference answer for precision/recall/correctness metrics
    "ground_truth": [
        "A transformer processes sequences using self-attention without recurrence.",
        "BERT is bidirectional; GPT is unidirectional and autoregressive.",
        "Positional encoding adds order information to token embeddings in transformers.",
    ],
}

dataset = Dataset.from_dict(data)

Core metrics

faithfulness

Faithfulness measures whether every claim in the answer is supported by the retrieved contexts. Score 1.0 = fully grounded; 0.0 = completely hallucinated.

python

from ragas import evaluate
from ragas.metrics import faithfulness
from datasets import Dataset

data = {
    "question":  ["What year was Python created?"],
    "answer":    ["Python was created in 1991 by Guido van Rossum. It is written in C and Java."],
    "contexts":  [["Python was created by Guido van Rossum and first released in 1991."]],
    "ground_truth": ["Python was created in 1991."],
}

result = evaluate(Dataset.from_dict(data), metrics=[faithfulness])
print(result["faithfulness"])

Output:

text

0.6667  ← "written in C and Java" is not supported by the context

answer_relevancy

Answer relevancy measures how well the answer addresses the question, ignoring whether it is factually correct. A verbose answer that restates the question before answering scores lower than a concise, direct answer.

python

from ragas.metrics import answer_relevancy
result = evaluate(Dataset.from_dict(data), metrics=[answer_relevancy])
print(result["answer_relevancy"])

context_precision

Context precision measures the signal-to-noise ratio of the retrieved chunks: what fraction of the retrieved chunks are actually relevant to answering the question? Requires ground_truth.

python

from ragas.metrics import context_precision
result = evaluate(Dataset.from_dict(data), metrics=[context_precision])
print(result["context_precision"])

context_recall

Context recall measures how much of the ground-truth answer is covered by the retrieved chunks. A low score means the retriever missed relevant content. Requires ground_truth.

python

from ragas.metrics import context_recall
result = evaluate(Dataset.from_dict(data), metrics=[context_recall])
print(result["context_recall"])

answer_correctness

Answer correctness combines factual accuracy and semantic similarity between the generated answer and ground_truth. Requires ground_truth.

python

from ragas.metrics import answer_correctness
result = evaluate(Dataset.from_dict(data), metrics=[answer_correctness])
print(result["answer_correctness"])

Evaluating all core metrics at once

python

from ragas import evaluate
from ragas.metrics import (
    faithfulness,
    answer_relevancy,
    context_precision,
    context_recall,
    answer_correctness,
)
from datasets import Dataset

result = evaluate(
    dataset,
    metrics=[
        faithfulness,
        answer_relevancy,
        context_precision,
        context_recall,
        answer_correctness,
    ],
)
print(result.to_pandas())

Output:

text

   faithfulness  answer_relevancy  context_precision  context_recall  answer_correctness
0          0.95              0.88               0.92            0.87                0.91
1          1.00              0.91               0.88            0.94                0.93
2          0.90              0.85               0.95            0.89                0.88

Configuring the LLM judge

By default ragas uses gpt-4o-mini from OpenAI. Change the judge model to any LangChain-compatible LLM.

python

from ragas import evaluate
from ragas.metrics import faithfulness, answer_relevancy
from ragas.llms import LangchainLLMWrapper
from langchain_anthropic import ChatAnthropic
from langchain_openai import OpenAIEmbeddings
from ragas.embeddings import LangchainEmbeddingsWrapper
import os

llm = LangchainLLMWrapper(
    ChatAnthropic(model="claude-sonnet-4-6", api_key=os.environ["ANTHROPIC_API_KEY"])
)
embeddings = LangchainEmbeddingsWrapper(
    OpenAIEmbeddings(api_key=os.environ["OPENAI_API_KEY"])
)

result = evaluate(
    dataset,
    metrics=[faithfulness, answer_relevancy],
    llm=llm,
    embeddings=embeddings,
)
print(result)

Synthetic test set generation

ragas can generate a diverse evaluation dataset from your documents — no manual labelling required.

python

from ragas.testset.generator import TestsetGenerator
from ragas.testset.evolutions import simple, reasoning, multi_context
from langchain_openai import ChatOpenAI, OpenAIEmbeddings
from langchain_community.document_loaders import DirectoryLoader
import os

loader = DirectoryLoader("./docs", glob="**/*.md")
documents = loader.load()

generator_llm = ChatOpenAI(model="gpt-4o-mini", api_key=os.environ["OPENAI_API_KEY"])
critic_llm    = ChatOpenAI(model="gpt-4o",      api_key=os.environ["OPENAI_API_KEY"])
embeddings    = OpenAIEmbeddings(api_key=os.environ["OPENAI_API_KEY"])

generator = TestsetGenerator.from_langchain(
    generator_llm=generator_llm,
    critic_llm=critic_llm,
    embeddings=embeddings,
)

testset = generator.generate_with_langchain_docs(
    documents,
    test_size=20,
    distributions={simple: 0.5, reasoning: 0.25, multi_context: 0.25},
)

df = testset.to_pandas()
print(df.head(3)[["question", "ground_truth"]])

LangChain RAG pipeline evaluation

Evaluate an end-to-end LangChain RAG chain — ragas calls the chain for each row and scores the output.

python

from ragas.integrations.langchain import EvaluatorChain
from ragas.metrics import faithfulness

evaluator = EvaluatorChain(metric=faithfulness)

# Sample run
result = evaluator.invoke({
    "question": "What is multi-head attention?",
    "answer":   "Multi-head attention runs several attention heads in parallel.",
    "contexts": ["Multi-head attention allows the model to jointly attend to "
                 "information from different representation subspaces."],
})
print(result["faithfulness_score"])

CI example — fail on score regression

python

from ragas import evaluate
from ragas.metrics import faithfulness, answer_relevancy
from datasets import load_from_disk

def test_rag_quality():
    dataset = load_from_disk("./eval_dataset")
    result  = evaluate(dataset, metrics=[faithfulness, answer_relevancy])

    assert result["faithfulness"]      >= 0.85, f"Faithfulness {result['faithfulness']:.2f} below 0.85"
    assert result["answer_relevancy"]  >= 0.80, f"Relevancy {result['answer_relevancy']:.2f} below 0.80"

How each metric is computed under the hood

Understanding the LLM prompts behind each metric helps you interpret edge cases and write your own custom metrics.

faithfulness — claim decomposition + verification

Ask the judge LLM to extract atomic claims from the answer (Generate a list of statements from the answer).
For each claim, ask the judge whether the claim can be inferred from the provided contexts (Yes / No).
Score = supported_claims / total_claims.

Implication: a long answer with many trivially-supported claims can offset a single hallucinated claim. Score the same answer twice with very different word counts and the proportions shift even though the hallucination is identical.

answer_relevancy — reverse-question generation

The judge generates N hypothetical questions (default 3) that the answer would be a good response to.
Each generated question is embedded and compared to the original question via cosine similarity.
Score = mean cosine similarity.

A verbose answer that drifts off-topic earns low relevancy because the regenerated questions don't match the original. The metric requires embeddings, not just an LLM.

context_precision — relevance ranking of retrieved chunks

For each retrieved chunk, the judge LLM is asked: "Is this chunk relevant to answering the question, given the ground truth?" → 1 or 0.
The position of each "1" in the ranking matters — relevant chunks at rank 1 score higher than the same chunk at rank 5.
Score = mean-average-precision over the retrieved sequence.

context_recall — ground-truth statement coverage

The judge decomposes ground_truth into atomic statements.
For each statement, it checks whether the statement can be inferred from the union of retrieved contexts.
Score = covered_statements / total_statements.

answer_correctness — weighted factual + semantic

Factual: classify each claim in the answer vs the ground truth into True Positives (in both), False Positives (only in answer), False Negatives (only in ground truth). F1 from those.
Semantic: cosine similarity between answer and ground-truth embeddings.
Default weights: 0.75 factual, 0.25 semantic — tune with Weights.

python

from ragas.metrics import AnswerCorrectness
ac = AnswerCorrectness(weights=[0.5, 0.5])    # equal factual + semantic

Diagnostic playbook — which metric → which problem

A drop in each metric points to a different layer of the pipeline. This table is the most actionable artefact of running ragas.

Symptom	Most likely cause	Where to fix
Low `faithfulness`	LLM is hallucinating beyond the contexts	Tighten the prompt ("only use the context"); switch to a more obedient model
Low `answer_relevancy`	Answer drifts, restates the question, or rambles	Reduce `max_tokens`; add "answer the question directly" to system prompt
Low `context_precision`	Retriever returns noisy / off-topic chunks	Add a reranker; raise the score threshold; reduce `k`
Low `context_recall`	Retriever misses chunks that contain the answer	Increase `k`; switch embedding model; smaller chunks
High `context_recall`, low `faithfulness`	Right info reached the LLM but it didn't use it	Tighten prompt; few-shot examples that quote the context
Low precision and recall	The whole retrieval layer is broken	Re-embed; check chunk size; verify the index isn't stale
`answer_correctness` low but `faithfulness` high	The retrieved contexts are wrong	Curate the corpus; remove outdated docs
High variance across runs	LLM judge nondeterminism	Set `temperature=0` on the judge; raise sample size

Configuring metric internals

Most metrics expose parameters. Faithfulness lets you swap the claim-decomposition strategy; AnswerRelevancy exposes strictness (number of regenerated questions).

python

from ragas.metrics import Faithfulness, AnswerRelevancy, ContextPrecision

faith = Faithfulness(
    name="faithfulness_strict",
    max_retries=2,           # retry the judge on JSON parse errors
)

rel = AnswerRelevancy(
    name="relevancy_high",
    strictness=5,            # generate 5 reverse-questions instead of 3
)

precision = ContextPrecision(
    name="precision_with_ref",
    # Provide a custom prompt template by subclassing if needed
)

result = evaluate(dataset, metrics=[faith, rel, precision])

Output:

text

{'faithfulness_strict': 0.9333, 'relevancy_high': 0.8721, 'precision_with_ref': 0.9100}

Local judge — use a small open-source model

LLM-as-judge calls add up. For dev iteration, point ragas at a local model via Ollama or vLLM.

python

from ragas import evaluate
from ragas.metrics import faithfulness, answer_relevancy
from ragas.llms import LangchainLLMWrapper
from ragas.embeddings import LangchainEmbeddingsWrapper
from langchain_community.chat_models import ChatOllama
from langchain_community.embeddings import HuggingFaceEmbeddings

llm = LangchainLLMWrapper(ChatOllama(model="qwen2:7b", temperature=0))
emb = LangchainEmbeddingsWrapper(HuggingFaceEmbeddings(model_name="BAAI/bge-small-en-v1.5"))

result = evaluate(dataset, metrics=[faithfulness, answer_relevancy], llm=llm, embeddings=emb)
print(result)

Output:

text

{'faithfulness': 0.91, 'answer_relevancy': 0.84}

bash

ollama serve &
ollama pull qwen2:7b
python eval.py

Output:

text

pulling manifest
pulling 73895fa49e92... 100%
verifying sha256 digest
success
{'faithfulness': 0.91, 'answer_relevancy': 0.84}

A small judge can be biased — it tends to be over-lenient on faithfulness (everything looks supported) and under-confident on relevancy. For final/release evaluation, use a strong judge (Claude Sonnet, GPT-4o); use the local judge only for fast iteration.

Cost control — caching, sampling, and judging only what changed

Each metric makes 1–5 LLM calls per row. A 500-row eval with 5 metrics easily hits 5 000 judge calls.

python

import os
os.environ["RAGAS_CACHE_PATH"] = "./.ragas_cache"      # on-disk cache for identical (prompt, inputs)

# Sample 100 rows for fast feedback during dev
small = dataset.shuffle(seed=42).select(range(100))
quick = evaluate(small, metrics=[faithfulness, answer_relevancy])

# Full run for nightly / pre-release
full = evaluate(dataset, metrics=[faithfulness, answer_relevancy, context_precision, context_recall])

Output:

text

quick (100 rows): {'faithfulness': 0.89, 'answer_relevancy': 0.85}
full  (500 rows): {'faithfulness': 0.88, 'answer_relevancy': 0.86, 'context_precision': 0.91, 'context_recall': 0.84}

Async and concurrent execution

evaluate() runs synchronously by default. Pass run_config=RunConfig(max_workers=...) to parallelise judge calls. The judge LLM's rate limit is usually the bottleneck.

python

from ragas import evaluate
from ragas.run_config import RunConfig
from ragas.metrics import faithfulness, answer_relevancy
import time

cfg = RunConfig(
    timeout=120,        # per-judge-call timeout
    max_retries=3,      # retry on transient errors
    max_wait=60,        # exponential backoff cap
    max_workers=8,      # concurrent judge calls
)

start = time.time()
result = evaluate(dataset, metrics=[faithfulness, answer_relevancy], run_config=cfg)
print(f"Evaluated {len(dataset)} rows in {time.time()-start:.1f}s")

Output:

text

Evaluated 500 rows in 78.3s

Saving and loading datasets

ragas uses the Hugging Face datasets library, so the standard save_to_disk / load_from_disk round-trip works.

bash

mkdir -p ./eval_datasets

Output: (none — exits 0 on success)

python

from datasets import Dataset, load_from_disk
import json

# Save dataset
dataset.save_to_disk("./eval_datasets/qa-v1")

# Save eval result alongside
result.to_pandas().to_csv("./eval_datasets/qa-v1.scores.csv", index=False)

# Reload later
ds = load_from_disk("./eval_datasets/qa-v1")
print(f"Loaded {len(ds)} rows")

Output:

text

Loaded 500 rows

Integration with LangSmith

Push ragas scores into LangSmith as feedback so the runs appear with metric scores in the LangSmith UI.

python

from ragas import evaluate
from ragas.metrics import faithfulness, answer_relevancy
from langsmith import Client
import os

ls = Client(api_key=os.environ["LANGCHAIN_API_KEY"])

# Each row in dataset has metadata['run_id'] linking it back to the original LangSmith run
result_df = evaluate(dataset, metrics=[faithfulness, answer_relevancy]).to_pandas()
for _, row in result_df.iterrows():
    ls.create_feedback(
        run_id=row["run_id"],
        key="faithfulness",
        score=float(row["faithfulness"]),
        comment="ragas auto-eval",
    )
    ls.create_feedback(
        run_id=row["run_id"],
        key="answer_relevancy",
        score=float(row["answer_relevancy"]),
    )
print(f"Pushed scores for {len(result_df)} runs")

Output:

text

Pushed scores for 500 runs

Integration with LlamaIndex

python

from ragas.integrations.llama_index import evaluate as li_evaluate
from llama_index.core import VectorStoreIndex, SimpleDirectoryReader

docs = SimpleDirectoryReader("./docs").load_data()
index = VectorStoreIndex.from_documents(docs)
query_engine = index.as_query_engine()

questions = [
    "What is multi-head attention?",
    "Why use positional encoding?",
]
ground_truths = [
    "Multiple attention heads attending to different subspaces.",
    "Transformers have no recurrence so order info must be added explicitly.",
]

result = li_evaluate(
    query_engine=query_engine,
    metrics=[faithfulness, answer_relevancy],
    dataset={"question": questions, "ground_truth": ground_truths},
)
print(result)

Output:

text

{'faithfulness': 0.93, 'answer_relevancy': 0.87}

Real-world recipes

Recipe: A/B test two embedding models

Build the same RAG pipeline twice with different embedding models and compare ragas scores.

python

from langchain_community.embeddings import HuggingFaceEmbeddings
from langchain_community.vectorstores import Chroma

variants = {
    "minilm":   HuggingFaceEmbeddings(model_name="sentence-transformers/all-MiniLM-L6-v2"),
    "bge_base": HuggingFaceEmbeddings(model_name="BAAI/bge-base-en-v1.5"),
}

scores = {}
for name, emb in variants.items():
    store = Chroma.from_documents(docs, embedding=emb, collection_name=f"eval_{name}")
    retriever = store.as_retriever(search_kwargs={"k": 5})
    rows = []
    for q, gt in zip(eval_questions, eval_truths):
        hits = retriever.get_relevant_documents(q)
        answer = llm.invoke(build_prompt(q, hits)).content
        rows.append({
            "question": q,
            "answer":   answer,
            "contexts": [h.page_content for h in hits],
            "ground_truth": gt,
        })
    ds = Dataset.from_list(rows)
    result = evaluate(ds, metrics=[faithfulness, answer_relevancy, context_recall])
    scores[name] = result

for name, s in scores.items():
    print(f"{name:>10}  faith={s['faithfulness']:.3f}  rel={s['answer_relevancy']:.3f}  recall={s['context_recall']:.3f}")

Output:

text

    minilm  faith=0.852  rel=0.811  recall=0.738
  bge_base  faith=0.891  rel=0.842  recall=0.823

Recipe: chunk-size sweep

Find the sweet spot between chunk size and retrieval quality.

python

from langchain.text_splitter import RecursiveCharacterTextSplitter

scores = {}
for chunk_size in (256, 512, 1024, 2048):
    splitter = RecursiveCharacterTextSplitter(chunk_size=chunk_size, chunk_overlap=chunk_size // 8)
    chunks = splitter.split_documents(raw_docs)
    store = Chroma.from_documents(chunks, embedding=emb)
    pipeline = build_rag(store)
    ds = run_pipeline(pipeline, eval_questions, eval_truths)
    scores[chunk_size] = evaluate(ds, metrics=[context_precision, context_recall, faithfulness])

import json
print(json.dumps({k: {m: round(v, 3) for m, v in s.items()} for k, s in scores.items()}, indent=2))

Output:

text

{
  "256":  {"context_precision": 0.71, "context_recall": 0.92, "faithfulness": 0.88},
  "512":  {"context_precision": 0.84, "context_recall": 0.91, "faithfulness": 0.91},
  "1024": {"context_precision": 0.86, "context_recall": 0.83, "faithfulness": 0.90},
  "2048": {"context_precision": 0.82, "context_recall": 0.71, "faithfulness": 0.86}
}

Recipe: CI gate with per-metric thresholds

python

THRESHOLDS = {
    "faithfulness":      0.85,
    "answer_relevancy":  0.80,
    "context_recall":    0.75,
}

def main():
    ds = load_from_disk("./eval_datasets/golden-v3")
    result = evaluate(ds, metrics=[faithfulness, answer_relevancy, context_recall])

    failures = [
        f"{k}: {result[k]:.3f} < {v:.2f}"
        for k, v in THRESHOLDS.items()
        if result[k] < v
    ]
    if failures:
        for f in failures:
            print(f"FAIL  {f}")
        raise SystemExit(1)
    print("PASS  all thresholds met")

bash

python -m scripts.eval_gate

Output:

text

PASS  all thresholds met

Recipe: bulk-score production traces

Score yesterday's production traces overnight and store the metrics for trend analysis.

python

from datetime import datetime, timedelta, timezone
from langsmith import Client
from datasets import Dataset

ls = Client()
runs = list(ls.list_runs(
    project_name="prod",
    start_time=datetime.now(timezone.utc) - timedelta(days=1),
    run_type="chain",
    limit=2000,
))

rows = [
    {
        "question":     r.inputs.get("question", ""),
        "answer":       r.outputs.get("answer", ""),
        "contexts":     r.outputs.get("sources", []) or [""],
        "ground_truth": "",            # leave empty for ref-free metrics only
        "run_id":       str(r.id),
    }
    for r in runs
    if r.outputs and r.inputs.get("question")
]
ds = Dataset.from_list(rows)
result = evaluate(ds, metrics=[faithfulness, answer_relevancy])
result.to_pandas().to_parquet(f"./scores/{datetime.utcnow():%Y-%m-%d}.parquet")
print(f"Scored {len(rows)} prod runs")

Output:

text

Scored 1184 prod runs

Performance and reliability tips

Set temperature=0 on the judge LLM. Default sampling causes 5–10% score variance between runs.
For long contexts, the judge may truncate. Verify the judge model's context window can hold question + all retrieved chunks + ground truth.
ragas raises ExceptionInRunner when too many judge calls fail in a row. Lower max_workers or raise max_retries in RunConfig.
Embedding-based metrics (answer_relevancy, answer_correctness) cache per-text embeddings — reuse the same embeddings object across calls.
Always seed shuffle(seed=...) so subset runs are comparable across iterations.

Quick reference

Metric	Requires	Measures
`faithfulness`	question, answer, contexts	Fraction of answer claims grounded in contexts
`answer_relevancy`	question, answer, contexts	How directly the answer addresses the question
`context_precision`	question, contexts, ground_truth	Fraction of retrieved chunks that are relevant
`context_recall`	question, contexts, ground_truth	Coverage of ground truth by retrieved chunks
`answer_correctness`	question, answer, ground_truth	Factual + semantic similarity to reference

Task	Code
Basic evaluation	`evaluate(dataset, metrics=[...])`
Custom LLM judge	`evaluate(..., llm=LangchainLLMWrapper(model))`
Custom embeddings	`evaluate(..., embeddings=LangchainEmbeddingsWrapper(emb))`
Concurrency	`evaluate(..., run_config=RunConfig(max_workers=8))`
Cache judge calls	`os.environ["RAGAS_CACHE_PATH"] = "./.ragas_cache"`
To pandas	`result.to_pandas()`
Save dataset	`dataset.save_to_disk("./path")`
Load dataset	`load_from_disk("./path")`
Generate testset	`TestsetGenerator.from_langchain(...).generate_with_langchain_docs(...)`
LangSmith push	`ls.create_feedback(run_id, key="faithfulness", score=...)`
LlamaIndex eval	`ragas.integrations.llama_index.evaluate(query_engine, metrics, dataset)`
Dataset format	`Dataset.from_dict({"question": [], "answer": [], "contexts": [[]], "ground_truth": []})`