cheat sheet
TruLens
Evaluate and monitor LLM applications with TruLens. Covers the RAG triad, feedback functions, TruChain, TruLlama, custom evaluators, the dashboard, and CI integration.
TruLens — LLM App Evaluation
What it is
TruLens is a Python library for evaluating and monitoring LLM-powered applications, with a particular focus on RAG pipelines. It defines the RAG Triad — three feedback functions (Answer Relevance, Context Relevance, and Groundedness) that together diagnose whether a RAG system retrieves the right information and generates faithful, on-topic answers. TruLens records every LLM call, computes feedback scores automatically, and surfaces results in a local web dashboard so you can compare runs and catch regressions.
Install
pip install trulens-eval
pip install trulens-eval[langchain] # LangChain integration
pip install trulens-eval[llama-index] # LlamaIndex integration
Output: (none — exits 0 on success)
Quick example
from trulens_eval import Tru, TruBasicApp, Feedback
from trulens_eval.feedback.provider import OpenAI as TruOpenAI
tru = Tru() # starts local SQLite database
provider = TruOpenAI(model_engine="gpt-4o-mini")
# Simple RAG stub
def rag(question: str) -> str:
return "Attention allows the model to weigh input tokens by relevance."
f_answer_relevance = Feedback(provider.relevance, name="Answer Relevance").on_input_output()
tru_rag = TruBasicApp(rag, app_id="my-rag-v1", feedbacks=[f_answer_relevance])
with tru_rag as recording:
answer = rag("What is attention in transformers?")
print(answer)
tru.get_leaderboard()
Output:
Attention allows the model to weigh input tokens by relevance.
app_id Answer Relevance total_cost total_tokens
0 my-rag-v1 0.92 0.0003 142
When / why to use it
- Systematically evaluating RAG pipelines across the three quality dimensions: retrieval relevance, context use, and answer faithfulness.
- Comparing pipeline versions side-by-side in the built-in dashboard — swap the embedding model or chunk size, run both, compare scores.
- Detecting regressions in CI: fail the build if groundedness drops below a threshold.
- Debugging: low Context Relevance means the retriever is fetching off-topic chunks; low Groundedness means the LLM is ignoring retrieved context.
- Building a leaderboard across LLM providers, prompt templates, or retrieval strategies.
Common pitfalls
TruLens uses LLMs as judges — each feedback function makes one or more LLM calls per evaluated record. Evaluating 1 000 records × 3 feedback functions can cost significant tokens. Use a cheaper judge model (e.g.
gpt-4o-mini) or cache results withtru.reset_database()between runs only when you want a clean slate.
Tru()writes to a local SQLite file — by defaultdefault.sqlitein the working directory. Setdatabase_urlto a persistent path:Tru(database_url="sqlite:///evals/trulens.sqlite"). Deleting or moving this file loses all recorded runs.
App ID must be unique per pipeline version — if you reuse the same
app_idacross different code versions, runs are merged in the dashboard. Use versioned IDs like"rag-v2-chroma-k5"to keep experiments separate.
Use
tru.start_dashboard()to open the local Streamlit dashboard in the browser. It shows per-run scores, a leaderboard, and a record-level trace viewer — no external service required.
The three RAG Triad metrics are complementary diagnostics, not a single score. Always evaluate all three together: a pipeline with high Answer Relevance but low Groundedness is hallucinating, while one with high Groundedness but low Context Relevance retrieved the wrong chunks.
The RAG Triad
The RAG Triad is TruLens's core evaluation framework. Each dimension measures a different failure mode in the retrieve-then-generate pipeline.
Question → Retriever → [context chunks] → LLM → Answer
↑ ↑ ↑
Context Context Answer
Relevance Relevance Relevance
↕
Groundedness
(is the answer supported by context?)
| Metric | Question it answers | Low score means… |
|---|---|---|
| Context Relevance | Are retrieved chunks relevant to the question? | Retriever is fetching noise |
| Groundedness | Is the answer supported by retrieved chunks? | LLM is hallucinating |
| Answer Relevance | Does the answer address the question? | LLM is off-topic or verbose |
Feedback functions
Feedback functions are the building blocks of evaluation. TruLens provides a library of pre-built feedback functions for common tasks (relevance, coherence, sentiment) and lets you define custom ones.
from trulens_eval import Feedback
from trulens_eval.feedback.provider import OpenAI as TruOpenAI
import numpy as np
provider = TruOpenAI(model_engine="gpt-4o-mini")
# Answer Relevance — does the answer address the question?
f_answer_relevance = (
Feedback(provider.relevance, name="Answer Relevance")
.on_input_output()
)
# Context Relevance — are retrieved chunks relevant to the question?
f_context_relevance = (
Feedback(provider.context_relevance, name="Context Relevance")
.on_input()
.on(TruChain.select_context())
.aggregate(np.mean)
)
# Groundedness — is the answer supported by the retrieved chunks?
f_groundedness = (
Feedback(provider.groundedness_measure_with_cot_reasons, name="Groundedness")
.on(TruChain.select_context())
.on_output()
.aggregate(np.mean)
)
TruChain — evaluating LangChain RAG pipelines
TruChain wraps a LangChain Runnable or chain, records every invocation, and runs the configured feedback functions after each call.
from trulens_eval import Tru, TruChain, Feedback
from trulens_eval.feedback.provider import OpenAI as TruOpenAI
from langchain_openai import ChatOpenAI, OpenAIEmbeddings
from langchain_community.vectorstores import Chroma
from langchain_core.prompts import ChatPromptTemplate
from langchain_core.output_parsers import StrOutputParser
from langchain_core.runnables import RunnablePassthrough
import numpy as np, os
tru = Tru(database_url="sqlite:///evals/trulens.sqlite")
provider = TruOpenAI(model_engine="gpt-4o-mini")
# --- build a minimal LangChain RAG chain ---
vectorstore = Chroma.from_texts(
texts=[
"Transformers use self-attention to process all tokens simultaneously.",
"BERT uses masked language modelling to learn bidirectional representations.",
"GPT trains as a left-to-right language model predicting the next token.",
],
embedding=OpenAIEmbeddings(api_key=os.environ["OPENAI_API_KEY"]),
)
retriever = vectorstore.as_retriever(search_kwargs={"k": 2})
prompt = ChatPromptTemplate.from_template(
"Answer using only the context below.\n\nContext: {context}\n\nQuestion: {question}"
)
llm = ChatOpenAI(model="gpt-4o-mini", api_key=os.environ["OPENAI_API_KEY"])
chain = (
{"context": retriever, "question": RunnablePassthrough()}
| prompt
| llm
| StrOutputParser()
)
# --- feedback functions ---
f_answer_relevance = Feedback(provider.relevance, name="Answer Relevance").on_input_output()
f_context_relevance = (
Feedback(provider.context_relevance, name="Context Relevance")
.on_input()
.on(TruChain.select_context())
.aggregate(np.mean)
)
f_groundedness = (
Feedback(provider.groundedness_measure_with_cot_reasons, name="Groundedness")
.on(TruChain.select_context())
.on_output()
.aggregate(np.mean)
)
# --- wrap with TruChain ---
tru_chain = TruChain(
chain,
app_id="langchain-rag-v1",
feedbacks=[f_answer_relevance, f_context_relevance, f_groundedness],
)
questions = [
"What is self-attention?",
"How does BERT differ from GPT?",
"What is positional encoding?",
]
with tru_chain as recording:
for q in questions:
chain.invoke(q)
leaderboard = tru.get_leaderboard(app_ids=["langchain-rag-v1"])
print(leaderboard)
Output:
app_id Answer Relevance Context Relevance Groundedness total_cost
0 langchain-rag-v1 0.91 0.87 0.94 0.0041
TruLlama — evaluating LlamaIndex RAG pipelines
TruLlama is the LlamaIndex equivalent of TruChain — it wraps any LlamaIndex query engine or chat engine.
from trulens_eval import Tru, TruLlama, Feedback
from trulens_eval.feedback.provider import OpenAI as TruOpenAI
from llama_index.core import VectorStoreIndex, SimpleDirectoryReader, Settings
from llama_index.llms.openai import OpenAI as LlamaOpenAI
import numpy as np, os
tru = Tru()
provider = TruOpenAI(model_engine="gpt-4o-mini")
# --- build a LlamaIndex query engine ---
Settings.llm = LlamaOpenAI(model="gpt-4o-mini", api_key=os.environ["OPENAI_API_KEY"])
documents = SimpleDirectoryReader("./docs").load_data()
index = VectorStoreIndex.from_documents(documents)
query_engine = index.as_query_engine(similarity_top_k=3)
# --- feedback functions ---
f_answer_relevance = Feedback(provider.relevance, name="Answer Relevance").on_input_output()
f_context_relevance = (
Feedback(provider.context_relevance, name="Context Relevance")
.on_input()
.on(TruLlama.select_source_nodes().node.text)
.aggregate(np.mean)
)
f_groundedness = (
Feedback(provider.groundedness_measure_with_cot_reasons, name="Groundedness")
.on(TruLlama.select_source_nodes().node.text)
.on_output()
.aggregate(np.mean)
)
# --- wrap with TruLlama ---
tru_query_engine = TruLlama(
query_engine,
app_id="llamaindex-rag-v1",
feedbacks=[f_answer_relevance, f_context_relevance, f_groundedness],
)
with tru_query_engine as recording:
response = query_engine.query("What are the main topics covered?")
print(response)
tru.get_leaderboard()
Output:
A summary of the main topics in the documents.
app_id Answer Relevance Context Relevance Groundedness
0 llamaindex-rag-v1 0.93 0.89 0.96
Custom feedback functions
Any Python function that returns a float between 0.0 and 1.0 can be a feedback function.
from trulens_eval import Feedback
from trulens_eval.feedback.provider import OpenAI as TruOpenAI
provider = TruOpenAI(model_engine="gpt-4o-mini")
# Built-in: conciseness check via LLM judge
f_conciseness = Feedback(provider.conciseness, name="Conciseness").on_output()
# Custom: Python-only word-count ratio (no LLM call)
def word_count_score(answer: str) -> float:
"""Score 1.0 for answers under 100 words, decaying for longer answers."""
words = len(answer.split())
if words <= 100:
return 1.0
return max(0.0, 1.0 - (words - 100) / 200)
f_brevity = Feedback(word_count_score, name="Brevity").on_output()
# Custom: check for citation presence
def has_citation(answer: str) -> float:
"""Returns 1.0 if the answer contains a citation pattern like [1] or (source:)."""
import re
return 1.0 if re.search(r"\[\d+\]|\(source:", answer, re.IGNORECASE) else 0.0
f_citation = Feedback(has_citation, name="Has Citation").on_output()
Alternate LLM providers as judges
TruLens supports Anthropic, Hugging Face, Bedrock, and local models as judge LLMs via provider wrappers.
from trulens_eval.feedback.provider import Bedrock as TruBedrock
from trulens_eval.feedback.provider import Huggingface as TruHuggingface
import os
# Anthropic Claude as judge (via LangChain wrapper)
from trulens_eval.feedback.provider import LangChainProvider
from langchain_anthropic import ChatAnthropic
provider = LangChainProvider(
chain=ChatAnthropic(model="claude-haiku-4-5-20251001", api_key=os.environ["ANTHROPIC_API_KEY"])
)
# Hugging Face local pipeline (free, no API key needed)
hf_provider = TruHuggingface() # uses sentence-transformers by default
f_relevance_hf = (
Feedback(hf_provider.not_toxic, name="Not Toxic")
.on_output()
)
The TruLens dashboard
The dashboard is a local Streamlit web app that shows all recorded runs, leaderboard scores, and per-record trace details.
from trulens_eval import Tru
tru = Tru(database_url="sqlite:///evals/trulens.sqlite")
# Open dashboard in browser (blocks until Ctrl-C)
tru.start_dashboard(port=8501, force=True)
# Or just print the leaderboard to stdout
leaderboard = tru.get_leaderboard()
print(leaderboard.to_string(index=False))
# Export all records as a DataFrame for custom analysis
records, feedback_col = tru.get_records_and_feedback(app_ids=["langchain-rag-v1"])
print(records[["input", "output", "Answer Relevance", "Groundedness"]].head())
Output:
input output Answer Relevance Groundedness
0 What is self-attention? Self-attention computes ... 0.94 0.97
1 How does BERT differ from … BERT is bidirectional w… 0.90 0.93
CI integration — fail on score regression
from trulens_eval import Tru
def test_rag_quality():
tru = Tru(database_url="sqlite:///evals/trulens.sqlite")
leaderboard = tru.get_leaderboard(app_ids=["langchain-rag-v1"])
row = leaderboard[leaderboard["app_id"] == "langchain-rag-v1"].iloc[0]
assert row["Answer Relevance"] >= 0.85, (
f"Answer Relevance {row['Answer Relevance']:.2f} below 0.85"
)
assert row["Groundedness"] >= 0.80, (
f"Groundedness {row['Groundedness']:.2f} below 0.80"
)
assert row["Context Relevance"] >= 0.75, (
f"Context Relevance {row['Context Relevance']:.2f} below 0.75"
)
pytest test_eval.py # fails if any metric regresses below threshold
Output: (none — exits 0 on success)
Comparing pipeline versions
from trulens_eval import Tru, TruChain
tru = Tru()
# Version A — k=2 retriever
tru_v1 = TruChain(chain_k2, app_id="rag-k2", feedbacks=[f_answer_relevance, f_groundedness])
# Version B — k=5 retriever
tru_v2 = TruChain(chain_k5, app_id="rag-k5", feedbacks=[f_answer_relevance, f_groundedness])
for question in eval_questions:
with tru_v1 as rec:
chain_k2.invoke(question)
with tru_v2 as rec:
chain_k5.invoke(question)
# Both appear side-by-side in the leaderboard
print(tru.get_leaderboard(app_ids=["rag-k2", "rag-k5"]))
Output:
app_id Answer Relevance Groundedness total_cost
rag-k2 0.86 0.88 0.0021
rag-k5 0.91 0.94 0.0034
Quick reference
| Metric | What it measures | Low score means… |
|---|---|---|
Answer Relevance | Does the answer address the question? | Off-topic or verbose response |
Context Relevance | Are retrieved chunks relevant? | Retriever fetching noise |
Groundedness | Is the answer supported by context? | LLM hallucinating |
| Task | Code |
|---|---|
| Init TruLens | tru = Tru(database_url="sqlite:///eval.sqlite") |
| Create provider | provider = TruOpenAI(model_engine="gpt-4o-mini") |
| Answer relevance | Feedback(provider.relevance).on_input_output() |
| Context relevance | Feedback(provider.context_relevance).on_input().on(TruChain.select_context()).aggregate(np.mean) |
| Groundedness | Feedback(provider.groundedness_measure_with_cot_reasons).on(TruChain.select_context()).on_output() |
| Wrap LangChain | TruChain(chain, app_id="v1", feedbacks=[...]) |
| Wrap LlamaIndex | TruLlama(query_engine, app_id="v1", feedbacks=[...]) |
| Record run | with tru_chain as rec: chain.invoke(q) |
| Leaderboard | tru.get_leaderboard(app_ids=["v1"]) |
| Dashboard | tru.start_dashboard(port=8501) |
| Export records | tru.get_records_and_feedback(app_ids=["v1"]) |
| Reset DB | tru.reset_database() |