cheat sheet

TruLens

Evaluate and monitor LLM applications with TruLens. Covers the RAG triad, feedback functions, TruChain, TruLlama, custom evaluators, the dashboard, and CI integration.

TruLens — LLM App Evaluation

What it is

TruLens is a Python library for evaluating and monitoring LLM-powered applications, with a particular focus on RAG pipelines. It defines the RAG Triad — three feedback functions (Answer Relevance, Context Relevance, and Groundedness) that together diagnose whether a RAG system retrieves the right information and generates faithful, on-topic answers. TruLens records every LLM call, computes feedback scores automatically, and surfaces results in a local web dashboard so you can compare runs and catch regressions.

Install

bash
pip install trulens-eval
pip install trulens-eval[langchain]      # LangChain integration
pip install trulens-eval[llama-index]    # LlamaIndex integration

Output: (none — exits 0 on success)

Quick example

python
from trulens_eval import Tru, TruBasicApp, Feedback
from trulens_eval.feedback.provider import OpenAI as TruOpenAI

tru = Tru()   # starts local SQLite database

provider = TruOpenAI(model_engine="gpt-4o-mini")

# Simple RAG stub
def rag(question: str) -> str:
    return "Attention allows the model to weigh input tokens by relevance."

f_answer_relevance = Feedback(provider.relevance, name="Answer Relevance").on_input_output()

tru_rag = TruBasicApp(rag, app_id="my-rag-v1", feedbacks=[f_answer_relevance])

with tru_rag as recording:
    answer = rag("What is attention in transformers?")

print(answer)
tru.get_leaderboard()

Output:

text
Attention allows the model to weigh input tokens by relevance.
           app_id  Answer Relevance  total_cost  total_tokens
0  my-rag-v1             0.92        0.0003           142

When / why to use it

  • Systematically evaluating RAG pipelines across the three quality dimensions: retrieval relevance, context use, and answer faithfulness.
  • Comparing pipeline versions side-by-side in the built-in dashboard — swap the embedding model or chunk size, run both, compare scores.
  • Detecting regressions in CI: fail the build if groundedness drops below a threshold.
  • Debugging: low Context Relevance means the retriever is fetching off-topic chunks; low Groundedness means the LLM is ignoring retrieved context.
  • Building a leaderboard across LLM providers, prompt templates, or retrieval strategies.

Common pitfalls

TruLens uses LLMs as judges — each feedback function makes one or more LLM calls per evaluated record. Evaluating 1 000 records × 3 feedback functions can cost significant tokens. Use a cheaper judge model (e.g. gpt-4o-mini) or cache results with tru.reset_database() between runs only when you want a clean slate.

Tru() writes to a local SQLite file — by default default.sqlite in the working directory. Set database_url to a persistent path: Tru(database_url="sqlite:///evals/trulens.sqlite"). Deleting or moving this file loses all recorded runs.

App ID must be unique per pipeline version — if you reuse the same app_id across different code versions, runs are merged in the dashboard. Use versioned IDs like "rag-v2-chroma-k5" to keep experiments separate.

Use tru.start_dashboard() to open the local Streamlit dashboard in the browser. It shows per-run scores, a leaderboard, and a record-level trace viewer — no external service required.

The three RAG Triad metrics are complementary diagnostics, not a single score. Always evaluate all three together: a pipeline with high Answer Relevance but low Groundedness is hallucinating, while one with high Groundedness but low Context Relevance retrieved the wrong chunks.

The RAG Triad

The RAG Triad is TruLens's core evaluation framework. Each dimension measures a different failure mode in the retrieve-then-generate pipeline.

csharp
Question → Retriever → [context chunks] → LLM → Answer
              ↑                ↑                 ↑
        Context         Context            Answer
        Relevance       Relevance         Relevance
                              ↕
                        Groundedness
                  (is the answer supported by context?)
MetricQuestion it answersLow score means…
Context RelevanceAre retrieved chunks relevant to the question?Retriever is fetching noise
GroundednessIs the answer supported by retrieved chunks?LLM is hallucinating
Answer RelevanceDoes the answer address the question?LLM is off-topic or verbose

Feedback functions

Feedback functions are the building blocks of evaluation. TruLens provides a library of pre-built feedback functions for common tasks (relevance, coherence, sentiment) and lets you define custom ones.

python
from trulens_eval import Feedback
from trulens_eval.feedback.provider import OpenAI as TruOpenAI
import numpy as np

provider = TruOpenAI(model_engine="gpt-4o-mini")

# Answer Relevance — does the answer address the question?
f_answer_relevance = (
    Feedback(provider.relevance, name="Answer Relevance")
    .on_input_output()
)

# Context Relevance — are retrieved chunks relevant to the question?
f_context_relevance = (
    Feedback(provider.context_relevance, name="Context Relevance")
    .on_input()
    .on(TruChain.select_context())
    .aggregate(np.mean)
)

# Groundedness — is the answer supported by the retrieved chunks?
f_groundedness = (
    Feedback(provider.groundedness_measure_with_cot_reasons, name="Groundedness")
    .on(TruChain.select_context())
    .on_output()
    .aggregate(np.mean)
)

TruChain — evaluating LangChain RAG pipelines

TruChain wraps a LangChain Runnable or chain, records every invocation, and runs the configured feedback functions after each call.

python
from trulens_eval import Tru, TruChain, Feedback
from trulens_eval.feedback.provider import OpenAI as TruOpenAI
from langchain_openai import ChatOpenAI, OpenAIEmbeddings
from langchain_community.vectorstores import Chroma
from langchain_core.prompts import ChatPromptTemplate
from langchain_core.output_parsers import StrOutputParser
from langchain_core.runnables import RunnablePassthrough
import numpy as np, os

tru = Tru(database_url="sqlite:///evals/trulens.sqlite")
provider = TruOpenAI(model_engine="gpt-4o-mini")

# --- build a minimal LangChain RAG chain ---
vectorstore = Chroma.from_texts(
    texts=[
        "Transformers use self-attention to process all tokens simultaneously.",
        "BERT uses masked language modelling to learn bidirectional representations.",
        "GPT trains as a left-to-right language model predicting the next token.",
    ],
    embedding=OpenAIEmbeddings(api_key=os.environ["OPENAI_API_KEY"]),
)
retriever = vectorstore.as_retriever(search_kwargs={"k": 2})

prompt = ChatPromptTemplate.from_template(
    "Answer using only the context below.\n\nContext: {context}\n\nQuestion: {question}"
)
llm = ChatOpenAI(model="gpt-4o-mini", api_key=os.environ["OPENAI_API_KEY"])

chain = (
    {"context": retriever, "question": RunnablePassthrough()}
    | prompt
    | llm
    | StrOutputParser()
)

# --- feedback functions ---
f_answer_relevance = Feedback(provider.relevance, name="Answer Relevance").on_input_output()
f_context_relevance = (
    Feedback(provider.context_relevance, name="Context Relevance")
    .on_input()
    .on(TruChain.select_context())
    .aggregate(np.mean)
)
f_groundedness = (
    Feedback(provider.groundedness_measure_with_cot_reasons, name="Groundedness")
    .on(TruChain.select_context())
    .on_output()
    .aggregate(np.mean)
)

# --- wrap with TruChain ---
tru_chain = TruChain(
    chain,
    app_id="langchain-rag-v1",
    feedbacks=[f_answer_relevance, f_context_relevance, f_groundedness],
)

questions = [
    "What is self-attention?",
    "How does BERT differ from GPT?",
    "What is positional encoding?",
]

with tru_chain as recording:
    for q in questions:
        chain.invoke(q)

leaderboard = tru.get_leaderboard(app_ids=["langchain-rag-v1"])
print(leaderboard)

Output:

text
             app_id  Answer Relevance  Context Relevance  Groundedness  total_cost
0  langchain-rag-v1              0.91               0.87          0.94      0.0041

TruLlama — evaluating LlamaIndex RAG pipelines

TruLlama is the LlamaIndex equivalent of TruChain — it wraps any LlamaIndex query engine or chat engine.

python
from trulens_eval import Tru, TruLlama, Feedback
from trulens_eval.feedback.provider import OpenAI as TruOpenAI
from llama_index.core import VectorStoreIndex, SimpleDirectoryReader, Settings
from llama_index.llms.openai import OpenAI as LlamaOpenAI
import numpy as np, os

tru = Tru()
provider = TruOpenAI(model_engine="gpt-4o-mini")

# --- build a LlamaIndex query engine ---
Settings.llm = LlamaOpenAI(model="gpt-4o-mini", api_key=os.environ["OPENAI_API_KEY"])

documents = SimpleDirectoryReader("./docs").load_data()
index = VectorStoreIndex.from_documents(documents)
query_engine = index.as_query_engine(similarity_top_k=3)

# --- feedback functions ---
f_answer_relevance  = Feedback(provider.relevance, name="Answer Relevance").on_input_output()
f_context_relevance = (
    Feedback(provider.context_relevance, name="Context Relevance")
    .on_input()
    .on(TruLlama.select_source_nodes().node.text)
    .aggregate(np.mean)
)
f_groundedness = (
    Feedback(provider.groundedness_measure_with_cot_reasons, name="Groundedness")
    .on(TruLlama.select_source_nodes().node.text)
    .on_output()
    .aggregate(np.mean)
)

# --- wrap with TruLlama ---
tru_query_engine = TruLlama(
    query_engine,
    app_id="llamaindex-rag-v1",
    feedbacks=[f_answer_relevance, f_context_relevance, f_groundedness],
)

with tru_query_engine as recording:
    response = query_engine.query("What are the main topics covered?")
    print(response)

tru.get_leaderboard()

Output:

text
A summary of the main topics in the documents.
             app_id  Answer Relevance  Context Relevance  Groundedness
0  llamaindex-rag-v1              0.93               0.89          0.96

Custom feedback functions

Any Python function that returns a float between 0.0 and 1.0 can be a feedback function.

python
from trulens_eval import Feedback
from trulens_eval.feedback.provider import OpenAI as TruOpenAI

provider = TruOpenAI(model_engine="gpt-4o-mini")

# Built-in: conciseness check via LLM judge
f_conciseness = Feedback(provider.conciseness, name="Conciseness").on_output()

# Custom: Python-only word-count ratio (no LLM call)
def word_count_score(answer: str) -> float:
    """Score 1.0 for answers under 100 words, decaying for longer answers."""
    words = len(answer.split())
    if words <= 100:
        return 1.0
    return max(0.0, 1.0 - (words - 100) / 200)

f_brevity = Feedback(word_count_score, name="Brevity").on_output()

# Custom: check for citation presence
def has_citation(answer: str) -> float:
    """Returns 1.0 if the answer contains a citation pattern like [1] or (source:)."""
    import re
    return 1.0 if re.search(r"\[\d+\]|\(source:", answer, re.IGNORECASE) else 0.0

f_citation = Feedback(has_citation, name="Has Citation").on_output()

Alternate LLM providers as judges

TruLens supports Anthropic, Hugging Face, Bedrock, and local models as judge LLMs via provider wrappers.

python
from trulens_eval.feedback.provider import Bedrock as TruBedrock
from trulens_eval.feedback.provider import Huggingface as TruHuggingface
import os

# Anthropic Claude as judge (via LangChain wrapper)
from trulens_eval.feedback.provider import LangChainProvider
from langchain_anthropic import ChatAnthropic

provider = LangChainProvider(
    chain=ChatAnthropic(model="claude-haiku-4-5-20251001", api_key=os.environ["ANTHROPIC_API_KEY"])
)

# Hugging Face local pipeline (free, no API key needed)
hf_provider = TruHuggingface()  # uses sentence-transformers by default

f_relevance_hf = (
    Feedback(hf_provider.not_toxic, name="Not Toxic")
    .on_output()
)

The TruLens dashboard

The dashboard is a local Streamlit web app that shows all recorded runs, leaderboard scores, and per-record trace details.

python
from trulens_eval import Tru

tru = Tru(database_url="sqlite:///evals/trulens.sqlite")

# Open dashboard in browser (blocks until Ctrl-C)
tru.start_dashboard(port=8501, force=True)

# Or just print the leaderboard to stdout
leaderboard = tru.get_leaderboard()
print(leaderboard.to_string(index=False))

# Export all records as a DataFrame for custom analysis
records, feedback_col = tru.get_records_and_feedback(app_ids=["langchain-rag-v1"])
print(records[["input", "output", "Answer Relevance", "Groundedness"]].head())

Output:

text
                        input                        output  Answer Relevance  Groundedness
0     What is self-attention?  Self-attention computes ...              0.94          0.97
1  How does BERT differ from …  BERT is bidirectional w…              0.90          0.93

CI integration — fail on score regression

python
from trulens_eval import Tru

def test_rag_quality():
    tru = Tru(database_url="sqlite:///evals/trulens.sqlite")
    leaderboard = tru.get_leaderboard(app_ids=["langchain-rag-v1"])

    row = leaderboard[leaderboard["app_id"] == "langchain-rag-v1"].iloc[0]

    assert row["Answer Relevance"] >= 0.85, (
        f"Answer Relevance {row['Answer Relevance']:.2f} below 0.85"
    )
    assert row["Groundedness"] >= 0.80, (
        f"Groundedness {row['Groundedness']:.2f} below 0.80"
    )
    assert row["Context Relevance"] >= 0.75, (
        f"Context Relevance {row['Context Relevance']:.2f} below 0.75"
    )
bash
pytest test_eval.py   # fails if any metric regresses below threshold

Output: (none — exits 0 on success)

Comparing pipeline versions

python
from trulens_eval import Tru, TruChain

tru = Tru()

# Version A — k=2 retriever
tru_v1 = TruChain(chain_k2, app_id="rag-k2", feedbacks=[f_answer_relevance, f_groundedness])
# Version B — k=5 retriever
tru_v2 = TruChain(chain_k5, app_id="rag-k5", feedbacks=[f_answer_relevance, f_groundedness])

for question in eval_questions:
    with tru_v1 as rec:
        chain_k2.invoke(question)
    with tru_v2 as rec:
        chain_k5.invoke(question)

# Both appear side-by-side in the leaderboard
print(tru.get_leaderboard(app_ids=["rag-k2", "rag-k5"]))

Output:

text
  app_id  Answer Relevance  Groundedness  total_cost
  rag-k2              0.86          0.88      0.0021
  rag-k5              0.91          0.94      0.0034

Quick reference

MetricWhat it measuresLow score means…
Answer RelevanceDoes the answer address the question?Off-topic or verbose response
Context RelevanceAre retrieved chunks relevant?Retriever fetching noise
GroundednessIs the answer supported by context?LLM hallucinating
TaskCode
Init TruLenstru = Tru(database_url="sqlite:///eval.sqlite")
Create providerprovider = TruOpenAI(model_engine="gpt-4o-mini")
Answer relevanceFeedback(provider.relevance).on_input_output()
Context relevanceFeedback(provider.context_relevance).on_input().on(TruChain.select_context()).aggregate(np.mean)
GroundednessFeedback(provider.groundedness_measure_with_cot_reasons).on(TruChain.select_context()).on_output()
Wrap LangChainTruChain(chain, app_id="v1", feedbacks=[...])
Wrap LlamaIndexTruLlama(query_engine, app_id="v1", feedbacks=[...])
Record runwith tru_chain as rec: chain.invoke(q)
Leaderboardtru.get_leaderboard(app_ids=["v1"])
Dashboardtru.start_dashboard(port=8501)
Export recordstru.get_records_and_feedback(app_ids=["v1"])
Reset DBtru.reset_database()