cheat sheet

trulens-eval

Package-level reference for trulens-eval on PyPI — install variants, the trulens umbrella rename, framework extras, and alternative evaluators.

updated 05-31-2026

trulens-eval

What it is

trulens-eval is the evaluation library at the core of TruLens, originally from TruEra. It instruments LLM applications — chains, agents, RAG pipelines — records every LLM call into a local SQLite database, computes feedback scores (Answer Relevance, Context Relevance, Groundedness — the "RAG triad", plus user-defined feedback functions), and ships a local Streamlit-based dashboard for browsing runs.

The TruEra-led project is rebranding around a broader trulens umbrella package, with trulens-eval continuing as the canonical install name for the evaluation pieces. Treat trulens-eval as the safe import name in production code until the umbrella story stabilises.

Reach for trulens-eval when you want in-process tracing plus feedback-function evaluation with a dashboard, and you want to deeply instrument an existing LangChain or LlamaIndex app. Reach for ragas for a metrics-first dataset-driven workflow, or langsmith for hosted observability.

Install

bash

pip install trulens-eval

Output: (none — exits 0 on success)

bash

uv add trulens-eval

Output: dependency resolved + added to pyproject.toml

bash

poetry add trulens-eval

Output: updated lockfile + virtualenv install

bash

pip install "trulens-eval[langchain]"
pip install "trulens-eval[llama-index]"
pip install "trulens-eval[chroma]"

Output: TruLens plus the framework or vector-store integration extras

Versioning & Python support

The package is pre-1.0. Feedback-function signatures, dashboard internals, and the underlying SQLAlchemy schema have evolved across minors — pin a version for any longitudinal evaluation.
Recent versions support Python 3.9+. Pure-Python with SQLAlchemy, Streamlit, and one or more LLM-provider SDKs as runtime dependencies.
An umbrella trulens package on PyPI is the in-progress home for the broader project (evaluation + production observability + connectors). At time of writing, trulens-eval is still the actively-imported package name for the evaluation pieces — new tutorials may use trulens once the umbrella stabilises, so check the project README before pinning.
The SQLite-backed Tru database file format has changed between minors; treat the local default.sqlite as ephemeral when upgrading.

Package metadata

Maintainer: Snowflake (which acquired TruEra) and community contributors
Project home: github.com/truera/trulens
Docs: trulens.org
PyPI: pypi.org/project/trulens-eval
License: MIT
Governance: vendor-led (Snowflake/TruEra) with open contributions
First released: 2023
Downloads: hundreds of thousands per month

Optional dependencies & extras

trulens-eval[langchain] — installs langchain so TruChain can instrument LangChain runnables.
trulens-eval[llama-index] — installs llama-index so TruLlama can instrument LlamaIndex query engines.
trulens-eval[chroma] — adds chromadb for examples and integration tests.
trulens-eval[openai], trulens-eval[huggingface], trulens-eval[bedrock], trulens-eval[litellm] (names may vary by version) — bundle individual feedback-provider SDKs.

Common companions installed alongside:

openai, anthropic, cohere — provider SDKs used by the built-in Feedback providers.
langchain / llama-index — the apps you are instrumenting.
streamlit — pulled in for the dashboard; you can suppress it on headless servers.
sqlalchemy — backs the runs database; PostgreSQL works in addition to SQLite.

Alternatives

Package	Trade-off
`ragas`	Dataset-driven RAG metrics with LangChain-style integration. Use when you have a fixed eval set.
`deepeval`	Pytest-native assertions. Use when evaluation belongs in unit tests.
`langsmith`	Hosted observability + datasets + evals from LangChain. Use for managed cloud evaluation.
`phoenix` (Arize)	Open-source LLM observability with trace replay. Use for trace-first OTel-style workflows.
`openllmetry` / `traceloop-sdk`	OTel-based LLM tracing for production. Use when you want vendor-neutral traces in your existing APM.
`humanloop` / `langfuse`	Hosted prompt + eval platforms. Use when you want collaboration features.

Common gotchas

Umbrella rename in progress. trulens-eval is the historical install name; the broader trulens umbrella is being introduced. New blog posts may show from trulens.core import ... while older code uses from trulens_eval import .... Check which import path the version you pinned actually exports.
SQLAlchemy backend defaults to SQLite. Fine for development, but multi-process production usage needs a real database. Set Tru(database_url="postgresql+psycopg://...") and re-run migrations.
Database schema migrations across versions. Upgrading minors sometimes adds tables or columns. Keep the runs DB out of source control and rebuild on upgrade unless you explicitly migrate.
Dashboard process lifecycle. tru.run_dashboard() launches Streamlit in a subprocess and registers an atexit cleanup hook. Killing the parent with kill -9 orphans Streamlit on port 8501; clean up manually before the next run.
Feedback-function call costs. Each feedback function is itself an LLM call. The default RAG triad makes three calls per record. Use a cheap judge model in development and a stronger one for release-gating runs.
TruChain and TruLlama need the framework installed. Importing them raises a friendly error if langchain / llama-index isn't present; install via the appropriate extra.
Headless server installs still pull Streamlit. If you only need evaluation and not the dashboard, plan for the ~30 MB extra dependency or build a slim image that excludes streamlit.

Real-world recipes

The recipes below focus on the wrapper choice and database topology each pattern implies — the sections/ai/trulens companion covers the RAG triad and feedback functions in depth.

Wrap a LangChain RAG chain with TruChain — the canonical first run. Every chain invocation is recorded, the RAG triad is computed asynchronously, and the result is queryable in the dashboard.

python

from trulens_eval import Tru, TruChain, Feedback
from trulens_eval.feedback.provider import OpenAI
from trulens_eval.feedback import GroundTruthAgreement
import numpy as np

tru = Tru()
provider = OpenAI()

# RAG triad
f_groundedness = Feedback(provider.groundedness_measure_with_cot_reasons).on(...)
f_answer_relevance = Feedback(provider.relevance).on_input_output()
f_context_relevance = Feedback(provider.context_relevance_with_cot_reasons).on(...)

tru_chain = TruChain(
    my_chain,
    app_id="rag-v1",
    feedbacks=[f_groundedness, f_answer_relevance, f_context_relevance],
)

with tru_chain as recording:
    answer = my_chain.invoke({"question": "What is HNSW?"})

# Feedback scores compute in the background; query when ready
tru.get_leaderboard(app_ids=["rag-v1"])

Output: every call is persisted to the local SQLite store; the dashboard at localhost:8501 shows the leaderboard with mean feedback scores per app version

A/B test prompt variants — give each variant a different app_id and compare leaderboards.

python

tru_v1 = TruChain(chain_v1, app_id="rag-v1", feedbacks=[...])
tru_v2 = TruChain(chain_v2, app_id="rag-v2", feedbacks=[...])

for q in eval_questions:
    with tru_v1: chain_v1.invoke({"question": q})
    with tru_v2: chain_v2.invoke({"question": q})

print(tru.get_leaderboard(app_ids=["rag-v1", "rag-v2"]))

Output: a comparison table with per-app feedback means and call counts; the dashboard renders the same data with per-row drilldowns

Custom feedback function — feedback functions are just Python callables that return a numeric score. Wrap your own logic with Feedback(callable).

python

from trulens_eval import Feedback
from trulens_eval.feedback import Select

def length_penalty(response: str) -> float:
    return 1.0 if 50 <= len(response) <= 500 else 0.5

f_length = (
    Feedback(length_penalty, name="length_in_range")
    .on(Select.RecordOutput)
)

Output: a feedback function that scores 1.0 for in-range responses; the Select.RecordOutput selector pulls the chain's output text from each record

TruLlama for LlamaIndex query engines — same idea, different wrapper.

python

from trulens_eval import TruLlama

tru_engine = TruLlama(
    query_engine,
    app_id="li-rag-v1",
    feedbacks=[f_groundedness, f_answer_relevance, f_context_relevance],
)
with tru_engine as recording:
    response = query_engine.query("What is HNSW?")

Output: LlamaIndex query traces are recorded with the same schema as LangChain runs; the dashboard unifies both

Move runs storage to Postgres — for multi-process workloads, SQLite locks under concurrent writes. Point Tru(database_url=...) at a real database.

python

from trulens_eval import Tru

tru = Tru(database_url="postgresql+psycopg://user:pw@db:5432/trulens")

Output: the Tru singleton now reads and writes to Postgres; all subsequent TruChain/TruLlama recordings persist there

Production deployment

TruLens has two layers: an instrumentation layer (TruChain / TruLlama / TruRag wrappers that record every call) and an evaluation layer (feedback functions that compute scores asynchronously). Production usage requires choices on both.

Topology checklist:

Concern	Approach
Runs database	SQLite for dev; Postgres (`postgresql+psycopg://...`) for multi-process
Dashboard	local Streamlit (`tru.run_dashboard()`) or skip entirely for headless
Feedback execution	inline (slow) or deferred (`tru.start_evaluator()`)
LM provider for feedback	cheap model in dev, strong model on schedule
Sampling	record everything in dev; sample (e.g. 10%) in prod
Headless installs	skip `streamlit` dep if no dashboard needed

Deferred feedback evaluation. By default, feedback runs inline on each record — expensive at request time. Switch to a background evaluator:

python

tru.start_evaluator()       # starts a background thread that picks up unscored records

Output: the evaluator runs feedback functions on records as they accumulate; chain calls return immediately, scores fill in seconds later

Sampling in production. Recording every request to an LLM-judge feedback pipeline is expensive. Wrap your app to record a sampled fraction:

python

import random

def maybe_record(chain, question):
    if random.random() < 0.1:    # 10% sample
        with tru_chain: return chain.invoke({"question": question})
    return chain.invoke({"question": question})

Output: only one in ten requests pays the recording cost; the leaderboard's confidence shrinks but trends remain visible

Multi-process and worker fleets. Each worker needs to point at the same Postgres database. The Tru singleton inside each process opens its own connection pool; there's no cross-process coordination needed beyond the shared DB.

Evaluation methodology

TruLens's signature contribution is the RAG triad: three feedback functions that together diagnose retrieval, generation, and grounding.

Feedback	What it measures	Catches
Context Relevance	are retrieved contexts relevant to the question?	retrieval failures
Answer Relevance	does the answer actually answer the question?	hallucination, off-topic generation
Groundedness	does the answer follow from the contexts?	generation that ignores or contradicts contexts

A high-quality RAG pipeline scores well on all three. A pipeline that scores high on Answer Relevance but low on Groundedness is hallucinating — the LM is making up plausible-sounding answers.

Feedback function design. A feedback function is (record selectors → score in [0, 1]). The Select.* API gives you typed accessors to record fields:

Select.RecordInput — the chain's input.
Select.RecordOutput — the chain's output.
Select.RecordCalls[...] — any intermediate call (retriever output, tool result, etc.).

LM-judge cost. Each feedback function is an LM call. The default RAG triad makes three calls per recorded request. For a 10,000-call eval set, that's 30,000 judge calls — pin a cheap model for the inner loop and a stronger model for release-gating runs.

Reference-based vs reference-free. TruLens supports both:

Reference-free (RAG triad, custom heuristics) — no ground truth required.
Reference-based (GroundTruthAgreement) — needs (input, expected_output) pairs.

Use reference-free for live production sampling; reference-based for fixed regression sets.

Version migration guide

trulens-eval → trulens umbrella. Snowflake/TruEra are migrating from the single trulens-eval package to a broader trulens umbrella that splits evaluation, dashboard, providers, and instrumentation into separate sub-packages. The exact final layout is still settling (verify against the project README at install time) — until it does, trulens-eval remains the safe install name for evaluation code.

Symptoms of the migration in the wild:

Some tutorials show from trulens.core import ... while older code uses from trulens_eval import .... Both refer to the same library at different naming stages.
The PyPI trulens package may be a metapackage that pulls in trulens-core, trulens-dashboard, trulens-providers-*, etc.
Long-term, expect pip install trulens to be the recommended entry; for now, pip install trulens-eval is more reliably reproducible.

Database schema migrations. The SQLAlchemy-backed runs DB changes across minors. TruLens runs Alembic migrations on startup by default; the operation can fail for SQLite if the file was opened by another process. Pattern for safe upgrades:

Stop all writers.
Back up the runs DB (cp default.sqlite default.sqlite.bak).
Upgrade the package.
Run Tru() in a script to trigger migrations.
Verify the dashboard loads.

Feedback function signatures. Selector syntax has evolved across versions:

Older: positional .on_input_output().
Current: .on(Select.RecordInput).on(Select.RecordOutput) builder.
The Feedback(...).aggregate(...) API for collapsing multi-step records has been reshaped.

Pinning strategy. For longitudinal feedback numbers, pin the trulens version, the provider model name, and (where possible) the prompt revision. Treat scores as bands across versions, not exact equality.

Index tuning & retrieval quality

TruLens doesn't index data — it observes what your retriever returns. The relevant tuning lever is how TruLens's feedback functions surface retrieval problems.

Retrieval signal in feedback scores.

Low Context Relevance with high Answer Relevance → retrieval is pulling junk; the LM is paving over it. Tune top_k, chunk size, or add a reranker.
High Context Relevance but low Groundedness → retrieval is fine, the LM is hallucinating. Switch model, add citation prompts, or apply guardrails.
Low on all three → broken pipeline. Inspect raw records via tru.get_records_and_feedback().

A/B testing retrieval variants. Wrap each variant as a separate TruChain / TruLlama with a distinct app_id. Send the same eval questions through each. The dashboard shows per-app feedback means side by side.

python

tru_topk5 = TruChain(chain_with_topk(5), app_id="topk-5", feedbacks=[...])
tru_topk10 = TruChain(chain_with_topk(10), app_id="topk-10", feedbacks=[...])

for q in eval_questions:
    with tru_topk5: chain_with_topk(5).invoke({"question": q})
    with tru_topk10: chain_with_topk(10).invoke({"question": q})
print(tru.get_leaderboard(app_ids=["topk-5", "topk-10"]))

Output: the leaderboard compares mean feedback scores per app version; pick the variant that wins on Context Relevance without sacrificing latency

Per-record drilldown. The dashboard surfaces individual records ordered by score — the worst-performing records are the most informative debug signal. Use them to find the chunking edge cases, missing documents, or ambiguous queries that need attention.

Performance tuning

Lever	Mechanism	When it helps
`tru.start_evaluator()`	background feedback eval	latency-sensitive apps
Sampling (record 10% of requests)	reduce judge calls	high-volume production
Cheap judge model	smaller LM	inner-loop dev
Postgres backend	multi-writer support	multi-process apps
Selective feedback functions	drop expensive metrics in dev	iteration speed
Skip dashboard install	leaner image	headless servers

Skipping Streamlit on headless servers. If you only need evaluation and not the dashboard, install with --no-deps and then manually add only the runtime deps you need; alternatively, run the dashboard separately on a different machine pointing at the same database.

Troubleshooting common errors

OperationalError: database is locked — two processes opened the same SQLite runs DB. Switch to Postgres or coordinate access via a single writer.
Dashboard hangs on startup — port 8501 already in use. Kill the stale Streamlit (pkill -f streamlit) and re-run tru.run_dashboard().
ImportError: cannot import name 'TruChain' — package name confusion. Either pip install trulens-eval (older) or pip install trulens (umbrella, newer). Check which import path your version actually exports.
All feedback scores are NaN — the judge LM is rejecting the prompt or hitting rate limits. Inspect raw responses via tru.get_records_and_feedback().
Feedback never computes — you didn't start the evaluator, and inline mode isn't engaged. Either tru.start_evaluator() or pass mode="with_app" on the feedback functions.
Schema mismatch after upgrade — Alembic migration failed silently. Back up, drop the DB, re-run.
Atexit cleanup on kill -9 — Streamlit subprocess orphans. Use kill (SIGTERM) or tru.stop_dashboard() to clean shut down.

Ecosystem integrations

LangChain — TruChain wraps any Runnable and instruments every step.
LlamaIndex — TruLlama wraps query engines, retrievers, and chat engines.
Native function wrapping — TruBasicApp and TruCustomApp instrument arbitrary Python callables without a framework.
Snowflake — first-class integration for running TruLens evaluations inside Snowflake (Snowflake Cortex Search, Snowflake LLMs as feedback providers).
HuggingFace — feedback provider that runs models via Inference API or local Transformers pipeline.
Cohere / Anthropic / Bedrock / LiteLLM — feedback providers via the corresponding extras.
OpenTelemetry — recent versions emit OTel traces alongside the SQLite recordings.

Embeddings & chunking strategy

Most TruLens feedback functions are LM-judged, not embedding-based — chunking and embedding decisions belong upstream in the RAG pipeline, not in TruLens itself. The interaction points are:

Groundedness feedback runs per-chunk by default, scoring each retrieved chunk's contribution to the answer. Smaller chunks produce more, smaller scoring calls; larger chunks produce fewer, larger calls. Both end at roughly the same total cost.
Context Relevance feedback scores chunks against the question. Heavily overlapping chunks inflate scores; ensure your chunker deduplicates overlap regions before indexing.
Embedding-based feedback (custom). You can write a feedback function that computes cosine similarity between input/output embeddings — useful for query-answer relevance without an LM judge. Reuse the same embedder your retrieval pipeline uses.

Chunk-size sweeps as evaluation. Use TruLens to compare app versions with different chunk sizes. Give each chunker its own app_id, run the same questions through each, and read the leaderboard — context_relevance typically peaks around 800–1500 character chunks for modern long-context LMs.

HyDE and query rewriting are upstream of TruLens; they show up in feedback scores as improved Context Relevance.

Security considerations

TruLens persists every recorded request and the LM judge's reasoning into a local database. That database is a copy of your sensitive data, and the feedback pipeline sends data to third-party LMs.

Runs database content. Every input, retrieved context, output, and feedback rationale lives in default.sqlite (or your Postgres equivalent). Treat the DB at the same sensitivity as your production prompts.
PII in dashboards. The Streamlit dashboard renders raw records. Restrict access — bind to localhost and tunnel, or front with auth.
Provider data exposure. Feedback functions send records to OpenAI / Anthropic / Bedrock / HuggingFace by default. For sensitive data, use a self-hosted judge (local Llama via vLLM/Ollama) or a provider with a no-training guarantee.
Multi-tenant runs DB. If multiple apps share a Postgres backend, isolate by app_id and consider per-tenant databases for strict isolation.
Prompt injection in retrieved content. The same caveat as any RAG pipeline — TruLens records but does not sanitise.
Sampling in CI. Eval-in-CI can leak production data to the judge LM. Use a staging dataset for CI rather than live prompts.
Key rotation. Provider API keys live in env vars; rotate alongside the rest of the fleet.
Database export. tru.get_records_and_feedback() returns a DataFrame of raw records; restrict who can run this.

When NOT to use this

TruLens is the right tool when you want in-process tracing plus a feedback-function dashboard for an existing LLM app. The trade-offs below are where another tool fits better.

Fixed-dataset batch evaluation. Ragas is more focused — dataset in, scores out, no tracing overhead.
Pytest-native assertions. DeepEval lives next to unit tests; TruLens is a dashboard tool.
Hosted observability with a team UI. LangSmith, Langfuse, or Phoenix have multi-user dashboards out of the box; TruLens's Streamlit is single-user-friendly.
Production tracing into your existing APM. OpenTelemetry / OpenLLMetry / Traceloop emit standard OTel; TruLens is a complement, not a replacement.
You don't want a separate runs database. TruLens persistence is a feature; if you only need stateless scoring, ragas is lighter.

trulens-eval

What it is

Install

Versioning & Python support

Package metadata

Optional dependencies & extras

Alternatives

Common gotchas

Real-world recipes

Production deployment

Evaluation methodology

Version migration guide

Index tuning & retrieval quality

Performance tuning

Troubleshooting common errors

Ecosystem integrations

Embeddings & chunking strategy

Security considerations

When NOT to use this

See also