cheat sheet
trulens-eval
Package-level reference for trulens-eval on PyPI — install variants, the trulens umbrella rename, framework extras, and alternative evaluators.
trulens-eval
What it is
trulens-eval is the evaluation library at the core of TruLens, originally from TruEra. It instruments LLM applications — chains, agents, RAG pipelines — records every LLM call into a local SQLite database, computes feedback scores (Answer Relevance, Context Relevance, Groundedness — the "RAG triad", plus user-defined feedback functions), and ships a local Streamlit-based dashboard for browsing runs.
The TruEra-led project is rebranding around a broader trulens umbrella package, with trulens-eval continuing as the canonical install name for the evaluation pieces. Treat trulens-eval as the safe import name in production code until the umbrella story stabilises.
Reach for trulens-eval when you want in-process tracing plus feedback-function evaluation with a dashboard, and you want to deeply instrument an existing LangChain or LlamaIndex app. Reach for ragas for a metrics-first dataset-driven workflow, or langsmith for hosted observability.
Install
pip install trulens-eval
Output: (none — exits 0 on success)
uv add trulens-eval
Output: dependency resolved + added to pyproject.toml
poetry add trulens-eval
Output: updated lockfile + virtualenv install
pip install "trulens-eval[langchain]"
pip install "trulens-eval[llama-index]"
pip install "trulens-eval[chroma]"
Output: TruLens plus the framework or vector-store integration extras
Versioning & Python support
- The package is pre-
1.0. Feedback-function signatures, dashboard internals, and the underlying SQLAlchemy schema have evolved across minors — pin a version for any longitudinal evaluation. - Recent versions support Python 3.9+. Pure-Python with SQLAlchemy, Streamlit, and one or more LLM-provider SDKs as runtime dependencies.
- An umbrella
trulenspackage on PyPI is the in-progress home for the broader project (evaluation + production observability + connectors). At time of writing,trulens-evalis still the actively-imported package name for the evaluation pieces — new tutorials may usetrulensonce the umbrella stabilises, so check the project README before pinning. - The SQLite-backed
Trudatabase file format has changed between minors; treat the localdefault.sqliteas ephemeral when upgrading.
Package metadata
- Maintainer: Snowflake (which acquired TruEra) and community contributors
- Project home: github.com/truera/trulens
- Docs: trulens.org
- PyPI: pypi.org/project/trulens-eval
- License: MIT
- Governance: vendor-led (Snowflake/TruEra) with open contributions
- First released: 2023
- Downloads: hundreds of thousands per month
Optional dependencies & extras
trulens-eval[langchain]— installslangchainsoTruChaincan instrument LangChain runnables.trulens-eval[llama-index]— installsllama-indexsoTruLlamacan instrument LlamaIndex query engines.trulens-eval[chroma]— addschromadbfor examples and integration tests.trulens-eval[openai],trulens-eval[huggingface],trulens-eval[bedrock],trulens-eval[litellm](names may vary by version) — bundle individual feedback-provider SDKs.
Common companions installed alongside:
openai,anthropic,cohere— provider SDKs used by the built-inFeedbackproviders.langchain/llama-index— the apps you are instrumenting.streamlit— pulled in for the dashboard; you can suppress it on headless servers.sqlalchemy— backs the runs database; PostgreSQL works in addition to SQLite.
Alternatives
| Package | Trade-off |
|---|---|
ragas | Dataset-driven RAG metrics with LangChain-style integration. Use when you have a fixed eval set. |
deepeval | Pytest-native assertions. Use when evaluation belongs in unit tests. |
langsmith | Hosted observability + datasets + evals from LangChain. Use for managed cloud evaluation. |
phoenix (Arize) | Open-source LLM observability with trace replay. Use for trace-first OTel-style workflows. |
openllmetry / traceloop-sdk | OTel-based LLM tracing for production. Use when you want vendor-neutral traces in your existing APM. |
humanloop / langfuse | Hosted prompt + eval platforms. Use when you want collaboration features. |
Common gotchas
- Umbrella rename in progress.
trulens-evalis the historical install name; the broadertrulensumbrella is being introduced. New blog posts may showfrom trulens.core import ...while older code usesfrom trulens_eval import .... Check which import path the version you pinned actually exports. - SQLAlchemy backend defaults to SQLite. Fine for development, but multi-process production usage needs a real database. Set
Tru(database_url="postgresql+psycopg://...")and re-run migrations. - Database schema migrations across versions. Upgrading minors sometimes adds tables or columns. Keep the runs DB out of source control and rebuild on upgrade unless you explicitly migrate.
- Dashboard process lifecycle.
tru.run_dashboard()launches Streamlit in a subprocess and registers anatexitcleanup hook. Killing the parent withkill -9orphans Streamlit on port8501; clean up manually before the next run. - Feedback-function call costs. Each feedback function is itself an LLM call. The default RAG triad makes three calls per record. Use a cheap judge model in development and a stronger one for release-gating runs.
TruChainandTruLlamaneed the framework installed. Importing them raises a friendly error iflangchain/llama-indexisn't present; install via the appropriate extra.- Headless server installs still pull Streamlit. If you only need evaluation and not the dashboard, plan for the ~30 MB extra dependency or build a slim image that excludes
streamlit.
Real-world recipes
The recipes below focus on the wrapper choice and database topology each pattern implies — the sections/ai/trulens companion covers the RAG triad and feedback functions in depth.
Wrap a LangChain RAG chain with TruChain — the canonical first run. Every chain invocation is recorded, the RAG triad is computed asynchronously, and the result is queryable in the dashboard.
from trulens_eval import Tru, TruChain, Feedback
from trulens_eval.feedback.provider import OpenAI
from trulens_eval.feedback import GroundTruthAgreement
import numpy as np
tru = Tru()
provider = OpenAI()
# RAG triad
f_groundedness = Feedback(provider.groundedness_measure_with_cot_reasons).on(...)
f_answer_relevance = Feedback(provider.relevance).on_input_output()
f_context_relevance = Feedback(provider.context_relevance_with_cot_reasons).on(...)
tru_chain = TruChain(
my_chain,
app_id="rag-v1",
feedbacks=[f_groundedness, f_answer_relevance, f_context_relevance],
)
with tru_chain as recording:
answer = my_chain.invoke({"question": "What is HNSW?"})
# Feedback scores compute in the background; query when ready
tru.get_leaderboard(app_ids=["rag-v1"])
Output: every call is persisted to the local SQLite store; the dashboard at localhost:8501 shows the leaderboard with mean feedback scores per app version
A/B test prompt variants — give each variant a different app_id and compare leaderboards.
tru_v1 = TruChain(chain_v1, app_id="rag-v1", feedbacks=[...])
tru_v2 = TruChain(chain_v2, app_id="rag-v2", feedbacks=[...])
for q in eval_questions:
with tru_v1: chain_v1.invoke({"question": q})
with tru_v2: chain_v2.invoke({"question": q})
print(tru.get_leaderboard(app_ids=["rag-v1", "rag-v2"]))
Output: a comparison table with per-app feedback means and call counts; the dashboard renders the same data with per-row drilldowns
Custom feedback function — feedback functions are just Python callables that return a numeric score. Wrap your own logic with Feedback(callable).
from trulens_eval import Feedback
from trulens_eval.feedback import Select
def length_penalty(response: str) -> float:
return 1.0 if 50 <= len(response) <= 500 else 0.5
f_length = (
Feedback(length_penalty, name="length_in_range")
.on(Select.RecordOutput)
)
Output: a feedback function that scores 1.0 for in-range responses; the Select.RecordOutput selector pulls the chain's output text from each record
TruLlama for LlamaIndex query engines — same idea, different wrapper.
from trulens_eval import TruLlama
tru_engine = TruLlama(
query_engine,
app_id="li-rag-v1",
feedbacks=[f_groundedness, f_answer_relevance, f_context_relevance],
)
with tru_engine as recording:
response = query_engine.query("What is HNSW?")
Output: LlamaIndex query traces are recorded with the same schema as LangChain runs; the dashboard unifies both
Move runs storage to Postgres — for multi-process workloads, SQLite locks under concurrent writes. Point Tru(database_url=...) at a real database.
from trulens_eval import Tru
tru = Tru(database_url="postgresql+psycopg://user:pw@db:5432/trulens")
Output: the Tru singleton now reads and writes to Postgres; all subsequent TruChain/TruLlama recordings persist there
Production deployment
TruLens has two layers: an instrumentation layer (TruChain / TruLlama / TruRag wrappers that record every call) and an evaluation layer (feedback functions that compute scores asynchronously). Production usage requires choices on both.
Topology checklist:
| Concern | Approach |
|---|---|
| Runs database | SQLite for dev; Postgres (postgresql+psycopg://...) for multi-process |
| Dashboard | local Streamlit (tru.run_dashboard()) or skip entirely for headless |
| Feedback execution | inline (slow) or deferred (tru.start_evaluator()) |
| LM provider for feedback | cheap model in dev, strong model on schedule |
| Sampling | record everything in dev; sample (e.g. 10%) in prod |
| Headless installs | skip streamlit dep if no dashboard needed |
Deferred feedback evaluation. By default, feedback runs inline on each record — expensive at request time. Switch to a background evaluator:
tru.start_evaluator() # starts a background thread that picks up unscored records
Output: the evaluator runs feedback functions on records as they accumulate; chain calls return immediately, scores fill in seconds later
Sampling in production. Recording every request to an LLM-judge feedback pipeline is expensive. Wrap your app to record a sampled fraction:
import random
def maybe_record(chain, question):
if random.random() < 0.1: # 10% sample
with tru_chain: return chain.invoke({"question": question})
return chain.invoke({"question": question})
Output: only one in ten requests pays the recording cost; the leaderboard's confidence shrinks but trends remain visible
Multi-process and worker fleets. Each worker needs to point at the same Postgres database. The Tru singleton inside each process opens its own connection pool; there's no cross-process coordination needed beyond the shared DB.
Evaluation methodology
TruLens's signature contribution is the RAG triad: three feedback functions that together diagnose retrieval, generation, and grounding.
| Feedback | What it measures | Catches |
|---|---|---|
| Context Relevance | are retrieved contexts relevant to the question? | retrieval failures |
| Answer Relevance | does the answer actually answer the question? | hallucination, off-topic generation |
| Groundedness | does the answer follow from the contexts? | generation that ignores or contradicts contexts |
A high-quality RAG pipeline scores well on all three. A pipeline that scores high on Answer Relevance but low on Groundedness is hallucinating — the LM is making up plausible-sounding answers.
Feedback function design. A feedback function is (record selectors → score in [0, 1]). The Select.* API gives you typed accessors to record fields:
Select.RecordInput— the chain's input.Select.RecordOutput— the chain's output.Select.RecordCalls[...]— any intermediate call (retriever output, tool result, etc.).
LM-judge cost. Each feedback function is an LM call. The default RAG triad makes three calls per recorded request. For a 10,000-call eval set, that's 30,000 judge calls — pin a cheap model for the inner loop and a stronger model for release-gating runs.
Reference-based vs reference-free. TruLens supports both:
- Reference-free (RAG triad, custom heuristics) — no ground truth required.
- Reference-based (
GroundTruthAgreement) — needs(input, expected_output)pairs.
Use reference-free for live production sampling; reference-based for fixed regression sets.
Version migration guide
trulens-eval → trulens umbrella. Snowflake/TruEra are migrating from the single trulens-eval package to a broader trulens umbrella that splits evaluation, dashboard, providers, and instrumentation into separate sub-packages. The exact final layout is still settling (verify against the project README at install time) — until it does, trulens-eval remains the safe install name for evaluation code.
Symptoms of the migration in the wild:
- Some tutorials show
from trulens.core import ...while older code usesfrom trulens_eval import .... Both refer to the same library at different naming stages. - The PyPI
trulenspackage may be a metapackage that pulls intrulens-core,trulens-dashboard,trulens-providers-*, etc. - Long-term, expect
pip install trulensto be the recommended entry; for now,pip install trulens-evalis more reliably reproducible.
Database schema migrations. The SQLAlchemy-backed runs DB changes across minors. TruLens runs Alembic migrations on startup by default; the operation can fail for SQLite if the file was opened by another process. Pattern for safe upgrades:
- Stop all writers.
- Back up the runs DB (
cp default.sqlite default.sqlite.bak). - Upgrade the package.
- Run
Tru()in a script to trigger migrations. - Verify the dashboard loads.
Feedback function signatures. Selector syntax has evolved across versions:
- Older: positional
.on_input_output(). - Current:
.on(Select.RecordInput).on(Select.RecordOutput)builder. - The
Feedback(...).aggregate(...)API for collapsing multi-step records has been reshaped.
Pinning strategy. For longitudinal feedback numbers, pin the trulens version, the provider model name, and (where possible) the prompt revision. Treat scores as bands across versions, not exact equality.
Index tuning & retrieval quality
TruLens doesn't index data — it observes what your retriever returns. The relevant tuning lever is how TruLens's feedback functions surface retrieval problems.
Retrieval signal in feedback scores.
- Low Context Relevance with high Answer Relevance → retrieval is pulling junk; the LM is paving over it. Tune
top_k, chunk size, or add a reranker. - High Context Relevance but low Groundedness → retrieval is fine, the LM is hallucinating. Switch model, add citation prompts, or apply guardrails.
- Low on all three → broken pipeline. Inspect raw records via
tru.get_records_and_feedback().
A/B testing retrieval variants. Wrap each variant as a separate TruChain / TruLlama with a distinct app_id. Send the same eval questions through each. The dashboard shows per-app feedback means side by side.
tru_topk5 = TruChain(chain_with_topk(5), app_id="topk-5", feedbacks=[...])
tru_topk10 = TruChain(chain_with_topk(10), app_id="topk-10", feedbacks=[...])
for q in eval_questions:
with tru_topk5: chain_with_topk(5).invoke({"question": q})
with tru_topk10: chain_with_topk(10).invoke({"question": q})
print(tru.get_leaderboard(app_ids=["topk-5", "topk-10"]))
Output: the leaderboard compares mean feedback scores per app version; pick the variant that wins on Context Relevance without sacrificing latency
Per-record drilldown. The dashboard surfaces individual records ordered by score — the worst-performing records are the most informative debug signal. Use them to find the chunking edge cases, missing documents, or ambiguous queries that need attention.
Performance tuning
| Lever | Mechanism | When it helps |
|---|---|---|
tru.start_evaluator() | background feedback eval | latency-sensitive apps |
| Sampling (record 10% of requests) | reduce judge calls | high-volume production |
| Cheap judge model | smaller LM | inner-loop dev |
| Postgres backend | multi-writer support | multi-process apps |
| Selective feedback functions | drop expensive metrics in dev | iteration speed |
| Skip dashboard install | leaner image | headless servers |
Skipping Streamlit on headless servers. If you only need evaluation and not the dashboard, install with --no-deps and then manually add only the runtime deps you need; alternatively, run the dashboard separately on a different machine pointing at the same database.
Troubleshooting common errors
OperationalError: database is locked— two processes opened the same SQLite runs DB. Switch to Postgres or coordinate access via a single writer.- Dashboard hangs on startup — port
8501already in use. Kill the stale Streamlit (pkill -f streamlit) and re-runtru.run_dashboard(). ImportError: cannot import name 'TruChain'— package name confusion. Eitherpip install trulens-eval(older) orpip install trulens(umbrella, newer). Check which import path your version actually exports.- All feedback scores are NaN — the judge LM is rejecting the prompt or hitting rate limits. Inspect raw responses via
tru.get_records_and_feedback(). - Feedback never computes — you didn't start the evaluator, and inline mode isn't engaged. Either
tru.start_evaluator()or passmode="with_app"on the feedback functions. - Schema mismatch after upgrade — Alembic migration failed silently. Back up, drop the DB, re-run.
- Atexit cleanup on
kill -9— Streamlit subprocess orphans. Usekill(SIGTERM) ortru.stop_dashboard()to clean shut down.
Ecosystem integrations
- LangChain —
TruChainwraps anyRunnableand instruments every step. - LlamaIndex —
TruLlamawraps query engines, retrievers, and chat engines. - Native function wrapping —
TruBasicAppandTruCustomAppinstrument arbitrary Python callables without a framework. - Snowflake — first-class integration for running TruLens evaluations inside Snowflake (Snowflake Cortex Search, Snowflake LLMs as feedback providers).
- HuggingFace — feedback provider that runs models via Inference API or local Transformers pipeline.
- Cohere / Anthropic / Bedrock / LiteLLM — feedback providers via the corresponding extras.
- OpenTelemetry — recent versions emit OTel traces alongside the SQLite recordings.
Embeddings & chunking strategy
Most TruLens feedback functions are LM-judged, not embedding-based — chunking and embedding decisions belong upstream in the RAG pipeline, not in TruLens itself. The interaction points are:
- Groundedness feedback runs per-chunk by default, scoring each retrieved chunk's contribution to the answer. Smaller chunks produce more, smaller scoring calls; larger chunks produce fewer, larger calls. Both end at roughly the same total cost.
- Context Relevance feedback scores chunks against the question. Heavily overlapping chunks inflate scores; ensure your chunker deduplicates overlap regions before indexing.
- Embedding-based feedback (custom). You can write a feedback function that computes cosine similarity between input/output embeddings — useful for query-answer relevance without an LM judge. Reuse the same embedder your retrieval pipeline uses.
Chunk-size sweeps as evaluation. Use TruLens to compare app versions with different chunk sizes. Give each chunker its own app_id, run the same questions through each, and read the leaderboard — context_relevance typically peaks around 800–1500 character chunks for modern long-context LMs.
HyDE and query rewriting are upstream of TruLens; they show up in feedback scores as improved Context Relevance.
Security considerations
TruLens persists every recorded request and the LM judge's reasoning into a local database. That database is a copy of your sensitive data, and the feedback pipeline sends data to third-party LMs.
- Runs database content. Every input, retrieved context, output, and feedback rationale lives in
default.sqlite(or your Postgres equivalent). Treat the DB at the same sensitivity as your production prompts. - PII in dashboards. The Streamlit dashboard renders raw records. Restrict access — bind to
localhostand tunnel, or front with auth. - Provider data exposure. Feedback functions send records to OpenAI / Anthropic / Bedrock / HuggingFace by default. For sensitive data, use a self-hosted judge (local Llama via vLLM/Ollama) or a provider with a no-training guarantee.
- Multi-tenant runs DB. If multiple apps share a Postgres backend, isolate by
app_idand consider per-tenant databases for strict isolation. - Prompt injection in retrieved content. The same caveat as any RAG pipeline — TruLens records but does not sanitise.
- Sampling in CI. Eval-in-CI can leak production data to the judge LM. Use a staging dataset for CI rather than live prompts.
- Key rotation. Provider API keys live in env vars; rotate alongside the rest of the fleet.
- Database export.
tru.get_records_and_feedback()returns a DataFrame of raw records; restrict who can run this.
When NOT to use this
TruLens is the right tool when you want in-process tracing plus a feedback-function dashboard for an existing LLM app. The trade-offs below are where another tool fits better.
- Fixed-dataset batch evaluation. Ragas is more focused — dataset in, scores out, no tracing overhead.
- Pytest-native assertions. DeepEval lives next to unit tests; TruLens is a dashboard tool.
- Hosted observability with a team UI. LangSmith, Langfuse, or Phoenix have multi-user dashboards out of the box; TruLens's Streamlit is single-user-friendly.
- Production tracing into your existing APM. OpenTelemetry / OpenLLMetry / Traceloop emit standard OTel; TruLens is a complement, not a replacement.
- You don't want a separate runs database. TruLens persistence is a feature; if you only need stateless scoring, ragas is lighter.
See also
- AI: trulens — RAG triad, feedback functions, dashboard
- Concept: RAG — retrieval-augmented generation patterns
- Concept: API — REST design fundamentals