cheat sheet
LangSmith
Trace, debug, evaluate, and monitor LLM applications with LangSmith. Covers tracing setup, datasets, evaluators, prompt hub, comparing runs, and CI integration.
LangSmith — LLM Observability & Evaluation
What it is
LangSmith is Langchain Inc.'s platform for observability and evaluation of LLM applications. It automatically captures every prompt, response, token count, latency, and error from LangChain chains and agents — and from any Python code you instrument manually. You use it to debug failures, build evaluation datasets from production traces, run automated regression tests, and compare model/prompt versions. LangSmith has a free tier and integrates with LangChain via two environment variables.
Install
pip install langsmith
pip install langchain # optional — auto-traces all LangChain calls
Output: (none — exits 0 on success)
Quick example
import os
# Enable tracing with two env vars — that's all LangChain needs
os.environ["LANGCHAIN_TRACING_V2"] = "true"
os.environ["LANGCHAIN_API_KEY"] = "ls__..." # from app.langsmith.com
os.environ["LANGCHAIN_PROJECT"] = "my-project"
from langchain_anthropic import ChatAnthropic
from langchain_core.prompts import ChatPromptTemplate
from langchain_core.output_parsers import StrOutputParser
chain = (
ChatPromptTemplate.from_template("Summarise in one sentence: {text}")
| ChatAnthropic(model="claude-sonnet-4-6")
| StrOutputParser()
)
result = chain.invoke({"text": "LangSmith traces every LLM call automatically."})
print(result)
# Every call now appears in the LangSmith UI with full prompt/response/latency
Output:
LangSmith automatically traces every LLM call and displays it in the UI.
When / why to use it
- Debugging LLM chains: inspect exactly what prompt was sent, what was received, and where the chain failed.
- Building evaluation datasets: tag production traces as "good" or "bad" and export to a dataset.
- Regression testing: run a dataset through a chain and compare scores across versions.
- Prompt management: version prompts in the LangSmith Hub and pull them by commit hash.
- Monitoring production: alert on latency regressions or error rate spikes.
Common pitfalls
Traces are sent asynchronously — LangSmith batches and sends traces in the background. In short-lived scripts, the process may exit before all traces are flushed. Add
langsmith.Client().flush()at the end of scripts to ensure all traces are sent.
LANGCHAIN_TRACING_V2must be set before importing LangChain — the tracer registers at import time. Setting the env var afterfrom langchain_core import ...has no effect.
Use
@traceableto trace any Python function — not just LangChain objects. This captures non-LangChain steps (database calls, pre/post-processing) in the same trace tree.
with tracing_context(project_name="experiment-v2"): overrides the project for a specific block, making it easy to route A/B experiments to separate projects without changing env vars.
Tracing non-LangChain code with @traceable
@traceable instruments any Python function so its inputs, outputs, and metadata appear in LangSmith traces.
from langsmith import traceable
import anthropic
import os
client = anthropic.Anthropic(api_key=os.environ["ANTHROPIC_API_KEY"])
@traceable(name="Claude API call", run_type="llm")
def call_claude(prompt: str, model: str = "claude-sonnet-4-6") -> str:
message = client.messages.create(
model=model,
max_tokens=256,
messages=[{"role": "user", "content": prompt}],
)
return message.content[0].text
@traceable(name="Extract keywords", run_type="chain")
def extract_keywords(text: str) -> list[str]:
response = call_claude(f"List 5 keywords from this text as a comma-separated list: {text}")
return [k.strip() for k in response.split(",")]
result = extract_keywords("LangSmith traces LLM calls and helps you evaluate and debug.")
print(result)
Output:
['LangSmith', 'traces', 'LLM', 'evaluate', 'debug']
Datasets — ground truth for evaluation
A dataset is a collection of input/output pairs used to evaluate a chain consistently. Build datasets from production traces (tag and export), from CSV, or programmatically.
from langsmith import Client
ls = Client(api_key=os.environ["LANGCHAIN_API_KEY"])
# Create a dataset
dataset = ls.create_dataset(
dataset_name="summarisation-v1",
description="Summarisation test cases",
)
# Add examples
examples = [
{
"inputs": {"text": "The quick brown fox jumps over the lazy dog."},
"outputs": {"summary": "A fox jumps over a dog."},
},
{
"inputs": {"text": "Python is a high-level interpreted programming language."},
"outputs": {"summary": "Python is an interpreted high-level language."},
},
]
ls.create_examples(inputs=[e["inputs"] for e in examples],
outputs=[e["outputs"] for e in examples],
dataset_id=dataset.id)
print(f"Dataset '{dataset.name}' created with {len(examples)} examples")
Output:
Dataset 'summarisation-v1' created with 2 examples
Evaluators — scoring predictions
An evaluator scores a chain's output against the expected output. LangSmith provides built-in evaluators (exact_match, embedding_distance, qa) and supports custom evaluators via EvaluatorOutputSchema.
from langsmith import Client
from langsmith.evaluation import evaluate, LangChainStringEvaluator
from langchain_anthropic import ChatAnthropic
from langchain_core.prompts import ChatPromptTemplate
from langchain_core.output_parsers import StrOutputParser
import os
chain = (
ChatPromptTemplate.from_template("Summarise in one sentence: {text}")
| ChatAnthropic(model="claude-sonnet-4-6", api_key=os.environ["ANTHROPIC_API_KEY"])
| StrOutputParser()
)
def predict(inputs: dict) -> dict:
return {"summary": chain.invoke(inputs)}
# LLM-as-a-judge evaluator for helpfulness
helpfulness_evaluator = LangChainStringEvaluator(
"criteria",
config={
"criteria": "helpfulness",
"llm": ChatAnthropic(model="claude-sonnet-4-6", api_key=os.environ["ANTHROPIC_API_KEY"]),
},
prepare_data=lambda run, example: {
"prediction": run.outputs["summary"],
"reference": example.outputs["summary"],
"input": example.inputs["text"],
},
)
results = evaluate(
predict,
data="summarisation-v1",
evaluators=[helpfulness_evaluator],
experiment_prefix="claude-sonnet-4-6",
)
print(f"Mean helpfulness: {results.get_aggregate_feedback()}")
Custom evaluators
from langsmith.evaluation import run_evaluator
from langsmith.schemas import Run, Example
@run_evaluator
def word_count_evaluator(run: Run, example: Example) -> dict:
"""Penalise summaries that are too long."""
prediction = run.outputs.get("summary", "")
expected = example.outputs.get("summary", "")
word_ratio = len(prediction.split()) / max(len(expected.split()), 1)
score = 1.0 if word_ratio <= 1.5 else max(0.0, 1.0 - (word_ratio - 1.5))
return {"key": "conciseness", "score": score, "comment": f"word_ratio={word_ratio:.2f}"}
Prompt hub — versioned prompts
Store, version, and pull prompts from the LangSmith Hub so experiments are reproducible and rollback is trivial.
from langsmith import Client
from langchain import hub
# Pull a prompt by owner/name (uses LANGCHAIN_API_KEY)
prompt = hub.pull("alicedev/summarise-v1")
print(prompt.messages)
# Pull a specific commit for reproducibility
prompt = hub.pull("alicedev/summarise-v1:abc123")
# Push a new version
from langchain_core.prompts import ChatPromptTemplate
new_prompt = ChatPromptTemplate.from_messages([
("system", "You are a concise summariser. Use at most 20 words."),
("human", "{text}"),
])
hub.push("alicedev/summarise-v1", new_prompt, new_repo_is_public=False)
Comparing experiments
Run the same dataset through two different chains (e.g. Claude vs GPT-4) to compare scores side-by-side in the LangSmith UI.
from langsmith.evaluation import evaluate
# Experiment A — Claude
results_a = evaluate(
predict_with_claude,
data="summarisation-v1",
evaluators=[helpfulness_evaluator],
experiment_prefix="claude-sonnet-4-6",
)
# Experiment B — GPT-4o
results_b = evaluate(
predict_with_gpt4,
data="summarisation-v1",
evaluators=[helpfulness_evaluator],
experiment_prefix="gpt-4o",
)
# Both experiments appear in the LangSmith UI under the same dataset
# for side-by-side score and latency comparison
Feedback — tagging individual runs
from langsmith import Client
ls = Client(api_key=os.environ["LANGCHAIN_API_KEY"])
# After a production run, record human feedback
ls.create_feedback(
run_id="<run-id-from-trace>",
key="user_rating",
score=1.0, # 0.0 = bad, 1.0 = good
comment="Perfect summary, exactly right length",
)
# Query runs with negative feedback
bad_runs = ls.list_runs(
project_name="my-project",
filter='and(eq(feedback_key, "user_rating"), lt(feedback_score, 0.5))',
)
for run in bad_runs:
print(run.id, run.inputs, run.outputs)
CI integration — fail on score regression
from langsmith.evaluation import evaluate
def test_summarisation_quality():
results = evaluate(
predict,
data="summarisation-v1",
evaluators=[helpfulness_evaluator],
experiment_prefix="ci",
)
agg = results.get_aggregate_feedback()
assert agg["helpfulness"] >= 0.75, (
f"Helpfulness score {agg['helpfulness']:.2f} below threshold 0.75"
)
pytest test_eval.py # fails if mean helpfulness drops below threshold
Output: (none — exits 0 on success)
Run types — semantic categories in the trace tree
A run type labels a span in the trace tree so the UI can render the right icon, surface token usage, and filter activity. Pick the type that matches what the function does — not the framework that produces it.
| Type | Meaning | Typical inputs/outputs |
|---|---|---|
llm | A model call (any provider) | {prompt} → {completion, tokens, cost} |
chain | A multi-step orchestration | composite inputs → composite outputs |
tool | A function/tool call (search, calc, HTTP) | tool args → tool return value |
retriever | A vector store or BM25 retriever | {query} → list of documents |
embedding | An embedding call | {text} → vector |
parser | Output parsing / JSON extraction | raw text → structured data |
prompt | A prompt template render | template + vars → final string |
from langsmith import traceable
@traceable(run_type="retriever", name="pgvector_retrieve")
def retrieve(query: str, k: int = 5) -> list[dict]:
# The UI renders this as a retriever node with a document count badge
return [{"page_content": "...", "metadata": {"source": "doc-1"}} for _ in range(k)]
@traceable(run_type="tool", name="weather_api")
def get_weather(city: str) -> dict:
return {"city": city, "temp_c": 18, "conditions": "cloudy"}
Output: (none — exits 0 on success)
Tracing context — overriding project, tags, metadata
tracing_context is a context manager that mutates the active trace settings for the lifetime of a block. Use it to route A/B variants to separate projects or attach experiment metadata without changing env vars.
from langsmith import traceable
from langsmith.run_helpers import tracing_context
@traceable
def summarise(text: str) -> str:
return call_claude(f"Summarise: {text}")
# Route this run to a different project with extra tags + metadata
with tracing_context(
project_name="experiment-v2",
tags=["ab-test", "treatment"],
metadata={"variant": "B", "user_segment": "power"},
):
summarise("LangSmith batches and flushes traces asynchronously.")
Output: (none — exits 0 on success)
Multi-turn chats — threads and sessions
Group related runs into a thread so the LangSmith UI shows them as a conversation. Set the session_id (LangChain) or pass metadata={"thread_id": ...} when calling @traceable functions.
import uuid
from langsmith import traceable
from langsmith.run_helpers import tracing_context
thread_id = str(uuid.uuid4())
@traceable(run_type="chain")
def chat_turn(user_message: str) -> str:
return call_claude(user_message)
with tracing_context(metadata={"session_id": thread_id, "user_id": "alice-dev"}):
chat_turn("Hi, what is RAG?")
chat_turn("How does it differ from fine-tuning?")
chat_turn("Show me a Python example.")
Output: (none — exits 0 on success)
Programmatic trace inspection
The Client API lets you query, filter, and download runs without the UI — useful for nightly reports, dataset curation, and custom dashboards.
from langsmith import Client
from datetime import datetime, timedelta, timezone
ls = Client()
# Last 24h of failing runs in this project
runs = ls.list_runs(
project_name="my-project",
start_time=datetime.now(timezone.utc) - timedelta(days=1),
filter='eq(status, "error")',
limit=100,
)
for r in runs:
print(f"{r.start_time:%H:%M:%S} {r.name:30s} err={r.error[:60] if r.error else '-'}")
Output:
14:02:11 retrieve err=ConnectionError: pgvector timed out
14:09:47 call_claude err=anthropic.APIStatusError: 529 overloaded
14:31:05 parse_json err=ValueError: Expecting value: line 1 col
Filtering runs — the LangSmith query DSL
filter= accepts a small expression language for selecting runs. Combine predicates with and(...), or(...), and not(...). Operators: eq, ne, gt, gte, lt, lte, has, search.
# All runs where the question contained "rag", token cost > $0.01, and feedback score < 0.5
runs = ls.list_runs(
project_name="my-project",
filter=(
'and('
' search("rag"),'
' gt(total_cost, 0.01),'
' and(eq(feedback_key, "user_rating"), lt(feedback_score, 0.5))'
')'
),
)
print(f"{sum(1 for _ in runs)} runs matched")
Output:
17 runs matched
Cost and token tracking
Every traced LLM call records prompt/completion token counts and a computed dollar cost (LangSmith maintains a per-model price list). Aggregate across a project with get_project_stats or by iterating list_runs.
from langsmith import Client
from collections import defaultdict
ls = Client()
costs = defaultdict(float)
for r in ls.list_runs(project_name="my-project", run_type="llm", limit=1000):
model = (r.extra or {}).get("invocation_params", {}).get("model", "unknown")
costs[model] += r.total_cost or 0.0
for model, total in sorted(costs.items(), key=lambda kv: -kv[1]):
print(f" {model:30s} ${total:7.2f}")
Output:
claude-sonnet-4-6 $ 42.18
gpt-4o $ 18.74
text-embedding-3-small $ 0.62
Streaming and partial outputs
For streamed token output, LangSmith records the full assembled text as the final output once the stream closes. Use streaming=True in LangChain clients; with @traceable, return an iterable or yield from a generator — LangSmith collects the full sequence automatically.
from langsmith import traceable
@traceable(run_type="llm", name="claude_stream")
def stream_completion(prompt: str):
with client.messages.stream(
model="claude-sonnet-4-6",
max_tokens=512,
messages=[{"role": "user", "content": prompt}],
) as stream:
for chunk in stream.text_stream:
yield chunk
text = "".join(stream_completion("List three uses of LangSmith."))
print(text[:120])
Output:
1. Debugging LangChain failures by inspecting the exact prompt/response per step.
2. Building evaluation
Datasets from production — promoting traces to ground truth
A common workflow: a user thumbs-down a response → you fix the prompt → you want to make sure the fix didn't regress other queries. The fastest loop is to clone interesting production runs into a dataset, then re-run a candidate chain against that dataset.
from langsmith import Client
ls = Client()
# Find production runs with negative feedback and clone them into a dataset
bad_runs = list(ls.list_runs(
project_name="prod",
filter='and(eq(feedback_key, "user_rating"), lt(feedback_score, 0.5))',
limit=50,
))
dataset = ls.create_dataset("regressions-2026-05", description="Negative feedback from prod")
for r in bad_runs:
ls.create_example(
inputs=r.inputs,
outputs=r.outputs, # current production output, treat as "what we had"
dataset_id=dataset.id,
metadata={"source_run": str(r.id)},
)
print(f"Promoted {len(bad_runs)} runs into '{dataset.name}'")
Output:
Promoted 24 runs into 'regressions-2026-05'
Pairwise (preference) evaluation
A pairwise evaluator chooses which of two candidate outputs is better — useful for A/B tests where no single ground-truth answer exists.
from langsmith.evaluation import evaluate_comparative
from langchain_anthropic import ChatAnthropic
import os
judge = ChatAnthropic(model="claude-sonnet-4-6", api_key=os.environ["ANTHROPIC_API_KEY"])
def preference_judge(runs, example):
"""Pick the run with the more concise, accurate summary."""
a, b = runs
prompt = (
"You are judging two summaries. Reply with 'A' or 'B' only.\n"
f"Question: {example.inputs['text']}\n"
f"Reference: {example.outputs['summary']}\n"
f"A: {a.outputs['summary']}\n"
f"B: {b.outputs['summary']}"
)
choice = judge.invoke(prompt).content.strip().upper()
winner = a if choice == "A" else b
return {"key": "preferred", "scores": {str(a.id): int(winner is a), str(b.id): int(winner is b)}}
evaluate_comparative(
experiments=["claude-sonnet-4-6", "gpt-4o"], # two prior experiment names
evaluators=[preference_judge],
)
Self-hosted LangSmith
Set a different LANGCHAIN_ENDPOINT to send traces to a self-hosted LangSmith instance (Helm chart, Docker Compose). The client is identical otherwise.
export LANGCHAIN_ENDPOINT="https://langsmith.internal.example.com"
export LANGCHAIN_API_KEY="ls__..."
export LANGCHAIN_TRACING_V2="true"
python my_app.py
Output: (none — exits 0 on success)
Sampling traces in production
For high-traffic services, capturing every trace is expensive. Sample 1–10% with LANGCHAIN_TRACING_SAMPLING_RATE or wrap your entry point with a manual sampler.
import os, random
from langsmith.run_helpers import tracing_context
@traceable
def handle_request(payload: dict) -> dict:
return process(payload)
def maybe_trace(payload: dict, sample_rate: float = 0.05) -> dict:
if random.random() < sample_rate:
return handle_request(payload)
# Disable tracing for this call entirely
with tracing_context(enabled=False):
return handle_request(payload)
Output: (none — exits 0 on success)
Real-world recipes
These recipes string together the building blocks above into common production patterns.
Recipe: nightly evaluation report
Run a held-out dataset against the current production chain every night and post the deltas to Slack.
import os
from datetime import datetime
from langsmith import Client
from langsmith.evaluation import evaluate
ls = Client()
def predict(inputs: dict) -> dict:
return {"summary": production_chain.invoke(inputs)}
results = evaluate(
predict,
data="regression-suite-v3",
evaluators=[helpfulness_evaluator, word_count_evaluator],
experiment_prefix=f"nightly-{datetime.utcnow():%Y-%m-%d}",
)
agg = results.get_aggregate_feedback()
prev = ls.read_experiment("nightly-previous") # convention: alias previous green run
prev_agg = prev.aggregate_feedback if prev else {}
delta = {k: agg[k] - prev_agg.get(k, 0) for k in agg}
print({k: round(v, 3) for k, v in delta.items()})
Output:
{'helpfulness': 0.04, 'conciseness': -0.02}
Recipe: prompt promotion gate
Block any merge that pushes a prompt change unless evaluation scores hold or improve on the canonical dataset.
from langsmith.evaluation import evaluate
from langsmith import Client
def gate_prompt_change(prompt_path: str, baseline_score: float = 0.78) -> None:
new_prompt = open(prompt_path).read()
def predict(inputs: dict) -> dict:
return {"answer": call_claude(new_prompt.format(**inputs))}
results = evaluate(predict, data="prompt-gate-v1", evaluators=[helpfulness_evaluator])
score = results.get_aggregate_feedback()["helpfulness"]
if score < baseline_score:
raise SystemExit(f"FAIL: helpfulness {score:.3f} < baseline {baseline_score:.3f}")
print(f"PASS: helpfulness {score:.3f} >= {baseline_score:.3f}")
python -m scripts.gate_prompt_change ./prompts/summarise.txt
Output:
PASS: helpfulness 0.812 >= 0.780
Recipe: user feedback → fine-tune dataset
Collect runs that earned a thumbs-up and export them as a Hugging Face dataset for supervised fine-tuning.
from langsmith import Client
from datasets import Dataset
ls = Client()
runs = list(ls.list_runs(
project_name="prod",
filter='and(eq(feedback_key, "user_rating"), eq(feedback_score, 1.0))',
run_type="llm",
limit=5000,
))
records = [
{
"prompt": (r.inputs.get("messages") or r.inputs.get("prompt") or [""])[0]
if isinstance(r.inputs.get("messages"), list) else str(r.inputs),
"completion": r.outputs.get("output") or r.outputs.get("content") or "",
}
for r in runs if r.outputs
]
ds = Dataset.from_list(records)
ds.save_to_disk("./sft_thumbs_up")
print(f"Exported {len(ds)} thumbs-up examples")
Output:
Exported 1438 thumbs-up examples
Recipe: cost alarm on a per-user basis
Aggregate trace cost by user_id metadata and warn on top spenders.
from collections import defaultdict
from langsmith import Client
from datetime import datetime, timezone, timedelta
ls = Client()
spend = defaultdict(float)
for r in ls.list_runs(
project_name="prod",
start_time=datetime.now(timezone.utc) - timedelta(days=1),
run_type="llm",
limit=10_000,
):
user = (r.extra or {}).get("metadata", {}).get("user_id", "anon")
spend[user] += r.total_cost or 0.0
for user, total in sorted(spend.items(), key=lambda kv: -kv[1])[:10]:
if total > 5.0:
print(f"ALERT {user} ${total:.2f}/day")
Output:
ALERT user_4711 $12.93/day
ALERT alice-dev $ 8.40/day
Performance and reliability tips
- Always call
Client().flush()at the end of short scripts; otherwise the background sender may drop traces on exit. - For high-throughput services, use
LANGCHAIN_TRACING_SAMPLING_RATE=0.05and tag the kept runs withmetadata={"sampled": True}so dashboards know the sampling factor. - Avoid putting large payloads (>1 MB) directly in
inputs/outputs— link to S3/R2 in metadata instead. LangSmith truncates oversized fields. - Set
LANGCHAIN_HIDE_INPUTS=trueto redact inputs on PII-sensitive projects; combine with a custom hash so you can still group identical queries. - Pin a prompt version (
hub.pull("owner/name:abc123")) in production code — the floating tag can drift under you.
Quick reference
| Task | Code |
|---|---|
| Enable tracing | os.environ["LANGCHAIN_TRACING_V2"] = "true" + LANGCHAIN_API_KEY |
| Set project | os.environ["LANGCHAIN_PROJECT"] = "name" |
| Trace any function | @traceable(name="step", run_type="chain") |
| Override project | with tracing_context(project_name="exp"): |
| Attach metadata | with tracing_context(metadata={"user_id": "..."}): |
| Group as thread | metadata={"session_id": uuid} on each turn |
| Disable a block | with tracing_context(enabled=False): |
| Create dataset | ls.create_dataset("name") |
| Add examples | ls.create_examples(inputs=[...], outputs=[...], dataset_id=...) |
| Run evaluation | evaluate(predict_fn, data="dataset-name", evaluators=[...]) |
| Built-in evaluator | LangChainStringEvaluator("criteria", config={"criteria": "helpfulness"}) |
| Custom evaluator | @run_evaluator def fn(run, example) -> dict: |
| Pairwise eval | evaluate_comparative(experiments=["a","b"], evaluators=[...]) |
| Query runs | ls.list_runs(project_name=..., filter='and(...)') |
| Pull prompt | hub.pull("owner/name") |
| Pin prompt | hub.pull("owner/name:abc123") |
| Push prompt | hub.push("owner/name", prompt) |
| Tag run | ls.create_feedback(run_id, key="rating", score=1.0) |
| Flush traces | Client().flush() |
| Self-hosted | export LANGCHAIN_ENDPOINT=https://... |
| Sample 5% | export LANGCHAIN_TRACING_SAMPLING_RATE=0.05 |