cheat sheet

LangSmith

Trace, debug, evaluate, and monitor LLM applications with LangSmith. Covers tracing setup, datasets, evaluators, prompt hub, comparing runs, and CI integration.

LangSmith — LLM Observability & Evaluation

What it is

LangSmith is Langchain Inc.'s platform for observability and evaluation of LLM applications. It automatically captures every prompt, response, token count, latency, and error from LangChain chains and agents — and from any Python code you instrument manually. You use it to debug failures, build evaluation datasets from production traces, run automated regression tests, and compare model/prompt versions. LangSmith has a free tier and integrates with LangChain via two environment variables.

Install

bash
pip install langsmith
pip install langchain   # optional — auto-traces all LangChain calls

Output: (none — exits 0 on success)

Quick example

python
import os

# Enable tracing with two env vars — that's all LangChain needs
os.environ["LANGCHAIN_TRACING_V2"] = "true"
os.environ["LANGCHAIN_API_KEY"]    = "ls__..."   # from app.langsmith.com
os.environ["LANGCHAIN_PROJECT"]    = "my-project"

from langchain_anthropic import ChatAnthropic
from langchain_core.prompts import ChatPromptTemplate
from langchain_core.output_parsers import StrOutputParser

chain = (
    ChatPromptTemplate.from_template("Summarise in one sentence: {text}")
    | ChatAnthropic(model="claude-sonnet-4-6")
    | StrOutputParser()
)

result = chain.invoke({"text": "LangSmith traces every LLM call automatically."})
print(result)
# Every call now appears in the LangSmith UI with full prompt/response/latency

Output:

text
LangSmith automatically traces every LLM call and displays it in the UI.

When / why to use it

  • Debugging LLM chains: inspect exactly what prompt was sent, what was received, and where the chain failed.
  • Building evaluation datasets: tag production traces as "good" or "bad" and export to a dataset.
  • Regression testing: run a dataset through a chain and compare scores across versions.
  • Prompt management: version prompts in the LangSmith Hub and pull them by commit hash.
  • Monitoring production: alert on latency regressions or error rate spikes.

Common pitfalls

Traces are sent asynchronously — LangSmith batches and sends traces in the background. In short-lived scripts, the process may exit before all traces are flushed. Add langsmith.Client().flush() at the end of scripts to ensure all traces are sent.

LANGCHAIN_TRACING_V2 must be set before importing LangChain — the tracer registers at import time. Setting the env var after from langchain_core import ... has no effect.

Use @traceable to trace any Python function — not just LangChain objects. This captures non-LangChain steps (database calls, pre/post-processing) in the same trace tree.

with tracing_context(project_name="experiment-v2"): overrides the project for a specific block, making it easy to route A/B experiments to separate projects without changing env vars.

Tracing non-LangChain code with @traceable

@traceable instruments any Python function so its inputs, outputs, and metadata appear in LangSmith traces.

python
from langsmith import traceable
import anthropic
import os

client = anthropic.Anthropic(api_key=os.environ["ANTHROPIC_API_KEY"])

@traceable(name="Claude API call", run_type="llm")
def call_claude(prompt: str, model: str = "claude-sonnet-4-6") -> str:
    message = client.messages.create(
        model=model,
        max_tokens=256,
        messages=[{"role": "user", "content": prompt}],
    )
    return message.content[0].text

@traceable(name="Extract keywords", run_type="chain")
def extract_keywords(text: str) -> list[str]:
    response = call_claude(f"List 5 keywords from this text as a comma-separated list: {text}")
    return [k.strip() for k in response.split(",")]

result = extract_keywords("LangSmith traces LLM calls and helps you evaluate and debug.")
print(result)

Output:

text
['LangSmith', 'traces', 'LLM', 'evaluate', 'debug']

Datasets — ground truth for evaluation

A dataset is a collection of input/output pairs used to evaluate a chain consistently. Build datasets from production traces (tag and export), from CSV, or programmatically.

python
from langsmith import Client

ls = Client(api_key=os.environ["LANGCHAIN_API_KEY"])

# Create a dataset
dataset = ls.create_dataset(
    dataset_name="summarisation-v1",
    description="Summarisation test cases",
)

# Add examples
examples = [
    {
        "inputs":  {"text": "The quick brown fox jumps over the lazy dog."},
        "outputs": {"summary": "A fox jumps over a dog."},
    },
    {
        "inputs":  {"text": "Python is a high-level interpreted programming language."},
        "outputs": {"summary": "Python is an interpreted high-level language."},
    },
]
ls.create_examples(inputs=[e["inputs"] for e in examples],
                   outputs=[e["outputs"] for e in examples],
                   dataset_id=dataset.id)

print(f"Dataset '{dataset.name}' created with {len(examples)} examples")

Output:

text
Dataset 'summarisation-v1' created with 2 examples

Evaluators — scoring predictions

An evaluator scores a chain's output against the expected output. LangSmith provides built-in evaluators (exact_match, embedding_distance, qa) and supports custom evaluators via EvaluatorOutputSchema.

python
from langsmith import Client
from langsmith.evaluation import evaluate, LangChainStringEvaluator
from langchain_anthropic import ChatAnthropic
from langchain_core.prompts import ChatPromptTemplate
from langchain_core.output_parsers import StrOutputParser
import os

chain = (
    ChatPromptTemplate.from_template("Summarise in one sentence: {text}")
    | ChatAnthropic(model="claude-sonnet-4-6", api_key=os.environ["ANTHROPIC_API_KEY"])
    | StrOutputParser()
)

def predict(inputs: dict) -> dict:
    return {"summary": chain.invoke(inputs)}

# LLM-as-a-judge evaluator for helpfulness
helpfulness_evaluator = LangChainStringEvaluator(
    "criteria",
    config={
        "criteria": "helpfulness",
        "llm": ChatAnthropic(model="claude-sonnet-4-6", api_key=os.environ["ANTHROPIC_API_KEY"]),
    },
    prepare_data=lambda run, example: {
        "prediction": run.outputs["summary"],
        "reference":  example.outputs["summary"],
        "input":      example.inputs["text"],
    },
)

results = evaluate(
    predict,
    data="summarisation-v1",
    evaluators=[helpfulness_evaluator],
    experiment_prefix="claude-sonnet-4-6",
)

print(f"Mean helpfulness: {results.get_aggregate_feedback()}")

Custom evaluators

python
from langsmith.evaluation import run_evaluator
from langsmith.schemas import Run, Example

@run_evaluator
def word_count_evaluator(run: Run, example: Example) -> dict:
    """Penalise summaries that are too long."""
    prediction = run.outputs.get("summary", "")
    expected   = example.outputs.get("summary", "")
    word_ratio = len(prediction.split()) / max(len(expected.split()), 1)
    score = 1.0 if word_ratio <= 1.5 else max(0.0, 1.0 - (word_ratio - 1.5))
    return {"key": "conciseness", "score": score, "comment": f"word_ratio={word_ratio:.2f}"}

Prompt hub — versioned prompts

Store, version, and pull prompts from the LangSmith Hub so experiments are reproducible and rollback is trivial.

python
from langsmith import Client
from langchain import hub

# Pull a prompt by owner/name (uses LANGCHAIN_API_KEY)
prompt = hub.pull("alicedev/summarise-v1")
print(prompt.messages)

# Pull a specific commit for reproducibility
prompt = hub.pull("alicedev/summarise-v1:abc123")

# Push a new version
from langchain_core.prompts import ChatPromptTemplate
new_prompt = ChatPromptTemplate.from_messages([
    ("system", "You are a concise summariser. Use at most 20 words."),
    ("human",  "{text}"),
])
hub.push("alicedev/summarise-v1", new_prompt, new_repo_is_public=False)

Comparing experiments

Run the same dataset through two different chains (e.g. Claude vs GPT-4) to compare scores side-by-side in the LangSmith UI.

python
from langsmith.evaluation import evaluate

# Experiment A — Claude
results_a = evaluate(
    predict_with_claude,
    data="summarisation-v1",
    evaluators=[helpfulness_evaluator],
    experiment_prefix="claude-sonnet-4-6",
)

# Experiment B — GPT-4o
results_b = evaluate(
    predict_with_gpt4,
    data="summarisation-v1",
    evaluators=[helpfulness_evaluator],
    experiment_prefix="gpt-4o",
)

# Both experiments appear in the LangSmith UI under the same dataset
# for side-by-side score and latency comparison

Feedback — tagging individual runs

python
from langsmith import Client

ls = Client(api_key=os.environ["LANGCHAIN_API_KEY"])

# After a production run, record human feedback
ls.create_feedback(
    run_id="<run-id-from-trace>",
    key="user_rating",
    score=1.0,           # 0.0 = bad, 1.0 = good
    comment="Perfect summary, exactly right length",
)

# Query runs with negative feedback
bad_runs = ls.list_runs(
    project_name="my-project",
    filter='and(eq(feedback_key, "user_rating"), lt(feedback_score, 0.5))',
)
for run in bad_runs:
    print(run.id, run.inputs, run.outputs)

CI integration — fail on score regression

python
from langsmith.evaluation import evaluate

def test_summarisation_quality():
    results = evaluate(
        predict,
        data="summarisation-v1",
        evaluators=[helpfulness_evaluator],
        experiment_prefix="ci",
    )
    agg = results.get_aggregate_feedback()
    assert agg["helpfulness"] >= 0.75, (
        f"Helpfulness score {agg['helpfulness']:.2f} below threshold 0.75"
    )
bash
pytest test_eval.py  # fails if mean helpfulness drops below threshold

Output: (none — exits 0 on success)

Run types — semantic categories in the trace tree

A run type labels a span in the trace tree so the UI can render the right icon, surface token usage, and filter activity. Pick the type that matches what the function does — not the framework that produces it.

TypeMeaningTypical inputs/outputs
llmA model call (any provider){prompt}{completion, tokens, cost}
chainA multi-step orchestrationcomposite inputs → composite outputs
toolA function/tool call (search, calc, HTTP)tool args → tool return value
retrieverA vector store or BM25 retriever{query} → list of documents
embeddingAn embedding call{text} → vector
parserOutput parsing / JSON extractionraw text → structured data
promptA prompt template rendertemplate + vars → final string
python
from langsmith import traceable

@traceable(run_type="retriever", name="pgvector_retrieve")
def retrieve(query: str, k: int = 5) -> list[dict]:
    # The UI renders this as a retriever node with a document count badge
    return [{"page_content": "...", "metadata": {"source": "doc-1"}} for _ in range(k)]

@traceable(run_type="tool", name="weather_api")
def get_weather(city: str) -> dict:
    return {"city": city, "temp_c": 18, "conditions": "cloudy"}

Output: (none — exits 0 on success)

Tracing context — overriding project, tags, metadata

tracing_context is a context manager that mutates the active trace settings for the lifetime of a block. Use it to route A/B variants to separate projects or attach experiment metadata without changing env vars.

python
from langsmith import traceable
from langsmith.run_helpers import tracing_context

@traceable
def summarise(text: str) -> str:
    return call_claude(f"Summarise: {text}")

# Route this run to a different project with extra tags + metadata
with tracing_context(
    project_name="experiment-v2",
    tags=["ab-test", "treatment"],
    metadata={"variant": "B", "user_segment": "power"},
):
    summarise("LangSmith batches and flushes traces asynchronously.")

Output: (none — exits 0 on success)

Multi-turn chats — threads and sessions

Group related runs into a thread so the LangSmith UI shows them as a conversation. Set the session_id (LangChain) or pass metadata={"thread_id": ...} when calling @traceable functions.

python
import uuid
from langsmith import traceable
from langsmith.run_helpers import tracing_context

thread_id = str(uuid.uuid4())

@traceable(run_type="chain")
def chat_turn(user_message: str) -> str:
    return call_claude(user_message)

with tracing_context(metadata={"session_id": thread_id, "user_id": "alice-dev"}):
    chat_turn("Hi, what is RAG?")
    chat_turn("How does it differ from fine-tuning?")
    chat_turn("Show me a Python example.")

Output: (none — exits 0 on success)

Programmatic trace inspection

The Client API lets you query, filter, and download runs without the UI — useful for nightly reports, dataset curation, and custom dashboards.

python
from langsmith import Client
from datetime import datetime, timedelta, timezone

ls = Client()

# Last 24h of failing runs in this project
runs = ls.list_runs(
    project_name="my-project",
    start_time=datetime.now(timezone.utc) - timedelta(days=1),
    filter='eq(status, "error")',
    limit=100,
)
for r in runs:
    print(f"{r.start_time:%H:%M:%S}  {r.name:30s}  err={r.error[:60] if r.error else '-'}")

Output:

text
14:02:11  retrieve                       err=ConnectionError: pgvector timed out
14:09:47  call_claude                    err=anthropic.APIStatusError: 529 overloaded
14:31:05  parse_json                     err=ValueError: Expecting value: line 1 col

Filtering runs — the LangSmith query DSL

filter= accepts a small expression language for selecting runs. Combine predicates with and(...), or(...), and not(...). Operators: eq, ne, gt, gte, lt, lte, has, search.

python
# All runs where the question contained "rag", token cost > $0.01, and feedback score < 0.5
runs = ls.list_runs(
    project_name="my-project",
    filter=(
        'and('
        '  search("rag"),'
        '  gt(total_cost, 0.01),'
        '  and(eq(feedback_key, "user_rating"), lt(feedback_score, 0.5))'
        ')'
    ),
)
print(f"{sum(1 for _ in runs)} runs matched")

Output:

text
17 runs matched

Cost and token tracking

Every traced LLM call records prompt/completion token counts and a computed dollar cost (LangSmith maintains a per-model price list). Aggregate across a project with get_project_stats or by iterating list_runs.

python
from langsmith import Client
from collections import defaultdict

ls = Client()
costs = defaultdict(float)
for r in ls.list_runs(project_name="my-project", run_type="llm", limit=1000):
    model = (r.extra or {}).get("invocation_params", {}).get("model", "unknown")
    costs[model] += r.total_cost or 0.0

for model, total in sorted(costs.items(), key=lambda kv: -kv[1]):
    print(f"  {model:30s}  ${total:7.2f}")

Output:

text
  claude-sonnet-4-6              $  42.18
  gpt-4o                         $  18.74
  text-embedding-3-small         $   0.62

Streaming and partial outputs

For streamed token output, LangSmith records the full assembled text as the final output once the stream closes. Use streaming=True in LangChain clients; with @traceable, return an iterable or yield from a generator — LangSmith collects the full sequence automatically.

python
from langsmith import traceable

@traceable(run_type="llm", name="claude_stream")
def stream_completion(prompt: str):
    with client.messages.stream(
        model="claude-sonnet-4-6",
        max_tokens=512,
        messages=[{"role": "user", "content": prompt}],
    ) as stream:
        for chunk in stream.text_stream:
            yield chunk

text = "".join(stream_completion("List three uses of LangSmith."))
print(text[:120])

Output:

text
1. Debugging LangChain failures by inspecting the exact prompt/response per step.
2. Building evaluation

Datasets from production — promoting traces to ground truth

A common workflow: a user thumbs-down a response → you fix the prompt → you want to make sure the fix didn't regress other queries. The fastest loop is to clone interesting production runs into a dataset, then re-run a candidate chain against that dataset.

python
from langsmith import Client

ls = Client()

# Find production runs with negative feedback and clone them into a dataset
bad_runs = list(ls.list_runs(
    project_name="prod",
    filter='and(eq(feedback_key, "user_rating"), lt(feedback_score, 0.5))',
    limit=50,
))

dataset = ls.create_dataset("regressions-2026-05", description="Negative feedback from prod")
for r in bad_runs:
    ls.create_example(
        inputs=r.inputs,
        outputs=r.outputs,            # current production output, treat as "what we had"
        dataset_id=dataset.id,
        metadata={"source_run": str(r.id)},
    )
print(f"Promoted {len(bad_runs)} runs into '{dataset.name}'")

Output:

text
Promoted 24 runs into 'regressions-2026-05'

Pairwise (preference) evaluation

A pairwise evaluator chooses which of two candidate outputs is better — useful for A/B tests where no single ground-truth answer exists.

python
from langsmith.evaluation import evaluate_comparative
from langchain_anthropic import ChatAnthropic
import os

judge = ChatAnthropic(model="claude-sonnet-4-6", api_key=os.environ["ANTHROPIC_API_KEY"])

def preference_judge(runs, example):
    """Pick the run with the more concise, accurate summary."""
    a, b = runs
    prompt = (
        "You are judging two summaries. Reply with 'A' or 'B' only.\n"
        f"Question: {example.inputs['text']}\n"
        f"Reference: {example.outputs['summary']}\n"
        f"A: {a.outputs['summary']}\n"
        f"B: {b.outputs['summary']}"
    )
    choice = judge.invoke(prompt).content.strip().upper()
    winner = a if choice == "A" else b
    return {"key": "preferred", "scores": {str(a.id): int(winner is a), str(b.id): int(winner is b)}}

evaluate_comparative(
    experiments=["claude-sonnet-4-6", "gpt-4o"],   # two prior experiment names
    evaluators=[preference_judge],
)

Self-hosted LangSmith

Set a different LANGCHAIN_ENDPOINT to send traces to a self-hosted LangSmith instance (Helm chart, Docker Compose). The client is identical otherwise.

bash
export LANGCHAIN_ENDPOINT="https://langsmith.internal.example.com"
export LANGCHAIN_API_KEY="ls__..."
export LANGCHAIN_TRACING_V2="true"
python my_app.py

Output: (none — exits 0 on success)

Sampling traces in production

For high-traffic services, capturing every trace is expensive. Sample 1–10% with LANGCHAIN_TRACING_SAMPLING_RATE or wrap your entry point with a manual sampler.

python
import os, random
from langsmith.run_helpers import tracing_context

@traceable
def handle_request(payload: dict) -> dict:
    return process(payload)

def maybe_trace(payload: dict, sample_rate: float = 0.05) -> dict:
    if random.random() < sample_rate:
        return handle_request(payload)
    # Disable tracing for this call entirely
    with tracing_context(enabled=False):
        return handle_request(payload)

Output: (none — exits 0 on success)

Real-world recipes

These recipes string together the building blocks above into common production patterns.

Recipe: nightly evaluation report

Run a held-out dataset against the current production chain every night and post the deltas to Slack.

python
import os
from datetime import datetime
from langsmith import Client
from langsmith.evaluation import evaluate

ls = Client()

def predict(inputs: dict) -> dict:
    return {"summary": production_chain.invoke(inputs)}

results = evaluate(
    predict,
    data="regression-suite-v3",
    evaluators=[helpfulness_evaluator, word_count_evaluator],
    experiment_prefix=f"nightly-{datetime.utcnow():%Y-%m-%d}",
)

agg = results.get_aggregate_feedback()
prev = ls.read_experiment("nightly-previous")           # convention: alias previous green run
prev_agg = prev.aggregate_feedback if prev else {}

delta = {k: agg[k] - prev_agg.get(k, 0) for k in agg}
print({k: round(v, 3) for k, v in delta.items()})

Output:

text
{'helpfulness': 0.04, 'conciseness': -0.02}

Recipe: prompt promotion gate

Block any merge that pushes a prompt change unless evaluation scores hold or improve on the canonical dataset.

python
from langsmith.evaluation import evaluate
from langsmith import Client

def gate_prompt_change(prompt_path: str, baseline_score: float = 0.78) -> None:
    new_prompt = open(prompt_path).read()

    def predict(inputs: dict) -> dict:
        return {"answer": call_claude(new_prompt.format(**inputs))}

    results = evaluate(predict, data="prompt-gate-v1", evaluators=[helpfulness_evaluator])
    score = results.get_aggregate_feedback()["helpfulness"]
    if score < baseline_score:
        raise SystemExit(f"FAIL: helpfulness {score:.3f} < baseline {baseline_score:.3f}")
    print(f"PASS: helpfulness {score:.3f} >= {baseline_score:.3f}")
bash
python -m scripts.gate_prompt_change ./prompts/summarise.txt

Output:

text
PASS: helpfulness 0.812 >= 0.780

Recipe: user feedback → fine-tune dataset

Collect runs that earned a thumbs-up and export them as a Hugging Face dataset for supervised fine-tuning.

python
from langsmith import Client
from datasets import Dataset

ls = Client()
runs = list(ls.list_runs(
    project_name="prod",
    filter='and(eq(feedback_key, "user_rating"), eq(feedback_score, 1.0))',
    run_type="llm",
    limit=5000,
))

records = [
    {
        "prompt": (r.inputs.get("messages") or r.inputs.get("prompt") or [""])[0]
                  if isinstance(r.inputs.get("messages"), list) else str(r.inputs),
        "completion": r.outputs.get("output") or r.outputs.get("content") or "",
    }
    for r in runs if r.outputs
]
ds = Dataset.from_list(records)
ds.save_to_disk("./sft_thumbs_up")
print(f"Exported {len(ds)} thumbs-up examples")

Output:

text
Exported 1438 thumbs-up examples

Recipe: cost alarm on a per-user basis

Aggregate trace cost by user_id metadata and warn on top spenders.

python
from collections import defaultdict
from langsmith import Client
from datetime import datetime, timezone, timedelta

ls = Client()
spend = defaultdict(float)
for r in ls.list_runs(
    project_name="prod",
    start_time=datetime.now(timezone.utc) - timedelta(days=1),
    run_type="llm",
    limit=10_000,
):
    user = (r.extra or {}).get("metadata", {}).get("user_id", "anon")
    spend[user] += r.total_cost or 0.0

for user, total in sorted(spend.items(), key=lambda kv: -kv[1])[:10]:
    if total > 5.0:
        print(f"ALERT  {user}  ${total:.2f}/day")

Output:

text
ALERT  user_4711  $12.93/day
ALERT  alice-dev  $ 8.40/day

Performance and reliability tips

  • Always call Client().flush() at the end of short scripts; otherwise the background sender may drop traces on exit.
  • For high-throughput services, use LANGCHAIN_TRACING_SAMPLING_RATE=0.05 and tag the kept runs with metadata={"sampled": True} so dashboards know the sampling factor.
  • Avoid putting large payloads (>1 MB) directly in inputs/outputs — link to S3/R2 in metadata instead. LangSmith truncates oversized fields.
  • Set LANGCHAIN_HIDE_INPUTS=true to redact inputs on PII-sensitive projects; combine with a custom hash so you can still group identical queries.
  • Pin a prompt version (hub.pull("owner/name:abc123")) in production code — the floating tag can drift under you.

Quick reference

TaskCode
Enable tracingos.environ["LANGCHAIN_TRACING_V2"] = "true" + LANGCHAIN_API_KEY
Set projectos.environ["LANGCHAIN_PROJECT"] = "name"
Trace any function@traceable(name="step", run_type="chain")
Override projectwith tracing_context(project_name="exp"):
Attach metadatawith tracing_context(metadata={"user_id": "..."}):
Group as threadmetadata={"session_id": uuid} on each turn
Disable a blockwith tracing_context(enabled=False):
Create datasetls.create_dataset("name")
Add examplesls.create_examples(inputs=[...], outputs=[...], dataset_id=...)
Run evaluationevaluate(predict_fn, data="dataset-name", evaluators=[...])
Built-in evaluatorLangChainStringEvaluator("criteria", config={"criteria": "helpfulness"})
Custom evaluator@run_evaluator def fn(run, example) -> dict:
Pairwise evalevaluate_comparative(experiments=["a","b"], evaluators=[...])
Query runsls.list_runs(project_name=..., filter='and(...)')
Pull prompthub.pull("owner/name")
Pin prompthub.pull("owner/name:abc123")
Push prompthub.push("owner/name", prompt)
Tag runls.create_feedback(run_id, key="rating", score=1.0)
Flush tracesClient().flush()
Self-hostedexport LANGCHAIN_ENDPOINT=https://...
Sample 5%export LANGCHAIN_TRACING_SAMPLING_RATE=0.05