cheat sheet

langsmith

Package-level reference for the langsmith SDK on PyPI — install, versioning, env-var setup, and observability alternatives.

updated 05-31-2026

langsmith

What it is

langsmith is the Python client SDK for the LangSmith platform — LangChain Inc.'s hosted observability, tracing, evaluation, and prompt-management service for LLM applications. It captures every LLM call, tool use, and chain step as a span on a hosted backend so you can debug failures, replay runs, and score outputs against datasets.

The SDK works with or without LangChain. Decorating any function with @traceable is enough to ship spans; you do not need LCEL or langchain-core to use it.

Install

bash

pip install langsmith

Output: (none — exits 0 on success)

bash

uv add langsmith

Output: dependency resolved + added to pyproject.toml

bash

poetry add langsmith

Output: updated lockfile + virtualenv install

The SDK reads two environment variables to enable tracing:

bash

export LANGCHAIN_TRACING_V2=true
export LANGCHAIN_API_KEY=ls__...

Output: any subsequent @traceable function emits spans to LangSmith

Versioning & Python support

Current stable line is 0.1.x / 0.2.x (as of late 2025). Pre-1.0 — minor bumps can change span schema and rename SDK helpers.
Python 3.9+. Async transport (AsyncClient) requires 3.10+ for best ergonomics.
The SDK is forward-compatible with the LangSmith backend, but old SDKs may miss new span fields (e.g. token-usage breakdowns, cost). Upgrade alongside any major feature you want to use.
Independent release cadence from langchain itself — the langchain-core callback handler that ships spans to LangSmith pins a langsmith>=X range, so a stale langchain-core can hold back upgrades.

Package metadata

Maintainer: LangChain Inc.
Project home: github.com/langchain-ai/langsmith-sdk
Docs: docs.smith.langchain.com
PyPI: pypi.org/project/langsmith
License: MIT (SDK); LangSmith backend is closed-source / commercial
Governance: commercial product, hosted SaaS + self-hosted enterprise
First released: 2023
Downloads: millions per month — pulled in transitively by most langchain installs

Optional dependencies & extras

langsmith is intentionally light. Core deps:

requests — synchronous HTTP transport
httpx — async transport
orjson — fast JSON serialisation
pydantic — span / run schemas

Extras worth noting:

langsmith[pytest] — pytest plugin and @unit decorator for evaluation-as-test workflows.
langsmith[openai-agents] — integration helpers for the OpenAI Agents SDK.
langsmith[vcr] — record/replay LLM calls for hermetic test runs.

The SDK does not depend on langchain — install one without the other freely.

Alternatives

Package	Trade-off
`langfuse`	Open-source observability backend + Python SDK. Self-hostable; UI is the closest comparable.
`arize-phoenix`	Open-source local + cloud LLM tracing (OpenInference / OpenTelemetry). Strong on retrieval-eval.
`wandb` (Weights & Biases)	Broader ML experiment tracking; LLM tracing via `wandb-traces` / `weave`.
`helicone`	Proxy-based observability — point your LLM client base URL at Helicone and it logs every call.
`opentelemetry-instrumentation-openai` (and similar)	OTel-native — ship spans to any OTel backend (Jaeger, Honeycomb, Datadog).

Common gotchas

LANGCHAIN_API_KEY must be set before import for the langchain-core callback handler to attach. Setting it after from langchain_openai import ChatOpenAI is too late — re-import, or use the explicit LangChainTracer callback.
Trace persistence is on the LangSmith backend by default. Prompts, model outputs, and tool arguments are stored on LangChain Inc.'s servers. For regulated data (PII, GDPR), use the self-hosted deployment or the hide_inputs / hide_outputs redaction hooks.
Sampling is off by default. Every traced call hits the backend. For high-QPS production, configure LANGCHAIN_SAMPLING_RATE (env) or sample manually in the @traceable decorator.
Async transport batches in the background. client.flush() is required before process exit, otherwise the last few seconds of spans drop. Most CLI scripts hit this.
@traceable is contagious. Spans inherit project, tags, and metadata from the parent run via context-vars. Forking subprocesses or concurrent.futures.ProcessPoolExecutor breaks the context — pass the run ID explicitly.
Project name defaults to default. Set LANGCHAIN_PROJECT=my-app to keep environments separate; otherwise every dev/CI/prod call piles into the same project.
SDK errors are silently swallowed. By design, tracing failures never crash the host app. Enable LANGCHAIN_VERBOSE=true while debugging missing traces.

Real-world recipes

These patterns string the SDK primitives into the workflows production teams actually run. They lean on @traceable, Client.list_runs, datasets, and evaluators — the same building blocks shown in the companion article.

Recipe: framework-agnostic instrumentation

langsmith does not require LangChain. Decorate any Python function with @traceable and you get a span with inputs, outputs, latency, tokens (if reported), and errors.

python

import os, anthropic
from langsmith import traceable

os.environ["LANGCHAIN_TRACING_V2"] = "true"
os.environ["LANGCHAIN_API_KEY"]    = "ls__..."
os.environ["LANGCHAIN_PROJECT"]    = "rag-prod"

client = anthropic.Anthropic(api_key=os.environ["ANTHROPIC_API_KEY"])

@traceable(run_type="retriever", name="pgvector_search")
def retrieve(query: str, k: int = 4) -> list[dict]:
    return [{"page_content": "...", "metadata": {"source": "doc-1"}}]

@traceable(run_type="llm", name="claude_call")
def call_claude(prompt: str) -> str:
    msg = client.messages.create(
        model="claude-sonnet-4-6",
        max_tokens=512,
        messages=[{"role": "user", "content": prompt}],
    )
    return msg.content[0].text

@traceable(run_type="chain", name="rag_pipeline")
def rag(question: str) -> str:
    docs = retrieve(question)
    context = "\n\n".join(d["page_content"] for d in docs)
    return call_claude(f"Context:\n{context}\n\nQuestion: {question}")

print(rag("What is RAG?"))

Output: three nested spans — rag_pipeline containing pgvector_search then claude_call — appear in LangSmith with the right run-type icons.

Recipe: a CI evaluation gate

Run a canonical regression dataset against the current chain on every PR; fail the build if scores regress.

python

from langsmith.evaluation import evaluate
from langsmith import Client

def predict(inputs):
    return {"answer": rag(inputs["question"])}

def test_no_regression():
    results = evaluate(
        predict,
        data="regression-suite-v3",
        evaluators=[helpfulness_evaluator, length_evaluator],
        experiment_prefix="ci",
    )
    agg = results.get_aggregate_feedback()
    assert agg["helpfulness"] >= 0.78, f"helpfulness regressed: {agg['helpfulness']:.2f}"

Output: the CI job posts experiment URL to the PR comment; threshold violations fail the build with a useful link.

Recipe: cost dashboard ingest

Aggregate yesterday's spend by model and ship to a dashboard.

python

from datetime import datetime, timezone, timedelta
from collections import defaultdict
from langsmith import Client

ls = Client()
since = datetime.now(timezone.utc) - timedelta(days=1)

spend = defaultdict(float)
tokens = defaultdict(int)
for r in ls.list_runs(project_name="prod", run_type="llm", start_time=since, limit=10_000):
    model = (r.extra or {}).get("invocation_params", {}).get("model", "unknown")
    spend[model]  += r.total_cost or 0.0
    tokens[model] += (r.prompt_tokens or 0) + (r.completion_tokens or 0)

for m in sorted(spend, key=lambda k: -spend[k]):
    print(f"{m:30s}  ${spend[m]:7.2f}   {tokens[m]:>10,} tokens")

Output: ranked table; pipe into Prometheus / OpenTelemetry / your warehouse for retention.

Recipe: prompt promotion pipeline

Use the Hub as a versioned store; tag the candidate, evaluate against a gold dataset, and promote on pass.

python

from langchain import hub
from langsmith.evaluation import evaluate

candidate = ChatPromptTemplate.from_template(open("prompts/v3.txt").read())
hub.push("alicedev/summarise-v1", candidate, new_repo_is_public=False)

results = evaluate(predict, data="summarise-gold", evaluators=[helpfulness_evaluator])
if results.get_aggregate_feedback()["helpfulness"] >= 0.82:
    # alias the candidate as 'prod' (operator-defined convention)
    hub.push("alicedev/summarise-v1:prod", candidate)

Output: prod-aliased commit hash drives the live chain; rollback is a single hub.push to the prior hash.

Recipe: dataset bootstrap from production traces

Take a week of production runs that earned a thumbs-up and seed a fine-tune dataset.

python

ls = Client()
runs = ls.list_runs(
    project_name="prod",
    run_type="llm",
    filter='and(eq(feedback_key, "user_rating"), gte(feedback_score, 0.8))',
    limit=5000,
)
ds = ls.create_dataset("sft-bootstrap-2026Q1")
for r in runs:
    ls.create_example(inputs=r.inputs, outputs=r.outputs, dataset_id=ds.id)

Output: dataset ready for export to JSONL / HuggingFace for SFT.

Production deployment

langsmith is a thin SDK; the server side is the LangSmith backend (SaaS or self-hosted). Production hardening is mostly about graceful failure, sampling, and PII discipline.

Flush on shutdown. In long-running services, register an exit hook: atexit.register(Client().flush). Without it, the last few seconds of spans drop on rolling deploys.
Sampling. Set LANGCHAIN_TRACING_SAMPLING_RATE=0.05 for high-traffic services (5% spans). Tag kept runs with metadata={"sampled": True} so dashboards can multiply by the sample rate.
Self-hosted endpoint. export LANGCHAIN_ENDPOINT=https://langsmith.internal/... — the SDK is identical otherwise. Self-host is the answer for regulated data and air-gapped environments.
Async transport is the default. Spans queue in-process and ship in background tasks. The HTTP failure modes (DNS, TLS, 5xx from the backend) never propagate to your request handler — by design.
hide_inputs / hide_outputs. Pass these env vars or per-@traceable parameters to redact payloads before they leave the host. Useful for PII regulations.
One project per environment. Set LANGCHAIN_PROJECT=prod in production deployments, =staging in staging, =ci in CI. Otherwise all environments land in the same project and dashboards become useless.
Run-ID propagation. When you fan out work to a worker pool (concurrent.futures, Celery), context-vars do NOT cross process boundaries. Pass the parent run ID via your job payload and rebuild the trace tree on the worker side via with tracing_context(parent=...):.

Version migration guide

langsmith is pre-1.0 and the public surface has been remarkably stable, but the span schema on the backend and a few SDK helpers have shifted.

Era	Notable changes
`0.0.x` (early)	Initial release. Sync-only `Client`. Many helpers later moved out of `langsmith` proper.
`0.1.x`	Stable `@traceable` and `Client` API. Async client (`AsyncClient`) introduced. Native cost / token fields per LLM span.
`0.2.x` (current)	Pytest plugin shipped as `langsmith[pytest]`. Pairwise evaluation helpers (`evaluate_comparative`). Stable run-type taxonomy.

General migration discipline:

Upgrade langsmith alongside langchain-core — the callback handler depends on a specific SDK floor.
Replace any private imports (anything starting with _) before upgrading; they vanish without warning.
Re-check that custom evaluator return shapes match {"key": ..., "score": ...} — older variants accepted positional args.
Hedge: the SDK pinning advice is langsmith>=X ranges in langchain-core's metadata — pip resolves through that; let it pick the floor rather than hard-pinning langsmith== unless you have a reason.

Evaluation & observability

LangSmith's evaluation surface is broader than most teams use. The mental model: an evaluator scores a (run, example) pair and returns {key, score}. Everything else — dataset loops, statistical aggregation, comparative experiments — is plumbing around that.

Built-in string evaluators wrap LLM-as-a-judge templates:

python

from langsmith.evaluation import LangChainStringEvaluator
helpfulness = LangChainStringEvaluator("criteria", config={"criteria": "helpfulness"})
correctness = LangChainStringEvaluator("labeled_criteria", config={"criteria": "correctness"})
embedding   = LangChainStringEvaluator("embedding_distance")

Custom evaluators for anything mechanical:

python

from langsmith.evaluation import run_evaluator
from langsmith.schemas import Run, Example

@run_evaluator
def schema_evaluator(run: Run, example: Example):
    import json
    try:
        parsed = json.loads(run.outputs["answer"])
        ok = isinstance(parsed, dict) and "title" in parsed
        return {"key": "valid_schema", "score": 1.0 if ok else 0.0}
    except Exception:
        return {"key": "valid_schema", "score": 0.0}

Pairwise / preference evaluations — use when no single ground-truth answer exists. evaluate_comparative takes two prior experiments and runs a judge that picks the winner.

Co-existence with other observability tools — langsmith runs happily next to OpenTelemetry instrumentation. Many teams ship spans to LangSmith for LLM-specific UI and to OTel collectors for SRE dashboards.

Multi-provider patterns

langsmith is provider-agnostic — @traceable works the same wrapping OpenAI, Anthropic, Gemini, or a local vLLM endpoint. The patterns worth knowing:

Token usage normalisation. Different providers report token counts in different envelope shapes. The langchain integration normalises them; raw @traceable users should attach token counts to metadata explicitly so dashboards work.
LiteLLM + LangSmith. Point a LiteLLM proxy at any provider; LiteLLM emits OpenAI-compat responses; LangSmith traces capture them when wrapped with @traceable or via LangChain. This is the common "one SDK, many providers" backbone.
OpenInference / OTel parallel export. For teams that already run an OTel collector, the openinference-instrumentation-* packages produce spans your collector already understands; you can dual-export to LangSmith and OTel simultaneously.

Troubleshooting common errors

No traces appear. Almost always order-of-import: LANGCHAIN_TRACING_V2/LANGCHAIN_API_KEY must be set before from langchain .... Fix by setting in your process manager / shell, or use the explicit LangChainTracer callback.
Last few traces missing on script exit. Background flusher hadn't drained. Call Client().flush() or atexit.register(...) it.
401 Unauthorized from the SDK. Either the key is wrong or it's for the wrong workspace. Each LangSmith workspace gets its own key prefix; copy from the right org in the UI.
429 from the backend. Backend ingest rate-limit. Lower your sampling rate or batch traces.
Spans show up under "default" project. LANGCHAIN_PROJECT not set in that environment; check the actual env in the running process (os.environ.get("LANGCHAIN_PROJECT")).
Forbidden: project does not exist when calling list_runs. The SDK auto-creates projects on the first traced call — but read-only ops on a never-traced project 404. Create it manually in the UI.
Subprocess loses context. Context-vars don't cross process boundaries. Forward run IDs via job payloads.
Huge inputs/outputs truncated. LangSmith truncates large fields server-side (~1 MB). Link to S3/R2 in metadata rather than embedding raw content.

Security considerations

Data residency. LangSmith stores full prompts, completions, tool arguments, and tool results on LangChain Inc.'s servers by default. For PII or regulated data, use the self-hosted deployment or hide_inputs/hide_outputs.
Secrets in prompts. API keys, customer tokens, and PII in prompts get persisted. Strip before tracing — metadata is also persisted.
API key rotation. Rotate LANGCHAIN_API_KEY on staff offboarding; revoke from the LangSmith workspace settings.
Egress filtering. Self-host the backend if your environment blocks egress to public LangSmith. The SDK's only outbound is LANGCHAIN_ENDPOINT.
Project isolation. A single workspace's projects share access. For multi-tenant scenarios, segment by workspace, not by project tag.
Replay safety. langsmith[vcr] records LLM calls for hermetic tests; the recorded fixtures contain real prompts/responses — treat them like the live data they came from.

Cost & rate-limit management

LangSmith itself meters by traces ingested per month on the SaaS plan; the more impactful budget conversation is the LLM cost you observe through LangSmith. The SDK helps that conversation in three ways:

Per-run cost is computed server-side from the model field LangSmith captures. Wrong / missing model → cost shows as $0. Always pass model= in your @traceable metadata or use the LangChain integration, which sets it for you.
Token counts must be reported. Decorated functions need to return / report token counts somehow — either via the LangChain integration, by attaching metadata={"usage": {...}}, or by setting them on the run object directly.
Sample rate vs ingest cost. Sampling 5% spans reduces ingest cost by ~20× and is invisible to debugging if you tag the samples. Don't pay for high-cardinality dev traffic at the same level as prod incident triage.
Per-user / per-tenant accounting. Stuff metadata={"user_id": ..., "tenant": ...} into every traced call. The query DSL lets you slice cost by metadata key arbitrarily.

Query DSL reference

The filter= argument to list_runs accepts a small expression language. Mastery saves a lot of post-fetch Python filtering.

Operators:

Operator	Meaning	Example
`eq(field, value)`	Equal	`eq(status, "error")`
`ne(field, value)`	Not equal	`ne(name, "test-run")`
`gt`, `gte`, `lt`, `lte`	Numeric comparisons	`gt(total_cost, 0.01)`
`has(field, value)`	Contains (lists / tags)	`has(tags, "experiment-v2")`
`search(text)`	Free-text search across inputs/outputs	`search("rag")`
`and(...)`, `or(...)`, `not(...)`	Boolean composition	`and(eq(status,"error"), gt(total_cost,0.01))`
`eq(feedback_key, ...) ∧ lt(feedback_score, ...)`	Joined feedback predicate (the only feedback shape)	`and(eq(feedback_key,"user_rating"), lt(feedback_score, 0.5))`

Common fields: status, name, run_type, start_time, end_time, total_cost, prompt_tokens, completion_tokens, total_tokens, tags, error, feedback_key, feedback_score.

Practical idioms:

python

# All errors in the last hour
ls.list_runs(project_name="prod",
             start_time=datetime.now(timezone.utc) - timedelta(hours=1),
             filter='eq(status, "error")')

# Expensive successful runs
ls.list_runs(filter='and(eq(status, "success"), gt(total_cost, 0.50))')

# Tagged + thumbs-down
ls.list_runs(filter='and(has(tags, "ab-test-b"), '
                    'and(eq(feedback_key, "user_rating"), lt(feedback_score, 0.5)))')

Tip: build complex filters by composing predicate strings programmatically and avoid quoting traps.

Ecosystem integrations

langsmith plugs into more than just LangChain. The notable hookups:

Integration	What you get
LangChain / LangGraph	Zero-config tracing once env vars are set — every chain, runnable, retriever, and tool emits spans automatically.
OpenAI Agents SDK	`langsmith[openai-agents]` provides a callback handler that maps Agents-SDK events to LangSmith run types.
Llama-Index	First-party callback (`llama_index.callbacks.LangSmithHandler`) attaches to the global `CallbackManager`.
Pytest plugin	`langsmith[pytest]` adds the `@unit` decorator and `pytest-langsmith` CLI for evaluation-as-test. Each test becomes a tracked experiment.
VCR	`langsmith[vcr]` records LLM API responses so test suites become hermetic. Recorded cassettes live alongside tests; CI replays without hitting providers.
OpenInference	The OTel-flavoured tracing layer used by Arize and others — `openinference-instrumentation-langchain` produces OpenInference spans alongside LangSmith's native ones.
CI/CD	`evaluate(...)` returns an experiment object with a stable URL; the GitHub Action ecosystem includes wrappers that post the URL as PR comments.
Slack / PagerDuty / webhooks	The backend supports alert rules — wire spend or score regressions to your incident channel.

For projects already on a non-LangChain stack, the @traceable decorator and Client.list_runs API alone deliver most of the value; the deeper integrations earn their keep as the surface area grows.

When NOT to use this

langsmith is excellent for any project building on LLMs — but it isn't always the right tool:

Pure non-LLM ML pipelines. Use MLflow, Weights & Biases, or Neptune. langsmith is shaped around prompt/completion/tool-call spans.
You need full OTel parity. Arize Phoenix, openinference-instrumentation, or vanilla OTel collectors give you broader interop with general APM stacks.
Compliance forbids egress. The SaaS won't work; the self-hosted edition does, but it's an operational commitment.
Single-shot scripts in CI. A print(token_count) and a CSV append might be all you need.
Cost is the only signal. Provider-side dashboards (Anthropic Console, OpenAI Usage) cover spend without any SDK at all.

langsmith

What it is

Install

Versioning & Python support

Package metadata

Optional dependencies & extras

Alternatives

Common gotchas

Real-world recipes

Recipe: framework-agnostic instrumentation

Recipe: a CI evaluation gate

Recipe: cost dashboard ingest

Recipe: prompt promotion pipeline

Recipe: dataset bootstrap from production traces

Production deployment

Version migration guide

Evaluation & observability

Multi-provider patterns

Troubleshooting common errors

Security considerations

Cost & rate-limit management

Query DSL reference

Ecosystem integrations

When NOT to use this

See also