cheat sheet
langsmith
Package-level reference for the langsmith SDK on PyPI — install, versioning, env-var setup, and observability alternatives.
langsmith
What it is
langsmith is the Python client SDK for the LangSmith platform — LangChain Inc.'s hosted observability, tracing, evaluation, and prompt-management service for LLM applications. It captures every LLM call, tool use, and chain step as a span on a hosted backend so you can debug failures, replay runs, and score outputs against datasets.
The SDK works with or without LangChain. Decorating any function with @traceable is enough to ship spans; you do not need LCEL or langchain-core to use it.
Install
pip install langsmith
Output: (none — exits 0 on success)
uv add langsmith
Output: dependency resolved + added to pyproject.toml
poetry add langsmith
Output: updated lockfile + virtualenv install
The SDK reads two environment variables to enable tracing:
export LANGCHAIN_TRACING_V2=true
export LANGCHAIN_API_KEY=ls__...
Output: any subsequent @traceable function emits spans to LangSmith
Versioning & Python support
- Current stable line is
0.1.x/0.2.x(as of late 2025). Pre-1.0 — minor bumps can change span schema and rename SDK helpers. - Python
3.9+. Async transport (AsyncClient) requires3.10+for best ergonomics. - The SDK is forward-compatible with the LangSmith backend, but old SDKs may miss new span fields (e.g. token-usage breakdowns, cost). Upgrade alongside any major feature you want to use.
- Independent release cadence from
langchainitself — thelangchain-corecallback handler that ships spans to LangSmith pins alangsmith>=Xrange, so a stalelangchain-corecan hold back upgrades.
Package metadata
- Maintainer: LangChain Inc.
- Project home: github.com/langchain-ai/langsmith-sdk
- Docs: docs.smith.langchain.com
- PyPI: pypi.org/project/langsmith
- License: MIT (SDK); LangSmith backend is closed-source / commercial
- Governance: commercial product, hosted SaaS + self-hosted enterprise
- First released: 2023
- Downloads: millions per month — pulled in transitively by most
langchaininstalls
Optional dependencies & extras
langsmith is intentionally light. Core deps:
requests— synchronous HTTP transporthttpx— async transportorjson— fast JSON serialisationpydantic— span / run schemas
Extras worth noting:
langsmith[pytest]— pytest plugin and@unitdecorator for evaluation-as-test workflows.langsmith[openai-agents]— integration helpers for the OpenAI Agents SDK.langsmith[vcr]— record/replay LLM calls for hermetic test runs.
The SDK does not depend on langchain — install one without the other freely.
Alternatives
| Package | Trade-off |
|---|---|
langfuse | Open-source observability backend + Python SDK. Self-hostable; UI is the closest comparable. |
arize-phoenix | Open-source local + cloud LLM tracing (OpenInference / OpenTelemetry). Strong on retrieval-eval. |
wandb (Weights & Biases) | Broader ML experiment tracking; LLM tracing via wandb-traces / weave. |
helicone | Proxy-based observability — point your LLM client base URL at Helicone and it logs every call. |
opentelemetry-instrumentation-openai (and similar) | OTel-native — ship spans to any OTel backend (Jaeger, Honeycomb, Datadog). |
Common gotchas
LANGCHAIN_API_KEYmust be set before import for thelangchain-corecallback handler to attach. Setting it afterfrom langchain_openai import ChatOpenAIis too late — re-import, or use the explicitLangChainTracercallback.- Trace persistence is on the LangSmith backend by default. Prompts, model outputs, and tool arguments are stored on LangChain Inc.'s servers. For regulated data (PII, GDPR), use the self-hosted deployment or the
hide_inputs/hide_outputsredaction hooks. - Sampling is off by default. Every traced call hits the backend. For high-QPS production, configure
LANGCHAIN_SAMPLING_RATE(env) or sample manually in the@traceabledecorator. - Async transport batches in the background.
client.flush()is required before process exit, otherwise the last few seconds of spans drop. Most CLI scripts hit this. @traceableis contagious. Spans inherit project, tags, and metadata from the parent run via context-vars. Forking subprocesses orconcurrent.futures.ProcessPoolExecutorbreaks the context — pass the run ID explicitly.- Project name defaults to
default. SetLANGCHAIN_PROJECT=my-appto keep environments separate; otherwise every dev/CI/prod call piles into the same project. - SDK errors are silently swallowed. By design, tracing failures never crash the host app. Enable
LANGCHAIN_VERBOSE=truewhile debugging missing traces.
Real-world recipes
These patterns string the SDK primitives into the workflows production teams actually run. They lean on @traceable, Client.list_runs, datasets, and evaluators — the same building blocks shown in the companion article.
Recipe: framework-agnostic instrumentation
langsmith does not require LangChain. Decorate any Python function with @traceable and you get a span with inputs, outputs, latency, tokens (if reported), and errors.
import os, anthropic
from langsmith import traceable
os.environ["LANGCHAIN_TRACING_V2"] = "true"
os.environ["LANGCHAIN_API_KEY"] = "ls__..."
os.environ["LANGCHAIN_PROJECT"] = "rag-prod"
client = anthropic.Anthropic(api_key=os.environ["ANTHROPIC_API_KEY"])
@traceable(run_type="retriever", name="pgvector_search")
def retrieve(query: str, k: int = 4) -> list[dict]:
return [{"page_content": "...", "metadata": {"source": "doc-1"}}]
@traceable(run_type="llm", name="claude_call")
def call_claude(prompt: str) -> str:
msg = client.messages.create(
model="claude-sonnet-4-6",
max_tokens=512,
messages=[{"role": "user", "content": prompt}],
)
return msg.content[0].text
@traceable(run_type="chain", name="rag_pipeline")
def rag(question: str) -> str:
docs = retrieve(question)
context = "\n\n".join(d["page_content"] for d in docs)
return call_claude(f"Context:\n{context}\n\nQuestion: {question}")
print(rag("What is RAG?"))
Output: three nested spans — rag_pipeline containing pgvector_search then claude_call — appear in LangSmith with the right run-type icons.
Recipe: a CI evaluation gate
Run a canonical regression dataset against the current chain on every PR; fail the build if scores regress.
from langsmith.evaluation import evaluate
from langsmith import Client
def predict(inputs):
return {"answer": rag(inputs["question"])}
def test_no_regression():
results = evaluate(
predict,
data="regression-suite-v3",
evaluators=[helpfulness_evaluator, length_evaluator],
experiment_prefix="ci",
)
agg = results.get_aggregate_feedback()
assert agg["helpfulness"] >= 0.78, f"helpfulness regressed: {agg['helpfulness']:.2f}"
Output: the CI job posts experiment URL to the PR comment; threshold violations fail the build with a useful link.
Recipe: cost dashboard ingest
Aggregate yesterday's spend by model and ship to a dashboard.
from datetime import datetime, timezone, timedelta
from collections import defaultdict
from langsmith import Client
ls = Client()
since = datetime.now(timezone.utc) - timedelta(days=1)
spend = defaultdict(float)
tokens = defaultdict(int)
for r in ls.list_runs(project_name="prod", run_type="llm", start_time=since, limit=10_000):
model = (r.extra or {}).get("invocation_params", {}).get("model", "unknown")
spend[model] += r.total_cost or 0.0
tokens[model] += (r.prompt_tokens or 0) + (r.completion_tokens or 0)
for m in sorted(spend, key=lambda k: -spend[k]):
print(f"{m:30s} ${spend[m]:7.2f} {tokens[m]:>10,} tokens")
Output: ranked table; pipe into Prometheus / OpenTelemetry / your warehouse for retention.
Recipe: prompt promotion pipeline
Use the Hub as a versioned store; tag the candidate, evaluate against a gold dataset, and promote on pass.
from langchain import hub
from langsmith.evaluation import evaluate
candidate = ChatPromptTemplate.from_template(open("prompts/v3.txt").read())
hub.push("alicedev/summarise-v1", candidate, new_repo_is_public=False)
results = evaluate(predict, data="summarise-gold", evaluators=[helpfulness_evaluator])
if results.get_aggregate_feedback()["helpfulness"] >= 0.82:
# alias the candidate as 'prod' (operator-defined convention)
hub.push("alicedev/summarise-v1:prod", candidate)
Output: prod-aliased commit hash drives the live chain; rollback is a single hub.push to the prior hash.
Recipe: dataset bootstrap from production traces
Take a week of production runs that earned a thumbs-up and seed a fine-tune dataset.
ls = Client()
runs = ls.list_runs(
project_name="prod",
run_type="llm",
filter='and(eq(feedback_key, "user_rating"), gte(feedback_score, 0.8))',
limit=5000,
)
ds = ls.create_dataset("sft-bootstrap-2026Q1")
for r in runs:
ls.create_example(inputs=r.inputs, outputs=r.outputs, dataset_id=ds.id)
Output: dataset ready for export to JSONL / HuggingFace for SFT.
Production deployment
langsmith is a thin SDK; the server side is the LangSmith backend (SaaS or self-hosted). Production hardening is mostly about graceful failure, sampling, and PII discipline.
- Flush on shutdown. In long-running services, register an exit hook:
atexit.register(Client().flush). Without it, the last few seconds of spans drop on rolling deploys. - Sampling. Set
LANGCHAIN_TRACING_SAMPLING_RATE=0.05for high-traffic services (5% spans). Tag kept runs withmetadata={"sampled": True}so dashboards can multiply by the sample rate. - Self-hosted endpoint.
export LANGCHAIN_ENDPOINT=https://langsmith.internal/...— the SDK is identical otherwise. Self-host is the answer for regulated data and air-gapped environments. - Async transport is the default. Spans queue in-process and ship in background tasks. The HTTP failure modes (DNS, TLS, 5xx from the backend) never propagate to your request handler — by design.
hide_inputs/hide_outputs. Pass these env vars or per-@traceableparameters to redact payloads before they leave the host. Useful for PII regulations.- One project per environment. Set
LANGCHAIN_PROJECT=prodin production deployments,=stagingin staging,=ciin CI. Otherwise all environments land in the same project and dashboards become useless. - Run-ID propagation. When you fan out work to a worker pool (
concurrent.futures, Celery), context-vars do NOT cross process boundaries. Pass the parent run ID via your job payload and rebuild the trace tree on the worker side viawith tracing_context(parent=...):.
Version migration guide
langsmith is pre-1.0 and the public surface has been remarkably stable, but the span schema on the backend and a few SDK helpers have shifted.
| Era | Notable changes |
|---|---|
0.0.x (early) | Initial release. Sync-only Client. Many helpers later moved out of langsmith proper. |
0.1.x | Stable @traceable and Client API. Async client (AsyncClient) introduced. Native cost / token fields per LLM span. |
0.2.x (current) | Pytest plugin shipped as langsmith[pytest]. Pairwise evaluation helpers (evaluate_comparative). Stable run-type taxonomy. |
General migration discipline:
- Upgrade
langsmithalongsidelangchain-core— the callback handler depends on a specific SDK floor. - Replace any private imports (anything starting with
_) before upgrading; they vanish without warning. - Re-check that custom evaluator return shapes match
{"key": ..., "score": ...}— older variants accepted positional args. - Hedge: the SDK pinning advice is
langsmith>=Xranges inlangchain-core's metadata — pip resolves through that; let it pick the floor rather than hard-pinninglangsmith==unless you have a reason.
Evaluation & observability
LangSmith's evaluation surface is broader than most teams use. The mental model: an evaluator scores a (run, example) pair and returns {key, score}. Everything else — dataset loops, statistical aggregation, comparative experiments — is plumbing around that.
Built-in string evaluators wrap LLM-as-a-judge templates:
from langsmith.evaluation import LangChainStringEvaluator
helpfulness = LangChainStringEvaluator("criteria", config={"criteria": "helpfulness"})
correctness = LangChainStringEvaluator("labeled_criteria", config={"criteria": "correctness"})
embedding = LangChainStringEvaluator("embedding_distance")
Custom evaluators for anything mechanical:
from langsmith.evaluation import run_evaluator
from langsmith.schemas import Run, Example
@run_evaluator
def schema_evaluator(run: Run, example: Example):
import json
try:
parsed = json.loads(run.outputs["answer"])
ok = isinstance(parsed, dict) and "title" in parsed
return {"key": "valid_schema", "score": 1.0 if ok else 0.0}
except Exception:
return {"key": "valid_schema", "score": 0.0}
Pairwise / preference evaluations — use when no single ground-truth answer exists. evaluate_comparative takes two prior experiments and runs a judge that picks the winner.
Co-existence with other observability tools — langsmith runs happily next to OpenTelemetry instrumentation. Many teams ship spans to LangSmith for LLM-specific UI and to OTel collectors for SRE dashboards.
Multi-provider patterns
langsmith is provider-agnostic — @traceable works the same wrapping OpenAI, Anthropic, Gemini, or a local vLLM endpoint. The patterns worth knowing:
- Token usage normalisation. Different providers report token counts in different envelope shapes. The
langchainintegration normalises them; raw@traceableusers should attach token counts tometadataexplicitly so dashboards work. - LiteLLM + LangSmith. Point a LiteLLM proxy at any provider; LiteLLM emits OpenAI-compat responses; LangSmith traces capture them when wrapped with
@traceableor via LangChain. This is the common "one SDK, many providers" backbone. - OpenInference / OTel parallel export. For teams that already run an OTel collector, the
openinference-instrumentation-*packages produce spans your collector already understands; you can dual-export to LangSmith and OTel simultaneously.
Troubleshooting common errors
- No traces appear. Almost always order-of-import:
LANGCHAIN_TRACING_V2/LANGCHAIN_API_KEYmust be set beforefrom langchain .... Fix by setting in your process manager / shell, or use the explicitLangChainTracercallback. - Last few traces missing on script exit. Background flusher hadn't drained. Call
Client().flush()oratexit.register(...)it. 401 Unauthorizedfrom the SDK. Either the key is wrong or it's for the wrong workspace. Each LangSmith workspace gets its own key prefix; copy from the right org in the UI.429from the backend. Backend ingest rate-limit. Lower your sampling rate or batch traces.- Spans show up under "default" project.
LANGCHAIN_PROJECTnot set in that environment; check the actual env in the running process (os.environ.get("LANGCHAIN_PROJECT")). Forbidden: project does not existwhen callinglist_runs. The SDK auto-creates projects on the first traced call — but read-only ops on a never-traced project 404. Create it manually in the UI.- Subprocess loses context. Context-vars don't cross process boundaries. Forward run IDs via job payloads.
- Huge inputs/outputs truncated. LangSmith truncates large fields server-side (~1 MB). Link to S3/R2 in metadata rather than embedding raw content.
Security considerations
- Data residency. LangSmith stores full prompts, completions, tool arguments, and tool results on LangChain Inc.'s servers by default. For PII or regulated data, use the self-hosted deployment or
hide_inputs/hide_outputs. - Secrets in prompts. API keys, customer tokens, and PII in prompts get persisted. Strip before tracing —
metadatais also persisted. - API key rotation. Rotate
LANGCHAIN_API_KEYon staff offboarding; revoke from the LangSmith workspace settings. - Egress filtering. Self-host the backend if your environment blocks egress to public LangSmith. The SDK's only outbound is
LANGCHAIN_ENDPOINT. - Project isolation. A single workspace's projects share access. For multi-tenant scenarios, segment by workspace, not by project tag.
- Replay safety.
langsmith[vcr]records LLM calls for hermetic tests; the recorded fixtures contain real prompts/responses — treat them like the live data they came from.
Cost & rate-limit management
LangSmith itself meters by traces ingested per month on the SaaS plan; the more impactful budget conversation is the LLM cost you observe through LangSmith. The SDK helps that conversation in three ways:
- Per-run cost is computed server-side from the model field LangSmith captures. Wrong / missing model → cost shows as $0. Always pass
model=in your@traceablemetadata or use the LangChain integration, which sets it for you. - Token counts must be reported. Decorated functions need to return / report token counts somehow — either via the LangChain integration, by attaching
metadata={"usage": {...}}, or by setting them on the run object directly. - Sample rate vs ingest cost. Sampling 5% spans reduces ingest cost by ~20× and is invisible to debugging if you tag the samples. Don't pay for high-cardinality dev traffic at the same level as prod incident triage.
- Per-user / per-tenant accounting. Stuff
metadata={"user_id": ..., "tenant": ...}into every traced call. The query DSL lets you slice cost by metadata key arbitrarily.
Query DSL reference
The filter= argument to list_runs accepts a small expression language. Mastery saves a lot of post-fetch Python filtering.
Operators:
| Operator | Meaning | Example |
|---|---|---|
eq(field, value) | Equal | eq(status, "error") |
ne(field, value) | Not equal | ne(name, "test-run") |
gt, gte, lt, lte | Numeric comparisons | gt(total_cost, 0.01) |
has(field, value) | Contains (lists / tags) | has(tags, "experiment-v2") |
search(text) | Free-text search across inputs/outputs | search("rag") |
and(...), or(...), not(...) | Boolean composition | and(eq(status,"error"), gt(total_cost,0.01)) |
eq(feedback_key, ...) ∧ lt(feedback_score, ...) | Joined feedback predicate (the only feedback shape) | and(eq(feedback_key,"user_rating"), lt(feedback_score, 0.5)) |
Common fields: status, name, run_type, start_time, end_time, total_cost, prompt_tokens, completion_tokens, total_tokens, tags, error, feedback_key, feedback_score.
Practical idioms:
# All errors in the last hour
ls.list_runs(project_name="prod",
start_time=datetime.now(timezone.utc) - timedelta(hours=1),
filter='eq(status, "error")')
# Expensive successful runs
ls.list_runs(filter='and(eq(status, "success"), gt(total_cost, 0.50))')
# Tagged + thumbs-down
ls.list_runs(filter='and(has(tags, "ab-test-b"), '
'and(eq(feedback_key, "user_rating"), lt(feedback_score, 0.5)))')
Tip: build complex filters by composing predicate strings programmatically and avoid quoting traps.
Ecosystem integrations
langsmith plugs into more than just LangChain. The notable hookups:
| Integration | What you get |
|---|---|
| LangChain / LangGraph | Zero-config tracing once env vars are set — every chain, runnable, retriever, and tool emits spans automatically. |
| OpenAI Agents SDK | langsmith[openai-agents] provides a callback handler that maps Agents-SDK events to LangSmith run types. |
| Llama-Index | First-party callback (llama_index.callbacks.LangSmithHandler) attaches to the global CallbackManager. |
| Pytest plugin | langsmith[pytest] adds the @unit decorator and pytest-langsmith CLI for evaluation-as-test. Each test becomes a tracked experiment. |
| VCR | langsmith[vcr] records LLM API responses so test suites become hermetic. Recorded cassettes live alongside tests; CI replays without hitting providers. |
| OpenInference | The OTel-flavoured tracing layer used by Arize and others — openinference-instrumentation-langchain produces OpenInference spans alongside LangSmith's native ones. |
| CI/CD | evaluate(...) returns an experiment object with a stable URL; the GitHub Action ecosystem includes wrappers that post the URL as PR comments. |
| Slack / PagerDuty / webhooks | The backend supports alert rules — wire spend or score regressions to your incident channel. |
For projects already on a non-LangChain stack, the @traceable decorator and Client.list_runs API alone deliver most of the value; the deeper integrations earn their keep as the surface area grows.
When NOT to use this
langsmith is excellent for any project building on LLMs — but it isn't always the right tool:
- Pure non-LLM ML pipelines. Use MLflow, Weights & Biases, or Neptune.
langsmithis shaped around prompt/completion/tool-call spans. - You need full OTel parity. Arize Phoenix, openinference-instrumentation, or vanilla OTel collectors give you broader interop with general APM stacks.
- Compliance forbids egress. The SaaS won't work; the self-hosted edition does, but it's an operational commitment.
- Single-shot scripts in CI. A
print(token_count)and a CSV append might be all you need. - Cost is the only signal. Provider-side dashboards (Anthropic Console, OpenAI Usage) cover spend without any SDK at all.
See also
- AI: LangSmith — tracing, evaluation, datasets
- Packages: pip-langchain — the framework most users pair with LangSmith
- Concept: api — client SDK design
- Concept: agents — what you're observing