Install
Terminal · npx
$npx skills add https://github.com/langchain-ai/langsmith-skills --skill langsmith-evaluator
Works with Paperclip
How Langsmith Evaluator fits into a Paperclip company.

Langsmith Evaluator drops into any Paperclip agent that handles this kind of work. Assign it to a specialist inside a pre-configured PaperclipOrg company and the skill becomes available on every heartbeat — no prompt engineering, no tool wiring.
SaaS FactoryPaired
Pre-configured AI company — 18 agents, 18 skills, one-time purchase.
$27$59
Explore pack
Source file
SKILL.md367 linesmarkdown
Expand
1---2name: langsmith-evaluator3description: "INVOKE THIS SKILL when building evaluation pipelines for LangSmith. Covers three core components: (1) Creating Evaluators - LLM-as-Judge, custom code; (2) Defining Run Functions - how to capture outputs and trajectories from your agent; (3) Running Evaluations - locally with evaluate() or auto-run via LangSmith. Uses the langsmith CLI tool."4---5 6<oneliner>7Three core components: **(1) Creating Evaluators** - LLM-as-Judge, custom code; **(2) Defining Run Functions** - capture agent outputs/trajectories for evaluation; **(3) Running Evaluations** - locally with `evaluate()` or auto-run via uploaded evaluators. Python and TypeScript examples included.8</oneliner>9 10<setup>11Environment Variables12 13```bash14LANGSMITH_API_KEY=lsv2_pt_your_api_key_here          # REQUIRED15LANGSMITH_PROJECT=your-project-name                   # Check this to know which project has traces16LANGSMITH_WORKSPACE_ID=your-workspace-id              # Optional: for org-scoped keys17OPENAI_API_KEY=your_openai_key                        # For LLM as Judge18```19 20Authentication is REQUIRED: either set the `LANGSMITH_API_KEY` environment variable, or pass the `--api-key` flag to CLI commands (preferred):21```bash22langsmith evaluator list --api-key $LANGSMITH_API_KEY23```24 25**IMPORTANT:** Always check the environment variables or `.env` file for `LANGSMITH_PROJECT` before querying or interacting with LangSmith. This tells you which project contains the relevant traces and data. If the LangSmith project is not available, use your best judgement to identify the right one.26 27Python Dependencies28```bash29pip install langsmith langchain-openai python-dotenv30```31 32CLI Tool (for uploading evaluators)33```bash34curl -sSL https://raw.githubusercontent.com/langchain-ai/langsmith-cli/main/scripts/install.sh | sh35```36 37JavaScript Dependencies38```bash39npm install langsmith openai40```41</setup>42 43<crucial_requirement>44## Golden Rule: Inspect Before You Implement45 46**CRITICAL:** Before writing ANY evaluator or extraction logic, you MUST:471. **Run your agent** on sample inputs and capture the actual output482. **Inspect the output** - print it, query LangSmith traces, understand the exact structure493. **Only then** write code that processes that output50 51Output structures vary significantly by framework, agent type, and configuration. Never assume the shape - always verify first. Query LangSmith traces to when outputs don't contain needed data to understand how to extract from execution.52</crucial_requirement>53 54<evaluator_format>55## Offline vs Online Evaluators56 57**Offline Evaluators** (attached to datasets):58- Function signature: `(run, example)` - receives both run outputs and dataset example59- Use case: Comparing agent outputs to expected values in a dataset60- Upload with: `--dataset "Dataset Name"`61 62**Online Evaluators** (attached to projects):63- Function signature: `(run)` - receives only run outputs, NO example parameter64- Use case: Real-time quality checks on production runs (no reference data)65- Upload with: `--project "Project Name"`66 67**CRITICAL - Return Format:**68- Each evaluator returns **ONE metric only**. For multiple metrics, create multiple evaluator functions.69- Do NOT return `{"metric_name": value}` or lists of metrics - this will error.70 71**CRITICAL - Local vs Uploaded Differences:**72 73| | Local `evaluate()` | Uploaded to LangSmith |74|---|---|---|75| **Column name** | Python: auto-derived from function name. TypeScript: must include `key` field or column is untitled | Comes from evaluator name set at upload time. Do NOT include `key` — it creates a duplicate column |76| **Python `run` type** | `RunTree` object → `run.outputs` (attribute) | `dict` → `run["outputs"]` (subscript). Handle both: `run.outputs if hasattr(run, "outputs") else run.get("outputs", {})` |77| **TypeScript `run` type** | Always attribute access: `run.outputs?.field` | Always attribute access: `run.outputs?.field` |78| **Python return** | `{"score": value, "comment": "..."}` | `{"score": value, "comment": "..."}` |79| **TypeScript return** | `{ key: "name", score: value, comment: "..." }` | `{ score: value, comment: "..." }` |80</evaluator_format>81 82<evaluator_types>83- **LLM as Judge** - Uses an LLM to grade outputs. Best for subjective quality (accuracy, helpfulness, relevance).84- **Custom Code** - Deterministic logic. Best for objective checks (exact match, trajectory validation, format compliance).85</evaluator_types>86 87<llm_judge>88## LLM as Judge Evaluators89 90**NOTE:** LLM-as-Judge upload is currently not supported by the CLI — only code evaluators are supported. For evaluations against a dataset, STRONGLY PREFER defining local evaluators to use with `evaluate(evaluators=[...])`.91 92<python>93```python94from typing import TypedDict, Annotated95from langchain_openai import ChatOpenAI96 97class Grade(TypedDict):98    reasoning: Annotated[str, ..., "Explain your reasoning"]99    is_accurate: Annotated[bool, ..., "True if response is accurate"]100 101judge = ChatOpenAI(model="gpt-4o-mini", temperature=0).with_structured_output(Grade, method="json_schema", strict=True)102 103async def accuracy_evaluator(run, example):104    run_outputs = run.outputs if hasattr(run, "outputs") else run.get("outputs", {}) or {}105    example_outputs = example.outputs if hasattr(example, "outputs") else example.get("outputs", {}) or {}106    grade = await judge.ainvoke([{"role": "user", "content": f"Expected: {example_outputs}\nActual: {run_outputs}\nIs this accurate?"}])107    return {"score": 1 if grade["is_accurate"] else 0, "comment": grade["reasoning"]}108```109</python>110 111<typescript>112```javascript113import OpenAI from "openai";114 115const openai = new OpenAI();116 117async function accuracyEvaluator(run, example) {118    const runOutputs = run.outputs ?? {};119    const exampleOutputs = example.outputs ?? {};120 121    const response = await openai.chat.completions.create({122    model: "gpt-4o-mini",123    temperature: 0,124    response_format: { type: "json_object" },125    messages: [126        { role: "system", content: 'Respond with JSON: {"is_accurate": boolean, "reasoning": string}' },127        { role: "user", content: `Expected: ${JSON.stringify(exampleOutputs)}\nActual: ${JSON.stringify(runOutputs)}\nIs this accurate?` }128    ]129    });130 131    const grade = JSON.parse(response.choices[0].message.content);132    return { score: grade.is_accurate ? 1 : 0, comment: grade.reasoning };133}134```135</typescript>136</llm_judge>137 138<code_evaluators>139## Custom Code Evaluators140 141**Before writing an evaluator:**1421. Inspect your dataset to understand expected field names (see Golden Rule above)1432. Test your run function and verify its output structure matches the dataset schema1443. Query LangSmith traces to debug any mismatches145 146<python>147```python148def trajectory_evaluator(run, example):149    run_outputs = run.outputs if hasattr(run, "outputs") else run.get("outputs", {}) or {}150    example_outputs = example.outputs if hasattr(example, "outputs") else example.get("outputs", {}) or {}151    # IMPORTANT: Replace these placeholders with your actual field names152    # 1. Query your LangSmith trace to see what fields exist in run outputs153    # 2. Check your dataset schema for expected field names154    # Note: Trajectory data may not appear in default output - verify against trace!155    actual = run_outputs.get("YOUR_TRAJECTORY_FIELD", [])156    expected = example_outputs.get("YOUR_EXPECTED_FIELD", [])157    return {"score": 1 if actual == expected else 0, "comment": f"Expected {expected}, got {actual}"}158```159</python>160 161<typescript>162```javascript163function trajectoryEvaluator(run, example) {164    const runOutputs = run.outputs ?? {};165    const exampleOutputs = example.outputs ?? {};166    // IMPORTANT: Replace these placeholders with your actual field names167    // 1. Query your LangSmith trace to see what fields exist in run outputs168    // 2. Check your dataset schema for expected field names169    const actual = runOutputs.YOUR_TRAJECTORY_FIELD ?? [];170    const expected = exampleOutputs.YOUR_EXPECTED_FIELD ?? [];171    const match = JSON.stringify(actual) === JSON.stringify(expected);172    return { score: match ? 1 : 0, comment: `Expected ${JSON.stringify(expected)}, got ${JSON.stringify(actual)}` };173}174```175</typescript>176</code_evaluators>177 178<run_functions>179## Defining Run Functions180 181Run functions execute your agent and return outputs for evaluation.182 183**CRITICAL - Test Your Run Function First:**184Before writing evaluators, you MUST test your run function and inspect the actual output structure. Output shapes vary by framework, agent type, and configuration.185 186**Debugging workflow:**1871. Run your agent once on sample input1882. Query the trace to see the execution structure1893. Print the raw output and verify against trace to output contains the right data1904. Adjust the run function as needed1914. Verify your output matches your dataset schema192 193**Try your hardest to match your run function output to your dataset schema.** This makes evaluators simple and reusable. If matching isn't possible, your evaluator must know how to extract and compare the right fields from each side.194 195<python>196```python197def run_agent(inputs: dict) -> dict:198    result = your_agent.run(inputs)199    # ALWAYS inspect output shape first - run this, check the print, query traces200    print(f"DEBUG - type: {type(result)}, keys: {result.keys() if hasattr(result, 'keys') else 'N/A'}")201    print(f"DEBUG - value: {result}")202    return {"output": result}  # Adjust to match your dataset schema203```204</python>205 206<typescript>207```javascript208async function runAgent(inputs) {209    const result = await yourAgent.invoke(inputs);210    // ALWAYS inspect output shape first211    console.log("DEBUG - type:", typeof result, "keys:", Object.keys(result));212    console.log("DEBUG - value:", result);213    return { output: result };  // Adjust to match your dataset schema214}215```216</typescript>217 218### Capturing Trajectories219 220For trajectory evaluation, your run function must capture tool calls during execution.221 222**CRITICAL:** Run output formats vary significantly by framework and agent type. You MUST inspect before implementing:223 224**LangGraph agents (LangChain OSS):** Use `stream_mode="debug"` with `subgraphs=True` to capture nested subagent tool calls.225 226```python227import uuid228 229def run_agent_with_trajectory(agent, inputs: dict) -> dict:230    config = {"configurable": {"thread_id": f"eval-{uuid.uuid4()}"}}231    trajectory = []232    final_result = None233 234    for chunk in agent.stream(inputs, config=config, stream_mode="debug", subgraphs=True):235        # STEP 1: Print chunks to understand the structure236        print(f"DEBUG chunk: {chunk}")237 238        # STEP 2: Write extraction based on YOUR observed structure239        # ... your extraction logic here ...240 241    # IMPORTANT: After running, query the LangSmith trace to verify242    # your trajectory data is complete. Default output may be missing243    # tool calls that appear in the trace.244    return {"output": final_result, "trajectory": trajectory}245```246 247**Custom / Non-LangChain Agents:**248 2491. **Inspect output first** - Run your agent and inspect the result structure. Trajectory data may already be included in the output (e.g., `result.tool_calls`, `result.steps`, etc.)2502. **Callbacks/Hooks** - If your framework supports execution callbacks, register a hook that records tool names on each invocation2513. **Parse execution logs** - As a last resort, extract tool names from structured logs or trace data252 253The key is to capture the tool name at execution time, not at definition time.254</run_functions>255 256<upload>257## Uploading Evaluators to LangSmith258 259**IMPORTANT - Auto-Run Behavior:**260Evaluators uploaded to a dataset **automatically run** when you run experiments on that dataset. You do NOT need to pass them to `evaluate()` - just run your agent against the dataset and the uploaded evaluators execute automatically.261 262**IMPORTANT - Local vs Uploaded:**263Uploaded evaluators run in a sandboxed environment with very limited package access. Only use built-in/standard library imports, and place all imports **inside** the evaluator function body. For dataset (offline) evaluators, prefer running locally with `evaluate(evaluators=[...])` first — this gives you full package access.264 265**IMPORTANT - Code vs Structured Evaluators:**266- **Code evaluators** (what the CLI uploads): Run in a limited environment without external packages. Use for deterministic logic (exact match, trajectory validation).267- **Structured evaluators** (LLM-as-Judge): Configured via LangSmith UI, use a specific payload format with model/prompt/schema. The CLI does not support this format yet.268 269**IMPORTANT - Choose the right target:**270- `--dataset`: Offline evaluator with `(run, example)` signature - for comparing to expected values271- `--project`: Online evaluator with `(run)` signature - for real-time quality checks272 273You must specify one. Global evaluators are not supported.274 275```bash276# List all evaluators277langsmith evaluator list --api-key $LANGSMITH_API_KEY278 279# Upload offline evaluator (attached to dataset)280langsmith evaluator upload my_evaluators.py \281  --name "Trajectory Match" --function trajectory_evaluator \282  --dataset "My Dataset" --replace --api-key $LANGSMITH_API_KEY283 284# Upload online evaluator (attached to project)285langsmith evaluator upload my_evaluators.py \286  --name "Quality Check" --function quality_check \287  --project "Production Agent" --replace --api-key $LANGSMITH_API_KEY288 289# Delete290langsmith evaluator delete "Trajectory Match" --api-key $LANGSMITH_API_KEY291```292 293**IMPORTANT - Safety Prompts:**294- The CLI prompts for confirmation before destructive operations295- **NEVER use `--yes` flag unless the user explicitly requests it**296</upload>297 298<best_practices>2991. **Use structured output for LLM judges** - More reliable than parsing free-text3002. **Match evaluator to dataset type**301   - Final Response → LLM as Judge for quality302   - Trajectory → Custom Code for sequence3033. **Use async for LLM judges** - Enables parallel evaluation3044. **Test evaluators independently** - Validate on known good/bad examples first3055. **Choose the right language**306   - Python: Use for Python agents, langchain integrations307   - JavaScript: Use for TypeScript/Node.js agents308</best_practices>309 310<running_evaluations>311## Running Evaluations312 313**Uploaded evaluators** auto-run when you run experiments - no code needed. **Local evaluators** are passed directly for development/testing.314 315<python>316```python317from langsmith import evaluate318 319# Uploaded evaluators run automatically320results = evaluate(run_agent, data="My Dataset", experiment_prefix="eval-v1")321 322# Or pass local evaluators for testing323results = evaluate(run_agent, data="My Dataset", evaluators=[my_evaluator], experiment_prefix="eval-v1")324```325</python>326 327<typescript>328```javascript329import { evaluate } from "langsmith/evaluation";330 331// Uploaded evaluators run automatically332const results = await evaluate(runAgent, {333  data: "My Dataset",334  experimentPrefix: "eval-v1",335});336 337// Or pass local evaluators for testing338const results = await evaluate(runAgent, {339  data: "My Dataset",340  evaluators: [myEvaluator],341  experimentPrefix: "eval-v1",342});343```344</typescript>345</running_evaluations>346 347<troubleshooting>348## Common Issues349 350**Output doesn't match what you expect:** Query the LangSmith trace. It shows exact inputs/outputs at each step - compare what you find to what you're trying to extract.351 352**One metric per evaluator:** Return `{"score": value, "comment": "..."}`. For multiple metrics, create separate functions.353 354**Field name mismatch:** Your run function output must match dataset schema exactly. Inspect dataset first with `client.read_example(example_id)`.355 356**RunTree vs dict (Python only):** Local `evaluate()` passes `RunTree`, uploaded evaluators receive `dict`. Handle both:357```python358run_outputs = run.outputs if hasattr(run, "outputs") else run.get("outputs", {}) or {}359```360TypeScript always uses attribute access: `run.outputs?.field`361</troubleshooting>362 363<resources>364- [LangSmith Evaluation Concepts](https://docs.langchain.com/langsmith/evaluation-concepts)365- [Custom Code Evaluators](https://changelog.langchain.com/announcements/custom-code-evaluators-in-langsmith)366- [OpenEvals - Readymade Evaluators](https://github.com/langchain-ai/openevals)367</resources>
Related skills
Langsmith Dataset

Install Langsmith Dataset skill for Claude Code from langchain-ai/langsmith-skills.
Langsmith Trace

Install Langsmith Trace skill for Claude Code from langchain-ai/langsmith-skills.
1password

Install 1password skill for Claude Code from steipete/clawdis.