Install

Terminal · npx

$npx skills add https://github.com/vercel-labs/agent-skills --skill vercel-react-best-practices

Works with Paperclip

How Agent Eval fits into a Paperclip company.

Agent Eval drops into any Paperclip agent that handles this kind of work. Assign it to a specialist inside a pre-configured PaperclipOrg company and the skill becomes available on every heartbeat — no prompt engineering, no tool wiring.

SaaS FactoryPaired

Pre-configured AI company — 18 agents, 18 skills, one-time purchase.

$27$59

Explore pack

Source file

SKILL.md145 linesmarkdown

Expand

1---2name: agent-eval3description: Head-to-head comparison of coding agents (Claude Code, Aider, Codex, etc.) on custom tasks with pass rate, cost, time, and consistency metrics4origin: ECC5tools: Read, Write, Edit, Bash, Grep, Glob6---7 8# Agent Eval Skill9 10A lightweight CLI tool for comparing coding agents head-to-head on reproducible tasks. Every "which coding agent is best?" comparison runs on vibes — this tool systematizes it.11 12## When to Activate13 14- Comparing coding agents (Claude Code, Aider, Codex, etc.) on your own codebase15- Measuring agent performance before adopting a new tool or model16- Running regression checks when an agent updates its model or tooling17- Producing data-backed agent selection decisions for a team18 19## Installation20 21> **Note:** Install agent-eval from its repository after reviewing the source.22 23## Core Concepts24 25### YAML Task Definitions26 27Define tasks declaratively. Each task specifies what to do, which files to touch, and how to judge success:28 29```yaml30name: add-retry-logic31description: Add exponential backoff retry to the HTTP client32repo: ./my-project33files:34  - src/http_client.py35prompt: |36  Add retry logic with exponential backoff to all HTTP requests.37  Max 3 retries. Initial delay 1s, max delay 30s.38judge:39  - type: pytest40    command: pytest tests/test_http_client.py -v41  - type: grep42    pattern: "exponential_backoff|retry"43    files: src/http_client.py44commit: "abc1234"  # pin to specific commit for reproducibility45```46 47### Git Worktree Isolation48 49Each agent run gets its own git worktree — no Docker required. This provides reproducibility isolation so agents cannot interfere with each other or corrupt the base repo.50 51### Metrics Collected52 53| Metric | What It Measures |54|--------|-----------------|55| Pass rate | Did the agent produce code that passes the judge? |56| Cost | API spend per task (when available) |57| Time | Wall-clock seconds to completion |58| Consistency | Pass rate across repeated runs (e.g., 3/3 = 100%) |59 60## Workflow61 62### 1. Define Tasks63 64Create a `tasks/` directory with YAML files, one per task:65 66```bash67mkdir tasks68# Write task definitions (see template above)69```70 71### 2. Run Agents72 73Execute agents against your tasks:74 75```bash76agent-eval run --task tasks/add-retry-logic.yaml --agent claude-code --agent aider --runs 377```78 79Each run:801. Creates a fresh git worktree from the specified commit812. Hands the prompt to the agent823. Runs the judge criteria834. Records pass/fail, cost, and time84 85### 3. Compare Results86 87Generate a comparison report:88 89```bash90agent-eval report --format table91```92 93```94Task: add-retry-logic (3 runs each)95┌──────────────┬───────────┬────────┬────────┬─────────────┐96│ Agent        │ Pass Rate │ Cost   │ Time   │ Consistency │97├──────────────┼───────────┼────────┼────────┼─────────────┤98│ claude-code  │ 3/3       │ $0.12  │ 45s    │ 100%        │99│ aider        │ 2/3       │ $0.08  │ 38s    │  67%        │100└──────────────┴───────────┴────────┴────────┴─────────────┘101```102 103## Judge Types104 105### Code-Based (deterministic)106 107```yaml108judge:109  - type: pytest110    command: pytest tests/ -v111  - type: command112    command: npm run build113```114 115### Pattern-Based116 117```yaml118judge:119  - type: grep120    pattern: "class.*Retry"121    files: src/**/*.py122```123 124### Model-Based (LLM-as-judge)125 126```yaml127judge:128  - type: llm129    prompt: |130      Does this implementation correctly handle exponential backoff?131      Check for: max retries, increasing delays, jitter.132```133 134## Best Practices135 136- **Start with 3-5 tasks** that represent your real workload, not toy examples137- **Run at least 3 trials** per agent to capture variance — agents are non-deterministic138- **Pin the commit** in your task YAML so results are reproducible across days/weeks139- **Include at least one deterministic judge** (tests, build) per task — LLM judges add noise140- **Track cost alongside pass rate** — a 95% agent at 10x the cost may not be the right choice141- **Version your task definitions** — they are test fixtures, treat them as code142 143## Links144 145- Repository: [github.com/joaquinhuigomez/agent-eval](https://github.com/joaquinhuigomez/agent-eval)

Related skills

Agent Harness Construction

Install Agent Harness Construction skill for Claude Code from affaan-m/everything-claude-code.

Agent Payment X402

Install Agent Payment X402 skill for Claude Code from affaan-m/everything-claude-code.

Agentic Engineering

Install Agentic Engineering skill for Claude Code from affaan-m/everything-claude-code.