Claude Agent Skill · by Affaan M

Agent Eval

Install Agent Eval skill for Claude Code from affaan-m/everything-claude-code.

Install
Terminal · npx
$npx skills add https://github.com/vercel-labs/agent-skills --skill vercel-react-best-practices
Works with Paperclip

How Agent Eval fits into a Paperclip company.

Agent Eval drops into any Paperclip agent that handles this kind of work. Assign it to a specialist inside a pre-configured PaperclipOrg company and the skill becomes available on every heartbeat — no prompt engineering, no tool wiring.

S
SaaS FactoryPaired

Pre-configured AI company — 18 agents, 18 skills, one-time purchase.

$27$59
Explore pack
Source file
SKILL.md145 lines
Expand
---name: agent-evaldescription: Head-to-head comparison of coding agents (Claude Code, Aider, Codex, etc.) on custom tasks with pass rate, cost, time, and consistency metricsorigin: ECCtools: Read, Write, Edit, Bash, Grep, Glob--- # Agent Eval Skill A lightweight CLI tool for comparing coding agents head-to-head on reproducible tasks. Every "which coding agent is best?" comparison runs on vibes — this tool systematizes it. ## When to Activate - Comparing coding agents (Claude Code, Aider, Codex, etc.) on your own codebase- Measuring agent performance before adopting a new tool or model- Running regression checks when an agent updates its model or tooling- Producing data-backed agent selection decisions for a team ## Installation > **Note:** Install agent-eval from its repository after reviewing the source. ## Core Concepts ### YAML Task Definitions Define tasks declaratively. Each task specifies what to do, which files to touch, and how to judge success: ```yamlname: add-retry-logicdescription: Add exponential backoff retry to the HTTP clientrepo: ./my-projectfiles:  - src/http_client.pyprompt: |  Add retry logic with exponential backoff to all HTTP requests.  Max 3 retries. Initial delay 1s, max delay 30s.judge:  - type: pytest    command: pytest tests/test_http_client.py -v  - type: grep    pattern: "exponential_backoff|retry"    files: src/http_client.pycommit: "abc1234"  # pin to specific commit for reproducibility``` ### Git Worktree Isolation Each agent run gets its own git worktree — no Docker required. This provides reproducibility isolation so agents cannot interfere with each other or corrupt the base repo. ### Metrics Collected | Metric | What It Measures ||--------|-----------------|| Pass rate | Did the agent produce code that passes the judge? || Cost | API spend per task (when available) || Time | Wall-clock seconds to completion || Consistency | Pass rate across repeated runs (e.g., 3/3 = 100%) | ## Workflow ### 1. Define Tasks Create a `tasks/` directory with YAML files, one per task: ```bashmkdir tasks# Write task definitions (see template above)``` ### 2. Run Agents Execute agents against your tasks: ```bashagent-eval run --task tasks/add-retry-logic.yaml --agent claude-code --agent aider --runs 3``` Each run:1. Creates a fresh git worktree from the specified commit2. Hands the prompt to the agent3. Runs the judge criteria4. Records pass/fail, cost, and time ### 3. Compare Results Generate a comparison report: ```bashagent-eval report --format table``` ```Task: add-retry-logic (3 runs each)┌──────────────┬───────────┬────────┬────────┬─────────────┐│ Agent        │ Pass Rate │ Cost   │ Time   │ Consistency │├──────────────┼───────────┼────────┼────────┼─────────────┤│ claude-code  │ 3/3       │ $0.12  │ 45s    │ 100%        ││ aider        │ 2/3       │ $0.08  │ 38s    │  67%        │└──────────────┴───────────┴────────┴────────┴─────────────┘``` ## Judge Types ### Code-Based (deterministic) ```yamljudge:  - type: pytest    command: pytest tests/ -v  - type: command    command: npm run build``` ### Pattern-Based ```yamljudge:  - type: grep    pattern: "class.*Retry"    files: src/**/*.py``` ### Model-Based (LLM-as-judge) ```yamljudge:  - type: llm    prompt: |      Does this implementation correctly handle exponential backoff?      Check for: max retries, increasing delays, jitter.``` ## Best Practices - **Start with 3-5 tasks** that represent your real workload, not toy examples- **Run at least 3 trials** per agent to capture variance — agents are non-deterministic- **Pin the commit** in your task YAML so results are reproducible across days/weeks- **Include at least one deterministic judge** (tests, build) per task — LLM judges add noise- **Track cost alongside pass rate** — a 95% agent at 10x the cost may not be the right choice- **Version your task definitions** — they are test fixtures, treat them as code ## Links - Repository: [github.com/joaquinhuigomez/agent-eval](https://github.com/joaquinhuigomez/agent-eval)