Install

Terminal · npx

$npx skills add https://github.com/affaan-m/everything-claude-code --skill eval-harness

Works with Paperclip

How Eval Harness fits into a Paperclip company.

Eval Harness drops into any Paperclip agent that handles this kind of work. Assign it to a specialist inside a pre-configured PaperclipOrg company and the skill becomes available on every heartbeat — no prompt engineering, no tool wiring.

SaaS FactoryPaired

Pre-configured AI company — 18 agents, 18 skills, one-time purchase.

$27$59

Explore pack

Source file

SKILL.md270 linesmarkdown

Expand

1---2name: eval-harness3description: Formal evaluation framework for Claude Code sessions implementing eval-driven development (EDD) principles4origin: ECC5tools: Read, Write, Edit, Bash, Grep, Glob6---7 8# Eval Harness Skill9 10A formal evaluation framework for Claude Code sessions, implementing eval-driven development (EDD) principles.11 12## When to Activate13 14- Setting up eval-driven development (EDD) for AI-assisted workflows15- Defining pass/fail criteria for Claude Code task completion16- Measuring agent reliability with pass@k metrics17- Creating regression test suites for prompt or agent changes18- Benchmarking agent performance across model versions19 20## Philosophy21 22Eval-Driven Development treats evals as the "unit tests of AI development":23- Define expected behavior BEFORE implementation24- Run evals continuously during development25- Track regressions with each change26- Use pass@k metrics for reliability measurement27 28## Eval Types29 30### Capability Evals31Test if Claude can do something it couldn't before:32```markdown33[CAPABILITY EVAL: feature-name]34Task: Description of what Claude should accomplish35Success Criteria:36  - [ ] Criterion 137  - [ ] Criterion 238  - [ ] Criterion 339Expected Output: Description of expected result40```41 42### Regression Evals43Ensure changes don't break existing functionality:44```markdown45[REGRESSION EVAL: feature-name]46Baseline: SHA or checkpoint name47Tests:48  - existing-test-1: PASS/FAIL49  - existing-test-2: PASS/FAIL50  - existing-test-3: PASS/FAIL51Result: X/Y passed (previously Y/Y)52```53 54## Grader Types55 56### 1. Code-Based Grader57Deterministic checks using code:58```bash59# Check if file contains expected pattern60grep -q "export function handleAuth" src/auth.ts && echo "PASS" || echo "FAIL"61 62# Check if tests pass63npm test -- --testPathPattern="auth" && echo "PASS" || echo "FAIL"64 65# Check if build succeeds66npm run build && echo "PASS" || echo "FAIL"67```68 69### 2. Model-Based Grader70Use Claude to evaluate open-ended outputs:71```markdown72[MODEL GRADER PROMPT]73Evaluate the following code change:741. Does it solve the stated problem?752. Is it well-structured?763. Are edge cases handled?774. Is error handling appropriate?78 79Score: 1-5 (1=poor, 5=excellent)80Reasoning: [explanation]81```82 83### 3. Human Grader84Flag for manual review:85```markdown86[HUMAN REVIEW REQUIRED]87Change: Description of what changed88Reason: Why human review is needed89Risk Level: LOW/MEDIUM/HIGH90```91 92## Metrics93 94### pass@k95"At least one success in k attempts"96- pass@1: First attempt success rate97- pass@3: Success within 3 attempts98- Typical target: pass@3 > 90%99 100### pass^k101"All k trials succeed"102- Higher bar for reliability103- pass^3: 3 consecutive successes104- Use for critical paths105 106## Eval Workflow107 108### 1. Define (Before Coding)109```markdown110## EVAL DEFINITION: feature-xyz111 112### Capability Evals1131. Can create new user account1142. Can validate email format1153. Can hash password securely116 117### Regression Evals1181. Existing login still works1192. Session management unchanged1203. Logout flow intact121 122### Success Metrics123- pass@3 > 90% for capability evals124- pass^3 = 100% for regression evals125```126 127### 2. Implement128Write code to pass the defined evals.129 130### 3. Evaluate131```bash132# Run capability evals133[Run each capability eval, record PASS/FAIL]134 135# Run regression evals136npm test -- --testPathPattern="existing"137 138# Generate report139```140 141### 4. Report142```markdown143EVAL REPORT: feature-xyz144========================145 146Capability Evals:147  create-user:     PASS (pass@1)148  validate-email:  PASS (pass@2)149  hash-password:   PASS (pass@1)150  Overall:         3/3 passed151 152Regression Evals:153  login-flow:      PASS154  session-mgmt:    PASS155  logout-flow:     PASS156  Overall:         3/3 passed157 158Metrics:159  pass@1: 67% (2/3)160  pass@3: 100% (3/3)161 162Status: READY FOR REVIEW163```164 165## Integration Patterns166 167### Pre-Implementation168```169/eval define feature-name170```171Creates eval definition file at `.claude/evals/feature-name.md`172 173### During Implementation174```175/eval check feature-name176```177Runs current evals and reports status178 179### Post-Implementation180```181/eval report feature-name182```183Generates full eval report184 185## Eval Storage186 187Store evals in project:188```189.claude/190  evals/191    feature-xyz.md      # Eval definition192    feature-xyz.log     # Eval run history193    baseline.json       # Regression baselines194```195 196## Best Practices197 1981. **Define evals BEFORE coding** - Forces clear thinking about success criteria1992. **Run evals frequently** - Catch regressions early2003. **Track pass@k over time** - Monitor reliability trends2014. **Use code graders when possible** - Deterministic > probabilistic2025. **Human review for security** - Never fully automate security checks2036. **Keep evals fast** - Slow evals don't get run2047. **Version evals with code** - Evals are first-class artifacts205 206## Example: Adding Authentication207 208```markdown209## EVAL: add-authentication210 211### Phase 1: Define (10 min)212Capability Evals:213- [ ] User can register with email/password214- [ ] User can login with valid credentials215- [ ] Invalid credentials rejected with proper error216- [ ] Sessions persist across page reloads217- [ ] Logout clears session218 219Regression Evals:220- [ ] Public routes still accessible221- [ ] API responses unchanged222- [ ] Database schema compatible223 224### Phase 2: Implement (varies)225[Write code]226 227### Phase 3: Evaluate228Run: /eval check add-authentication229 230### Phase 4: Report231EVAL REPORT: add-authentication232==============================233Capability: 5/5 passed (pass@3: 100%)234Regression: 3/3 passed (pass^3: 100%)235Status: SHIP IT236```237 238## Product Evals (v1.8)239 240Use product evals when behavior quality cannot be captured by unit tests alone.241 242### Grader Types243 2441. Code grader (deterministic assertions)2452. Rule grader (regex/schema constraints)2463. Model grader (LLM-as-judge rubric)2474. Human grader (manual adjudication for ambiguous outputs)248 249### pass@k Guidance250 251- `pass@1`: direct reliability252- `pass@3`: practical reliability under controlled retries253- `pass^3`: stability test (all 3 runs must pass)254 255Recommended thresholds:256- Capability evals: pass@3 >= 0.90257- Regression evals: pass^3 = 1.00 for release-critical paths258 259### Eval Anti-Patterns260 261- Overfitting prompts to known eval examples262- Measuring only happy-path outputs263- Ignoring cost and latency drift while chasing pass rates264- Allowing flaky graders in release gates265 266### Minimal Eval Artifact Layout267 268- `.claude/evals/<feature>.md` definition269- `.claude/evals/<feature>.log` run history270- `docs/releases/<version>/eval-summary.md` release snapshot

Related skills

Agent Eval

Install Agent Eval skill for Claude Code from affaan-m/everything-claude-code.

Agent Harness Construction

Install Agent Harness Construction skill for Claude Code from affaan-m/everything-claude-code.

Agent Payment X402

Install Agent Payment X402 skill for Claude Code from affaan-m/everything-claude-code.