Install
Terminal · npx$
npx skills add https://github.com/affaan-m/everything-claude-code --skill eval-harnessWorks with Paperclip
How Eval Harness fits into a Paperclip company.
Eval Harness drops into any Paperclip agent that handles this kind of work. Assign it to a specialist inside a pre-configured PaperclipOrg company and the skill becomes available on every heartbeat — no prompt engineering, no tool wiring.
S
SaaS FactoryPaired
Pre-configured AI company — 18 agents, 18 skills, one-time purchase.
$27$59
Explore packSource file
SKILL.md270 linesExpandCollapse
---name: eval-harnessdescription: Formal evaluation framework for Claude Code sessions implementing eval-driven development (EDD) principlesorigin: ECCtools: Read, Write, Edit, Bash, Grep, Glob--- # Eval Harness Skill A formal evaluation framework for Claude Code sessions, implementing eval-driven development (EDD) principles. ## When to Activate - Setting up eval-driven development (EDD) for AI-assisted workflows- Defining pass/fail criteria for Claude Code task completion- Measuring agent reliability with pass@k metrics- Creating regression test suites for prompt or agent changes- Benchmarking agent performance across model versions ## Philosophy Eval-Driven Development treats evals as the "unit tests of AI development":- Define expected behavior BEFORE implementation- Run evals continuously during development- Track regressions with each change- Use pass@k metrics for reliability measurement ## Eval Types ### Capability EvalsTest if Claude can do something it couldn't before:```markdown[CAPABILITY EVAL: feature-name]Task: Description of what Claude should accomplishSuccess Criteria: - [ ] Criterion 1 - [ ] Criterion 2 - [ ] Criterion 3Expected Output: Description of expected result``` ### Regression EvalsEnsure changes don't break existing functionality:```markdown[REGRESSION EVAL: feature-name]Baseline: SHA or checkpoint nameTests: - existing-test-1: PASS/FAIL - existing-test-2: PASS/FAIL - existing-test-3: PASS/FAILResult: X/Y passed (previously Y/Y)``` ## Grader Types ### 1. Code-Based GraderDeterministic checks using code:```bash# Check if file contains expected patterngrep -q "export function handleAuth" src/auth.ts && echo "PASS" || echo "FAIL" # Check if tests passnpm test -- --testPathPattern="auth" && echo "PASS" || echo "FAIL" # Check if build succeedsnpm run build && echo "PASS" || echo "FAIL"``` ### 2. Model-Based GraderUse Claude to evaluate open-ended outputs:```markdown[MODEL GRADER PROMPT]Evaluate the following code change:1. Does it solve the stated problem?2. Is it well-structured?3. Are edge cases handled?4. Is error handling appropriate? Score: 1-5 (1=poor, 5=excellent)Reasoning: [explanation]``` ### 3. Human GraderFlag for manual review:```markdown[HUMAN REVIEW REQUIRED]Change: Description of what changedReason: Why human review is neededRisk Level: LOW/MEDIUM/HIGH``` ## Metrics ### pass@k"At least one success in k attempts"- pass@1: First attempt success rate- pass@3: Success within 3 attempts- Typical target: pass@3 > 90% ### pass^k"All k trials succeed"- Higher bar for reliability- pass^3: 3 consecutive successes- Use for critical paths ## Eval Workflow ### 1. Define (Before Coding)```markdown## EVAL DEFINITION: feature-xyz ### Capability Evals1. Can create new user account2. Can validate email format3. Can hash password securely ### Regression Evals1. Existing login still works2. Session management unchanged3. Logout flow intact ### Success Metrics- pass@3 > 90% for capability evals- pass^3 = 100% for regression evals``` ### 2. ImplementWrite code to pass the defined evals. ### 3. Evaluate```bash# Run capability evals[Run each capability eval, record PASS/FAIL] # Run regression evalsnpm test -- --testPathPattern="existing" # Generate report``` ### 4. Report```markdownEVAL REPORT: feature-xyz======================== Capability Evals: create-user: PASS (pass@1) validate-email: PASS (pass@2) hash-password: PASS (pass@1) Overall: 3/3 passed Regression Evals: login-flow: PASS session-mgmt: PASS logout-flow: PASS Overall: 3/3 passed Metrics: pass@1: 67% (2/3) pass@3: 100% (3/3) Status: READY FOR REVIEW``` ## Integration Patterns ### Pre-Implementation```/eval define feature-name```Creates eval definition file at `.claude/evals/feature-name.md` ### During Implementation```/eval check feature-name```Runs current evals and reports status ### Post-Implementation```/eval report feature-name```Generates full eval report ## Eval Storage Store evals in project:```.claude/ evals/ feature-xyz.md # Eval definition feature-xyz.log # Eval run history baseline.json # Regression baselines``` ## Best Practices 1. **Define evals BEFORE coding** - Forces clear thinking about success criteria2. **Run evals frequently** - Catch regressions early3. **Track pass@k over time** - Monitor reliability trends4. **Use code graders when possible** - Deterministic > probabilistic5. **Human review for security** - Never fully automate security checks6. **Keep evals fast** - Slow evals don't get run7. **Version evals with code** - Evals are first-class artifacts ## Example: Adding Authentication ```markdown## EVAL: add-authentication ### Phase 1: Define (10 min)Capability Evals:- [ ] User can register with email/password- [ ] User can login with valid credentials- [ ] Invalid credentials rejected with proper error- [ ] Sessions persist across page reloads- [ ] Logout clears session Regression Evals:- [ ] Public routes still accessible- [ ] API responses unchanged- [ ] Database schema compatible ### Phase 2: Implement (varies)[Write code] ### Phase 3: EvaluateRun: /eval check add-authentication ### Phase 4: ReportEVAL REPORT: add-authentication==============================Capability: 5/5 passed (pass@3: 100%)Regression: 3/3 passed (pass^3: 100%)Status: SHIP IT``` ## Product Evals (v1.8) Use product evals when behavior quality cannot be captured by unit tests alone. ### Grader Types 1. Code grader (deterministic assertions)2. Rule grader (regex/schema constraints)3. Model grader (LLM-as-judge rubric)4. Human grader (manual adjudication for ambiguous outputs) ### pass@k Guidance - `pass@1`: direct reliability- `pass@3`: practical reliability under controlled retries- `pass^3`: stability test (all 3 runs must pass) Recommended thresholds:- Capability evals: pass@3 >= 0.90- Regression evals: pass^3 = 1.00 for release-critical paths ### Eval Anti-Patterns - Overfitting prompts to known eval examples- Measuring only happy-path outputs- Ignoring cost and latency drift while chasing pass rates- Allowing flaky graders in release gates ### Minimal Eval Artifact Layout - `.claude/evals/<feature>.md` definition- `.claude/evals/<feature>.log` run history- `docs/releases/<version>/eval-summary.md` release snapshotRelated skills
Agent Eval
Install Agent Eval skill for Claude Code from affaan-m/everything-claude-code.
Agent Harness Construction
Install Agent Harness Construction skill for Claude Code from affaan-m/everything-claude-code.
Agent Payment X402
Install Agent Payment X402 skill for Claude Code from affaan-m/everything-claude-code.