Claude Agent Skill · by Github

Agentic Eval

Implements self-critique loops where Claude generates output, evaluates it against your criteria, then refines based on its own feedback. Includes evaluator-opt

Install
Terminal · npx
$npx skills add https://github.com/github/awesome-copilot --skill agentic-eval
Works with Paperclip

How Agentic Eval fits into a Paperclip company.

Agentic Eval drops into any Paperclip agent that handles this kind of work. Assign it to a specialist inside a pre-configured PaperclipOrg company and the skill becomes available on every heartbeat — no prompt engineering, no tool wiring.

S
SaaS FactoryPaired

Pre-configured AI company — 18 agents, 18 skills, one-time purchase.

$27$59
Explore pack
Source file
SKILL.md189 lines
Expand
---name: agentic-evaldescription: |  Patterns and techniques for evaluating and improving AI agent outputs. Use this skill when:  - Implementing self-critique and reflection loops  - Building evaluator-optimizer pipelines for quality-critical generation  - Creating test-driven code refinement workflows  - Designing rubric-based or LLM-as-judge evaluation systems  - Adding iterative improvement to agent outputs (code, reports, analysis)  - Measuring and improving agent response quality--- # Agentic Evaluation Patterns Patterns for self-improvement through iterative evaluation and refinement. ## Overview Evaluation patterns enable agents to assess and improve their own outputs, moving beyond single-shot generation to iterative refinement loops. ```Generate → Evaluate → Critique → Refine → Output    ↑                              │    └──────────────────────────────┘``` ## When to Use - **Quality-critical generation**: Code, reports, analysis requiring high accuracy- **Tasks with clear evaluation criteria**: Defined success metrics exist- **Content requiring specific standards**: Style guides, compliance, formatting --- ## Pattern 1: Basic Reflection Agent evaluates and improves its own output through self-critique. ```pythondef reflect_and_refine(task: str, criteria: list[str], max_iterations: int = 3) -> str:    """Generate with reflection loop."""    output = llm(f"Complete this task:\n{task}")        for i in range(max_iterations):        # Self-critique        critique = llm(f"""        Evaluate this output against criteria: {criteria}        Output: {output}        Rate each: PASS/FAIL with feedback as JSON.        """)                critique_data = json.loads(critique)        all_pass = all(c["status"] == "PASS" for c in critique_data.values())        if all_pass:            return output                # Refine based on critique        failed = {k: v["feedback"] for k, v in critique_data.items() if v["status"] == "FAIL"}        output = llm(f"Improve to address: {failed}\nOriginal: {output}")        return output``` **Key insight**: Use structured JSON output for reliable parsing of critique results. --- ## Pattern 2: Evaluator-Optimizer Separate generation and evaluation into distinct components for clearer responsibilities. ```pythonclass EvaluatorOptimizer:    def __init__(self, score_threshold: float = 0.8):        self.score_threshold = score_threshold        def generate(self, task: str) -> str:        return llm(f"Complete: {task}")        def evaluate(self, output: str, task: str) -> dict:        return json.loads(llm(f"""        Evaluate output for task: {task}        Output: {output}        Return JSON: {{"overall_score": 0-1, "dimensions": {{"accuracy": ..., "clarity": ...}}}}        """))        def optimize(self, output: str, feedback: dict) -> str:        return llm(f"Improve based on feedback: {feedback}\nOutput: {output}")        def run(self, task: str, max_iterations: int = 3) -> str:        output = self.generate(task)        for _ in range(max_iterations):            evaluation = self.evaluate(output, task)            if evaluation["overall_score"] >= self.score_threshold:                break            output = self.optimize(output, evaluation)        return output``` --- ## Pattern 3: Code-Specific Reflection Test-driven refinement loop for code generation. ```pythonclass CodeReflector:    def reflect_and_fix(self, spec: str, max_iterations: int = 3) -> str:        code = llm(f"Write Python code for: {spec}")        tests = llm(f"Generate pytest tests for: {spec}\nCode: {code}")                for _ in range(max_iterations):            result = run_tests(code, tests)            if result["success"]:                return code            code = llm(f"Fix error: {result['error']}\nCode: {code}")        return code``` --- ## Evaluation Strategies ### Outcome-BasedEvaluate whether output achieves the expected result. ```pythondef evaluate_outcome(task: str, output: str, expected: str) -> str:    return llm(f"Does output achieve expected outcome? Task: {task}, Expected: {expected}, Output: {output}")``` ### LLM-as-JudgeUse LLM to compare and rank outputs. ```pythondef llm_judge(output_a: str, output_b: str, criteria: str) -> str:    return llm(f"Compare outputs A and B for {criteria}. Which is better and why?")``` ### Rubric-BasedScore outputs against weighted dimensions. ```pythonRUBRIC = {    "accuracy": {"weight": 0.4},    "clarity": {"weight": 0.3},    "completeness": {"weight": 0.3}} def evaluate_with_rubric(output: str, rubric: dict) -> float:    scores = json.loads(llm(f"Rate 1-5 for each dimension: {list(rubric.keys())}\nOutput: {output}"))    return sum(scores[d] * rubric[d]["weight"] for d in rubric) / 5``` --- ## Best Practices | Practice | Rationale ||----------|-----------|| **Clear criteria** | Define specific, measurable evaluation criteria upfront || **Iteration limits** | Set max iterations (3-5) to prevent infinite loops || **Convergence check** | Stop if output score isn't improving between iterations || **Log history** | Keep full trajectory for debugging and analysis || **Structured output** | Use JSON for reliable parsing of evaluation results | --- ## Quick Start Checklist ```markdown## Evaluation Implementation Checklist ### Setup- [ ] Define evaluation criteria/rubric- [ ] Set score threshold for "good enough"- [ ] Configure max iterations (default: 3) ### Implementation- [ ] Implement generate() function- [ ] Implement evaluate() function with structured output- [ ] Implement optimize() function- [ ] Wire up the refinement loop ### Safety- [ ] Add convergence detection- [ ] Log all iterations for debugging- [ ] Handle evaluation parse failures gracefully```