How Agentic Eval fits into a Paperclip company.

Agentic Eval drops into any Paperclip agent that handles this kind of work. Assign it to a specialist inside a pre-configured PaperclipOrg company and the skill becomes available on every heartbeat — no prompt engineering, no tool wiring.

SaaS FactoryPaired

Pre-configured AI company — 18 agents, 18 skills, one-time purchase.

$27$59

Explore pack

Source file

SKILL.md189 linesmarkdown

Expand

1---2name: agentic-eval3description: |4  Patterns and techniques for evaluating and improving AI agent outputs. Use this skill when:5  - Implementing self-critique and reflection loops6  - Building evaluator-optimizer pipelines for quality-critical generation7  - Creating test-driven code refinement workflows8  - Designing rubric-based or LLM-as-judge evaluation systems9  - Adding iterative improvement to agent outputs (code, reports, analysis)10  - Measuring and improving agent response quality11---12 13# Agentic Evaluation Patterns14 15Patterns for self-improvement through iterative evaluation and refinement.16 17## Overview18 19Evaluation patterns enable agents to assess and improve their own outputs, moving beyond single-shot generation to iterative refinement loops.20 21```22Generate → Evaluate → Critique → Refine → Output23    ↑                              │24    └──────────────────────────────┘25```26 27## When to Use28 29- **Quality-critical generation**: Code, reports, analysis requiring high accuracy30- **Tasks with clear evaluation criteria**: Defined success metrics exist31- **Content requiring specific standards**: Style guides, compliance, formatting32 33---34 35## Pattern 1: Basic Reflection36 37Agent evaluates and improves its own output through self-critique.38 39```python40def reflect_and_refine(task: str, criteria: list[str], max_iterations: int = 3) -> str:41    """Generate with reflection loop."""42    output = llm(f"Complete this task:\n{task}")43    44    for i in range(max_iterations):45        # Self-critique46        critique = llm(f"""47        Evaluate this output against criteria: {criteria}48        Output: {output}49        Rate each: PASS/FAIL with feedback as JSON.50        """)51        52        critique_data = json.loads(critique)53        all_pass = all(c["status"] == "PASS" for c in critique_data.values())54        if all_pass:55            return output56        57        # Refine based on critique58        failed = {k: v["feedback"] for k, v in critique_data.items() if v["status"] == "FAIL"}59        output = llm(f"Improve to address: {failed}\nOriginal: {output}")60    61    return output62```63 64**Key insight**: Use structured JSON output for reliable parsing of critique results.65 66---67 68## Pattern 2: Evaluator-Optimizer69 70Separate generation and evaluation into distinct components for clearer responsibilities.71 72```python73class EvaluatorOptimizer:74    def __init__(self, score_threshold: float = 0.8):75        self.score_threshold = score_threshold76    77    def generate(self, task: str) -> str:78        return llm(f"Complete: {task}")79    80    def evaluate(self, output: str, task: str) -> dict:81        return json.loads(llm(f"""82        Evaluate output for task: {task}83        Output: {output}84        Return JSON: {{"overall_score": 0-1, "dimensions": {{"accuracy": ..., "clarity": ...}}}}85        """))86    87    def optimize(self, output: str, feedback: dict) -> str:88        return llm(f"Improve based on feedback: {feedback}\nOutput: {output}")89    90    def run(self, task: str, max_iterations: int = 3) -> str:91        output = self.generate(task)92        for _ in range(max_iterations):93            evaluation = self.evaluate(output, task)94            if evaluation["overall_score"] >= self.score_threshold:95                break96            output = self.optimize(output, evaluation)97        return output98```99 100---101 102## Pattern 3: Code-Specific Reflection103 104Test-driven refinement loop for code generation.105 106```python107class CodeReflector:108    def reflect_and_fix(self, spec: str, max_iterations: int = 3) -> str:109        code = llm(f"Write Python code for: {spec}")110        tests = llm(f"Generate pytest tests for: {spec}\nCode: {code}")111        112        for _ in range(max_iterations):113            result = run_tests(code, tests)114            if result["success"]:115                return code116            code = llm(f"Fix error: {result['error']}\nCode: {code}")117        return code118```119 120---121 122## Evaluation Strategies123 124### Outcome-Based125Evaluate whether output achieves the expected result.126 127```python128def evaluate_outcome(task: str, output: str, expected: str) -> str:129    return llm(f"Does output achieve expected outcome? Task: {task}, Expected: {expected}, Output: {output}")130```131 132### LLM-as-Judge133Use LLM to compare and rank outputs.134 135```python136def llm_judge(output_a: str, output_b: str, criteria: str) -> str:137    return llm(f"Compare outputs A and B for {criteria}. Which is better and why?")138```139 140### Rubric-Based141Score outputs against weighted dimensions.142 143```python144RUBRIC = {145    "accuracy": {"weight": 0.4},146    "clarity": {"weight": 0.3},147    "completeness": {"weight": 0.3}148}149 150def evaluate_with_rubric(output: str, rubric: dict) -> float:151    scores = json.loads(llm(f"Rate 1-5 for each dimension: {list(rubric.keys())}\nOutput: {output}"))152    return sum(scores[d] * rubric[d]["weight"] for d in rubric) / 5153```154 155---156 157## Best Practices158 159| Practice | Rationale |160|----------|-----------|161| **Clear criteria** | Define specific, measurable evaluation criteria upfront |162| **Iteration limits** | Set max iterations (3-5) to prevent infinite loops |163| **Convergence check** | Stop if output score isn't improving between iterations |164| **Log history** | Keep full trajectory for debugging and analysis |165| **Structured output** | Use JSON for reliable parsing of evaluation results |166 167---168 169## Quick Start Checklist170 171```markdown172## Evaluation Implementation Checklist173 174### Setup175- [ ] Define evaluation criteria/rubric176- [ ] Set score threshold for "good enough"177- [ ] Configure max iterations (default: 3)178 179### Implementation180- [ ] Implement generate() function181- [ ] Implement evaluate() function with structured output182- [ ] Implement optimize() function183- [ ] Wire up the refinement loop184 185### Safety186- [ ] Add convergence detection187- [ ] Log all iterations for debugging188- [ ] Handle evaluation parse failures gracefully189```

Related skills

Add Educational Comments

Takes any code file and transforms it into a teaching resource by adding educational comments that explain syntax, design choices, and language concepts. Automa

Agent Governance

When your AI agents start calling APIs, touching databases, or executing shell commands, you need guardrails before something goes sideways. This gives you comp

Ai Prompt Engineering Safety Review

The ai-prompt-engineering-safety-review skill analyzes AI prompts for safety risks, bias, security vulnerabilities, and effectiveness using a structured evaluat