npx skills add https://github.com/github/awesome-copilot --skill agentic-evalHow Agentic Eval fits into a Paperclip company.
Agentic Eval drops into any Paperclip agent that handles this kind of work. Assign it to a specialist inside a pre-configured PaperclipOrg company and the skill becomes available on every heartbeat — no prompt engineering, no tool wiring.
Pre-configured AI company — 18 agents, 18 skills, one-time purchase.
SKILL.md189 linesExpandCollapse
---name: agentic-evaldescription: | Patterns and techniques for evaluating and improving AI agent outputs. Use this skill when: - Implementing self-critique and reflection loops - Building evaluator-optimizer pipelines for quality-critical generation - Creating test-driven code refinement workflows - Designing rubric-based or LLM-as-judge evaluation systems - Adding iterative improvement to agent outputs (code, reports, analysis) - Measuring and improving agent response quality--- # Agentic Evaluation Patterns Patterns for self-improvement through iterative evaluation and refinement. ## Overview Evaluation patterns enable agents to assess and improve their own outputs, moving beyond single-shot generation to iterative refinement loops. ```Generate → Evaluate → Critique → Refine → Output ↑ │ └──────────────────────────────┘``` ## When to Use - **Quality-critical generation**: Code, reports, analysis requiring high accuracy- **Tasks with clear evaluation criteria**: Defined success metrics exist- **Content requiring specific standards**: Style guides, compliance, formatting --- ## Pattern 1: Basic Reflection Agent evaluates and improves its own output through self-critique. ```pythondef reflect_and_refine(task: str, criteria: list[str], max_iterations: int = 3) -> str: """Generate with reflection loop.""" output = llm(f"Complete this task:\n{task}") for i in range(max_iterations): # Self-critique critique = llm(f""" Evaluate this output against criteria: {criteria} Output: {output} Rate each: PASS/FAIL with feedback as JSON. """) critique_data = json.loads(critique) all_pass = all(c["status"] == "PASS" for c in critique_data.values()) if all_pass: return output # Refine based on critique failed = {k: v["feedback"] for k, v in critique_data.items() if v["status"] == "FAIL"} output = llm(f"Improve to address: {failed}\nOriginal: {output}") return output``` **Key insight**: Use structured JSON output for reliable parsing of critique results. --- ## Pattern 2: Evaluator-Optimizer Separate generation and evaluation into distinct components for clearer responsibilities. ```pythonclass EvaluatorOptimizer: def __init__(self, score_threshold: float = 0.8): self.score_threshold = score_threshold def generate(self, task: str) -> str: return llm(f"Complete: {task}") def evaluate(self, output: str, task: str) -> dict: return json.loads(llm(f""" Evaluate output for task: {task} Output: {output} Return JSON: {{"overall_score": 0-1, "dimensions": {{"accuracy": ..., "clarity": ...}}}} """)) def optimize(self, output: str, feedback: dict) -> str: return llm(f"Improve based on feedback: {feedback}\nOutput: {output}") def run(self, task: str, max_iterations: int = 3) -> str: output = self.generate(task) for _ in range(max_iterations): evaluation = self.evaluate(output, task) if evaluation["overall_score"] >= self.score_threshold: break output = self.optimize(output, evaluation) return output``` --- ## Pattern 3: Code-Specific Reflection Test-driven refinement loop for code generation. ```pythonclass CodeReflector: def reflect_and_fix(self, spec: str, max_iterations: int = 3) -> str: code = llm(f"Write Python code for: {spec}") tests = llm(f"Generate pytest tests for: {spec}\nCode: {code}") for _ in range(max_iterations): result = run_tests(code, tests) if result["success"]: return code code = llm(f"Fix error: {result['error']}\nCode: {code}") return code``` --- ## Evaluation Strategies ### Outcome-BasedEvaluate whether output achieves the expected result. ```pythondef evaluate_outcome(task: str, output: str, expected: str) -> str: return llm(f"Does output achieve expected outcome? Task: {task}, Expected: {expected}, Output: {output}")``` ### LLM-as-JudgeUse LLM to compare and rank outputs. ```pythondef llm_judge(output_a: str, output_b: str, criteria: str) -> str: return llm(f"Compare outputs A and B for {criteria}. Which is better and why?")``` ### Rubric-BasedScore outputs against weighted dimensions. ```pythonRUBRIC = { "accuracy": {"weight": 0.4}, "clarity": {"weight": 0.3}, "completeness": {"weight": 0.3}} def evaluate_with_rubric(output: str, rubric: dict) -> float: scores = json.loads(llm(f"Rate 1-5 for each dimension: {list(rubric.keys())}\nOutput: {output}")) return sum(scores[d] * rubric[d]["weight"] for d in rubric) / 5``` --- ## Best Practices | Practice | Rationale ||----------|-----------|| **Clear criteria** | Define specific, measurable evaluation criteria upfront || **Iteration limits** | Set max iterations (3-5) to prevent infinite loops || **Convergence check** | Stop if output score isn't improving between iterations || **Log history** | Keep full trajectory for debugging and analysis || **Structured output** | Use JSON for reliable parsing of evaluation results | --- ## Quick Start Checklist ```markdown## Evaluation Implementation Checklist ### Setup- [ ] Define evaluation criteria/rubric- [ ] Set score threshold for "good enough"- [ ] Configure max iterations (default: 3) ### Implementation- [ ] Implement generate() function- [ ] Implement evaluate() function with structured output- [ ] Implement optimize() function- [ ] Wire up the refinement loop ### Safety- [ ] Add convergence detection- [ ] Log all iterations for debugging- [ ] Handle evaluation parse failures gracefully```Add Educational Comments
Takes any code file and transforms it into a teaching resource by adding educational comments that explain syntax, design choices, and language concepts. Automa
Agent Governance
When your AI agents start calling APIs, touching databases, or executing shell commands, you need guardrails before something goes sideways. This gives you comp
Ai Prompt Engineering Safety Review
The ai-prompt-engineering-safety-review skill analyzes AI prompts for safety risks, bias, security vulnerabilities, and effectiveness using a structured evaluat