How Llm Evaluation fits into a Paperclip company.

Llm Evaluation drops into any Paperclip agent that handles this kind of work. Assign it to a specialist inside a pre-configured PaperclipOrg company and the skill becomes available on every heartbeat — no prompt engineering, no tool wiring.
SaaS FactoryPaired
Pre-configured AI company — 18 agents, 18 skills, one-time purchase.
$27$59
Explore pack
Source file
SKILL.md666 linesmarkdown
Expand
1---2name: llm-evaluation3description: Implement comprehensive evaluation strategies for LLM applications using automated metrics, human feedback, and benchmarking. Use when testing LLM performance, measuring AI application quality, or establishing evaluation frameworks.4---5 6# LLM Evaluation7 8Master comprehensive evaluation strategies for LLM applications, from automated metrics to human evaluation and A/B testing.9 10## When to Use This Skill11 12- Measuring LLM application performance systematically13- Comparing different models or prompts14- Detecting performance regressions before deployment15- Validating improvements from prompt changes16- Building confidence in production systems17- Establishing baselines and tracking progress over time18- Debugging unexpected model behavior19 20## Core Evaluation Types21 22### 1. Automated Metrics23 24Fast, repeatable, scalable evaluation using computed scores.25 26**Text Generation:**27 28- **BLEU**: N-gram overlap (translation)29- **ROUGE**: Recall-oriented (summarization)30- **METEOR**: Semantic similarity31- **BERTScore**: Embedding-based similarity32- **Perplexity**: Language model confidence33 34**Classification:**35 36- **Accuracy**: Percentage correct37- **Precision/Recall/F1**: Class-specific performance38- **Confusion Matrix**: Error patterns39- **AUC-ROC**: Ranking quality40 41**Retrieval (RAG):**42 43- **MRR**: Mean Reciprocal Rank44- **NDCG**: Normalized Discounted Cumulative Gain45- **Precision@K**: Relevant in top K46- **Recall@K**: Coverage in top K47 48### 2. Human Evaluation49 50Manual assessment for quality aspects difficult to automate.51 52**Dimensions:**53 54- **Accuracy**: Factual correctness55- **Coherence**: Logical flow56- **Relevance**: Answers the question57- **Fluency**: Natural language quality58- **Safety**: No harmful content59- **Helpfulness**: Useful to the user60 61### 3. LLM-as-Judge62 63Use stronger LLMs to evaluate weaker model outputs.64 65**Approaches:**66 67- **Pointwise**: Score individual responses68- **Pairwise**: Compare two responses69- **Reference-based**: Compare to gold standard70- **Reference-free**: Judge without ground truth71 72## Quick Start73 74```python75from dataclasses import dataclass76from typing import Callable77import numpy as np78 79@dataclass80class Metric:81    name: str82    fn: Callable83 84    @staticmethod85    def accuracy():86        return Metric("accuracy", calculate_accuracy)87 88    @staticmethod89    def bleu():90        return Metric("bleu", calculate_bleu)91 92    @staticmethod93    def bertscore():94        return Metric("bertscore", calculate_bertscore)95 96    @staticmethod97    def custom(name: str, fn: Callable):98        return Metric(name, fn)99 100class EvaluationSuite:101    def __init__(self, metrics: list[Metric]):102        self.metrics = metrics103 104    async def evaluate(self, model, test_cases: list[dict]) -> dict:105        results = {m.name: [] for m in self.metrics}106 107        for test in test_cases:108            prediction = await model.predict(test["input"])109 110            for metric in self.metrics:111                score = metric.fn(112                    prediction=prediction,113                    reference=test.get("expected"),114                    context=test.get("context")115                )116                results[metric.name].append(score)117 118        return {119            "metrics": {k: np.mean(v) for k, v in results.items()},120            "raw_scores": results121        }122 123# Usage124suite = EvaluationSuite([125    Metric.accuracy(),126    Metric.bleu(),127    Metric.bertscore(),128    Metric.custom("groundedness", check_groundedness)129])130 131test_cases = [132    {133        "input": "What is the capital of France?",134        "expected": "Paris",135        "context": "France is a country in Europe. Paris is its capital."136    },137]138 139results = await suite.evaluate(model=your_model, test_cases=test_cases)140```141 142## Automated Metrics Implementation143 144### BLEU Score145 146```python147from nltk.translate.bleu_score import sentence_bleu, SmoothingFunction148 149def calculate_bleu(reference: str, hypothesis: str, **kwargs) -> float:150    """Calculate BLEU score between reference and hypothesis."""151    smoothie = SmoothingFunction().method4152 153    return sentence_bleu(154        [reference.split()],155        hypothesis.split(),156        smoothing_function=smoothie157    )158```159 160### ROUGE Score161 162```python163from rouge_score import rouge_scorer164 165def calculate_rouge(reference: str, hypothesis: str, **kwargs) -> dict:166    """Calculate ROUGE scores."""167    scorer = rouge_scorer.RougeScorer(168        ['rouge1', 'rouge2', 'rougeL'],169        use_stemmer=True170    )171    scores = scorer.score(reference, hypothesis)172 173    return {174        'rouge1': scores['rouge1'].fmeasure,175        'rouge2': scores['rouge2'].fmeasure,176        'rougeL': scores['rougeL'].fmeasure177    }178```179 180### BERTScore181 182```python183from bert_score import score184 185def calculate_bertscore(186    references: list[str],187    hypotheses: list[str],188    **kwargs189) -> dict:190    """Calculate BERTScore using pre-trained model."""191    P, R, F1 = score(192        hypotheses,193        references,194        lang='en',195        model_type='microsoft/deberta-xlarge-mnli'196    )197 198    return {199        'precision': P.mean().item(),200        'recall': R.mean().item(),201        'f1': F1.mean().item()202    }203```204 205### Custom Metrics206 207```python208def calculate_groundedness(response: str, context: str, **kwargs) -> float:209    """Check if response is grounded in provided context."""210    from transformers import pipeline211 212    nli = pipeline(213        "text-classification",214        model="microsoft/deberta-large-mnli"215    )216 217    result = nli(f"{context} [SEP] {response}")[0]218 219    # Return confidence that response is entailed by context220    return result['score'] if result['label'] == 'ENTAILMENT' else 0.0221 222def calculate_toxicity(text: str, **kwargs) -> float:223    """Measure toxicity in generated text."""224    from detoxify import Detoxify225 226    results = Detoxify('original').predict(text)227    return max(results.values())  # Return highest toxicity score228 229def calculate_factuality(claim: str, sources: list[str], **kwargs) -> float:230    """Verify factual claims against sources."""231    from transformers import pipeline232 233    nli = pipeline("text-classification", model="facebook/bart-large-mnli")234 235    scores = []236    for source in sources:237        result = nli(f"{source}</s></s>{claim}")[0]238        if result['label'] == 'entailment':239            scores.append(result['score'])240 241    return max(scores) if scores else 0.0242```243 244## LLM-as-Judge Patterns245 246### Single Output Evaluation247 248```python249from anthropic import Anthropic250from pydantic import BaseModel, Field251import json252 253class QualityRating(BaseModel):254    accuracy: int = Field(ge=1, le=10, description="Factual correctness")255    helpfulness: int = Field(ge=1, le=10, description="Answers the question")256    clarity: int = Field(ge=1, le=10, description="Well-written and understandable")257    reasoning: str = Field(description="Brief explanation")258 259async def llm_judge_quality(260    response: str,261    question: str,262    context: str = None263) -> QualityRating:264    """Use Claude to judge response quality."""265    client = Anthropic()266 267    system = """You are an expert evaluator of AI responses.268    Rate responses on accuracy, helpfulness, and clarity (1-10 scale).269    Provide brief reasoning for your ratings."""270 271    prompt = f"""Rate the following response:272 273Question: {question}274{f'Context: {context}' if context else ''}275Response: {response}276 277Provide ratings in JSON format:278{{279  "accuracy": <1-10>,280  "helpfulness": <1-10>,281  "clarity": <1-10>,282  "reasoning": "<brief explanation>"283}}"""284 285    message = client.messages.create(286        model="claude-sonnet-4-6",287        max_tokens=500,288        system=system,289        messages=[{"role": "user", "content": prompt}]290    )291 292    return QualityRating(**json.loads(message.content[0].text))293```294 295### Pairwise Comparison296 297```python298from pydantic import BaseModel, Field299from typing import Literal300 301class ComparisonResult(BaseModel):302    winner: Literal["A", "B", "tie"]303    reasoning: str304    confidence: int = Field(ge=1, le=10)305 306async def compare_responses(307    question: str,308    response_a: str,309    response_b: str310) -> ComparisonResult:311    """Compare two responses using LLM judge."""312    client = Anthropic()313 314    prompt = f"""Compare these two responses and determine which is better.315 316Question: {question}317 318Response A: {response_a}319 320Response B: {response_b}321 322Consider accuracy, helpfulness, and clarity.323 324Answer with JSON:325{{326  "winner": "A" or "B" or "tie",327  "reasoning": "<explanation>",328  "confidence": <1-10>329}}"""330 331    message = client.messages.create(332        model="claude-sonnet-4-6",333        max_tokens=500,334        messages=[{"role": "user", "content": prompt}]335    )336 337    return ComparisonResult(**json.loads(message.content[0].text))338```339 340### Reference-Based Evaluation341 342```python343class ReferenceEvaluation(BaseModel):344    semantic_similarity: float = Field(ge=0, le=1)345    factual_accuracy: float = Field(ge=0, le=1)346    completeness: float = Field(ge=0, le=1)347    issues: list[str]348 349async def evaluate_against_reference(350    response: str,351    reference: str,352    question: str353) -> ReferenceEvaluation:354    """Evaluate response against gold standard reference."""355    client = Anthropic()356 357    prompt = f"""Compare the response to the reference answer.358 359Question: {question}360Reference Answer: {reference}361Response to Evaluate: {response}362 363Evaluate:3641. Semantic similarity (0-1): How similar is the meaning?3652. Factual accuracy (0-1): Are all facts correct?3663. Completeness (0-1): Does it cover all key points?3674. List any specific issues or errors.368 369Respond in JSON:370{{371  "semantic_similarity": <0-1>,372  "factual_accuracy": <0-1>,373  "completeness": <0-1>,374  "issues": ["issue1", "issue2"]375}}"""376 377    message = client.messages.create(378        model="claude-sonnet-4-6",379        max_tokens=500,380        messages=[{"role": "user", "content": prompt}]381    )382 383    return ReferenceEvaluation(**json.loads(message.content[0].text))384```385 386## Human Evaluation Frameworks387 388### Annotation Guidelines389 390```python391from dataclasses import dataclass, field392from typing import Optional393 394@dataclass395class AnnotationTask:396    """Structure for human annotation task."""397    response: str398    question: str399    context: Optional[str] = None400 401    def get_annotation_form(self) -> dict:402        return {403            "question": self.question,404            "context": self.context,405            "response": self.response,406            "ratings": {407                "accuracy": {408                    "scale": "1-5",409                    "description": "Is the response factually correct?"410                },411                "relevance": {412                    "scale": "1-5",413                    "description": "Does it answer the question?"414                },415                "coherence": {416                    "scale": "1-5",417                    "description": "Is it logically consistent?"418                }419            },420            "issues": {421                "factual_error": False,422                "hallucination": False,423                "off_topic": False,424                "unsafe_content": False425            },426            "feedback": ""427        }428```429 430### Inter-Rater Agreement431 432```python433from sklearn.metrics import cohen_kappa_score434 435def calculate_agreement(436    rater1_scores: list[int],437    rater2_scores: list[int]438) -> dict:439    """Calculate inter-rater agreement."""440    kappa = cohen_kappa_score(rater1_scores, rater2_scores)441 442    if kappa < 0:443        interpretation = "Poor"444    elif kappa < 0.2:445        interpretation = "Slight"446    elif kappa < 0.4:447        interpretation = "Fair"448    elif kappa < 0.6:449        interpretation = "Moderate"450    elif kappa < 0.8:451        interpretation = "Substantial"452    else:453        interpretation = "Almost Perfect"454 455    return {456        "kappa": kappa,457        "interpretation": interpretation458    }459```460 461## A/B Testing462 463### Statistical Testing Framework464 465```python466from scipy import stats467import numpy as np468from dataclasses import dataclass, field469 470@dataclass471class ABTest:472    variant_a_name: str = "A"473    variant_b_name: str = "B"474    variant_a_scores: list[float] = field(default_factory=list)475    variant_b_scores: list[float] = field(default_factory=list)476 477    def add_result(self, variant: str, score: float):478        """Add evaluation result for a variant."""479        if variant == "A":480            self.variant_a_scores.append(score)481        else:482            self.variant_b_scores.append(score)483 484    def analyze(self, alpha: float = 0.05) -> dict:485        """Perform statistical analysis."""486        a_scores = np.array(self.variant_a_scores)487        b_scores = np.array(self.variant_b_scores)488 489        # T-test490        t_stat, p_value = stats.ttest_ind(a_scores, b_scores)491 492        # Effect size (Cohen's d)493        pooled_std = np.sqrt((np.std(a_scores)**2 + np.std(b_scores)**2) / 2)494        cohens_d = (np.mean(b_scores) - np.mean(a_scores)) / pooled_std495 496        return {497            "variant_a_mean": np.mean(a_scores),498            "variant_b_mean": np.mean(b_scores),499            "difference": np.mean(b_scores) - np.mean(a_scores),500            "relative_improvement": (np.mean(b_scores) - np.mean(a_scores)) / np.mean(a_scores),501            "p_value": p_value,502            "statistically_significant": p_value < alpha,503            "cohens_d": cohens_d,504            "effect_size": self._interpret_cohens_d(cohens_d),505            "winner": self.variant_b_name if np.mean(b_scores) > np.mean(a_scores) else self.variant_a_name506        }507 508    @staticmethod509    def _interpret_cohens_d(d: float) -> str:510        """Interpret Cohen's d effect size."""511        abs_d = abs(d)512        if abs_d < 0.2:513            return "negligible"514        elif abs_d < 0.5:515            return "small"516        elif abs_d < 0.8:517            return "medium"518        else:519            return "large"520```521 522## Regression Testing523 524### Regression Detection525 526```python527from dataclasses import dataclass528 529@dataclass530class RegressionResult:531    metric: str532    baseline: float533    current: float534    change: float535    is_regression: bool536 537class RegressionDetector:538    def __init__(self, baseline_results: dict, threshold: float = 0.05):539        self.baseline = baseline_results540        self.threshold = threshold541 542    def check_for_regression(self, new_results: dict) -> dict:543        """Detect if new results show regression."""544        regressions = []545 546        for metric in self.baseline.keys():547            baseline_score = self.baseline[metric]548            new_score = new_results.get(metric)549 550            if new_score is None:551                continue552 553            # Calculate relative change554            relative_change = (new_score - baseline_score) / baseline_score555 556            # Flag if significant decrease557            is_regression = relative_change < -self.threshold558            if is_regression:559                regressions.append(RegressionResult(560                    metric=metric,561                    baseline=baseline_score,562                    current=new_score,563                    change=relative_change,564                    is_regression=True565                ))566 567        return {568            "has_regression": len(regressions) > 0,569            "regressions": regressions,570            "summary": f"{len(regressions)} metric(s) regressed"571        }572```573 574## LangSmith Evaluation Integration575 576```python577from langsmith import Client578from langsmith.evaluation import evaluate, LangChainStringEvaluator579 580# Initialize LangSmith client581client = Client()582 583# Create dataset584dataset = client.create_dataset("qa_test_cases")585client.create_examples(586    inputs=[{"question": q} for q in questions],587    outputs=[{"answer": a} for a in expected_answers],588    dataset_id=dataset.id589)590 591# Define evaluators592evaluators = [593    LangChainStringEvaluator("qa"),           # QA correctness594    LangChainStringEvaluator("context_qa"),   # Context-grounded QA595    LangChainStringEvaluator("cot_qa"),       # Chain-of-thought QA596]597 598# Run evaluation599async def target_function(inputs: dict) -> dict:600    result = await your_chain.ainvoke(inputs)601    return {"answer": result}602 603experiment_results = await evaluate(604    target_function,605    data=dataset.name,606    evaluators=evaluators,607    experiment_prefix="v1.0.0",608    metadata={"model": "claude-sonnet-4-6", "version": "1.0.0"}609)610 611print(f"Mean score: {experiment_results.aggregate_metrics['qa']['mean']}")612```613 614## Benchmarking615 616### Running Benchmarks617 618```python619from dataclasses import dataclass620import numpy as np621 622@dataclass623class BenchmarkResult:624    metric: str625    mean: float626    std: float627    min: float628    max: float629 630class BenchmarkRunner:631    def __init__(self, benchmark_dataset: list[dict]):632        self.dataset = benchmark_dataset633 634    async def run_benchmark(635        self,636        model,637        metrics: list[Metric]638    ) -> dict[str, BenchmarkResult]:639        """Run model on benchmark and calculate metrics."""640        results = {metric.name: [] for metric in metrics}641 642        for example in self.dataset:643            # Generate prediction644            prediction = await model.predict(example["input"])645 646            # Calculate each metric647            for metric in metrics:648                score = metric.fn(649                    prediction=prediction,650                    reference=example["reference"],651                    context=example.get("context")652                )653                results[metric.name].append(score)654 655        # Aggregate results656        return {657            metric: BenchmarkResult(658                metric=metric,659                mean=np.mean(scores),660                std=np.std(scores),661                min=min(scores),662                max=max(scores)663            )664            for metric, scores in results.items()665        }666```
Related skills
Accessibility Compliance

This walks you through implementing proper WCAG 2.2 compliance with real code patterns for screen readers, keyboard navigation, and mobile accessibility. It cov
Airflow Dag Patterns

If you're building data pipelines with Airflow, this skill gives you production-ready DAG patterns that actually work in the real world. It covers TaskFlow API
Angular Migration

Migrating from AngularJS to Angular is notoriously painful, and this skill tackles the practical stuff that makes or breaks these projects. It covers hybrid app