Claude Agent Skill · by Affaan M

Santa Method

Install Santa Method skill for Claude Code from affaan-m/everything-claude-code.

Works with Paperclip

How Santa Method fits into a Paperclip company.

Santa Method drops into any Paperclip agent that handles this kind of work. Assign it to a specialist inside a pre-configured PaperclipOrg company and the skill becomes available on every heartbeat — no prompt engineering, no tool wiring.

S
SaaS FactoryPaired

Pre-configured AI company — 18 agents, 18 skills, one-time purchase.

$27$59
Explore pack
Source file
SKILL.md306 lines
Expand
---name: santa-methoddescription: "Multi-agent adversarial verification with convergence loop. Two independent review agents must both pass before output ships."origin: "Ronald Skelton - Founder, RapportScore.ai"--- # Santa Method Multi-agent adversarial verification framework. Make a list, check it twice. If it's naughty, fix it until it's nice. The core insight: a single agent reviewing its own output shares the same biases, knowledge gaps, and systematic errors that produced the output. Two independent reviewers with no shared context break this failure mode. ## When to Activate Invoke this skill when:- Output will be published, deployed, or consumed by end users- Compliance, regulatory, or brand constraints must be enforced- Code ships to production without human review- Content accuracy matters (technical docs, educational material, customer-facing copy)- Batch generation at scale where spot-checking misses systemic patterns- Hallucination risk is elevated (claims, statistics, API references, legal language) Do NOT use for internal drafts, exploratory research, or tasks with deterministic verification (use build/test/lint pipelines for those). ## Architecture ```┌─────────────┐│  GENERATOR   │  Phase 1: Make a List│  (Agent A)   │  Produce the deliverable└──────┬───────┘       │ output┌──────────────────────────────┐│     DUAL INDEPENDENT REVIEW   │  Phase 2: Check It Twice│                                ││  ┌───────────┐ ┌───────────┐  │  Two agents, same rubric,│  │ Reviewer B │ │ Reviewer C │  │  no shared context│  └─────┬─────┘ └─────┬─────┘  ││        │              │        │└────────┼──────────────┼────────┘         │              │         ▼              ▼┌──────────────────────────────┐│        VERDICT GATE           │  Phase 3: Naughty or Nice│                                ││  B passes AND C passes → NICE  │  Both must pass.│  Otherwise → NAUGHTY           │  No exceptions.└──────┬──────────────┬─────────┘       │              │    NICE           NAUGHTY       │              │       ▼              ▼   [ SHIP ]    ┌─────────────┐               │  FIX CYCLE   │  Phase 4: Fix Until Nice               │              │               │ iteration++  │  Collect all flags.               │ if i > MAX:  │  Fix all issues.               │   escalate   │  Re-run both reviewers.               │ else:        │  Loop until convergence.               │   goto Ph.2  │               └──────────────┘``` ## Phase Details ### Phase 1: Make a List (Generate) Execute the primary task. No changes to your normal generation workflow. Santa Method is a post-generation verification layer, not a generation strategy. ```python# The generator runs as normaloutput = generate(task_spec)``` ### Phase 2: Check It Twice (Independent Dual Review) Spawn two review agents in parallel. Critical invariants: 1. **Context isolation** — neither reviewer sees the other's assessment2. **Identical rubric** — both receive the same evaluation criteria3. **Same inputs** — both receive the original spec AND the generated output4. **Structured output** — each returns a typed verdict, not prose ```pythonREVIEWER_PROMPT = """You are an independent quality reviewer. You have NOT seen any other review of this output. ## Task Specification{task_spec} ## Output Under Review{output} ## Evaluation Rubric{rubric} ## InstructionsEvaluate the output against EACH rubric criterion. For each:- PASS: criterion fully met, no issues- FAIL: specific issue found (cite the exact problem) Return your assessment as structured JSON:{  "verdict": "PASS" | "FAIL",  "checks": [    {"criterion": "...", "result": "PASS|FAIL", "detail": "..."}  ],  "critical_issues": ["..."],   // blockers that must be fixed  "suggestions": ["..."]         // non-blocking improvements} Be rigorous. Your job is to find problems, not to approve."""``` ```python# Spawn reviewers in parallel (Claude Code subagents)review_b = Agent(prompt=REVIEWER_PROMPT.format(...), description="Santa Reviewer B")review_c = Agent(prompt=REVIEWER_PROMPT.format(...), description="Santa Reviewer C") # Both run concurrently — neither sees the other``` ### Rubric Design The rubric is the most important input. Vague rubrics produce vague reviews. Every criterion must have an objective pass/fail condition. | Criterion | Pass Condition | Failure Signal ||-----------|---------------|----------------|| Factual accuracy | All claims verifiable against source material or common knowledge | Invented statistics, wrong version numbers, nonexistent APIs || Hallucination-free | No fabricated entities, quotes, URLs, or references | Links to pages that don't exist, attributed quotes with no source || Completeness | Every requirement in the spec is addressed | Missing sections, skipped edge cases, incomplete coverage || Compliance | Passes all project-specific constraints | Banned terms used, tone violations, regulatory non-compliance || Internal consistency | No contradictions within the output | Section A says X, section B says not-X || Technical correctness | Code compiles/runs, algorithms are sound | Syntax errors, logic bugs, wrong complexity claims | #### Domain-Specific Rubric Extensions **Content/Marketing:**- Brand voice adherence- SEO requirements met (keyword density, meta tags, structure)- No competitor trademark misuse- CTA present and correctly linked **Code:**- Type safety (no `any` leaks, proper null handling)- Error handling coverage- Security (no secrets in code, input validation, injection prevention)- Test coverage for new paths **Compliance-Sensitive (regulated, legal, financial):**- No outcome guarantees or unsubstantiated claims- Required disclaimers present- Approved terminology only- Jurisdiction-appropriate language ### Phase 3: Naughty or Nice (Verdict Gate) ```pythondef santa_verdict(review_b, review_c):    """Both reviewers must pass. No partial credit."""    if review_b.verdict == "PASS" and review_c.verdict == "PASS":        return "NICE"  # Ship it     # Merge flags from both reviewers, deduplicate    all_issues = dedupe(review_b.critical_issues + review_c.critical_issues)    all_suggestions = dedupe(review_b.suggestions + review_c.suggestions)     return "NAUGHTY", all_issues, all_suggestions``` Why both must pass: if only one reviewer catches an issue, that issue is real. The other reviewer's blind spot is exactly the failure mode Santa Method exists to eliminate. ### Phase 4: Fix Until Nice (Convergence Loop) ```pythonMAX_ITERATIONS = 3 for iteration in range(MAX_ITERATIONS):    verdict, issues, suggestions = santa_verdict(review_b, review_c)     if verdict == "NICE":        log_santa_result(output, iteration, "passed")        return ship(output)     # Fix all critical issues (suggestions are optional)    output = fix_agent.execute(        output=output,        issues=issues,        instruction="Fix ONLY the flagged issues. Do not refactor or add unrequested changes."    )     # Re-run BOTH reviewers on fixed output (fresh agents, no memory of previous round)    review_b = Agent(prompt=REVIEWER_PROMPT.format(output=output, ...))    review_c = Agent(prompt=REVIEWER_PROMPT.format(output=output, ...)) # Exhausted iterations — escalatelog_santa_result(output, MAX_ITERATIONS, "escalated")escalate_to_human(output, issues)``` Critical: each review round uses **fresh agents**. Reviewers must not carry memory from previous rounds, as prior context creates anchoring bias. ## Implementation Patterns ### Pattern A: Claude Code Subagents (Recommended) Subagents provide true context isolation. Each reviewer is a separate process with no shared state. ```bash# In a Claude Code session, use the Agent tool to spawn reviewers# Both agents run in parallel for speed``` ```python# Pseudocode for Agent tool invocationreviewer_b = Agent(    description="Santa Review B",    prompt=f"Review this output for quality...\n\nRUBRIC:\n{rubric}\n\nOUTPUT:\n{output}")reviewer_c = Agent(    description="Santa Review C",    prompt=f"Review this output for quality...\n\nRUBRIC:\n{rubric}\n\nOUTPUT:\n{output}")``` ### Pattern B: Sequential Inline (Fallback) When subagents aren't available, simulate isolation with explicit context resets: 1. Generate output2. New context: "You are Reviewer 1. Evaluate ONLY against this rubric. Find problems."3. Record findings verbatim4. Clear context completely5. New context: "You are Reviewer 2. Evaluate ONLY against this rubric. Find problems."6. Compare both reviews, fix, repeat The subagent pattern is strictly superior — inline simulation risks context bleed between reviewers. ### Pattern C: Batch Sampling For large batches (100+ items), full Santa on every item is cost-prohibitive. Use stratified sampling: 1. Run Santa on a random sample (10-15% of batch, minimum 5 items)2. Categorize failures by type (hallucination, compliance, completeness, etc.)3. If systematic patterns emerge, apply targeted fixes to the entire batch4. Re-sample and re-verify the fixed batch5. Continue until a clean sample passes ```pythonimport random def santa_batch(items, rubric, sample_rate=0.15):    sample = random.sample(items, max(5, int(len(items) * sample_rate)))     for item in sample:        result = santa_full(item, rubric)        if result.verdict == "NAUGHTY":            pattern = classify_failure(result.issues)            items = batch_fix(items, pattern)  # Fix all items matching pattern            return santa_batch(items, rubric)   # Re-sample     return items  # Clean sample → ship batch``` ## Failure Modes and Mitigations | Failure Mode | Symptom | Mitigation ||-------------|---------|------------|| Infinite loop | Reviewers keep finding new issues after fixes | Max iteration cap (3). Escalate. || Rubber stamping | Both reviewers pass everything | Adversarial prompt: "Your job is to find problems, not approve." || Subjective drift | Reviewers flag style preferences, not errors | Tight rubric with objective pass/fail criteria only || Fix regression | Fixing issue A introduces issue B | Fresh reviewers each round catch regressions || Reviewer agreement bias | Both reviewers miss the same thing | Mitigated by independence, not eliminated. For critical output, add a third reviewer or human spot-check. || Cost explosion | Too many iterations on large outputs | Batch sampling pattern. Budget caps per verification cycle. | ## Integration with Other Skills | Skill | Relationship ||-------|-------------|| Verification Loop | Use for deterministic checks (build, lint, test). Santa for semantic checks (accuracy, hallucinations). Run verification-loop first, Santa second. || Eval Harness | Santa Method results feed eval metrics. Track pass@k across Santa runs to measure generator quality over time. || Continuous Learning v2 | Santa findings become instincts. Repeated failures on the same criterion → learned behavior to avoid the pattern. || Strategic Compact | Run Santa BEFORE compacting. Don't lose review context mid-verification. | ## Metrics Track these to measure Santa Method effectiveness: - **First-pass rate**: % of outputs that pass Santa on round 1 (target: >70%)- **Mean iterations to convergence**: average rounds to NICE (target: <1.5)- **Issue taxonomy**: distribution of failure types (hallucination vs. completeness vs. compliance)- **Reviewer agreement**: % of issues flagged by both reviewers vs. only one (low agreement = rubric needs tightening)- **Escape rate**: issues found post-ship that Santa should have caught (target: 0) ## Cost Analysis Santa Method costs approximately 2-3x the token cost of generation alone per verification cycle. For most high-stakes output, this is a bargain: ```Cost of Santa = (generation tokens) + 2×(review tokens per round) × (avg rounds)Cost of NOT Santa = (reputation damage) + (correction effort) + (trust erosion)``` For batch operations, the sampling pattern reduces cost to ~15-20% of full verification while catching >90% of systematic issues.