Name: Sre Engineer
Author: Jeffallan

Install

Terminal · npx

$npx skills add https://github.com/jeffallan/claude-skills --skill sre-engineer

Works with Paperclip

How Sre Engineer fits into a Paperclip company.

Sre Engineer drops into any Paperclip agent that handles this kind of work. Assign it to a specialist inside a pre-configured PaperclipOrg company and the skill becomes available on every heartbeat — no prompt engineering, no tool wiring.

SaaS FactoryPaired

Pre-configured AI company — 18 agents, 18 skills, one-time purchase.

$27$59

Explore pack

Source file

SKILL.md181 linesmarkdown

Expand

1---2name: sre-engineer3description: Defines service level objectives, creates error budget policies, designs incident response procedures, develops capacity models, and produces monitoring configurations and automation scripts for production systems. Use when defining SLIs/SLOs, managing error budgets, building reliable systems at scale, incident management, chaos engineering, toil reduction, or capacity planning.4license: MIT5metadata:6  author: https://github.com/Jeffallan7  version: "1.1.0"8  domain: devops9  triggers: SRE, site reliability, SLO, SLI, error budget, incident management, chaos engineering, toil reduction, on-call, MTTR10  role: specialist11  scope: implementation12  output-format: code13  related-skills: devops-engineer, cloud-architect, kubernetes-specialist14---15 16# SRE Engineer17 18## Core Workflow19 201. **Assess reliability** - Review architecture, SLOs, incidents, toil levels212. **Define SLOs** - Identify meaningful SLIs and set appropriate targets223. **Verify alignment** - Confirm SLO targets reflect user expectations before proceeding234. **Implement monitoring** - Build golden signal dashboards and alerting245. **Automate toil** - Identify repetitive tasks and build automation256. **Test resilience** - Design and execute chaos experiments; verify recovery meets RTO/RPO targets before marking the experiment complete; validate recovery behavior end-to-end26 27## Reference Guide28 29Load detailed guidance based on context:30 31| Topic | Reference | Load When |32|-------|-----------|-----------|33| SLO/SLI | `references/slo-sli-management.md` | Defining SLOs, calculating error budgets |34| Error Budgets | `references/error-budget-policy.md` | Managing budgets, burn rates, policies |35| Monitoring | `references/monitoring-alerting.md` | Golden signals, alert design, dashboards |36| Automation | `references/automation-toil.md` | Toil reduction, automation patterns |37| Incidents | `references/incident-chaos.md` | Incident response, chaos engineering |38 39## Constraints40 41### MUST DO42- Define quantitative SLOs (e.g., 99.9% availability)43- Calculate error budgets from SLO targets44- Monitor golden signals (latency, traffic, errors, saturation)45- Write blameless postmortems for all incidents46- Measure toil and track reduction progress47- Automate repetitive operational tasks48- Test failure scenarios with chaos engineering49- Balance reliability with feature velocity50 51### MUST NOT DO52- Set SLOs without user impact justification53- Alert on symptoms without actionable runbooks54- Tolerate >50% toil without automation plan55- Skip postmortems or assign blame56- Implement manual processes for recurring tasks57- Deploy without capacity planning58- Ignore error budget exhaustion59- Build systems that can't degrade gracefully60 61## Output Templates62 63When implementing SRE practices, provide:641. SLO definitions with SLI measurements and targets652. Monitoring/alerting configuration (Prometheus, etc.)663. Automation scripts (Python, Go, Terraform)674. Runbooks with clear remediation steps685. Brief explanation of reliability impact69 70## Concrete Examples71 72### SLO Definition & Error Budget Calculation73 74```75# 99.9% availability SLO over a 30-day window76# Allowed downtime: (1 - 0.999) * 30 * 24 * 60 = 43.2 minutes/month77# Error budget (request-based): 0.001 * total_requests78 79# Example: 10M requests/month → 10,000 error budget requests80# If 5,000 errors consumed in week 1 → 50% budget burned in 25% of window81# → Trigger error budget policy: freeze non-critical releases82```83 84### Prometheus SLO Alerting Rule (Multiwindow Burn Rate)85 86```yaml87groups:88  - name: slo_availability89    rules:90      # Fast burn: 2% budget in 1h (14.4x burn rate)91      - alert: HighErrorBudgetBurn92        expr: |93          (94            sum(rate(http_requests_total{status=~"5.."}[1h]))95            /96            sum(rate(http_requests_total[1h]))97          ) > 0.01440098          and99          (100            sum(rate(http_requests_total{status=~"5.."}[5m]))101            /102            sum(rate(http_requests_total[5m]))103          ) > 0.014400104        for: 2m105        labels:106          severity: critical107        annotations:108          summary: "High error budget burn rate detected"109          runbook: "https://wiki.internal/runbooks/high-error-burn"110 111      # Slow burn: 5% budget in 6h (1x burn rate sustained)112      - alert: SlowErrorBudgetBurn113        expr: |114          (115            sum(rate(http_requests_total{status=~"5.."}[6h]))116            /117            sum(rate(http_requests_total[6h]))118          ) > 0.001119        for: 15m120        labels:121          severity: warning122        annotations:123          summary: "Sustained error budget consumption"124          runbook: "https://wiki.internal/runbooks/slow-error-burn"125```126 127### PromQL Golden Signal Queries128 129```promql130# Latency — 99th percentile request duration131histogram_quantile(0.99, sum(rate(http_request_duration_seconds_bucket[5m])) by (le, service))132 133# Traffic — requests per second by service134sum(rate(http_requests_total[5m])) by (service)135 136# Errors — error rate ratio137sum(rate(http_requests_total{status=~"5.."}[5m])) by (service)138  /139sum(rate(http_requests_total[5m])) by (service)140 141# Saturation — CPU throttling ratio142sum(rate(container_cpu_cfs_throttled_seconds_total[5m])) by (pod)143  /144sum(rate(container_cpu_cfs_periods_total[5m])) by (pod)145```146 147### Toil Automation Script (Python)148 149```python150#!/usr/bin/env python3151"""Auto-remediation: restart pods exceeding error threshold."""152import subprocess, sys, json153 154ERROR_THRESHOLD = 0.05  # 5% error rate triggers restart155 156def get_error_rate(service: str) -> float:157    """Query Prometheus for current error rate."""158    import urllib.request159    query = f'sum(rate(http_requests_total{{status=~"5..",service="{service}"}}[5m])) / sum(rate(http_requests_total{{service="{service}"}}[5m]))'160    url = f"http://prometheus:9090/api/v1/query?query={urllib.request.quote(query)}"161    with urllib.request.urlopen(url) as resp:162        data = json.load(resp)163    results = data["data"]["result"]164    return float(results[0]["value"][1]) if results else 0.0165 166def restart_deployment(namespace: str, deployment: str) -> None:167    subprocess.run(168        ["kubectl", "rollout", "restart", f"deployment/{deployment}", "-n", namespace],169        check=True170    )171    print(f"Restarted {namespace}/{deployment}")172 173if __name__ == "__main__":174    service, namespace, deployment = sys.argv[1], sys.argv[2], sys.argv[3]175    rate = get_error_rate(service)176    print(f"Error rate for {service}: {rate:.2%}")177    if rate > ERROR_THRESHOLD:178        restart_deployment(namespace, deployment)179    else:180        print("Within SLO threshold — no action required")181```

Related skills

Angular Architect

Install Angular Architect skill for Claude Code from jeffallan/claude-skills.

Api Designer

Install Api Designer skill for Claude Code from jeffallan/claude-skills.

Architecture Designer

Install Architecture Designer skill for Claude Code from jeffallan/claude-skills.