npx skills add https://github.com/wshobson/agents --skill incident-runbook-templatesHow Incident Runbook Templates fits into a Paperclip company.
Incident Runbook Templates drops into any Paperclip agent that handles this kind of work. Assign it to a specialist inside a pre-configured PaperclipOrg company and the skill becomes available on every heartbeat — no prompt engineering, no tool wiring.
Pre-configured AI company — 18 agents, 18 skills, one-time purchase.
SKILL.md472 linesExpandCollapse
---name: incident-runbook-templatesdescription: Create structured incident response runbooks with step-by-step procedures, escalation paths, and recovery actions. Use this skill when building a service outage runbook for a payment processing system; creating database incident procedures covering connection pool exhaustion, replication lag, and disk space alerts; onboarding new on-call engineers who need step-by-step recovery guides written for a 3 AM brain; or standardizing escalation matrices across multiple engineering teams.--- # Incident Runbook Templates Production-ready templates for incident response runbooks covering detection, triage, mitigation, resolution, and communication. ## When to Use This Skill - Creating incident response procedures- Building service-specific runbooks- Establishing escalation paths- Documenting recovery procedures- Responding to active incidents- Onboarding on-call engineers ## Core Concepts ### 1. Incident Severity Levels | Severity | Impact | Response Time | Example || -------- | -------------------------- | ----------------- | ----------------------- || **SEV1** | Complete outage, data loss | 15 min | Production down || **SEV2** | Major degradation | 30 min | Critical feature broken || **SEV3** | Minor impact | 2 hours | Non-critical bug || **SEV4** | Minimal impact | Next business day | Cosmetic issue | ### 2. Runbook Structure ```1. Overview & Impact2. Detection & Alerts3. Initial Triage4. Mitigation Steps5. Root Cause Investigation6. Resolution Procedures7. Verification & Rollback8. Communication Templates9. Escalation Matrix``` ## Runbook Templates ### Template 1: Service Outage Runbook ````markdown# [Service Name] Outage Runbook ## Overview **Service**: Payment Processing Service**Owner**: Platform Team**Slack**: #payments-incidents**PagerDuty**: payments-oncall ## Impact Assessment - [ ] Which customers are affected?- [ ] What percentage of traffic is impacted?- [ ] Are there financial implications?- [ ] What's the blast radius? ## Detection ### Alerts - `payment_error_rate > 5%` (PagerDuty)- `payment_latency_p99 > 2s` (Slack)- `payment_success_rate < 95%` (PagerDuty) ### Dashboards - [Payment Service Dashboard](https://grafana/d/payments)- [Error Tracking](https://sentry.io/payments)- [Dependency Status](https://status.stripe.com) ## Initial Triage (First 5 Minutes) ### 1. Assess Scope ```bash# Check service healthkubectl get pods -n payments -l app=payment-service # Check recent deploymentskubectl rollout history deployment/payment-service -n payments # Check error ratescurl -s "http://prometheus:9090/api/v1/query?query=sum(rate(http_requests_total{status=~'5..'}[5m]))"``````` ### 2. Quick Health Checks - [ ] Can you reach the service? `curl -I https://api.company.com/payments/health`- [ ] Database connectivity? Check connection pool metrics- [ ] External dependencies? Check Stripe, bank API status- [ ] Recent changes? Check deploy history ### 3. Initial Classification | Symptom | Likely Cause | Go To Section || -------------------- | ------------------- | ------------- || All requests failing | Service down | Section 4.1 || High latency | Database/dependency | Section 4.2 || Partial failures | Code bug | Section 4.3 || Spike in errors | Traffic surge | Section 4.4 | ## Mitigation Procedures ### 4.1 Service Completely Down ```bash# Step 1: Check pod statuskubectl get pods -n payments # Step 2: If pods are crash-looping, check logskubectl logs -n payments -l app=payment-service --tail=100 # Step 3: Check recent deploymentskubectl rollout history deployment/payment-service -n payments # Step 4: ROLLBACK if recent deploy is suspectkubectl rollout undo deployment/payment-service -n payments # Step 5: Scale up if resource constrainedkubectl scale deployment/payment-service -n payments --replicas=10 # Step 6: Verify recoverykubectl rollout status deployment/payment-service -n payments``` ### 4.2 High Latency ```bash# Step 1: Check database connectionskubectl exec -n payments deploy/payment-service -- \ curl localhost:8080/metrics | grep db_pool # Step 2: Check slow queries (if DB issue)psql -h $DB_HOST -U $DB_USER -c " SELECT pid, now() - query_start AS duration, query FROM pg_stat_activity WHERE state = 'active' AND duration > interval '5 seconds' ORDER BY duration DESC;" # Step 3: Kill long-running queries if neededpsql -h $DB_HOST -U $DB_USER -c "SELECT pg_terminate_backend(pid);" # Step 4: Check external dependency latencycurl -w "@curl-format.txt" -o /dev/null -s https://api.stripe.com/v1/health # Step 5: Enable circuit breaker if dependency is slowkubectl set env deployment/payment-service \ STRIPE_CIRCUIT_BREAKER_ENABLED=true -n payments``` ### 4.3 Partial Failures (Specific Errors) ```bash# Step 1: Identify error patternkubectl logs -n payments -l app=payment-service --tail=500 | \ grep -i error | sort | uniq -c | sort -rn | head -20 # Step 2: Check error tracking# Go to Sentry: https://sentry.io/payments # Step 3: If specific endpoint, enable feature flag to disablecurl -X POST https://api.company.com/internal/feature-flags \ -d '{"flag": "DISABLE_PROBLEMATIC_FEATURE", "enabled": true}' # Step 4: If data issue, check recent data changespsql -h $DB_HOST -c " SELECT * FROM audit_log WHERE table_name = 'payment_methods' AND created_at > now() - interval '1 hour';"``` ### 4.4 Traffic Surge ```bash# Step 1: Check current request ratekubectl top pods -n payments # Step 2: Scale horizontallykubectl scale deployment/payment-service -n payments --replicas=20 # Step 3: Enable rate limitingkubectl set env deployment/payment-service \ RATE_LIMIT_ENABLED=true \ RATE_LIMIT_RPS=1000 -n payments # Step 4: If attack, block suspicious IPskubectl apply -f - <<EOFapiVersion: networking.k8s.io/v1kind: NetworkPolicymetadata: name: block-suspicious namespace: paymentsspec: podSelector: matchLabels: app: payment-service ingress: - from: - ipBlock: cidr: 0.0.0.0/0 except: - 192.168.1.0/24 # Suspicious rangeEOF``` ## Verification Steps ```bash# Verify service is healthycurl -s https://api.company.com/payments/health | jq # Verify error rate is back to normalcurl -s "http://prometheus:9090/api/v1/query?query=sum(rate(http_requests_total{status=~'5..'}[5m]))" | jq '.data.result[0].value[1]' # Verify latency is acceptablecurl -s "http://prometheus:9090/api/v1/query?query=histogram_quantile(0.99,sum(rate(http_request_duration_seconds_bucket[5m]))by(le))" | jq # Smoke test critical flows./scripts/smoke-test-payments.sh``` ## Rollback Procedures ```bash# Rollback Kubernetes deploymentkubectl rollout undo deployment/payment-service -n payments # Rollback database migration (if applicable)./scripts/db-rollback.sh $MIGRATION_VERSION # Rollback feature flagcurl -X POST https://api.company.com/internal/feature-flags \ -d '{"flag": "NEW_PAYMENT_FLOW", "enabled": false}'``` ## Escalation Matrix | Condition | Escalate To | Contact || ----------------------------- | ------------------- | ------------------- || > 15 min unresolved SEV1 | Engineering Manager | @manager (Slack) || Data breach suspected | Security Team | #security-incidents || Financial impact > $10k | Finance + Legal | @finance-oncall || Customer communication needed | Support Lead | @support-lead | ## Communication Templates ### Initial Notification (Internal) ```🚨 INCIDENT: Payment Service Degradation Severity: SEV2Status: InvestigatingImpact: ~20% of payment requests failingStart Time: [TIME]Incident Commander: [NAME] Current Actions:- Investigating root cause- Scaling up service- Monitoring dashboards Updates in #payments-incidents``` ### Status Update ```📊 UPDATE: Payment Service Incident Status: MitigatingImpact: Reduced to ~5% failure rateDuration: 25 minutes Actions Taken:- Rolled back deployment v2.3.4 → v2.3.3- Scaled service from 5 → 10 replicas Next Steps:- Continuing to monitor- Root cause analysis in progress ETA to Resolution: ~15 minutes``` ### Resolution Notification ```✅ RESOLVED: Payment Service Incident Duration: 45 minutesImpact: ~5,000 affected transactionsRoot Cause: Memory leak in v2.3.4 Resolution:- Rolled back to v2.3.3- Transactions auto-retried successfully Follow-up:- Postmortem scheduled for [DATE]- Bug fix in progress``` ```` ### Template 2: Database Incident Runbook ```markdown# Database Incident Runbook ## Quick Reference| Issue | Command ||-------|---------|| Check connections | `SELECT count(*) FROM pg_stat_activity;` || Kill query | `SELECT pg_terminate_backend(pid);` || Check replication lag | `SELECT extract(epoch from (now() - pg_last_xact_replay_timestamp()));` || Check locks | `SELECT * FROM pg_locks WHERE NOT granted;` | ## Connection Pool Exhaustion```sql-- Check current connectionsSELECT datname, usename, state, count(*)FROM pg_stat_activityGROUP BY datname, usename, stateORDER BY count(*) DESC; -- Identify long-running connectionsSELECT pid, usename, datname, state, query_start, queryFROM pg_stat_activityWHERE state != 'idle'ORDER BY query_start; -- Terminate idle connectionsSELECT pg_terminate_backend(pid)FROM pg_stat_activityWHERE state = 'idle'AND query_start < now() - interval '10 minutes';```` ## Replication Lag ```sql-- Check lag on replicaSELECT CASE WHEN pg_last_wal_receive_lsn() = pg_last_wal_replay_lsn() THEN 0 ELSE extract(epoch from now() - pg_last_xact_replay_timestamp()) END AS lag_seconds; -- If lag > 60s, consider:-- 1. Check network between primary/replica-- 2. Check replica disk I/O-- 3. Consider failover if unrecoverable``` ## Disk Space Critical ```bash# Check disk usagedf -h /var/lib/postgresql/data # Find large tablespsql -c "SELECT relname, pg_size_pretty(pg_total_relation_size(relid))FROM pg_catalog.pg_statio_user_tablesORDER BY pg_total_relation_size(relid) DESCLIMIT 10;" # VACUUM to reclaim spacepsql -c "VACUUM FULL large_table;" # If emergency, delete old data or expand disk``` ``` ## Best Practices ### Do's- **Keep runbooks updated** - Review after every incident- **Test runbooks regularly** - Game days, chaos engineering- **Include rollback steps** - Always have an escape hatch- **Document assumptions** - What must be true for steps to work- **Link to dashboards** - Quick access during stress ### Don'ts- **Don't assume knowledge** - Write for 3 AM brain- **Don't skip verification** - Confirm each step worked- **Don't forget communication** - Keep stakeholders informed- **Don't work alone** - Escalate early- **Don't skip postmortems** - Learn from every incident ## Troubleshooting ### Runbook steps work in staging but fail during a real incident Steps often assume preconditions that are true in a healthy environment but not during an outage. For each command in your runbook, add a prerequisite check and a "what to do if this command fails" note: ```bash# Step: Check pod statuskubectl get pods -n payments # Prerequisites: kubectl configured, kubeconfig points to correct cluster# If this fails: run `aws eks update-kubeconfig --name prod-cluster --region us-east-1`# Expected output: pods in Running state``` ### On-call engineer panics and skips steps out of order Add a numbered checklist at the top of the runbook that mirrors the section numbers, so responders can track progress under stress without reading the full document: ```markdown## Quick Checklist- [ ] 1. Declare incident severity and open war room- [ ] 2. Check service health (Section 4.1)- [ ] 3. Check recent deployments (Section 4.1)- [ ] 4. Roll back if deploy is suspect (Section 4.1)- [ ] 5. Post initial notification to #payments-incidents- [ ] 6. Escalate if > 15 min unresolved``` ### Runbook is outdated — commands reference old cluster names or endpoints Runbooks rot because they're updated manually. Include a "Last Verified" date and owner at the top, and add a CI check that validates all `curl` endpoints and `kubectl` context names are still valid: ```markdown## Runbook Metadata| Field | Value ||---|---|| Last verified | 2024-11-15 || Owner | @platform-team || Review cadence | After every SEV1/SEV2 |``` ### Stakeholder communication is delayed while engineers are heads-down Assign a dedicated incident communicator role (separate from the incident commander) whose only job is to post status updates. Add a standing agenda in the communication template: ```Update every 15 minutes (even if no new information):- Current status (Investigating / Mitigating / Monitoring)- Impact (what is broken, who is affected, % of traffic)- What we are doing right now- Next update in: 15 minutes``` ### Database runbook commands cause additional downtime when run incorrectly Add explicit warnings before destructive SQL commands and require a dry-run output check before executing: ```sql-- WARNING: This terminates active connections. Verify count first.-- DRY RUN (check count before terminating):SELECT count(*) FROM pg_stat_activity WHERE state = 'idle' AND query_start < now() - interval '10 minutes'; -- EXECUTE only after verifying count is reasonable (< 50):SELECT pg_terminate_backend(pid) FROM pg_stat_activityWHERE state = 'idle' AND query_start < now() - interval '10 minutes';``` ## Related Skills - `postmortem-writing` - After resolving an incident, use postmortem templates to capture root cause and preventive actions- `on-call-handoff-patterns` - Structure shift handoffs so the incoming responder has full context on active incidentsAccessibility Compliance
This walks you through implementing proper WCAG 2.2 compliance with real code patterns for screen readers, keyboard navigation, and mobile accessibility. It cov
Airflow Dag Patterns
If you're building data pipelines with Airflow, this skill gives you production-ready DAG patterns that actually work in the real world. It covers TaskFlow API
Angular Migration
Migrating from AngularJS to Angular is notoriously painful, and this skill tackles the practical stuff that makes or breaks these projects. It covers hybrid app