Claude Agent Skill · by Wshobson

Incident Runbook Templates

When your payment service goes down at 2 AM and you're fumbling through Slack history for that kubectl command, you need structured runbooks that work under pre

Install
Terminal · npx
$npx skills add https://github.com/wshobson/agents --skill incident-runbook-templates
Works with Paperclip

How Incident Runbook Templates fits into a Paperclip company.

Incident Runbook Templates drops into any Paperclip agent that handles this kind of work. Assign it to a specialist inside a pre-configured PaperclipOrg company and the skill becomes available on every heartbeat — no prompt engineering, no tool wiring.

S
SaaS FactoryPaired

Pre-configured AI company — 18 agents, 18 skills, one-time purchase.

$27$59
Explore pack
Source file
SKILL.md472 lines
Expand
---name: incident-runbook-templatesdescription: Create structured incident response runbooks with step-by-step procedures, escalation paths, and recovery actions. Use this skill when building a service outage runbook for a payment processing system; creating database incident procedures covering connection pool exhaustion, replication lag, and disk space alerts; onboarding new on-call engineers who need step-by-step recovery guides written for a 3 AM brain; or standardizing escalation matrices across multiple engineering teams.--- # Incident Runbook Templates Production-ready templates for incident response runbooks covering detection, triage, mitigation, resolution, and communication. ## When to Use This Skill - Creating incident response procedures- Building service-specific runbooks- Establishing escalation paths- Documenting recovery procedures- Responding to active incidents- Onboarding on-call engineers ## Core Concepts ### 1. Incident Severity Levels | Severity | Impact                     | Response Time     | Example                 || -------- | -------------------------- | ----------------- | ----------------------- || **SEV1** | Complete outage, data loss | 15 min            | Production down         || **SEV2** | Major degradation          | 30 min            | Critical feature broken || **SEV3** | Minor impact               | 2 hours           | Non-critical bug        || **SEV4** | Minimal impact             | Next business day | Cosmetic issue          | ### 2. Runbook Structure ```1. Overview & Impact2. Detection & Alerts3. Initial Triage4. Mitigation Steps5. Root Cause Investigation6. Resolution Procedures7. Verification & Rollback8. Communication Templates9. Escalation Matrix``` ## Runbook Templates ### Template 1: Service Outage Runbook ````markdown# [Service Name] Outage Runbook ## Overview **Service**: Payment Processing Service**Owner**: Platform Team**Slack**: #payments-incidents**PagerDuty**: payments-oncall ## Impact Assessment - [ ] Which customers are affected?- [ ] What percentage of traffic is impacted?- [ ] Are there financial implications?- [ ] What's the blast radius? ## Detection ### Alerts - `payment_error_rate > 5%` (PagerDuty)- `payment_latency_p99 > 2s` (Slack)- `payment_success_rate < 95%` (PagerDuty) ### Dashboards - [Payment Service Dashboard](https://grafana/d/payments)- [Error Tracking](https://sentry.io/payments)- [Dependency Status](https://status.stripe.com) ## Initial Triage (First 5 Minutes) ### 1. Assess Scope ```bash# Check service healthkubectl get pods -n payments -l app=payment-service # Check recent deploymentskubectl rollout history deployment/payment-service -n payments # Check error ratescurl -s "http://prometheus:9090/api/v1/query?query=sum(rate(http_requests_total{status=~'5..'}[5m]))"``````` ### 2. Quick Health Checks - [ ] Can you reach the service? `curl -I https://api.company.com/payments/health`- [ ] Database connectivity? Check connection pool metrics- [ ] External dependencies? Check Stripe, bank API status- [ ] Recent changes? Check deploy history ### 3. Initial Classification | Symptom              | Likely Cause        | Go To Section || -------------------- | ------------------- | ------------- || All requests failing | Service down        | Section 4.1   || High latency         | Database/dependency | Section 4.2   || Partial failures     | Code bug            | Section 4.3   || Spike in errors      | Traffic surge       | Section 4.4   | ## Mitigation Procedures ### 4.1 Service Completely Down ```bash# Step 1: Check pod statuskubectl get pods -n payments # Step 2: If pods are crash-looping, check logskubectl logs -n payments -l app=payment-service --tail=100 # Step 3: Check recent deploymentskubectl rollout history deployment/payment-service -n payments # Step 4: ROLLBACK if recent deploy is suspectkubectl rollout undo deployment/payment-service -n payments # Step 5: Scale up if resource constrainedkubectl scale deployment/payment-service -n payments --replicas=10 # Step 6: Verify recoverykubectl rollout status deployment/payment-service -n payments``` ### 4.2 High Latency ```bash# Step 1: Check database connectionskubectl exec -n payments deploy/payment-service -- \  curl localhost:8080/metrics | grep db_pool # Step 2: Check slow queries (if DB issue)psql -h $DB_HOST -U $DB_USER -c "  SELECT pid, now() - query_start AS duration, query  FROM pg_stat_activity  WHERE state = 'active' AND duration > interval '5 seconds'  ORDER BY duration DESC;" # Step 3: Kill long-running queries if neededpsql -h $DB_HOST -U $DB_USER -c "SELECT pg_terminate_backend(pid);" # Step 4: Check external dependency latencycurl -w "@curl-format.txt" -o /dev/null -s https://api.stripe.com/v1/health # Step 5: Enable circuit breaker if dependency is slowkubectl set env deployment/payment-service \  STRIPE_CIRCUIT_BREAKER_ENABLED=true -n payments``` ### 4.3 Partial Failures (Specific Errors) ```bash# Step 1: Identify error patternkubectl logs -n payments -l app=payment-service --tail=500 | \  grep -i error | sort | uniq -c | sort -rn | head -20 # Step 2: Check error tracking# Go to Sentry: https://sentry.io/payments # Step 3: If specific endpoint, enable feature flag to disablecurl -X POST https://api.company.com/internal/feature-flags \  -d '{"flag": "DISABLE_PROBLEMATIC_FEATURE", "enabled": true}' # Step 4: If data issue, check recent data changespsql -h $DB_HOST -c "  SELECT * FROM audit_log  WHERE table_name = 'payment_methods'  AND created_at > now() - interval '1 hour';"``` ### 4.4 Traffic Surge ```bash# Step 1: Check current request ratekubectl top pods -n payments # Step 2: Scale horizontallykubectl scale deployment/payment-service -n payments --replicas=20 # Step 3: Enable rate limitingkubectl set env deployment/payment-service \  RATE_LIMIT_ENABLED=true \  RATE_LIMIT_RPS=1000 -n payments # Step 4: If attack, block suspicious IPskubectl apply -f - <<EOFapiVersion: networking.k8s.io/v1kind: NetworkPolicymetadata:  name: block-suspicious  namespace: paymentsspec:  podSelector:    matchLabels:      app: payment-service  ingress:  - from:    - ipBlock:        cidr: 0.0.0.0/0        except:        - 192.168.1.0/24  # Suspicious rangeEOF``` ## Verification Steps ```bash# Verify service is healthycurl -s https://api.company.com/payments/health | jq # Verify error rate is back to normalcurl -s "http://prometheus:9090/api/v1/query?query=sum(rate(http_requests_total{status=~'5..'}[5m]))" | jq '.data.result[0].value[1]' # Verify latency is acceptablecurl -s "http://prometheus:9090/api/v1/query?query=histogram_quantile(0.99,sum(rate(http_request_duration_seconds_bucket[5m]))by(le))" | jq # Smoke test critical flows./scripts/smoke-test-payments.sh``` ## Rollback Procedures ```bash# Rollback Kubernetes deploymentkubectl rollout undo deployment/payment-service -n payments # Rollback database migration (if applicable)./scripts/db-rollback.sh $MIGRATION_VERSION # Rollback feature flagcurl -X POST https://api.company.com/internal/feature-flags \  -d '{"flag": "NEW_PAYMENT_FLOW", "enabled": false}'``` ## Escalation Matrix | Condition                     | Escalate To         | Contact             || ----------------------------- | ------------------- | ------------------- || > 15 min unresolved SEV1      | Engineering Manager | @manager (Slack)    || Data breach suspected         | Security Team       | #security-incidents || Financial impact > $10k       | Finance + Legal     | @finance-oncall     || Customer communication needed | Support Lead        | @support-lead       | ## Communication Templates ### Initial Notification (Internal) ```🚨 INCIDENT: Payment Service Degradation Severity: SEV2Status: InvestigatingImpact: ~20% of payment requests failingStart Time: [TIME]Incident Commander: [NAME] Current Actions:- Investigating root cause- Scaling up service- Monitoring dashboards Updates in #payments-incidents``` ### Status Update ```📊 UPDATE: Payment Service Incident Status: MitigatingImpact: Reduced to ~5% failure rateDuration: 25 minutes Actions Taken:- Rolled back deployment v2.3.4 → v2.3.3- Scaled service from 5 → 10 replicas Next Steps:- Continuing to monitor- Root cause analysis in progress ETA to Resolution: ~15 minutes``` ### Resolution Notification ```✅ RESOLVED: Payment Service Incident Duration: 45 minutesImpact: ~5,000 affected transactionsRoot Cause: Memory leak in v2.3.4 Resolution:- Rolled back to v2.3.3- Transactions auto-retried successfully Follow-up:- Postmortem scheduled for [DATE]- Bug fix in progress``` ```` ### Template 2: Database Incident Runbook ```markdown# Database Incident Runbook ## Quick Reference| Issue | Command ||-------|---------|| Check connections | `SELECT count(*) FROM pg_stat_activity;` || Kill query | `SELECT pg_terminate_backend(pid);` || Check replication lag | `SELECT extract(epoch from (now() - pg_last_xact_replay_timestamp()));` || Check locks | `SELECT * FROM pg_locks WHERE NOT granted;` | ## Connection Pool Exhaustion```sql-- Check current connectionsSELECT datname, usename, state, count(*)FROM pg_stat_activityGROUP BY datname, usename, stateORDER BY count(*) DESC; -- Identify long-running connectionsSELECT pid, usename, datname, state, query_start, queryFROM pg_stat_activityWHERE state != 'idle'ORDER BY query_start; -- Terminate idle connectionsSELECT pg_terminate_backend(pid)FROM pg_stat_activityWHERE state = 'idle'AND query_start < now() - interval '10 minutes';```` ## Replication Lag ```sql-- Check lag on replicaSELECT  CASE    WHEN pg_last_wal_receive_lsn() = pg_last_wal_replay_lsn() THEN 0    ELSE extract(epoch from now() - pg_last_xact_replay_timestamp())  END AS lag_seconds; -- If lag > 60s, consider:-- 1. Check network between primary/replica-- 2. Check replica disk I/O-- 3. Consider failover if unrecoverable``` ## Disk Space Critical ```bash# Check disk usagedf -h /var/lib/postgresql/data # Find large tablespsql -c "SELECT relname, pg_size_pretty(pg_total_relation_size(relid))FROM pg_catalog.pg_statio_user_tablesORDER BY pg_total_relation_size(relid) DESCLIMIT 10;" # VACUUM to reclaim spacepsql -c "VACUUM FULL large_table;" # If emergency, delete old data or expand disk``` ``` ## Best Practices ### Do's- **Keep runbooks updated** - Review after every incident- **Test runbooks regularly** - Game days, chaos engineering- **Include rollback steps** - Always have an escape hatch- **Document assumptions** - What must be true for steps to work- **Link to dashboards** - Quick access during stress ### Don'ts- **Don't assume knowledge** - Write for 3 AM brain- **Don't skip verification** - Confirm each step worked- **Don't forget communication** - Keep stakeholders informed- **Don't work alone** - Escalate early- **Don't skip postmortems** - Learn from every incident ## Troubleshooting ### Runbook steps work in staging but fail during a real incident Steps often assume preconditions that are true in a healthy environment but not during an outage. For each command in your runbook, add a prerequisite check and a "what to do if this command fails" note: ```bash# Step: Check pod statuskubectl get pods -n payments # Prerequisites: kubectl configured, kubeconfig points to correct cluster# If this fails: run `aws eks update-kubeconfig --name prod-cluster --region us-east-1`# Expected output: pods in Running state``` ### On-call engineer panics and skips steps out of order Add a numbered checklist at the top of the runbook that mirrors the section numbers, so responders can track progress under stress without reading the full document: ```markdown## Quick Checklist- [ ] 1. Declare incident severity and open war room- [ ] 2. Check service health (Section 4.1)- [ ] 3. Check recent deployments (Section 4.1)- [ ] 4. Roll back if deploy is suspect (Section 4.1)- [ ] 5. Post initial notification to #payments-incidents- [ ] 6. Escalate if > 15 min unresolved``` ### Runbook is outdated — commands reference old cluster names or endpoints Runbooks rot because they're updated manually. Include a "Last Verified" date and owner at the top, and add a CI check that validates all `curl` endpoints and `kubectl` context names are still valid: ```markdown## Runbook Metadata| Field | Value ||---|---|| Last verified | 2024-11-15 || Owner | @platform-team || Review cadence | After every SEV1/SEV2 |``` ### Stakeholder communication is delayed while engineers are heads-down Assign a dedicated incident communicator role (separate from the incident commander) whose only job is to post status updates. Add a standing agenda in the communication template: ```Update every 15 minutes (even if no new information):- Current status (Investigating / Mitigating / Monitoring)- Impact (what is broken, who is affected, % of traffic)- What we are doing right now- Next update in: 15 minutes``` ### Database runbook commands cause additional downtime when run incorrectly Add explicit warnings before destructive SQL commands and require a dry-run output check before executing: ```sql-- WARNING: This terminates active connections. Verify count first.-- DRY RUN (check count before terminating):SELECT count(*) FROM pg_stat_activity WHERE state = 'idle' AND query_start < now() - interval '10 minutes'; -- EXECUTE only after verifying count is reasonable (< 50):SELECT pg_terminate_backend(pid) FROM pg_stat_activityWHERE state = 'idle' AND query_start < now() - interval '10 minutes';``` ## Related Skills - `postmortem-writing` - After resolving an incident, use postmortem templates to capture root cause and preventive actions- `on-call-handoff-patterns` - Structure shift handoffs so the incoming responder has full context on active incidents