Name: Incident Runbook Templates
Author: Wshobson
Install
Terminal · npx
$npx skills add https://github.com/wshobson/agents --skill incident-runbook-templates
Works with Paperclip
How Incident Runbook Templates fits into a Paperclip company.

Incident Runbook Templates drops into any Paperclip agent that handles this kind of work. Assign it to a specialist inside a pre-configured PaperclipOrg company and the skill becomes available on every heartbeat — no prompt engineering, no tool wiring.
SaaS FactoryPaired
Pre-configured AI company — 18 agents, 18 skills, one-time purchase.
$27$59
Explore pack
Source file
SKILL.md472 linesmarkdown
Expand
1---2name: incident-runbook-templates3description: Create structured incident response runbooks with step-by-step procedures, escalation paths, and recovery actions. Use this skill when building a service outage runbook for a payment processing system; creating database incident procedures covering connection pool exhaustion, replication lag, and disk space alerts; onboarding new on-call engineers who need step-by-step recovery guides written for a 3 AM brain; or standardizing escalation matrices across multiple engineering teams.4---5 6# Incident Runbook Templates7 8Production-ready templates for incident response runbooks covering detection, triage, mitigation, resolution, and communication.9 10## When to Use This Skill11 12- Creating incident response procedures13- Building service-specific runbooks14- Establishing escalation paths15- Documenting recovery procedures16- Responding to active incidents17- Onboarding on-call engineers18 19## Core Concepts20 21### 1. Incident Severity Levels22 23| Severity | Impact                     | Response Time     | Example                 |24| -------- | -------------------------- | ----------------- | ----------------------- |25| **SEV1** | Complete outage, data loss | 15 min            | Production down         |26| **SEV2** | Major degradation          | 30 min            | Critical feature broken |27| **SEV3** | Minor impact               | 2 hours           | Non-critical bug        |28| **SEV4** | Minimal impact             | Next business day | Cosmetic issue          |29 30### 2. Runbook Structure31 32```331. Overview & Impact342. Detection & Alerts353. Initial Triage364. Mitigation Steps375. Root Cause Investigation386. Resolution Procedures397. Verification & Rollback408. Communication Templates419. Escalation Matrix42```43 44## Runbook Templates45 46### Template 1: Service Outage Runbook47 48````markdown49# [Service Name] Outage Runbook50 51## Overview52 53**Service**: Payment Processing Service54**Owner**: Platform Team55**Slack**: #payments-incidents56**PagerDuty**: payments-oncall57 58## Impact Assessment59 60- [ ] Which customers are affected?61- [ ] What percentage of traffic is impacted?62- [ ] Are there financial implications?63- [ ] What's the blast radius?64 65## Detection66 67### Alerts68 69- `payment_error_rate > 5%` (PagerDuty)70- `payment_latency_p99 > 2s` (Slack)71- `payment_success_rate < 95%` (PagerDuty)72 73### Dashboards74 75- [Payment Service Dashboard](https://grafana/d/payments)76- [Error Tracking](https://sentry.io/payments)77- [Dependency Status](https://status.stripe.com)78 79## Initial Triage (First 5 Minutes)80 81### 1. Assess Scope82 83```bash84# Check service health85kubectl get pods -n payments -l app=payment-service86 87# Check recent deployments88kubectl rollout history deployment/payment-service -n payments89 90# Check error rates91curl -s "http://prometheus:9090/api/v1/query?query=sum(rate(http_requests_total{status=~'5..'}[5m]))"92```93````94 95### 2. Quick Health Checks96 97- [ ] Can you reach the service? `curl -I https://api.company.com/payments/health`98- [ ] Database connectivity? Check connection pool metrics99- [ ] External dependencies? Check Stripe, bank API status100- [ ] Recent changes? Check deploy history101 102### 3. Initial Classification103 104| Symptom              | Likely Cause        | Go To Section |105| -------------------- | ------------------- | ------------- |106| All requests failing | Service down        | Section 4.1   |107| High latency         | Database/dependency | Section 4.2   |108| Partial failures     | Code bug            | Section 4.3   |109| Spike in errors      | Traffic surge       | Section 4.4   |110 111## Mitigation Procedures112 113### 4.1 Service Completely Down114 115```bash116# Step 1: Check pod status117kubectl get pods -n payments118 119# Step 2: If pods are crash-looping, check logs120kubectl logs -n payments -l app=payment-service --tail=100121 122# Step 3: Check recent deployments123kubectl rollout history deployment/payment-service -n payments124 125# Step 4: ROLLBACK if recent deploy is suspect126kubectl rollout undo deployment/payment-service -n payments127 128# Step 5: Scale up if resource constrained129kubectl scale deployment/payment-service -n payments --replicas=10130 131# Step 6: Verify recovery132kubectl rollout status deployment/payment-service -n payments133```134 135### 4.2 High Latency136 137```bash138# Step 1: Check database connections139kubectl exec -n payments deploy/payment-service -- \140  curl localhost:8080/metrics | grep db_pool141 142# Step 2: Check slow queries (if DB issue)143psql -h $DB_HOST -U $DB_USER -c "144  SELECT pid, now() - query_start AS duration, query145  FROM pg_stat_activity146  WHERE state = 'active' AND duration > interval '5 seconds'147  ORDER BY duration DESC;"148 149# Step 3: Kill long-running queries if needed150psql -h $DB_HOST -U $DB_USER -c "SELECT pg_terminate_backend(pid);"151 152# Step 4: Check external dependency latency153curl -w "@curl-format.txt" -o /dev/null -s https://api.stripe.com/v1/health154 155# Step 5: Enable circuit breaker if dependency is slow156kubectl set env deployment/payment-service \157  STRIPE_CIRCUIT_BREAKER_ENABLED=true -n payments158```159 160### 4.3 Partial Failures (Specific Errors)161 162```bash163# Step 1: Identify error pattern164kubectl logs -n payments -l app=payment-service --tail=500 | \165  grep -i error | sort | uniq -c | sort -rn | head -20166 167# Step 2: Check error tracking168# Go to Sentry: https://sentry.io/payments169 170# Step 3: If specific endpoint, enable feature flag to disable171curl -X POST https://api.company.com/internal/feature-flags \172  -d '{"flag": "DISABLE_PROBLEMATIC_FEATURE", "enabled": true}'173 174# Step 4: If data issue, check recent data changes175psql -h $DB_HOST -c "176  SELECT * FROM audit_log177  WHERE table_name = 'payment_methods'178  AND created_at > now() - interval '1 hour';"179```180 181### 4.4 Traffic Surge182 183```bash184# Step 1: Check current request rate185kubectl top pods -n payments186 187# Step 2: Scale horizontally188kubectl scale deployment/payment-service -n payments --replicas=20189 190# Step 3: Enable rate limiting191kubectl set env deployment/payment-service \192  RATE_LIMIT_ENABLED=true \193  RATE_LIMIT_RPS=1000 -n payments194 195# Step 4: If attack, block suspicious IPs196kubectl apply -f - <<EOF197apiVersion: networking.k8s.io/v1198kind: NetworkPolicy199metadata:200  name: block-suspicious201  namespace: payments202spec:203  podSelector:204    matchLabels:205      app: payment-service206  ingress:207  - from:208    - ipBlock:209        cidr: 0.0.0.0/0210        except:211        - 192.168.1.0/24  # Suspicious range212EOF213```214 215## Verification Steps216 217```bash218# Verify service is healthy219curl -s https://api.company.com/payments/health | jq220 221# Verify error rate is back to normal222curl -s "http://prometheus:9090/api/v1/query?query=sum(rate(http_requests_total{status=~'5..'}[5m]))" | jq '.data.result[0].value[1]'223 224# Verify latency is acceptable225curl -s "http://prometheus:9090/api/v1/query?query=histogram_quantile(0.99,sum(rate(http_request_duration_seconds_bucket[5m]))by(le))" | jq226 227# Smoke test critical flows228./scripts/smoke-test-payments.sh229```230 231## Rollback Procedures232 233```bash234# Rollback Kubernetes deployment235kubectl rollout undo deployment/payment-service -n payments236 237# Rollback database migration (if applicable)238./scripts/db-rollback.sh $MIGRATION_VERSION239 240# Rollback feature flag241curl -X POST https://api.company.com/internal/feature-flags \242  -d '{"flag": "NEW_PAYMENT_FLOW", "enabled": false}'243```244 245## Escalation Matrix246 247| Condition                     | Escalate To         | Contact             |248| ----------------------------- | ------------------- | ------------------- |249| > 15 min unresolved SEV1      | Engineering Manager | @manager (Slack)    |250| Data breach suspected         | Security Team       | #security-incidents |251| Financial impact > $10k       | Finance + Legal     | @finance-oncall     |252| Customer communication needed | Support Lead        | @support-lead       |253 254## Communication Templates255 256### Initial Notification (Internal)257 258```259🚨 INCIDENT: Payment Service Degradation260 261Severity: SEV2262Status: Investigating263Impact: ~20% of payment requests failing264Start Time: [TIME]265Incident Commander: [NAME]266 267Current Actions:268- Investigating root cause269- Scaling up service270- Monitoring dashboards271 272Updates in #payments-incidents273```274 275### Status Update276 277```278📊 UPDATE: Payment Service Incident279 280Status: Mitigating281Impact: Reduced to ~5% failure rate282Duration: 25 minutes283 284Actions Taken:285- Rolled back deployment v2.3.4 → v2.3.3286- Scaled service from 5 → 10 replicas287 288Next Steps:289- Continuing to monitor290- Root cause analysis in progress291 292ETA to Resolution: ~15 minutes293```294 295### Resolution Notification296 297```298✅ RESOLVED: Payment Service Incident299 300Duration: 45 minutes301Impact: ~5,000 affected transactions302Root Cause: Memory leak in v2.3.4303 304Resolution:305- Rolled back to v2.3.3306- Transactions auto-retried successfully307 308Follow-up:309- Postmortem scheduled for [DATE]310- Bug fix in progress311```312 313````314 315### Template 2: Database Incident Runbook316 317```markdown318# Database Incident Runbook319 320## Quick Reference321| Issue | Command |322|-------|---------|323| Check connections | `SELECT count(*) FROM pg_stat_activity;` |324| Kill query | `SELECT pg_terminate_backend(pid);` |325| Check replication lag | `SELECT extract(epoch from (now() - pg_last_xact_replay_timestamp()));` |326| Check locks | `SELECT * FROM pg_locks WHERE NOT granted;` |327 328## Connection Pool Exhaustion329```sql330-- Check current connections331SELECT datname, usename, state, count(*)332FROM pg_stat_activity333GROUP BY datname, usename, state334ORDER BY count(*) DESC;335 336-- Identify long-running connections337SELECT pid, usename, datname, state, query_start, query338FROM pg_stat_activity339WHERE state != 'idle'340ORDER BY query_start;341 342-- Terminate idle connections343SELECT pg_terminate_backend(pid)344FROM pg_stat_activity345WHERE state = 'idle'346AND query_start < now() - interval '10 minutes';347````348 349## Replication Lag350 351```sql352-- Check lag on replica353SELECT354  CASE355    WHEN pg_last_wal_receive_lsn() = pg_last_wal_replay_lsn() THEN 0356    ELSE extract(epoch from now() - pg_last_xact_replay_timestamp())357  END AS lag_seconds;358 359-- If lag > 60s, consider:360-- 1. Check network between primary/replica361-- 2. Check replica disk I/O362-- 3. Consider failover if unrecoverable363```364 365## Disk Space Critical366 367```bash368# Check disk usage369df -h /var/lib/postgresql/data370 371# Find large tables372psql -c "SELECT relname, pg_size_pretty(pg_total_relation_size(relid))373FROM pg_catalog.pg_statio_user_tables374ORDER BY pg_total_relation_size(relid) DESC375LIMIT 10;"376 377# VACUUM to reclaim space378psql -c "VACUUM FULL large_table;"379 380# If emergency, delete old data or expand disk381```382 383```384 385## Best Practices386 387### Do's388- **Keep runbooks updated** - Review after every incident389- **Test runbooks regularly** - Game days, chaos engineering390- **Include rollback steps** - Always have an escape hatch391- **Document assumptions** - What must be true for steps to work392- **Link to dashboards** - Quick access during stress393 394### Don'ts395- **Don't assume knowledge** - Write for 3 AM brain396- **Don't skip verification** - Confirm each step worked397- **Don't forget communication** - Keep stakeholders informed398- **Don't work alone** - Escalate early399- **Don't skip postmortems** - Learn from every incident400 401## Troubleshooting402 403### Runbook steps work in staging but fail during a real incident404 405Steps often assume preconditions that are true in a healthy environment but not during an outage. For each command in your runbook, add a prerequisite check and a "what to do if this command fails" note:406 407```bash408# Step: Check pod status409kubectl get pods -n payments410 411# Prerequisites: kubectl configured, kubeconfig points to correct cluster412# If this fails: run `aws eks update-kubeconfig --name prod-cluster --region us-east-1`413# Expected output: pods in Running state414```415 416### On-call engineer panics and skips steps out of order417 418Add a numbered checklist at the top of the runbook that mirrors the section numbers, so responders can track progress under stress without reading the full document:419 420```markdown421## Quick Checklist422- [ ] 1. Declare incident severity and open war room423- [ ] 2. Check service health (Section 4.1)424- [ ] 3. Check recent deployments (Section 4.1)425- [ ] 4. Roll back if deploy is suspect (Section 4.1)426- [ ] 5. Post initial notification to #payments-incidents427- [ ] 6. Escalate if > 15 min unresolved428```429 430### Runbook is outdated — commands reference old cluster names or endpoints431 432Runbooks rot because they're updated manually. Include a "Last Verified" date and owner at the top, and add a CI check that validates all `curl` endpoints and `kubectl` context names are still valid:433 434```markdown435## Runbook Metadata436| Field | Value |437|---|---|438| Last verified | 2024-11-15 |439| Owner | @platform-team |440| Review cadence | After every SEV1/SEV2 |441```442 443### Stakeholder communication is delayed while engineers are heads-down444 445Assign a dedicated incident communicator role (separate from the incident commander) whose only job is to post status updates. Add a standing agenda in the communication template:446 447```448Update every 15 minutes (even if no new information):449- Current status (Investigating / Mitigating / Monitoring)450- Impact (what is broken, who is affected, % of traffic)451- What we are doing right now452- Next update in: 15 minutes453```454 455### Database runbook commands cause additional downtime when run incorrectly456 457Add explicit warnings before destructive SQL commands and require a dry-run output check before executing:458 459```sql460-- WARNING: This terminates active connections. Verify count first.461-- DRY RUN (check count before terminating):462SELECT count(*) FROM pg_stat_activity WHERE state = 'idle' AND query_start < now() - interval '10 minutes';463 464-- EXECUTE only after verifying count is reasonable (< 50):465SELECT pg_terminate_backend(pid) FROM pg_stat_activity466WHERE state = 'idle' AND query_start < now() - interval '10 minutes';467```468 469## Related Skills470 471- `postmortem-writing` - After resolving an incident, use postmortem templates to capture root cause and preventive actions472- `on-call-handoff-patterns` - Structure shift handoffs so the incoming responder has full context on active incidents
Related skills
Accessibility Compliance

This walks you through implementing proper WCAG 2.2 compliance with real code patterns for screen readers, keyboard navigation, and mobile accessibility. It cov
Airflow Dag Patterns

If you're building data pipelines with Airflow, this skill gives you production-ready DAG patterns that actually work in the real world. It covers TaskFlow API
Angular Migration

Migrating from AngularJS to Angular is notoriously painful, and this skill tackles the practical stuff that makes or breaks these projects. It covers hybrid app