How Postmortem Writing fits into a Paperclip company.

Postmortem Writing drops into any Paperclip agent that handles this kind of work. Assign it to a specialist inside a pre-configured PaperclipOrg company and the skill becomes available on every heartbeat — no prompt engineering, no tool wiring.
SaaS FactoryPaired
Pre-configured AI company — 18 agents, 18 skills, one-time purchase.
$27$59
Explore pack
Source file
SKILL.md390 linesmarkdown
Expand
1---2name: postmortem-writing3description: Write effective blameless postmortems with root cause analysis, timelines, and action items. Use when conducting incident reviews, writing postmortem documents, or improving incident response processes.4---5 6# Postmortem Writing7 8Comprehensive guide to writing effective, blameless postmortems that drive organizational learning and prevent incident recurrence.9 10## When to Use This Skill11 12- Conducting post-incident reviews13- Writing postmortem documents14- Facilitating blameless postmortem meetings15- Identifying root causes and contributing factors16- Creating actionable follow-up items17- Building organizational learning culture18 19## Core Concepts20 21### 1. Blameless Culture22 23| Blame-Focused            | Blameless                         |24| ------------------------ | --------------------------------- |25| "Who caused this?"       | "What conditions allowed this?"   |26| "Someone made a mistake" | "The system allowed this mistake" |27| Punish individuals       | Improve systems                   |28| Hide information         | Share learnings                   |29| Fear of speaking up      | Psychological safety              |30 31### 2. Postmortem Triggers32 33- SEV1 or SEV2 incidents34- Customer-facing outages > 15 minutes35- Data loss or security incidents36- Near-misses that could have been severe37- Novel failure modes38- Incidents requiring unusual intervention39 40## Quick Start41 42### Postmortem Timeline43 44```45Day 0: Incident occurs46Day 1-2: Draft postmortem document47Day 3-5: Postmortem meeting48Day 5-7: Finalize document, create tickets49Week 2+: Action item completion50Quarterly: Review patterns across incidents51```52 53## Templates54 55### Template 1: Standard Postmortem56 57```markdown58# Postmortem: [Incident Title]59 60**Date**: 2024-01-1561**Authors**: @alice, @bob62**Status**: Draft | In Review | Final63**Incident Severity**: SEV264**Incident Duration**: 47 minutes65 66## Executive Summary67 68On January 15, 2024, the payment processing service experienced a 47-minute outage affecting approximately 12,000 customers. The root cause was a database connection pool exhaustion triggered by a configuration change in deployment v2.3.4. The incident was resolved by rolling back to v2.3.3 and increasing connection pool limits.69 70**Impact**:71 72- 12,000 customers unable to complete purchases73- Estimated revenue loss: $45,00074- 847 support tickets created75- No data loss or security implications76 77## Timeline (All times UTC)78 79| Time  | Event                                           |80| ----- | ----------------------------------------------- |81| 14:23 | Deployment v2.3.4 completed to production       |82| 14:31 | First alert: `payment_error_rate > 5%`          |83| 14:33 | On-call engineer @alice acknowledges alert      |84| 14:35 | Initial investigation begins, error rate at 23% |85| 14:41 | Incident declared SEV2, @bob joins              |86| 14:45 | Database connection exhaustion identified       |87| 14:52 | Decision to rollback deployment                 |88| 14:58 | Rollback to v2.3.3 initiated                    |89| 15:10 | Rollback complete, error rate dropping          |90| 15:18 | Service fully recovered, incident resolved      |91 92## Root Cause Analysis93 94### What Happened95 96The v2.3.4 deployment included a change to the database query pattern that inadvertently removed connection pooling for a frequently-called endpoint. Each request opened a new database connection instead of reusing pooled connections.97 98### Why It Happened99 1001. **Proximate Cause**: Code change in `PaymentRepository.java` replaced pooled `DataSource` with direct `DriverManager.getConnection()` calls.101 1022. **Contributing Factors**:103   - Code review did not catch the connection handling change104   - No integration tests specifically for connection pool behavior105   - Staging environment has lower traffic, masking the issue106   - Database connection metrics alert threshold was too high (90%)107 1083. **5 Whys Analysis**:109   - Why did the service fail? → Database connections exhausted110   - Why were connections exhausted? → Each request opened new connection111   - Why did each request open new connection? → Code bypassed connection pool112   - Why did code bypass connection pool? → Developer unfamiliar with codebase patterns113   - Why was developer unfamiliar? → No documentation on connection management patterns114 115### System Diagram116```117 118[Client] → [Load Balancer] → [Payment Service] → [Database]119↓120Connection Pool (broken)121↓122Direct connections (cause)123 124```125 126## Detection127 128### What Worked129- Error rate alert fired within 8 minutes of deployment130- Grafana dashboard clearly showed connection spike131- On-call response was swift (2 minute acknowledgment)132 133### What Didn't Work134- Database connection metric alert threshold too high135- No deployment-correlated alerting136- Canary deployment would have caught this earlier137 138### Detection Gap139The deployment completed at 14:23, but the first alert didn't fire until 14:31 (8 minutes). A deployment-aware alert could have detected the issue faster.140 141## Response142 143### What Worked144- On-call engineer quickly identified database as the issue145- Rollback decision was made decisively146- Clear communication in incident channel147 148### What Could Be Improved149- Took 10 minutes to correlate issue with recent deployment150- Had to manually check deployment history151- Rollback took 12 minutes (could be faster)152 153## Impact154 155### Customer Impact156- 12,000 unique customers affected157- Average impact duration: 35 minutes158- 847 support tickets (23% of affected users)159- Customer satisfaction score dropped 12 points160 161### Business Impact162- Estimated revenue loss: $45,000163- Support cost: ~$2,500 (agent time)164- Engineering time: ~8 person-hours165 166### Technical Impact167- Database primary experienced elevated load168- Some replica lag during incident169- No permanent damage to systems170 171## Lessons Learned172 173### What Went Well1741. Alerting detected the issue before customer reports1752. Team collaborated effectively under pressure1763. Rollback procedure worked smoothly1774. Communication was clear and timely178 179### What Went Wrong1801. Code review missed critical change1812. Test coverage gap for connection pooling1823. Staging environment doesn't reflect production traffic1834. Alert thresholds were not tuned properly184 185### Where We Got Lucky1861. Incident occurred during business hours with full team available1872. Database handled the load without failing completely1883. No other incidents occurred simultaneously189 190## Action Items191 192| Priority | Action | Owner | Due Date | Ticket |193|----------|--------|-------|----------|--------|194| P0 | Add integration test for connection pool behavior | @alice | 2024-01-22 | ENG-1234 |195| P0 | Lower database connection alert threshold to 70% | @bob | 2024-01-17 | OPS-567 |196| P1 | Document connection management patterns | @alice | 2024-01-29 | DOC-89 |197| P1 | Implement deployment-correlated alerting | @bob | 2024-02-05 | OPS-568 |198| P2 | Evaluate canary deployment strategy | @charlie | 2024-02-15 | ENG-1235 |199| P2 | Load test staging with production-like traffic | @dave | 2024-02-28 | QA-123 |200 201## Appendix202 203### Supporting Data204 205#### Error Rate Graph206[Link to Grafana dashboard snapshot]207 208#### Database Connection Graph209[Link to metrics]210 211### Related Incidents212- 2023-11-02: Similar connection issue in User Service (POSTMORTEM-42)213 214### References215- [Connection Pool Best Practices](internal-wiki/connection-pools)216- [Deployment Runbook](internal-wiki/deployment-runbook)217```218 219### Template 2: 5 Whys Analysis220 221```markdown222# 5 Whys Analysis: [Incident]223 224## Problem Statement225 226Payment service experienced 47-minute outage due to database connection exhaustion.227 228## Analysis229 230### Why #1: Why did the service fail?231 232**Answer**: Database connections were exhausted, causing all new requests to fail.233 234**Evidence**: Metrics showed connection count at 100/100 (max), with 500+ pending requests.235 236---237 238### Why #2: Why were database connections exhausted?239 240**Answer**: Each incoming request opened a new database connection instead of using the connection pool.241 242**Evidence**: Code diff shows direct `DriverManager.getConnection()` instead of pooled `DataSource`.243 244---245 246### Why #3: Why did the code bypass the connection pool?247 248**Answer**: A developer refactored the repository class and inadvertently changed the connection acquisition method.249 250**Evidence**: PR #1234 shows the change, made while fixing a different bug.251 252---253 254### Why #4: Why wasn't this caught in code review?255 256**Answer**: The reviewer focused on the functional change (the bug fix) and didn't notice the infrastructure change.257 258**Evidence**: Review comments only discuss business logic.259 260---261 262### Why #5: Why isn't there a safety net for this type of change?263 264**Answer**: We lack automated tests that verify connection pool behavior and lack documentation about our connection patterns.265 266**Evidence**: Test suite has no tests for connection handling; wiki has no article on database connections.267 268## Root Causes Identified269 2701. **Primary**: Missing automated tests for infrastructure behavior2712. **Secondary**: Insufficient documentation of architectural patterns2723. **Tertiary**: Code review checklist doesn't include infrastructure considerations273 274## Systemic Improvements275 276| Root Cause    | Improvement                       | Type       |277| ------------- | --------------------------------- | ---------- |278| Missing tests | Add infrastructure behavior tests | Prevention |279| Missing docs  | Document connection patterns      | Prevention |280| Review gaps   | Update review checklist           | Detection  |281| No canary     | Implement canary deployments      | Mitigation |282```283 284### Template 3: Quick Postmortem (Minor Incidents)285 286```markdown287# Quick Postmortem: [Brief Title]288 289**Date**: 2024-01-15 | **Duration**: 12 min | **Severity**: SEV3290 291## What Happened292 293API latency spiked to 5s due to cache miss storm after cache flush.294 295## Timeline296 297- 10:00 - Cache flush initiated for config update298- 10:02 - Latency alerts fire299- 10:05 - Identified as cache miss storm300- 10:08 - Enabled cache warming301- 10:12 - Latency normalized302 303## Root Cause304 305Full cache flush for minor config update caused thundering herd.306 307## Fix308 309- Immediate: Enabled cache warming310- Long-term: Implement partial cache invalidation (ENG-999)311 312## Lessons313 314Don't full-flush cache in production; use targeted invalidation.315```316 317## Facilitation Guide318 319### Running a Postmortem Meeting320 321```markdown322## Meeting Structure (60 minutes)323 324### 1. Opening (5 min)325 326- Remind everyone of blameless culture327- "We're here to learn, not to blame"328- Review meeting norms329 330### 2. Timeline Review (15 min)331 332- Walk through events chronologically333- Ask clarifying questions334- Identify gaps in timeline335 336### 3. Analysis Discussion (20 min)337 338- What failed?339- Why did it fail?340- What conditions allowed this?341- What would have prevented it?342 343### 4. Action Items (15 min)344 345- Brainstorm improvements346- Prioritize by impact and effort347- Assign owners and due dates348 349### 5. Closing (5 min)350 351- Summarize key learnings352- Confirm action item owners353- Schedule follow-up if needed354 355## Facilitation Tips356 357- Keep discussion on track358- Redirect blame to systems359- Encourage quiet participants360- Document dissenting views361- Time-box tangents362```363 364## Anti-Patterns to Avoid365 366| Anti-Pattern            | Problem                    | Better Approach                 |367| ----------------------- | -------------------------- | ------------------------------- |368| **Blame game**          | Shuts down learning        | Focus on systems                |369| **Shallow analysis**    | Doesn't prevent recurrence | Ask "why" 5 times               |370| **No action items**     | Waste of time              | Always have concrete next steps |371| **Unrealistic actions** | Never completed            | Scope to achievable tasks       |372| **No follow-up**        | Actions forgotten          | Track in ticketing system       |373 374## Best Practices375 376### Do's377 378- **Start immediately** - Memory fades fast379- **Be specific** - Exact times, exact errors380- **Include graphs** - Visual evidence381- **Assign owners** - No orphan action items382- **Share widely** - Organizational learning383 384### Don'ts385 386- **Don't name and shame** - Ever387- **Don't skip small incidents** - They reveal patterns388- **Don't make it a blame doc** - That kills learning389- **Don't create busywork** - Actions should be meaningful390- **Don't skip follow-up** - Verify actions completed
Related skills
Accessibility Compliance

This walks you through implementing proper WCAG 2.2 compliance with real code patterns for screen readers, keyboard navigation, and mobile accessibility. It cov
Airflow Dag Patterns

If you're building data pipelines with Airflow, this skill gives you production-ready DAG patterns that actually work in the real world. It covers TaskFlow API
Angular Migration

Migrating from AngularJS to Angular is notoriously painful, and this skill tackles the practical stuff that makes or breaks these projects. It covers hybrid app