operationsengineering
Incident Postmortem
Write blameless incident postmortems with structured timeline reconstruction, impact quantification, contributing factor analysis, and actionable follow-up items with owners and deadlines.
incidentspostmortemblamelessreliabilityincident-response
Works well with agents
Works well with skills
incident-postmortem/
SKILL.md
Markdown| 1 | |
| 2 | # Incident Postmortem |
| 3 | |
| 4 | ## Before you start |
| 5 | |
| 6 | Gather the following from the user: |
| 7 | |
| 8 | 1. **What happened?** (Service name, symptoms, error messages, alerts that fired) |
| 9 | 2. **When did it happen?** (Detection time, start time if known, resolution time — all in UTC) |
| 10 | 3. **Who was involved?** (On-call responder, escalation chain, any external parties) |
| 11 | 4. **What was the blast radius?** (Affected users, regions, services, revenue impact) |
| 12 | 5. **What fixed it?** (Mitigation steps taken, in order) |
| 13 | |
| 14 | If the user gives you a vague summary ("the site went down for a bit"), push back: "What specific errors did users see? Which services were affected? When exactly did alerts fire vs. when was the issue resolved?" |
| 15 | |
| 16 | ## Postmortem template |
| 17 | |
| 18 | Use the following structure for every postmortem: |
| 19 | |
| 20 | ### Incident Summary |
| 21 | |
| 22 | Write 3-5 sentences covering: what broke, who was affected, how long it lasted, and how it was resolved. This should be understandable by someone outside the team. |
| 23 | |
| 24 | ``` |
| 25 | On 2024-03-12 at 14:32 UTC, the checkout service began returning 500 errors |
| 26 | for all payment processing requests. Approximately 12,000 users were unable to |
| 27 | complete purchases during the 47-minute outage. The issue was caused by an |
| 28 | expired TLS certificate on the payment gateway. Service was restored at 15:19 |
| 29 | UTC by rotating the certificate. |
| 30 | ``` |
| 31 | |
| 32 | ### Timeline |
| 33 | |
| 34 | Use UTC timestamps. Include detection lag (time between incident start and first alert). Mark each entry with a category tag. |
| 35 | |
| 36 | ``` |
| 37 | 14:32 UTC [ONSET] First 500 errors appear in payment service logs |
| 38 | 14:38 UTC [DETECTION] PagerDuty alert fires for checkout error rate > 5% |
| 39 | 14:40 UTC [RESPONSE] On-call engineer acknowledges alert |
| 40 | 14:45 UTC [DIAGNOSIS] Engineer identifies TLS handshake failures in logs |
| 41 | 14:52 UTC [ESCALATION] Platform team paged for certificate access |
| 42 | 15:10 UTC [MITIGATION] New certificate issued and deployed to staging |
| 43 | 15:15 UTC [MITIGATION] Certificate deployed to production |
| 44 | 15:19 UTC [RESOLUTION] Error rates return to baseline, incident closed |
| 45 | ``` |
| 46 | |
| 47 | ### Impact |
| 48 | |
| 49 | Quantify impact with actual numbers, not vague language: |
| 50 | |
| 51 | - **Duration**: Total outage time (onset to resolution) and user-facing downtime |
| 52 | - **Users affected**: Count or percentage, segmented if possible |
| 53 | - **Revenue impact**: Lost transactions, failed payments, SLA credits issued |
| 54 | - **Downstream effects**: Other services or teams that were impacted |
| 55 | - **Detection time**: How long between onset and first alert |
| 56 | |
| 57 | ### Contributing Factors |
| 58 | |
| 59 | List every factor that contributed to the incident occurring or lasting longer than it should have. Frame these as system failures, not personal failures. |
| 60 | |
| 61 | ``` |
| 62 | - Certificate expiry was tracked in a spreadsheet with no automated alerting |
| 63 | - The payment service had no fallback path when TLS negotiation fails |
| 64 | - Runbook for certificate rotation was last updated 18 months ago and |
| 65 | referenced a deprecated tool |
| 66 | - On-call engineer did not have permissions to rotate certificates, |
| 67 | requiring escalation |
| 68 | ``` |
| 69 | |
| 70 | ### Root Cause |
| 71 | |
| 72 | Identify the deepest systemic cause. The root cause is never "someone made a mistake" — it is the system condition that allowed the mistake to have impact. |
| 73 | |
| 74 | ``` |
| 75 | Root cause: Certificate lifecycle management relied on manual tracking without |
| 76 | automated expiry alerts or rotation. The system had no defense against expiry |
| 77 | because it was treated as a one-time setup rather than an ongoing concern. |
| 78 | ``` |
| 79 | |
| 80 | ### Action Items |
| 81 | |
| 82 | Every action item must have an owner, a deadline, and a priority. Use this format: |
| 83 | |
| 84 | | Priority | Action Item | Owner | Deadline | Ticket | |
| 85 | |----------|-------------|-------|----------|--------| |
| 86 | | **P0** | Add automated certificate expiry alerting (30/14/7 day warnings) | @platform-team | 2024-03-19 | OPS-891 | |
| 87 | | **P1** | Implement certificate auto-rotation for payment service | @platform-team | 2024-04-01 | OPS-892 | |
| 88 | | **P1** | Grant on-call engineers certificate rotation permissions | @security-team | 2024-03-15 | SEC-234 | |
| 89 | | **P2** | Add TLS handshake failure to checkout service health check | @checkout-team | 2024-04-15 | CHK-567 | |
| 90 | |
| 91 | Priority definitions: **P0** — before next on-call rotation, prevents recurrence. **P1** — within 2 weeks, reduces severity or detection time. **P2** — within 30 days, improves resilience or observability. |
| 92 | |
| 93 | ### Lessons Learned |
| 94 | |
| 95 | Include three categories: |
| 96 | |
| 97 | - **What went well**: Response actions, tools, or processes that worked as intended |
| 98 | - **What went poorly**: Gaps that made the incident worse or slower to resolve |
| 99 | - **Where we got lucky**: Things that could have made this much worse but didn't |
| 100 | |
| 101 | ## Quality checklist |
| 102 | |
| 103 | Before delivering the postmortem, verify: |
| 104 | |
| 105 | - [ ] Summary is understandable by someone outside the engineering team |
| 106 | - [ ] Timeline uses UTC and includes detection lag |
| 107 | - [ ] Impact section contains actual numbers, not "some users were affected" |
| 108 | - [ ] Contributing factors describe system failures, not individual mistakes |
| 109 | - [ ] Root cause identifies a systemic issue, not "human error" |
| 110 | - [ ] Every action item has an owner, deadline, priority, and ticket reference |
| 111 | - [ ] At least one P0 action item exists that prevents immediate recurrence |
| 112 | - [ ] Lessons learned include all three categories (well, poorly, lucky) |
| 113 | |
| 114 | ## Common mistakes to avoid |
| 115 | |
| 116 | - **Blaming individuals**. "John forgot to renew the certificate" is a blame statement. "Certificate renewal depended on manual tracking with no automated alerts" is a system observation. Always describe the system gap, not the person. |
| 117 | - **Vague action items**. "Improve monitoring" is not actionable. "Add PagerDuty alert when certificate expiry is within 30 days (OPS-891, @platform-team, due 2024-03-19)" is actionable. |
| 118 | - **Missing the detection gap**. Always call out how long the incident was occurring before anyone noticed. A 2-minute outage with a 45-minute detection gap is a monitoring problem, not just an infrastructure problem. |
| 119 | - **Action items without owners**. An action item assigned to a team mailing list or "TBD" will not get done. Every item needs a specific person or team lead who is accountable. |
| 120 | - **Skipping "where we got lucky"**. This section surfaces near-misses that deserve preventive action even though they didn't cause damage this time. |
| 121 |