operationsengineering

Incident Postmortem

Write blameless incident postmortems with structured timeline reconstruction, impact quantification, contributing factor analysis, and actionable follow-up items with owners and deadlines.

incidentspostmortemblamelessreliabilityincident-response

Works well with agents

DevOps Engineer Agent Engineering Manager Agent Incident Commander Agent Release Manager Agent Site Reliability Architect Agent SRE Engineer Agent

Works well with skills

Disaster Recovery Plan Runbook Writing Ticket Writing

incident-postmortem/

SKILL.md

Markdown

1
2	# Incident Postmortem
3
4	## Before you start
5
6	Gather the following from the user:
7
8	1. What happened? (Service name, symptoms, error messages, alerts that fired)
9	2. When did it happen? (Detection time, start time if known, resolution time — all in UTC)
10	3. Who was involved? (On-call responder, escalation chain, any external parties)
11	4. What was the blast radius? (Affected users, regions, services, revenue impact)
12	5. What fixed it? (Mitigation steps taken, in order)
13
14	If the user gives you a vague summary ("the site went down for a bit"), push back: "What specific errors did users see? Which services were affected? When exactly did alerts fire vs. when was the issue resolved?"
15
16	## Postmortem template
17
18	Use the following structure for every postmortem:
19
20	### Incident Summary
21
22	Write 3-5 sentences covering: what broke, who was affected, how long it lasted, and how it was resolved. This should be understandable by someone outside the team.
23
24	```
25	On 2024-03-12 at 14:32 UTC, the checkout service began returning 500 errors
26	for all payment processing requests. Approximately 12,000 users were unable to
27	complete purchases during the 47-minute outage. The issue was caused by an
28	expired TLS certificate on the payment gateway. Service was restored at 15:19
29	UTC by rotating the certificate.
30	```
31
32	### Timeline
33
34	Use UTC timestamps. Include detection lag (time between incident start and first alert). Mark each entry with a category tag.
35
36	```
37	14:32 UTC [ONSET] First 500 errors appear in payment service logs
38	14:38 UTC [DETECTION] PagerDuty alert fires for checkout error rate > 5%
39	14:40 UTC [RESPONSE] On-call engineer acknowledges alert
40	14:45 UTC [DIAGNOSIS] Engineer identifies TLS handshake failures in logs
41	14:52 UTC [ESCALATION] Platform team paged for certificate access
42	15:10 UTC [MITIGATION] New certificate issued and deployed to staging
43	15:15 UTC [MITIGATION] Certificate deployed to production
44	15:19 UTC [RESOLUTION] Error rates return to baseline, incident closed
45	```
46
47	### Impact
48
49	Quantify impact with actual numbers, not vague language:
50
51	- Duration: Total outage time (onset to resolution) and user-facing downtime
52	- Users affected: Count or percentage, segmented if possible
53	- Revenue impact: Lost transactions, failed payments, SLA credits issued
54	- Downstream effects: Other services or teams that were impacted
55	- Detection time: How long between onset and first alert
56
57	### Contributing Factors
58
59	List every factor that contributed to the incident occurring or lasting longer than it should have. Frame these as system failures, not personal failures.
60
61	```
62	- Certificate expiry was tracked in a spreadsheet with no automated alerting
63	- The payment service had no fallback path when TLS negotiation fails
64	- Runbook for certificate rotation was last updated 18 months ago and
65	referenced a deprecated tool
66	- On-call engineer did not have permissions to rotate certificates,
67	requiring escalation
68	```
69
70	### Root Cause
71
72	Identify the deepest systemic cause. The root cause is never "someone made a mistake" — it is the system condition that allowed the mistake to have impact.
73
74	```
75	Root cause: Certificate lifecycle management relied on manual tracking without
76	automated expiry alerts or rotation. The system had no defense against expiry
77	because it was treated as a one-time setup rather than an ongoing concern.
78	```
79
80	### Action Items
81
82	Every action item must have an owner, a deadline, and a priority. Use this format:
83
84	\| Priority \| Action Item \| Owner \| Deadline \| Ticket \|
85	\|----------\|-------------\|-------\|----------\|--------\|
86	\| P0 \| Add automated certificate expiry alerting (30/14/7 day warnings) \| @platform-team \| 2024-03-19 \| OPS-891 \|
87	\| P1 \| Implement certificate auto-rotation for payment service \| @platform-team \| 2024-04-01 \| OPS-892 \|
88	\| P1 \| Grant on-call engineers certificate rotation permissions \| @security-team \| 2024-03-15 \| SEC-234 \|
89	\| P2 \| Add TLS handshake failure to checkout service health check \| @checkout-team \| 2024-04-15 \| CHK-567 \|
90
91	Priority definitions: P0 — before next on-call rotation, prevents recurrence. P1 — within 2 weeks, reduces severity or detection time. P2 — within 30 days, improves resilience or observability.
92
93	### Lessons Learned
94
95	Include three categories:
96
97	- What went well: Response actions, tools, or processes that worked as intended
98	- What went poorly: Gaps that made the incident worse or slower to resolve
99	- Where we got lucky: Things that could have made this much worse but didn't
100
101	## Quality checklist
102
103	Before delivering the postmortem, verify:
104
105	- [ ] Summary is understandable by someone outside the engineering team
106	- [ ] Timeline uses UTC and includes detection lag
107	- [ ] Impact section contains actual numbers, not "some users were affected"
108	- [ ] Contributing factors describe system failures, not individual mistakes
109	- [ ] Root cause identifies a systemic issue, not "human error"
110	- [ ] Every action item has an owner, deadline, priority, and ticket reference
111	- [ ] At least one P0 action item exists that prevents immediate recurrence
112	- [ ] Lessons learned include all three categories (well, poorly, lucky)
113
114	## Common mistakes to avoid
115
116	- Blaming individuals. "John forgot to renew the certificate" is a blame statement. "Certificate renewal depended on manual tracking with no automated alerts" is a system observation. Always describe the system gap, not the person.
117	- Vague action items. "Improve monitoring" is not actionable. "Add PagerDuty alert when certificate expiry is within 30 days (OPS-891, @platform-team, due 2024-03-19)" is actionable.
118	- Missing the detection gap. Always call out how long the incident was occurring before anyone noticed. A 2-minute outage with a 45-minute detection gap is a monitoring problem, not just an infrastructure problem.
119	- Action items without owners. An action item assigned to a team mailing list or "TBD" will not get done. Every item needs a specific person or team lead who is accountable.
120	- Skipping "where we got lucky". This section surfaces near-misses that deserve preventive action even though they didn't cause damage this time.
121