operationsengineering

Incident Postmortem

Write blameless incident postmortems with structured timeline reconstruction, impact quantification, contributing factor analysis, and actionable follow-up items with owners and deadlines.

incidentspostmortemblamelessreliabilityincident-response

Works well with agents

DevOps Engineer AgentEngineering Manager AgentIncident Commander AgentRelease Manager AgentSite Reliability Architect AgentSRE Engineer Agent

Works well with skills

Disaster Recovery PlanRunbook WritingTicket Writing
incident-postmortem/
    • database-outage.md4.0 KB
  • SKILL.md6.7 KB
SKILL.md
Markdown
1 
2# Incident Postmortem
3 
4## Before you start
5 
6Gather the following from the user:
7 
81. **What happened?** (Service name, symptoms, error messages, alerts that fired)
92. **When did it happen?** (Detection time, start time if known, resolution time — all in UTC)
103. **Who was involved?** (On-call responder, escalation chain, any external parties)
114. **What was the blast radius?** (Affected users, regions, services, revenue impact)
125. **What fixed it?** (Mitigation steps taken, in order)
13 
14If the user gives you a vague summary ("the site went down for a bit"), push back: "What specific errors did users see? Which services were affected? When exactly did alerts fire vs. when was the issue resolved?"
15 
16## Postmortem template
17 
18Use the following structure for every postmortem:
19 
20### Incident Summary
21 
22Write 3-5 sentences covering: what broke, who was affected, how long it lasted, and how it was resolved. This should be understandable by someone outside the team.
23 
24```
25On 2024-03-12 at 14:32 UTC, the checkout service began returning 500 errors
26for all payment processing requests. Approximately 12,000 users were unable to
27complete purchases during the 47-minute outage. The issue was caused by an
28expired TLS certificate on the payment gateway. Service was restored at 15:19
29UTC by rotating the certificate.
30```
31 
32### Timeline
33 
34Use UTC timestamps. Include detection lag (time between incident start and first alert). Mark each entry with a category tag.
35 
36```
3714:32 UTC [ONSET] First 500 errors appear in payment service logs
3814:38 UTC [DETECTION] PagerDuty alert fires for checkout error rate > 5%
3914:40 UTC [RESPONSE] On-call engineer acknowledges alert
4014:45 UTC [DIAGNOSIS] Engineer identifies TLS handshake failures in logs
4114:52 UTC [ESCALATION] Platform team paged for certificate access
4215:10 UTC [MITIGATION] New certificate issued and deployed to staging
4315:15 UTC [MITIGATION] Certificate deployed to production
4415:19 UTC [RESOLUTION] Error rates return to baseline, incident closed
45```
46 
47### Impact
48 
49Quantify impact with actual numbers, not vague language:
50 
51- **Duration**: Total outage time (onset to resolution) and user-facing downtime
52- **Users affected**: Count or percentage, segmented if possible
53- **Revenue impact**: Lost transactions, failed payments, SLA credits issued
54- **Downstream effects**: Other services or teams that were impacted
55- **Detection time**: How long between onset and first alert
56 
57### Contributing Factors
58 
59List every factor that contributed to the incident occurring or lasting longer than it should have. Frame these as system failures, not personal failures.
60 
61```
62- Certificate expiry was tracked in a spreadsheet with no automated alerting
63- The payment service had no fallback path when TLS negotiation fails
64- Runbook for certificate rotation was last updated 18 months ago and
65 referenced a deprecated tool
66- On-call engineer did not have permissions to rotate certificates,
67 requiring escalation
68```
69 
70### Root Cause
71 
72Identify the deepest systemic cause. The root cause is never "someone made a mistake" — it is the system condition that allowed the mistake to have impact.
73 
74```
75Root cause: Certificate lifecycle management relied on manual tracking without
76automated expiry alerts or rotation. The system had no defense against expiry
77because it was treated as a one-time setup rather than an ongoing concern.
78```
79 
80### Action Items
81 
82Every action item must have an owner, a deadline, and a priority. Use this format:
83 
84| Priority | Action Item | Owner | Deadline | Ticket |
85|----------|-------------|-------|----------|--------|
86| **P0** | Add automated certificate expiry alerting (30/14/7 day warnings) | @platform-team | 2024-03-19 | OPS-891 |
87| **P1** | Implement certificate auto-rotation for payment service | @platform-team | 2024-04-01 | OPS-892 |
88| **P1** | Grant on-call engineers certificate rotation permissions | @security-team | 2024-03-15 | SEC-234 |
89| **P2** | Add TLS handshake failure to checkout service health check | @checkout-team | 2024-04-15 | CHK-567 |
90 
91Priority definitions: **P0** — before next on-call rotation, prevents recurrence. **P1** — within 2 weeks, reduces severity or detection time. **P2** — within 30 days, improves resilience or observability.
92 
93### Lessons Learned
94 
95Include three categories:
96 
97- **What went well**: Response actions, tools, or processes that worked as intended
98- **What went poorly**: Gaps that made the incident worse or slower to resolve
99- **Where we got lucky**: Things that could have made this much worse but didn't
100 
101## Quality checklist
102 
103Before delivering the postmortem, verify:
104 
105- [ ] Summary is understandable by someone outside the engineering team
106- [ ] Timeline uses UTC and includes detection lag
107- [ ] Impact section contains actual numbers, not "some users were affected"
108- [ ] Contributing factors describe system failures, not individual mistakes
109- [ ] Root cause identifies a systemic issue, not "human error"
110- [ ] Every action item has an owner, deadline, priority, and ticket reference
111- [ ] At least one P0 action item exists that prevents immediate recurrence
112- [ ] Lessons learned include all three categories (well, poorly, lucky)
113 
114## Common mistakes to avoid
115 
116- **Blaming individuals**. "John forgot to renew the certificate" is a blame statement. "Certificate renewal depended on manual tracking with no automated alerts" is a system observation. Always describe the system gap, not the person.
117- **Vague action items**. "Improve monitoring" is not actionable. "Add PagerDuty alert when certificate expiry is within 30 days (OPS-891, @platform-team, due 2024-03-19)" is actionable.
118- **Missing the detection gap**. Always call out how long the incident was occurring before anyone noticed. A 2-minute outage with a 45-minute detection gap is a monitoring problem, not just an infrastructure problem.
119- **Action items without owners**. An action item assigned to a team mailing list or "TBD" will not get done. Every item needs a specific person or team lead who is accountable.
120- **Skipping "where we got lucky"**. This section surfaces near-misses that deserve preventive action even though they didn't cause damage this time.
121 

©2026 ai-directory.company

·Privacy·Terms·Cookies·