Checklist
Incident review
A postmortem template that produces actions, not theater.
Use this when
- You had an outage, data loss, security incident, or serious degradation.
- You felt lucky that it didn’t get worse.
- The same class of failure keeps repeating.
- Timeline (UTC timestamps)
- Impact (users, money, data, trust)
- Logs, traces, dashboards, and relevant deploys
- Who was on call and who was paged
Checklist
1) Summary (the one paragraph version)
- What happened?
- Who was impacted and how?
- When did it start and when did it end?
2) Detection
- How did we notice (monitoring, user report, luck)?
- What signal did we miss?
- What would have detected it earlier?
3) Timeline
- Facts only. No conclusions in the timeline.
- Include deploys, config changes, and paging events.
4) Root cause (and contributing factors)
- Root cause: the smallest set of conditions that made the incident possible.
- Contributing factors: everything that made it worse or harder to recover.
- Separate “trigger” from “root cause”. The trigger is rarely the real story.
5) What went well
- What actions reduced impact?
- What tooling or process actually helped?
6) What didn’t go well
- Where did we waste time?
- What information was missing?
- Where were handoffs unclear?
7) Action items (the only part that really matters)
- Write actions as changes to the system, not reminders to be careful.
- Each action has: owner, deadline, and measurable outcome.
- Prefer small, reversible mitigations first, then structural fixes.
Output
- A short public summary you could share without embarrassment (even if you keep it internal).
- A prioritized action list that fits into real sprint capacity.