Checklist

Incident review

A postmortem template that produces actions, not theater.

← All checklists

Use this when

You had an outage, data loss, security incident, or serious degradation.
You felt lucky that it didn’t get worse.
The same class of failure keeps repeating.

Inputs

Timeline (UTC timestamps)
Impact (users, money, data, trust)
Logs, traces, dashboards, and relevant deploys
Who was on call and who was paged

Checklist

1) Summary (the one paragraph version)

What happened?
Who was impacted and how?
When did it start and when did it end?

2) Detection

How did we notice (monitoring, user report, luck)?
What signal did we miss?
What would have detected it earlier?

3) Timeline

Facts only. No conclusions in the timeline.
Include deploys, config changes, and paging events.

4) Root cause (and contributing factors)

Root cause: the smallest set of conditions that made the incident possible.
Contributing factors: everything that made it worse or harder to recover.
Separate “trigger” from “root cause”. The trigger is rarely the real story.

5) What went well

What actions reduced impact?
What tooling or process actually helped?

6) What didn’t go well

Where did we waste time?
What information was missing?
Where were handoffs unclear?

7) Action items (the only part that really matters)

Write actions as changes to the system, not reminders to be careful.
Each action has: owner, deadline, and measurable outcome.
Prefer small, reversible mitigations first, then structural fixes.

Output

A short public summary you could share without embarrassment (even if you keep it internal).
A prioritized action list that fits into real sprint capacity.