
To write a postmortem after an incident, do five things in order: reconstruct a minute-by-minute timeline from logs and chat, run a five-whys analysis to find the system-level root cause (not the human who pressed the button), write a customer-facing summary separate from the internal report, list action items with named owners and dated commitments, and publish the document somewhere searchable within five business days. The goal is not blame. The goal is a permanent, queryable record of how your system fails so the next incident is shorter.
Most postmortems fail because they get written by the engineer who broke prod, late on a Friday, under duress, and read like an apology. A good postmortem reads like a forensic report: dry, specific, full of timestamps, with the human emotion stripped out. This guide covers the format, a reusable template, and the five mistakes that turn postmortems into theater.
Three things changed in the last 24 months. First, deploys got faster: the median team on Vercel, Render, or Fly ships 8 to 15 times a day in 2026, up from 2 to 3 in 2022. More deploys, more blast radius.
Second, AI-assisted coding shortened the gap between intent and code. A Cursor or Claude Code session ships a 400-line PR in 90 minutes. The postmortem now has to investigate not just "why did the engineer write this" but "why did the AI suggest this, and why did the reviewer trust it".
Third, on-call rotations got smaller. Teams of 4 to 8 engineers operate systems that used to need 20. Every incident has to produce a permanent learning, because you cannot afford to relearn the same lesson twice.
Most teams reach for a template that looks like this:
What happened: We had a database outage. Why: Someone ran a bad migration. Action items: Be more careful with migrations. Owner: The team.
This is a confession, not a postmortem. Zero forensic value, blames a person without naming them, produces an action item nobody owns. Three months later the same thing happens, and nobody can find the document.
The failure mode is structural: the format does not force specificity. A real postmortem has five sections that each take real work to fill in.
Open Slack, your alerting tool (PagerDuty, Opsgenie, Grafana OnCall), your deploy log, and your error tracker (Sentry, Honeycomb, Datadog). Reconstruct the incident minute by minute, with UTC timestamps.
A good timeline entry has three parts: the timestamp, the event, and the source. Like this:
14:03:12 UTC. Vercel deploy #1287 completed. Source: Vercel deploy log.
14:03:47 UTC. First 500 errors appear in Sentry. 12 events/min. Source: Sentry alert #4421.
14:05:01 UTC. PagerDuty alerts on-call (Priya). Source: PagerDuty incident #87.
14:07:33 UTC. Priya acknowledges, opens #incident-2026-05-18 in Slack. Source: Slack.
14:11:09 UTC. Priya identifies deploy #1287 as suspect. Asks team in Slack. Source: Slack.
14:14:50 UTC. Team confirms rollback. Vercel rollback initiated. Source: Vercel deploy log.
14:16:22 UTC. Rollback complete. 500 rate returns to baseline. Source: Sentry.
14:21:00 UTC. Incident declared resolved. Source: Slack.
Notice the discipline: every line has a source. If you can't cite the source, the entry doesn't go in. This is the single most valuable habit in postmortem writing, because it forces you to look at actual data instead of writing from memory. Memory is wrong about 40% of the time on the second day after an incident.
Three numbers fall out of a good timeline for free: MTTD (detect time), MTTA (acknowledge time), and MTTR (resolve time). Track these per-incident. Trends over a quarter tell you more about ops health than any vendor dashboard.
The "five whys" technique is famous and frequently misused. The point is not to ask "why" five times and write down whatever sounds reasonable. The point is to keep asking until you stop blaming a person and start blaming a system.
Worked example using the deploy above:
null on an unseeded test database.Root cause: stale test fixtures, not the engineer. The action item is "audit test fixtures against current account states," not "be more careful." See our writeup on setting up E2E testing for a SaaS for how the fixture problem usually starts.
The blameless framing matters. Saying "the engineer should have known" is functionally the same as saying "we accept that a human will hit this trap again." Saying "our fixtures don't represent reality" is something you can fix in code and prevent permanently. Pick the framing that produces a permanent fix.
The internal postmortem is for engineers. It has names, swears, screenshots of broken SQL, and honest assessments of who was tired. The customer-facing summary is for users, journalists, enterprise security reviewers, and your support team. They are different documents, full stop.
A customer-facing summary has four short sections:
| Section | What goes in | What does NOT go in |
|---|---|---|
| What happened | Plain-language description of the user impact, with start and end times in UTC. | Internal codenames, employee names, tool names. |
| Who was affected | Percentage of users, regions, plans. | "All paying customers in tier 3 of SKU 4421" kind of jargon. |
| What we did | The remediation step (rolled back, scaled up, patched). | The five-whys analysis. |
| What we are doing to prevent it | The top 2 action items, plainly stated, with dates. | The full action item list and owners. |
Keep it under 400 words. Publish it within 48 hours. If you operate B2B, your enterprise contracts probably require this anyway. If you operate B2C, doing it builds trust that compounds.
Every action item has three fields: what, who, when. If any of those is missing, it is not an action item. It is a wish.
Bad action item: "Improve test coverage on Stripe webhooks."
Good action item: "Add an integration test for the 'no active subscription' branch of handleSubscriptionUpdated. Owner: Priya. Due: 2026-05-25. Tracked: LIN-4421."
Action items should fall into three categories, and your postmortem should have at least one from each:
If you only ever fix the specific bug, you'll patch this incident and miss the class. The "detect faster" and "contain" items are where compounding ops maturity comes from. Our guide on how to handle Stripe webhooks correctly is essentially a list of contain-the-blast-radius patterns generalized from real postmortems.
Publish the finished postmortem to one searchable place. Notion, Linear docs, a /postmortems folder in your monorepo, a private Confluence space. Pick one and never deviate. Tag every postmortem with the systems involved (stripe, auth, database, deploy-pipeline) so future engineers can find prior art when they get paged on the same surface.
Once a quarter, read every postmortem from the past 90 days in one sitting. Patterns jump out that you can't see one at a time. We have seen teams discover that 6 of 9 incidents in a quarter all traced back to the same migration tool, which the team then replaced. That replacement decision is only legible when you read the postmortems as a corpus.
# Incident YYYY-MM-DD: [Short title]
**Severity**: SEV-1 / SEV-2 / SEV-3
**Detected**: HH:MM UTC
**Resolved**: HH:MM UTC
**Duration**: NN minutes
**Customer impact**: [Plain-language summary, 1 sentence]
**Author**: [Name]
**Reviewers**: [Names]
## Timeline (UTC)
| Time | Event | Source |
|---|---|---|
| HH:MM | ... | ... |
## What happened
[2-4 paragraphs of forensic narrative. Past tense. Names of systems, not people.]
## Five whys
1. Why did X happen? Because...
2. Why? Because...
3. Why? Because...
4. Why? Because...
5. Why? Because... [← This is the root cause.]
**Root cause**: [One sentence, system-level.]
## Customer-facing summary
[The exact text you would publish on status.yourdomain.com. ≤400 words.]
## Action items
| Type | Action | Owner | Due | Ticket |
|---|---|---|---|---|
| Prevent | ... | ... | YYYY-MM-DD | LIN-xxxx |
| Detect | ... | ... | YYYY-MM-DD | LIN-xxxx |
| Contain | ... | ... | YYYY-MM-DD | LIN-xxxx |
## What went well
[Yes, really. List 2-3 things. The fast rollback. The clean alert. The teammate who jumped in. This matters for morale and for spotting practices to formalize.]
## Open questions
[Anything still unknown. Assign each to a person to investigate.]
If a Cadence senior engineer ($1,500/week) is the on-call lead during an incident, the postmortem is part of the deliverable for that week; the senior tier owns the writeup unprompted. The mid tier ($1,000/week) can write a good postmortem from a template once they've seen one.
After reading hundreds of these, five patterns show up over and over.
Not every incident needs a 1,500-word document. A reasonable threshold: any SEV-1 (customer impact, data loss, security) gets a full postmortem. SEV-2 (degraded service, internal only) gets a 200-word "lite" version with timeline and root cause. SEV-3 (caught before customer impact) gets a Slack thread that ends with "any action items?" and nothing else.
If you write postmortems for every blip, the team stops reading them. Calibrate the bar to your team's actual rate. A 4-person team that writes 8 postmortems a month is doing too many; a team that writes none is hiding incidents from itself. For a guide to the broader engineering hygiene this connects to, see our writeup on managing technical debt in a startup.
If your team is too small to own the postmortem discipline, an experienced contractor can install the practice in a week. Booking a senior Cadence engineer on a one-week engagement to set up the template, run the first three postmortems with the team, and write the runbook is a common scope. Every Cadence engineer is AI-native, vetted on Cursor / Claude / Copilot fluency, so they can do the log reconstruction work fast.
Pick one open incident from the last 30 days that never got a writeup. Use the template above. Time-box it to two hours. Publish the result somewhere your team can find it in 90 days. Then write the second one when the next incident hits, while the timeline is fresh.
If you want an honest grade on your current incident response and deploy practices before the next outage, the Cadence Ship-or-Skip audit takes about ten minutes and tells you which of your engineering hygiene gaps are actually worth fixing this quarter versus deferring. We grade postmortem discipline as one of the eight signals.
For a SEV-1, 1,500 to 2,500 words including the timeline. For a SEV-2, 400 to 800 words. The timeline is the heaviest section; the prose around it should be tight.
The on-call engineer who handled the incident writes the first draft, because they have the freshest context. A second engineer who was not involved reviews it for blame and gaps before publication. The engineering lead or manager publishes it.
Internally, within 5 business days. Customer-facing summaries within 48 hours for SEV-1. If you wait longer than a week, the action items stop feeling urgent and never get done.
Cloudflare, GitLab, and Stripe publish theirs and it builds trust. For most startups, publishing the customer-facing summary on a status page is enough.
Whatever your team reads daily. Notion, Linear docs, a postmortems/ folder in the monorepo as markdown, or Confluence. We prefer markdown files in the repo for teams that already do technical specs as markdown docs.
Make the postmortem part of "resolved." An incident is not closed until the postmortem is drafted. Wire it into the incident response runbook. The first three feel like overhead; by the fifth, the team will demand them.