I am a...
Learn more
How it worksPricingFAQ
Account
May 17, 2026 · 11 min read · Cadence Editorial

How to write a postmortem after an incident

write postmortem incident — How to write a postmortem after an incident
Photo by [Walls.io](https://www.pexels.com/@walls-io-440716388) on [Pexels](https://www.pexels.com/photo/people-making-a-plan-on-a-board-17724744/)

How to write a postmortem after an incident

To write a postmortem after an incident, do five things in order: reconstruct a minute-by-minute timeline from logs and chat, run a five-whys analysis to find the system-level root cause (not the human who pressed the button), write a customer-facing summary separate from the internal report, list action items with named owners and dated commitments, and publish the document somewhere searchable within five business days. The goal is not blame. The goal is a permanent, queryable record of how your system fails so the next incident is shorter.

Most postmortems fail because they get written by the engineer who broke prod, late on a Friday, under duress, and read like an apology. A good postmortem reads like a forensic report: dry, specific, full of timestamps, with the human emotion stripped out. This guide covers the format, a reusable template, and the five mistakes that turn postmortems into theater.

Why postmortem discipline matters more in 2026

Three things changed in the last 24 months. First, deploys got faster: the median team on Vercel, Render, or Fly ships 8 to 15 times a day in 2026, up from 2 to 3 in 2022. More deploys, more blast radius.

Second, AI-assisted coding shortened the gap between intent and code. A Cursor or Claude Code session ships a 400-line PR in 90 minutes. The postmortem now has to investigate not just "why did the engineer write this" but "why did the AI suggest this, and why did the reviewer trust it".

Third, on-call rotations got smaller. Teams of 4 to 8 engineers operate systems that used to need 20. Every incident has to produce a permanent learning, because you cannot afford to relearn the same lesson twice.

The default postmortem (and why it fails)

Most teams reach for a template that looks like this:

What happened: We had a database outage. Why: Someone ran a bad migration. Action items: Be more careful with migrations. Owner: The team.

This is a confession, not a postmortem. Zero forensic value, blames a person without naming them, produces an action item nobody owns. Three months later the same thing happens, and nobody can find the document.

The failure mode is structural: the format does not force specificity. A real postmortem has five sections that each take real work to fill in.

The five sections of a real postmortem

1. Timeline reconstruction

Open Slack, your alerting tool (PagerDuty, Opsgenie, Grafana OnCall), your deploy log, and your error tracker (Sentry, Honeycomb, Datadog). Reconstruct the incident minute by minute, with UTC timestamps.

A good timeline entry has three parts: the timestamp, the event, and the source. Like this:

14:03:12 UTC. Vercel deploy #1287 completed. Source: Vercel deploy log.
14:03:47 UTC. First 500 errors appear in Sentry. 12 events/min. Source: Sentry alert #4421.
14:05:01 UTC. PagerDuty alerts on-call (Priya). Source: PagerDuty incident #87.
14:07:33 UTC. Priya acknowledges, opens #incident-2026-05-18 in Slack. Source: Slack.
14:11:09 UTC. Priya identifies deploy #1287 as suspect. Asks team in Slack. Source: Slack.
14:14:50 UTC. Team confirms rollback. Vercel rollback initiated. Source: Vercel deploy log.
14:16:22 UTC. Rollback complete. 500 rate returns to baseline. Source: Sentry.
14:21:00 UTC. Incident declared resolved. Source: Slack.

Notice the discipline: every line has a source. If you can't cite the source, the entry doesn't go in. This is the single most valuable habit in postmortem writing, because it forces you to look at actual data instead of writing from memory. Memory is wrong about 40% of the time on the second day after an incident.

Three numbers fall out of a good timeline for free: MTTD (detect time), MTTA (acknowledge time), and MTTR (resolve time). Track these per-incident. Trends over a quarter tell you more about ops health than any vendor dashboard.

2. Five whys, with the brakes on

The "five whys" technique is famous and frequently misused. The point is not to ask "why" five times and write down whatever sounds reasonable. The point is to keep asking until you stop blaming a person and start blaming a system.

Worked example using the deploy above:

  • Why did the site 500 for 13 minutes? Because deploy #1287 introduced an unhandled exception in the Stripe webhook handler.
  • Why was the exception unhandled? Because the engineer added a new branch that called a method that returned null on an unseeded test database.
  • Why did the code path not get caught in tests? Because the integration tests only run against seeded data, and this branch only fires for accounts with no Stripe subscription.
  • Why did the integration tests not cover that case? Because we never wrote a test for the "free-trial-not-yet-converted" account state.
  • Why is "free-trial-not-yet-converted" not in our test fixtures? Because the fixture file was last updated 14 months ago, before we introduced free trials.

Root cause: stale test fixtures, not the engineer. The action item is "audit test fixtures against current account states," not "be more careful." See our writeup on setting up E2E testing for a SaaS for how the fixture problem usually starts.

The blameless framing matters. Saying "the engineer should have known" is functionally the same as saying "we accept that a human will hit this trap again." Saying "our fixtures don't represent reality" is something you can fix in code and prevent permanently. Pick the framing that produces a permanent fix.

3. Customer-facing summary, separate from internal

The internal postmortem is for engineers. It has names, swears, screenshots of broken SQL, and honest assessments of who was tired. The customer-facing summary is for users, journalists, enterprise security reviewers, and your support team. They are different documents, full stop.

A customer-facing summary has four short sections:

SectionWhat goes inWhat does NOT go in
What happenedPlain-language description of the user impact, with start and end times in UTC.Internal codenames, employee names, tool names.
Who was affectedPercentage of users, regions, plans."All paying customers in tier 3 of SKU 4421" kind of jargon.
What we didThe remediation step (rolled back, scaled up, patched).The five-whys analysis.
What we are doing to prevent itThe top 2 action items, plainly stated, with dates.The full action item list and owners.

Keep it under 400 words. Publish it within 48 hours. If you operate B2B, your enterprise contracts probably require this anyway. If you operate B2C, doing it builds trust that compounds.

4. Action items: named owners, dated commitments

Every action item has three fields: what, who, when. If any of those is missing, it is not an action item. It is a wish.

Bad action item: "Improve test coverage on Stripe webhooks." Good action item: "Add an integration test for the 'no active subscription' branch of handleSubscriptionUpdated. Owner: Priya. Due: 2026-05-25. Tracked: LIN-4421."

Action items should fall into three categories, and your postmortem should have at least one from each:

  • Prevent: the change that would have stopped this specific incident from happening.
  • Detect faster: the alert, dashboard, or test that would have caught it earlier.
  • Contain blast radius: the change that would have made the failure smaller (feature flag, kill switch, circuit breaker).

If you only ever fix the specific bug, you'll patch this incident and miss the class. The "detect faster" and "contain" items are where compounding ops maturity comes from. Our guide on how to handle Stripe webhooks correctly is essentially a list of contain-the-blast-radius patterns generalized from real postmortems.

5. The lessons file

Publish the finished postmortem to one searchable place. Notion, Linear docs, a /postmortems folder in your monorepo, a private Confluence space. Pick one and never deviate. Tag every postmortem with the systems involved (stripe, auth, database, deploy-pipeline) so future engineers can find prior art when they get paged on the same surface.

Once a quarter, read every postmortem from the past 90 days in one sitting. Patterns jump out that you can't see one at a time. We have seen teams discover that 6 of 9 incidents in a quarter all traced back to the same migration tool, which the team then replaced. That replacement decision is only legible when you read the postmortems as a corpus.

Reusable template (copy this)

# Incident YYYY-MM-DD: [Short title]

**Severity**: SEV-1 / SEV-2 / SEV-3
**Detected**: HH:MM UTC
**Resolved**: HH:MM UTC
**Duration**: NN minutes
**Customer impact**: [Plain-language summary, 1 sentence]
**Author**: [Name]
**Reviewers**: [Names]

## Timeline (UTC)

| Time | Event | Source |
|---|---|---|
| HH:MM | ... | ... |

## What happened

[2-4 paragraphs of forensic narrative. Past tense. Names of systems, not people.]

## Five whys

1. Why did X happen? Because...
2. Why? Because...
3. Why? Because...
4. Why? Because...
5. Why? Because... [← This is the root cause.]

**Root cause**: [One sentence, system-level.]

## Customer-facing summary

[The exact text you would publish on status.yourdomain.com. ≤400 words.]

## Action items

| Type | Action | Owner | Due | Ticket |
|---|---|---|---|---|
| Prevent | ... | ... | YYYY-MM-DD | LIN-xxxx |
| Detect | ... | ... | YYYY-MM-DD | LIN-xxxx |
| Contain | ... | ... | YYYY-MM-DD | LIN-xxxx |

## What went well

[Yes, really. List 2-3 things. The fast rollback. The clean alert. The teammate who jumped in. This matters for morale and for spotting practices to formalize.]

## Open questions

[Anything still unknown. Assign each to a person to investigate.]

If a Cadence senior engineer ($1,500/week) is the on-call lead during an incident, the postmortem is part of the deliverable for that week; the senior tier owns the writeup unprompted. The mid tier ($1,000/week) can write a good postmortem from a template once they've seen one.

Common postmortem mistakes

After reading hundreds of these, five patterns show up over and over.

  1. Vague AI-assisted writeups. When you paste raw logs into ChatGPT and ask for a postmortem, you get something that reads like a postmortem but cites nothing. The five whys come out generic, the timeline is reconstructed from prose instead of source data, and the action items have no owners. Use AI for first-draft prose, but the timeline and the whys have to be your own work, sourced line by line. This is the single most common 2026 failure mode.
  2. Blame masquerading as analysis. Sentences like "the engineer should have caught this in review" or "QA failed to test this case" are blame. Convert every such sentence into a system fix. "The PR template did not require a test for the negative case" is something you can change. "The engineer should have been more careful" is not.
  3. Missing communications timeline. Half of postmortems track the technical events and forget the comms events. When did you tell customers? When did support get the canned response? When did the CEO find out? Comms gaps cause more reputational damage than the outage itself. Track them in the timeline.
  4. Action items that are essays. "Improve our deploy pipeline" is not an action item. "Add a canary stage to the Vercel deploy that holds 5% of traffic for 5 minutes before rolling forward" is. If you can't put it in a Linear ticket today, it is not an action item.
  5. No second review. The first draft, written by the person on call, is always too close to the event. Have one engineer who was not involved read it before publishing. They will spot blame, missing whys, and timeline gaps the author cannot see.

When you can skip the formal postmortem

Not every incident needs a 1,500-word document. A reasonable threshold: any SEV-1 (customer impact, data loss, security) gets a full postmortem. SEV-2 (degraded service, internal only) gets a 200-word "lite" version with timeline and root cause. SEV-3 (caught before customer impact) gets a Slack thread that ends with "any action items?" and nothing else.

If you write postmortems for every blip, the team stops reading them. Calibrate the bar to your team's actual rate. A 4-person team that writes 8 postmortems a month is doing too many; a team that writes none is hiding incidents from itself. For a guide to the broader engineering hygiene this connects to, see our writeup on managing technical debt in a startup.

If your team is too small to own the postmortem discipline, an experienced contractor can install the practice in a week. Booking a senior Cadence engineer on a one-week engagement to set up the template, run the first three postmortems with the team, and write the runbook is a common scope. Every Cadence engineer is AI-native, vetted on Cursor / Claude / Copilot fluency, so they can do the log reconstruction work fast.

What to do next

Pick one open incident from the last 30 days that never got a writeup. Use the template above. Time-box it to two hours. Publish the result somewhere your team can find it in 90 days. Then write the second one when the next incident hits, while the timeline is fresh.

If you want an honest grade on your current incident response and deploy practices before the next outage, the Cadence Ship-or-Skip audit takes about ten minutes and tells you which of your engineering hygiene gaps are actually worth fixing this quarter versus deferring. We grade postmortem discipline as one of the eight signals.

FAQ

How long should a postmortem be?

For a SEV-1, 1,500 to 2,500 words including the timeline. For a SEV-2, 400 to 800 words. The timeline is the heaviest section; the prose around it should be tight.

Who should write the postmortem?

The on-call engineer who handled the incident writes the first draft, because they have the freshest context. A second engineer who was not involved reviews it for blame and gaps before publication. The engineering lead or manager publishes it.

How fast should we publish?

Internally, within 5 business days. Customer-facing summaries within 48 hours for SEV-1. If you wait longer than a week, the action items stop feeling urgent and never get done.

Should we publish postmortems publicly?

Cloudflare, GitLab, and Stripe publish theirs and it builds trust. For most startups, publishing the customer-facing summary on a status page is enough.

What tool should we use?

Whatever your team reads daily. Notion, Linear docs, a postmortems/ folder in the monorepo as markdown, or Confluence. We prefer markdown files in the repo for teams that already do technical specs as markdown docs.

How do we get the team to do this?

Make the postmortem part of "resolved." An incident is not closed until the postmortem is drafted. Wire it into the incident response runbook. The first three feel like overhead; by the fifth, the team will demand them.

All posts