How to set up a startup-grade incident response process

A startup incident response process needs five things: a severity ladder (SEV1/SEV2/SEV3), a rotating incident commander even if you only have three engineers, a dedicated #incidents Slack channel, a 15-minute response SLA for SEV1, and a one-page runbook per service. Skip everything else until you have 50 paying customers. Google SRE-style command structures will sink a seed-stage team.

You do not need PagerDuty on day one. You do need a written rule for what counts as an incident, who responds, and where the conversation happens. Most startups break the second condition first: nobody knows whether the database alert is "I should look at this tomorrow" or "wake up the founder."

Why most startup incident response advice is wrong for you

The internet's incident response playbooks are written by people at Google, Stripe, and Cloudflare. They assume you have a 24/7 on-call rotation, a dedicated SRE function, and an actual SLA contract with customers who will sue you. You have none of those.

The advice still leaks down. We see seed-stage teams trying to run blameless postmortems with Five Whys analysis when their actual problem is that nobody noticed the Postgres CPU was at 98% for three days. Process without observability is theater.

What you actually need at 3 to 15 engineers is the smallest amount of structure that prevents three failure modes: silent outages (nobody saw it), confused responses (everyone fixes the same thing twice), and lost learnings (the same incident happens again in six weeks). Everything in this post is calibrated to that bar.

The severity ladder: SEV1, SEV2, SEV3

Pick three severities. Not five, not seven. Three. Anyone on the team can declare any severity in under 10 seconds.

Here is the ladder we recommend for startups under $5M ARR:

Severity	Definition	Response SLA	Who responds	Customer comms
SEV1	Product is down for >25% of users, or any data loss/security incident	15 minutes, 24/7	Incident commander + on-call eng	Statuspage update within 30 min
SEV2	Major feature broken, billing/checkout impaired, or significant degradation	1 hour, business hours	On-call eng, IC if not resolved in 2 hrs	Statuspage if >1 hour
SEV3	Minor bug, single-customer issue, non-blocking degradation	Next business day	Whoever owns the area	Customer reply only

The two rules that matter: anyone can declare a SEV1, and you upgrade severity at any time but never downgrade mid-incident. Downgrading turns into politics. If it started as a SEV1, write the postmortem like it was a SEV1 and reclassify in retro.

The 15-minute response SLA is the load-bearing number. It is not "fix in 15 minutes." It is "human acknowledges the alert in 15 minutes." That is what customers actually care about and what is achievable at 3 engineers without burnout.

The incident commander role at 3 engineers

Yes, you need an incident commander. Yes, even at 3 engineers. The IC does not have to be the most senior person. The IC has one job: keep the response coordinated so people stop typing the same kubectl command into the same terminal.

The IC role rotates weekly. We have seen this work with teams as small as 2 engineers (they alternate every other week). The pattern at most seed startups:

Week of: Engineer A is IC. Anyone who pages an incident pings A first.
A's job during an incident: declare severity, open the channel, assign a fixer (which may be themselves), assign a comms person, write the timeline.
A's job after: drive the postmortem within 5 business days.

The mistake we see is founders assuming they must be IC for every incident because they care most. That breaks when you raise a Series A and stop being on Slack at 11pm. Rotate from day one. Read our companion guide on how to write a postmortem after an incident for the document the IC produces.

The incident channel and Statuspage

You need exactly one place where incident conversation happens. Not the engineering channel, not DMs, not a Zoom call without notes. A dedicated #incidents Slack channel that is silent 99% of the time.

When an incident starts, the IC opens a thread in #incidents with this template:

SEV2: Stripe webhooks failing intermittently
IC: @ana
Fixer: @sam
Comms: @jordan
Timeline: starts in this thread
Statuspage: posting in 10 min

Everyone responding works in that thread. Side conversations get pulled back in with "let's keep this in the thread." This is how you avoid the postmortem black hole where the IC spends two hours reconstructing who did what.

Statuspage (the Atlassian one, around $29/month) or a free alternative like Instatus solves customer comms. Set up two components: API and Web App. That is enough. Customers do not need a 40-component dashboard of your microservices. They need to know if the thing they pay for is working.

The runbook template: one page per service

A runbook is a one-page markdown file in your repo at docs/runbooks/<service>.md. It answers six questions:

What does this service do (one sentence)?
Where does it run (Vercel, Fly, Render, AWS region)?
Where are the logs (Datadog link, Axiom dashboard, CloudWatch group)?
What are the common failures and the first thing to try?
Who owns it (one human name, not a team)?
How do you roll back (the literal command or button)?

That is it. No flowcharts, no architectural diagrams. If your runbook is longer than one page, the on-call engineer at 2am will not read it. Resist the urge to document every edge case; document the three failures you have actually seen.

A seed-stage team with 4 services should have 4 runbooks. You can write all of them in a single Friday afternoon. If you have a backlog of services without runbooks, book a mid engineer for a week to write them; this is the exact scope a $1,000/week Cadence engineer ships in 4 days because every engineer on Cadence is AI-native, vetted on Cursor and Claude Code fluency before they unlock the platform, and runbook drafting is a near-perfect AI-assisted task.

The Slack workflow for declaring an incident

You can build this in the Slack Workflow Builder in 20 minutes. No code. The flow:

Anyone types /incident in any channel
A modal opens asking: severity, one-line description, suspected affected area
Slack auto-posts to #incidents with the template above
Slack DMs the on-call engineer
Slack adds a calendar event with "Postmortem due" set for 5 business days out

This single workflow removes 80% of the friction. The reason teams skip declaring incidents is not laziness, it is that declaring feels heavy. Make it a slash command and people will use it.

For teams that want a real tool, incident.io starts at $0/month for small teams and runs around $200 to $500/month for early-stage startups. It wraps the Slack workflow above with a proper UI, automatic postmortem drafts, and follow-up tracking. Most series-A teams we work with move to it within 6 months of hitting product-market fit.

Choosing tools by stage

Honest pricing comparison. The right tool depends on how many paying customers will notice if you go down at 3am.

Stage	Paying customers	Suggested tooling	Monthly cost
Pre-seed / prototype	0 to 10	Slack workflow + Statuspage Free + a single shared on-call calendar	$0
Seed	10 to 200	Slack workflow + Statuspage ($29) + Better Stack or Cronitor for uptime ($30)	$50 to $100
Post-PMF seed	200 to 1,500	incident.io Pro + Statuspage + Datadog or Axiom	$300 to $700
Series A	1,500+	incident.io + PagerDuty ($21/user/mo) + full observability stack	$1,500 to $5,000

PagerDuty is the gold standard for actual paging. It is also overkill if you have 4 engineers and 50 customers. Many seed teams use Slack DMs as their pager (the IC's phone is set to push for #incidents) and that is fine until somebody starts ignoring Slack at night. The moment that happens, you upgrade.

The "what to do this week" plan

If you have nothing today, here is the actual order. Each item takes under 90 minutes:

Write the severity ladder in your team handbook. Copy ours above.
Create #incidents in Slack. Pin a message with the template.
Build the /incident slash command in Workflow Builder.
Sign up for Statuspage Free or Instatus. Add two components.
Set up one uptime monitor on your homepage with Better Stack ($30/month).
Write a runbook for your most fragile service (you know which one).
Designate the IC rotation. Put it in a shared calendar.

That is your week one. If you want this audited end-to-end before you start, run our stack through Ship or Skip for an honest grade on what is missing.

Common founder mistakes

We have watched founders over-engineer this badly. The pattern repeats:

Buying PagerDuty before they have a rotation. Tools do not replace the agreement about who responds. Decide the rotation first, then automate it.
Writing a 40-page incident playbook. Nobody reads it. Three severities and one channel beats a 40-page document every time.
Declaring blameless postmortems but skipping the action items. The point of the postmortem is the list of changes you will ship, not the document. If your postmortems do not produce 2 to 5 tracked tasks, you are LARPing.
Treating customer comms as the legal team's job. At seed stage, the IC writes the Statuspage update in plain English within 30 minutes. "We are investigating reports of failed Stripe webhooks. Updates every 30 min." That is the entire template.
Founder-as-permanent-IC. Burns out the founder, hides the failure mode from the team, and creates a single point of failure when you raise.

These mirror the broader pattern in our bootstrap startup engineering playbook: pick the least process that prevents the worst failure, then add structure only when something breaks.

Need an engineer to stand this up? Booking a mid engineer on Cadence for one week ($1,000) covers the runbook drafts, Slack workflow, Statuspage setup, and Better Stack monitors. Start the 48-hour free trial; you only pay if the work ships.

FAQ

How long should a startup incident response process take to build?

About 4 to 6 hours of focused work to ship the minimum viable version: severity ladder, #incidents channel, Slack workflow, Statuspage, one runbook. Most teams stretch it over a week because they over-design. Resist that. Ship the minimum, then iterate after your first real incident.

Do we need PagerDuty at seed stage?

Probably not. Until you have customers paying enough that 3am downtime risks churn, Slack notifications on the IC's phone work. The PagerDuty switch usually flips somewhere between $30K and $100K MRR, when one bad night costs more than $250/month in tooling.

Who should be the incident commander on a 3-engineer team?

Rotate weekly between all three engineers, including the founder if they still ship code. The IC's job is coordination, not technical seniority, so a mid engineer is often a better IC than a senior who wants to dive into the fix. The CTO should explicitly not be IC every week; that hides the bus-factor risk.

What is the difference between a SEV1 and a SEV2?

SEV1 means the product is down or data is at risk and you wake people up. SEV2 means something significant is broken but customers can mostly still use the product, so you respond in business hours. The line we use: if you would refund a customer for the impact, it is SEV1. If you would apologize but not refund, it is SEV2.

How often should we do incident response drills?

Once a quarter is enough at seed stage. Pick a recent real incident, replay it as a tabletop exercise with the team, and time how long it takes to declare, open the channel, and post to Statuspage. If it is over 15 minutes, your slash command or your rotation is broken. This is also where new hires get inducted; we cover this in our Series A engineering hiring playbook.

What should the postmortem include?

Timeline, root cause, contributing factors, customer impact, and 2 to 5 specific action items with owners and due dates. Keep it to one page. The action items are the only part anyone will reread; everything else is paperwork. We have a full template in our postmortem writing guide.

Ayush Singh

Growth & Talent Partner

Sits between growth and talent at withRemote. Writes on partnership-driven hiring, referral economics, and growth loops for engineering teams.

All posts