
A startup incident response process needs five things: a severity ladder (SEV1/SEV2/SEV3), a rotating incident commander even if you only have three engineers, a dedicated #incidents Slack channel, a 15-minute response SLA for SEV1, and a one-page runbook per service. Skip everything else until you have 50 paying customers. Google SRE-style command structures will sink a seed-stage team.
You do not need PagerDuty on day one. You do need a written rule for what counts as an incident, who responds, and where the conversation happens. Most startups break the second condition first: nobody knows whether the database alert is "I should look at this tomorrow" or "wake up the founder."
The internet's incident response playbooks are written by people at Google, Stripe, and Cloudflare. They assume you have a 24/7 on-call rotation, a dedicated SRE function, and an actual SLA contract with customers who will sue you. You have none of those.
The advice still leaks down. We see seed-stage teams trying to run blameless postmortems with Five Whys analysis when their actual problem is that nobody noticed the Postgres CPU was at 98% for three days. Process without observability is theater.
What you actually need at 3 to 15 engineers is the smallest amount of structure that prevents three failure modes: silent outages (nobody saw it), confused responses (everyone fixes the same thing twice), and lost learnings (the same incident happens again in six weeks). Everything in this post is calibrated to that bar.
Pick three severities. Not five, not seven. Three. Anyone on the team can declare any severity in under 10 seconds.
Here is the ladder we recommend for startups under $5M ARR:
| Severity | Definition | Response SLA | Who responds | Customer comms |
|---|---|---|---|---|
| SEV1 | Product is down for >25% of users, or any data loss/security incident | 15 minutes, 24/7 | Incident commander + on-call eng | Statuspage update within 30 min |
| SEV2 | Major feature broken, billing/checkout impaired, or significant degradation | 1 hour, business hours | On-call eng, IC if not resolved in 2 hrs | Statuspage if >1 hour |
| SEV3 | Minor bug, single-customer issue, non-blocking degradation | Next business day | Whoever owns the area | Customer reply only |
The two rules that matter: anyone can declare a SEV1, and you upgrade severity at any time but never downgrade mid-incident. Downgrading turns into politics. If it started as a SEV1, write the postmortem like it was a SEV1 and reclassify in retro.
The 15-minute response SLA is the load-bearing number. It is not "fix in 15 minutes." It is "human acknowledges the alert in 15 minutes." That is what customers actually care about and what is achievable at 3 engineers without burnout.
Yes, you need an incident commander. Yes, even at 3 engineers. The IC does not have to be the most senior person. The IC has one job: keep the response coordinated so people stop typing the same kubectl command into the same terminal.
The IC role rotates weekly. We have seen this work with teams as small as 2 engineers (they alternate every other week). The pattern at most seed startups:
The mistake we see is founders assuming they must be IC for every incident because they care most. That breaks when you raise a Series A and stop being on Slack at 11pm. Rotate from day one. Read our companion guide on how to write a postmortem after an incident for the document the IC produces.
You need exactly one place where incident conversation happens. Not the engineering channel, not DMs, not a Zoom call without notes. A dedicated #incidents Slack channel that is silent 99% of the time.
When an incident starts, the IC opens a thread in #incidents with this template:
SEV2: Stripe webhooks failing intermittently
IC: @ana
Fixer: @sam
Comms: @jordan
Timeline: starts in this thread
Statuspage: posting in 10 min
Everyone responding works in that thread. Side conversations get pulled back in with "let's keep this in the thread." This is how you avoid the postmortem black hole where the IC spends two hours reconstructing who did what.
Statuspage (the Atlassian one, around $29/month) or a free alternative like Instatus solves customer comms. Set up two components: API and Web App. That is enough. Customers do not need a 40-component dashboard of your microservices. They need to know if the thing they pay for is working.
A runbook is a one-page markdown file in your repo at docs/runbooks/<service>.md. It answers six questions:
That is it. No flowcharts, no architectural diagrams. If your runbook is longer than one page, the on-call engineer at 2am will not read it. Resist the urge to document every edge case; document the three failures you have actually seen.
A seed-stage team with 4 services should have 4 runbooks. You can write all of them in a single Friday afternoon. If you have a backlog of services without runbooks, book a mid engineer for a week to write them; this is the exact scope a $1,000/week Cadence engineer ships in 4 days because every engineer on Cadence is AI-native, vetted on Cursor and Claude Code fluency before they unlock the platform, and runbook drafting is a near-perfect AI-assisted task.
You can build this in the Slack Workflow Builder in 20 minutes. No code. The flow:
/incident in any channel#incidents with the template aboveThis single workflow removes 80% of the friction. The reason teams skip declaring incidents is not laziness, it is that declaring feels heavy. Make it a slash command and people will use it.
For teams that want a real tool, incident.io starts at $0/month for small teams and runs around $200 to $500/month for early-stage startups. It wraps the Slack workflow above with a proper UI, automatic postmortem drafts, and follow-up tracking. Most series-A teams we work with move to it within 6 months of hitting product-market fit.
Honest pricing comparison. The right tool depends on how many paying customers will notice if you go down at 3am.
| Stage | Paying customers | Suggested tooling | Monthly cost |
|---|---|---|---|
| Pre-seed / prototype | 0 to 10 | Slack workflow + Statuspage Free + a single shared on-call calendar | $0 |
| Seed | 10 to 200 | Slack workflow + Statuspage ($29) + Better Stack or Cronitor for uptime ($30) | $50 to $100 |
| Post-PMF seed | 200 to 1,500 | incident.io Pro + Statuspage + Datadog or Axiom | $300 to $700 |
| Series A | 1,500+ | incident.io + PagerDuty ($21/user/mo) + full observability stack | $1,500 to $5,000 |
PagerDuty is the gold standard for actual paging. It is also overkill if you have 4 engineers and 50 customers. Many seed teams use Slack DMs as their pager (the IC's phone is set to push for #incidents) and that is fine until somebody starts ignoring Slack at night. The moment that happens, you upgrade.
If you have nothing today, here is the actual order. Each item takes under 90 minutes:
#incidents in Slack. Pin a message with the template./incident slash command in Workflow Builder.That is your week one. If you want this audited end-to-end before you start, run our stack through Ship or Skip for an honest grade on what is missing.
We have watched founders over-engineer this badly. The pattern repeats:
These mirror the broader pattern in our bootstrap startup engineering playbook: pick the least process that prevents the worst failure, then add structure only when something breaks.
Need an engineer to stand this up? Booking a mid engineer on Cadence for one week ($1,000) covers the runbook drafts, Slack workflow, Statuspage setup, and Better Stack monitors. Start the 48-hour free trial; you only pay if the work ships.
About 4 to 6 hours of focused work to ship the minimum viable version: severity ladder, #incidents channel, Slack workflow, Statuspage, one runbook. Most teams stretch it over a week because they over-design. Resist that. Ship the minimum, then iterate after your first real incident.
Probably not. Until you have customers paying enough that 3am downtime risks churn, Slack notifications on the IC's phone work. The PagerDuty switch usually flips somewhere between $30K and $100K MRR, when one bad night costs more than $250/month in tooling.
Rotate weekly between all three engineers, including the founder if they still ship code. The IC's job is coordination, not technical seniority, so a mid engineer is often a better IC than a senior who wants to dive into the fix. The CTO should explicitly not be IC every week; that hides the bus-factor risk.
SEV1 means the product is down or data is at risk and you wake people up. SEV2 means something significant is broken but customers can mostly still use the product, so you respond in business hours. The line we use: if you would refund a customer for the impact, it is SEV1. If you would apologize but not refund, it is SEV2.
Once a quarter is enough at seed stage. Pick a recent real incident, replay it as a tabletop exercise with the team, and time how long it takes to declare, open the channel, and post to Statuspage. If it is over 15 minutes, your slash command or your rotation is broken. This is also where new hires get inducted; we cover this in our Series A engineering hiring playbook.
Timeline, root cause, contributing factors, customer impact, and 2 to 5 specific action items with owners and due dates. Keep it to one page. The action items are the only part anyone will reread; everything else is paperwork. We have a full template in our postmortem writing guide.
Sits between growth and talent at withRemote. Writes on partnership-driven hiring, referral economics, and growth loops for engineering teams.