I am a...
Learn more
How it worksPricingFAQ
Account
May 8, 2026 · 12 min read · Cadence Editorial

How to set up disaster recovery for a SaaS

disaster recovery saas — How to set up disaster recovery for a SaaS
Photo by [Brett Sayles](https://www.pexels.com/@brett-sayles) on [Pexels](https://www.pexels.com/photo/server-racks-on-data-center-5480781/)

How to set up disaster recovery for a SaaS

Disaster recovery for a SaaS is the engineering discipline of getting your product back online (RTO) with acceptable data loss (RPO) after something breaks. The five artifacts you need: written RPO/RTO targets, automated backups, a tested restore drill, a written runbook, and a multi-region plan once you outgrow single-region risk. Most teams ship steps one and two, then call it done. The expensive failures live in steps three through five.

Why disaster recovery looks different in 2026

Two shifts changed the playbook. First, managed Postgres providers (Neon, Supabase, Render, AWS RDS) now ship continuous WAL archiving and point-in-time recovery as defaults, so most teams already have a 7-to-35-day restore window without writing a single line of config. Second, copy-on-write storage (Neon branching, in particular) made restore-to-a-prior-second a sub-second operation rather than a 4-hour ordeal.

That is the good news. The bad news: defaults solve backup, not restore. Teams that never drill discover, mid-incident, that the backup is encrypted with a key on the laptop of the person who left. The work in 2026 is not "set up backups." It is "build the restore muscle."

RPO and RTO, defined in plain English

RPO is how much data you can afford to lose, measured in time. RTO is how long you can be down.

If you back up nightly at 2am and disaster strikes at 3pm, you lose 13 hours of data. Your achieved RPO that day was 13 hours. If your written RPO target is 1 hour, you missed.

RTO is the time from incident-start to service-restored. If you discover a corrupted database at 9am and customers can sign in again at 11am, your RTO was 2 hours. RTO does not include the time to learn what went wrong; it ends when production traffic is healthy again.

Set both targets by stage, not by aspiration. A pre-revenue founder writing "RPO = 1 minute, RTO = 5 minutes" is signing up for $5,000/month of standby infrastructure they do not need. Here is a defensible starting matrix:

StageRPO targetRTO targetStrategy
Pre-revenue24h4-8hBackup & Restore
$10k MRR1h1hBackup & Restore + warm DB
$100k MRR5min15minPilot Light
$1M MRR<1min<5minWarm Standby or Multi-Site
Regulated (SOC 2, HIPAA)per contractper contractPer customer commitment

The targets ratchet as customer commitments tighten. SOC 2 auditors will read your contractual SLAs and check your DR plan against them, which is one of the things you handle during SOC 2 audit preparation. If you committed 99.9% uptime to a customer (43 minutes/month of downtime), an 8-hour RTO is a finding.

The four DR strategies, ranked by cost and complexity

The AWS Well-Architected framework codified four strategies, but they apply to any cloud. You pick one (or one per system) and engineer to it.

StrategyRPORTOMonthly cost (rough)Best fit
Backup & Restore24h4-24h$0-200Pre-revenue, $10k MRR
Pilot Light5-60min30-60min$200-2k$50-100k MRR
Warm Standby<1min5-15min$2k-15k$100k-1M MRR
Multi-Site Active/Active<1s<1min$15k+$1M+ MRR, regulated

Backup & Restore is what most teams do by default. Database snapshot, object-storage replication, infrastructure-as-code in git. On disaster, you spin up new infrastructure from code, restore the snapshot, point DNS. Honest RTO: 4 hours when drilled, 24+ hours when not.

Pilot Light keeps your data live in a second region but compute scaled to zero. You replicate the database, keep S3 cross-region replication on, and have a Terraform plan that scales the compute up. RTO drops to 30-60 minutes because you do not wait for data to copy.

Warm Standby runs a scaled-down replica of production in a second region. Compute is on, just smaller. Failover is a DNS flip plus an autoscale event. This is where cost starts to bite: you are paying for two environments.

Multi-Site Active/Active runs both regions hot, taking traffic. No failover, just traffic-shifting. RTO measured in seconds, RPO in milliseconds. You also need conflict-resolution logic because two regions writing the same row is a real concern.

Most SaaS companies should be in Backup & Restore until $50-100k MRR, then climb the ladder one rung at a time as customer SLAs demand.

Backups for Postgres: logical, physical, and PITR

Three backup types, each with a job.

Logical (pg_dump) produces a SQL file you can restore into any Postgres version. It is portable, human-readable, and slow on large databases. Use it for nightly archival snapshots, one-off rescue exports, and migrating between providers.

pg_dump --format=custom --file=prod-2026-05-13.dump \
  --no-owner --no-acl \
  "$DATABASE_URL"
# Restore: pg_restore --dbname="$NEW_DATABASE_URL" prod-2026-05-13.dump

Physical (pg_basebackup + WAL archiving) captures the on-disk database files plus a continuous stream of write-ahead log segments. Restore replays WAL up to a chosen second. This is what powers point-in-time recovery.

# postgresql.conf
wal_level = replica
archive_mode = on
archive_command = 'aws s3 cp %p s3://your-wal-bucket/%f'
archive_timeout = 60

PITR (Point-in-Time Recovery) is the combination: a base backup plus continuous WAL gives you the ability to restore to any second within your retention window. It is the only backup type that delivers sub-minute RPO without continuous replication.

If you are self-hosting Postgres, pgBackRest and WAL-G are the tools. They handle parallelism, encryption, retention, and S3 upload.

In 2026, most SaaS teams should not self-host. The managed options:

ProviderDefault backupPITR windowCost signal
AWS RDSDaily snapshotUp to 35 daysIncluded in instance cost
Render PostgresDaily snapshot7 days (Standard+)Included in plan
SupabaseDaily snapshot7-day PITR add-on (Pro)$100/mo PITR add-on
NeonContinuous WAL7-30 days by plan$0.20/GB-month of changes

Neon deserves a special mention. Because Neon stores data in a copy-on-write log, a "restore" is a new branch pointed at a prior LSN. The actual restore happens in under a second. That is a different category of recovery experience than waiting 20 minutes for an RDS snapshot to materialize. If you are starting a new SaaS in 2026 and DR is on your radar, Neon is the path of least resistance to a sub-minute RTO.

Object storage and assets: do not forget S3

Half of the DR posts on the internet stop at the database. Your customer-uploaded files, generated PDFs, and CDN-cached static assets are also stateful. Losing them is, in many jurisdictions, a notifiable data incident, which overlaps with HIPAA compliance for SaaS if you handle health data and with GDPR for a SaaS app if you serve EU customers.

Three things to configure on day one:

  1. Versioning on. Every PUT writes a new version; deletes are soft. Reverts are a metadata flip.
  2. Lifecycle rules. Expire non-current versions after 30-90 days so you do not pay storage forever.
  3. Cross-region replication. S3 CRR copies new objects to a second region. Cost is roughly the inter-region transfer (~$0.02/GB) plus storage in the second region.

For 100GB of new objects per month, CRR runs about $2-4/month plus $2.30/month of storage. Cheaper than a single hour of incident response.

Vercel Blob, Cloudflare R2, and Supabase Storage have similar features but check the specifics. R2 has zero egress fees, which makes cross-region copies effectively free. Supabase Storage uses S3 under the hood and exposes versioning controls.

The disaster runbook nobody writes

A backup without a runbook is a sandbag without a sandbagger. Write the document before you need it. The skeleton:

Who declares a disaster. One named person. A single decider, not a committee. If they are unavailable, a named backup. Disaster declaration unlocks emergency spend, customer comms, and the restore sequence.

Communication order. Status page first (within 15 minutes), customer email second (within 1 hour), internal Slack last. Reverse this order and you spend the incident answering "is it down?" instead of fixing it. Status page first reduces inbound support volume by roughly 70% in our team's experience.

Restore order. Database first (Postgres), then object storage if separately affected, then application servers, then DNS cutover. Application servers are the easiest to rebuild and the slowest to discover bugs in, so you want the data back first.

Decision log. A shared doc with timestamped decisions. "10:42 declared disaster. 10:45 began RDS restore. 10:51 noticed WAL retention exceeded for 12-minute gap, accepted as RPO miss." The log is gold for the post-mortem and required by SOC 2. Pair it with structured request tracing (the OpenTelemetry guide for 2026 covers this) so you can reconstruct what was happening when the incident started.

A useful exercise: write the runbook in plain English and store it in your repo at docs/runbooks/disaster.md. Print a copy. Keep one in the company drive that does not depend on your auth provider being up.

Drill the restore monthly (the most-skipped step)

This is the work nobody does and the work that separates teams who survive from teams who shut down.

The drill: once a month, on a calendar invite, one engineer restores last night's backup into a scratch environment, runs a smoke test, and times the entire process. They write the actual time-to-restore in the runbook. They rotate next month.

The first drill almost always fails. Backup is missing the new schema. The restore script references a deleted IAM role. The S3 bucket is in the wrong region. The encryption key is on the laptop of the person who left. This is exactly why you drill: you are buying the failure now, in a controlled setting, instead of at 2am with customers waiting.

The drill goes on the calendar. The owner is named. The output is a one-line status in your Slack: "May DR drill: 47 minutes to full restore, target 60. Pass." Without that ritual, your runbook is a story you tell yourself.

Cost math by stage

Real numbers, not "it depends":

  • Pre-revenue. Render Postgres Standard ($20/mo) + S3 versioning + a weekly pg_dump to a separate cloud account ($5). Total: $25/mo. RPO 24h, RTO 4-8h.
  • $10k MRR. Render Pro Postgres ($95/mo) + S3 cross-region replication ($10/mo) + a small staging environment for monthly drills ($25/mo). Total: ~$130/mo. RPO 1h via WAL, RTO 1-2h.
  • $100k MRR. AWS RDS Multi-AZ ($400/mo) + RDS PITR included + S3 CRR ($30/mo) + read replica in second region ($300/mo). Total: ~$730/mo. RPO 5min, RTO 30min.
  • $1M MRR. Multi-region warm standby on RDS ($2,500/mo) + Route 53 health-check failover ($50/mo) + drill automation ($500/mo of engineering time). Total: ~$3,000/mo. RPO <1min, RTO <5min.

The $1M MRR number assumes you have one engineer who understands the system. If you do not, double everything for incident-induced overtime.

Steps

These are the literal commands and decisions, in order, to set up disaster recovery for a SaaS from a cold start.

  1. Define RPO and RTO. Pick targets from the stage table above. Write them in docs/runbooks/disaster.md. One sentence each.
  2. Turn on managed backups. For Render, Supabase, RDS, or Neon, confirm the PITR window in the dashboard. Note the retention. Default is usually 7 days.
  3. Add a logical backup as a second copy. Run pg_dump nightly via a cron job or GitHub Action; upload the dump to a different cloud provider (so a single-vendor outage does not kill both copies).
  4. Configure S3 versioning and cross-region replication. Apply to every bucket holding customer data. Verify with a test PUT and a delete.
  5. Write the runbook. Use the skeleton above. Single decider, comms order, restore order, decision-log location. One page.
  6. Schedule the first drill. Calendar invite, named owner, scratch environment. Restore last night's backup. Time it. Update the runbook with actual numbers.
  7. Repeat the drill monthly. Rotate owners. Track the trend of time-to-restore.
  8. Plan multi-region when you cross $50-100k MRR. Move from Backup & Restore to Pilot Light. Add a warm read replica in a second region. Re-drill including a region failover.

Steps 1 through 5 are about a senior engineer-day of work. Step 6 is two hours. Step 7 is one hour a month. Step 8 is a project, usually two weeks for a senior engineer who has done it before.

If you are not sure which step your stack is missing, run Ship or Skip on your current setup; it grades the gaps in your DR posture against what your stage actually needs and tells you what to fix first.

When you can skip most of this

Best practices have ROI curves. Two scenarios where the full plan is overkill:

Pre-launch consumer side project. Two founders, no paying customers, no contracts. Render's daily snapshot is fine. Write down the runbook anyway, but skip the drill until you have revenue. The same scope discipline applies to scaling an MVP to production-ready: do the cheap things first.

Internal tooling for a 5-person team. If the worst case is rebuilding from a 24-hour-old backup and slacking the team to re-enter a day's work, you do not need warm standby. You need a backup that works.

The full plan is non-optional once you have: paying customers with SLAs, regulated data (HIPAA, PCI), or a public outage that would damage trust beyond the dollar value of recovery infrastructure. If any of those apply, run all eight steps.

Where Cadence engineers fit

Setting up the full DR plan (steps 1-7) is a 3-to-5-day project for a senior engineer who has done it on Postgres before. Most founders book this work on Cadence at the senior tier ($1,500/week) and ship it as a single sprint, including the first drill and the runbook commit. Cadence's engineer pool ships to first commit in a median of 27 hours after the booking, and every engineer is AI-native by default (Cursor, Claude Code, Copilot in daily use), which matters for DR work because most of the heavy lifting is config and runbook prose, not novel code.

If you would rather grind it out yourself, the playbook above is enough. If you would rather have it done end-to-end with a working drill on day five, that is the booking shape.

Audit your stack honestly. Run Ship or Skip to grade your current DR posture in two minutes. You get a per-step score, the cheapest fix, and whether your stage warrants paying for warm standby yet. No signup, no email gate.

FAQ

What is the difference between RPO and RTO?

RPO is how much data you can lose, measured in time. RTO is how long you can be down. RPO answers "how recent is the most recent backup?" RTO answers "how fast can we restore service?" You set both as targets and engineer your stack to meet them.

How often should I back up my SaaS database?

Match backup frequency to your RPO target. A 1-hour RPO means hourly backups or continuous WAL archiving. Most managed Postgres providers (RDS, Neon, Supabase, Render) archive WAL continuously by default, giving you sub-minute effective RPO inside their PITR window.

Do managed databases handle disaster recovery for me?

Partially. Managed providers handle backups, PITR, and storage redundancy. They do not handle your runbook, your drills, your application-layer recovery, your DNS failover, or your customer communications. Those are still your job.

How often should I test my disaster recovery plan?

Monthly at minimum. Quarterly is the floor for SOC 2 and similar audits. The first drill almost always uncovers a broken backup or a missing credential, so you want that discovery to happen on a Tuesday afternoon, not at 2am during a real incident.

What is a realistic RTO for an early-stage SaaS?

4 to 8 hours for pre-revenue, 1 to 2 hours by $10k MRR, 30 minutes by $100k MRR. Drop the target as customer SLA commitments tighten. Writing "5 minutes" before you have the warm standby infrastructure to back it is worse than writing "4 hours" and hitting it consistently.

Should I use Neon for the database if DR matters to me?

Neon's branching makes restore a sub-second operation, which is uniquely good for accidental-deletion scenarios (engineer drops a table, you branch from 30 seconds ago). For region-level disaster, Neon is similar to other managed Postgres options. If you are starting fresh in 2026 and value restore speed for human-error recovery, Neon is the strongest default.

All posts