How to handle database migrations safely in production

To handle database migrations safely in production, use the expand-migrate-contract pattern, set lock_timeout and statement_timeout before any DDL, test the migration on a Neon branch first, and require a rollback plan in every pull request. Never ship a single ALTER TABLE against a hot table; break it into small, independently revertible deploys.

That's the playbook. The rest of this post is the working code, the lock numbers that matter, an honest comparison of the six migration tools you'd actually consider in 2026, and the Postgres gotchas that bite at scale.

Why migrations break production in 2026

Most schema changes don't break because the SQL was wrong. They break because the SQL was correct, but it ran inside a single deploy that coupled the schema change to code that depended on it.

Postgres takes locks when you run DDL. The most dangerous one is ACCESS EXCLUSIVE, which blocks every reader and writer for the duration of the statement. On a 200-million-row users table, an ALTER TABLE ... ADD COLUMN with a non-null default can hold that lock for minutes. During those minutes, every API call that touches the table queues up. Your p99 latency graph looks like a cliff.

The second failure mode is tooling. Drizzle Kit, Prisma Migrate, and most ORMs generate migration SQL from schema diffs. They are not deployment safety systems. They will happily emit a migration that drops a column the running application is still reading. The diff is correct. The deploy is suicide.

The fix is the same fix engineering teams have been converging on for a decade, but with sharper tools. You stop thinking of a migration as a single atomic event and start thinking of it as a coordinated rollout across multiple deploys.

The expand-migrate-contract pattern, explained

The pattern decomposes any breaking schema change into three logical phases: expand, migrate, contract. In practice it's usually five or six deploys, and each one is independently revertible.

The shape:

Expand. Add the new schema (a column, a table, an index) alongside the old one. Nothing reads or writes it yet.
Migrate. Update application code to dual-write. Backfill historical data. Cut reads over to the new shape once dual-write is verified.
Contract. Stop dual-writing. Drop the old schema in a final, separate migration.

At every checkpoint between phases, the system is in a known good state. If anything goes wrong, you stop, you don't have to run a "down" migration, and the old schema is still there as the source of truth. This is the property that makes the pattern worth the extra deploys.

The classic example is renaming a column. ALTER TABLE users RENAME COLUMN email TO email_address looks atomic. It is not safe. The instant the rename commits, the running application (which still has email in its query strings) starts throwing errors. You'd need a perfectly synchronized code-and-schema deploy, which doesn't exist.

The expand-contract version: add email_address as a new nullable column, dual-write to both, backfill, switch reads, stop dual-writing, drop email. Six deploys instead of one. Zero downtime instead of an outage.

Postgres locking strategies that actually work

Before any production DDL, set two session-level limits:

SET lock_timeout = '2s';
SET statement_timeout = '5s';

lock_timeout caps how long the migration waits in the lock queue. Without it, a long-running query in the background can pin your ALTER indefinitely while every new query queues behind it. With a 2-second cap, the migration fails fast instead of taking the database down.

statement_timeout caps how long the migration itself can run. Five seconds is a reasonable upper bound for any metadata-only DDL on Postgres 11+. If you're hitting it, you're doing something that rewrites the table, and you should not be doing that synchronously.

For indexes, never use plain CREATE INDEX on a production table. Use the concurrent variant:

CREATE INDEX CONCURRENTLY idx_users_org_id ON users (org_id);

CONCURRENTLY builds the index without blocking writes. It takes longer wall-clock time and uses more disk space, but it doesn't take an ACCESS EXCLUSIVE lock. The catch is that it cannot run inside a transaction, which breaks Prisma Migrate by default (more on that below).

For constraints, split validation into two steps. Add the constraint as NOT VALID first, which only takes a brief lock to update the catalog. Then VALIDATE CONSTRAINT separately, which takes a much weaker SHARE UPDATE EXCLUSIVE lock that doesn't block reads or writes.

ALTER TABLE orders ADD CONSTRAINT fk_org FOREIGN KEY (org_id) REFERENCES orgs(id) NOT VALID;
ALTER TABLE orders VALIDATE CONSTRAINT fk_org;

Migration tools compared

The 2026 landscape has six tools you'd seriously consider. They cover different ends of the spectrum: TypeScript-native, declarative, enterprise, SQL-first.

Tool	Style	Concurrent index?	Rollback support	Best for
Drizzle Kit	TS schema, generates SQL	Yes (manual edit)	Manual	TypeScript teams who want SQL output
Prisma Migrate	`schema.prisma`, generates SQL	Blocked (tx wrap)	Manual	Full-stack TS apps using Prisma Client
Atlas	Declarative HCL	Yes	Plan-based	Teams treating schema as code
Sqitch	Deploy / verify / revert triplets	Yes	First-class	Teams that need strict reversibility
Liquibase	XML / YAML changesets	Yes	Built-in	Enterprise, JVM-heavy stacks
Flyway	Versioned SQL files	Yes	Paid tier only	SQL-first teams of any size

Where each one wins:

Drizzle Kit is the cleanest TypeScript story, especially if you've already chosen Drizzle as your ORM. It generates a SQL file you can read and edit. The downside: it doesn't sequence anything. You generate the SQL, then you edit it to insert CONCURRENTLY, split constraints, and stage deploys yourself. If you want a deeper look at the ORM itself, our Drizzle ORM guide covers the runtime side.

Prisma Migrate has the friendliest DX for full-stack TypeScript apps. It also wraps every migration in a transaction, which means CREATE INDEX CONCURRENTLY flat-out doesn't work without manual SQL. The mitigation is a custom migration with hand-written SQL, which most teams figure out the second time they take production down. The honest Drizzle vs Prisma comparison walks through the trade-off in detail.

Atlas (from the Ariga team) takes a Terraform-style declarative approach. You write the desired schema, Atlas computes the diff. It supports CREATE INDEX CONCURRENTLY natively and has a strong "plan / apply" workflow. It's the most interesting newcomer for teams that already think in IaC.

Sqitch is the rigorous choice. Every migration is a triplet: a deploy script, a verify script, and a revert script. If your team has been burned by un-rollbackable migrations, Sqitch's discipline is worth the verbosity.

Liquibase and Flyway are the JVM-world veterans. Both work fine with Postgres. Liquibase is heavier (XML or YAML changesets, change-tracking metadata in the DB) and shines in regulated environments. Flyway is lighter (just versioned SQL files) and is what most teams pick if they want SQL-first migrations without ORM lock-in. Note Flyway's rollback feature requires the paid tier.

Postgres gotchas that bite at scale

A handful of specific operations are worth memorizing because they look harmless and aren't:

NOT NULL on an existing column. Pre-Postgres 11, this took an ACCESS EXCLUSIVE lock and rewrote the table. Postgres 11+ has a fast path only if you set the column with a DEFAULT at the same time, and only if Postgres can store the default in the catalog without a rewrite. Adding NOT NULL to a column with no default still does a full table scan to verify. On a large table, do this in two steps: backfill the column to non-null values, add the constraint as NOT VALID, then VALIDATE.

Type changes. ALTER COLUMN ... TYPE rewrites the table for almost any change that isn't a simple binary-compatible widen (like varchar(50) to varchar(100)). Going from int to bigint rewrites every row. The safe path is the expand-contract dance: add a new bigint column, dual-write, backfill, swap reads, drop the old column.

Foreign key additions. ADD FOREIGN KEY takes ACCESS EXCLUSIVE on both the referenced and referencing tables until the constraint is validated. Use the NOT VALID / VALIDATE split shown above.

Large table backfills. Never run a single UPDATE users SET new_col = old_col on a multi-million-row table. It holds row locks for the duration and bloats the WAL. Batch it:

DO $$
DECLARE
  batch_size INT := 5000;
  rows_updated INT;
BEGIN
  LOOP
    UPDATE users
    SET email_address = email
    WHERE id IN (
      SELECT id FROM users
      WHERE email_address IS NULL
      LIMIT batch_size
    );
    GET DIAGNOSTICS rows_updated = ROW_COUNT;
    EXIT WHEN rows_updated = 0;
    PERFORM pg_sleep(0.1);
  END LOOP;
END $$;

Five thousand rows, hundred-millisecond sleep, repeat until done. The sleep gives autovacuum and the rest of your workload room to breathe. Tune the batch and sleep based on table size and write traffic; 1k rows with 50ms sleep is conservative, 10k rows with 200ms sleep is aggressive.

Test on Neon branches before production

The most under-used safety tool in 2026 is database branching. Neon, Xata, and Postgres-flavored cloud providers offer copy-on-write database branches that materialize in seconds against your production data.

The flow looks like this:

Run neon branches create --name migration-test (or the equivalent in your provider).
Get back a fresh connection string with a full copy of production.
Run the migration against the branch.
Time the lock window. Inspect the resulting schema.
Drop the branch.

This catches the failures that staging never catches: lock contention against real data volumes, queries that suddenly need a sequential scan, foreign key validations that take five minutes instead of five seconds. It's the closest thing to running the migration in production without the consequences. Cadence engineers use this flow on every Postgres-shaped client because it's faster than maintaining a useful staging dataset.

If you're not sure whether your current migration tooling is up to the job, our Ship or Skip stack audit gives you an honest grade on what to keep and what to replace before the next big schema change.

The 2026 rule: every migration PR ships a rollback plan

This is the practice that separates teams that have been burned from teams that haven't been burned yet. Every pull request that contains a migration must include four sections in the description:

Down migration or recovery path. The exact SQL or steps to reverse the change. If the change is destructive (a dropped column), document how you'd restore from backup.
Feature flag or code gate. What flag controls the new code path, and how do you turn it off?
Backfill reversal. If the migration writes data, how do you undo it? Is there a reversible UPDATE?
Estimated lock window. How long did the migration hold a lock on the Neon branch test? If it's more than 100ms on a hot table, the PR doesn't merge.

The PR template enforces this. The reviewer's job is to read the rollback plan first, before the schema diff. If the rollback plan is "rebuild from backup," that's a real answer, but it should be an explicit one, not a discovery you make at 2am.

For an end-to-end view of how this fits into a production stack, our scale-MVP-to-production guide covers the broader operational checklist. And if you're tracing migration-induced latency cliffs in production, the OpenTelemetry guide shows how to instrument your DDL and connect it to your traces.

Common pitfalls

A few patterns that look correct and break:

Squashing migrations that already ran. Fine for dev branches, dangerous after staging or prod has applied them. The migrations table now disagrees with the squashed file, and the next deploy panics.
Coupling app deploys with migration deploys. If the migration takes 30 seconds, your app rolls out late. If it locks for 10 minutes, your CI times out and you're stuck mid-deploy.
No staging traffic to expose lock contention. A staging DB with no concurrent reads will not surface the lock-queue problem. Use synthetic traffic or test against a Neon branch of production.
Trusting ORM-generated SQL without reading it. Drizzle Kit and Prisma Migrate generate SQL that's correct in dev and dangerous in production. Read it, every time.

When you can skip all of this

Be honest about scope. Expand-contract has overhead. If your situation is any of these, ship the simple migration:

Pre-launch with no users. No traffic to lock out. Run whatever you want.
Maintenance window is acceptable. A B2B tool used 9-to-5 in one timezone can take a 2am downtime cleanly.
Read-only or append-only data. No reads to lock, or no writes to coordinate.
Single-tenant SaaS with low concurrency. A 50-second ALTER against five users isn't an incident.

The pattern has ROI curves. Use it where the math says yes. The mid-tier engineers on Cadence ($1,000/week) handle most simple migrations with the right safety nets in place; the senior tier ($1,500/week) is where you reach when you need someone to design the multi-deploy expand-contract sequence for a 200-million-row table without taking the system down.

Need a senior engineer for a tricky migration this week? Cadence shortlists vetted Postgres-fluent engineers in 2 minutes, with a 48-hour free trial. Every engineer on the platform is AI-native by default, so they'll move through the expand-contract dance with Cursor and Claude Code as their daily drivers. Audit your migration stack or book a senior engineer to own the rollout.

Steps

Expand. Add the new column nullable. Skip defaults that would rewrite the table.
Dual-write. Deploy application code that writes to both the old and new columns.
Backfill. Run batched UPDATE statements (1k-10k rows) with pg_sleep(0.1) between batches until no rows remain.
Switch reads. Deploy application code that reads from the new column. Verify with metrics, not vibes.
Stop dual-writing. Deploy application code that writes only to the new column.
Contract. In a separate, final migration, drop the old column. Add NOT NULL to the new column if needed, using the NOT VALID then VALIDATE split.

FAQ

How long should a production migration take?

Most safe migrations finish in under 100ms because they only touch the catalog (adding a nullable column, creating a NOT VALID constraint). Backfills can run for hours, but they run in batches that never block live traffic. Anything that holds an ACCESS EXCLUSIVE lock for longer than the lock_timeout should fail and be reworked.

Can I just run ALTER TABLE during off-hours?

On a small, low-traffic table, sometimes. On any hot table, no. Off-hours doesn't change the fact that ACCESS EXCLUSIVE blocks every reader and writer for the duration. If your "off-hours" still has 50 requests per second, you're queuing 50 requests for every second the migration holds the lock.

Do I need expand-contract for every migration?

No. Adding a new table is safe. Adding a nullable column with no default is safe in Postgres 11+. Creating an index CONCURRENTLY is safe. The pattern is for changes that would otherwise lock a table or break code that's still running, which is mostly column renames, type changes, and constraint additions on existing data.

What's the fastest way to test a migration on real data?

Branch your production database. Neon, Xata, and other Postgres-branching providers give you a copy-on-write replica in seconds. Run the migration there, time the lock window with pg_locks, and verify the resulting schema before you touch production.

Should every migration have a down migration?

Every migration PR should ship with a written rollback plan. A reversible down script is the best version. When reversibility is impossible (you dropped a column), the rollback plan documents the recovery path: which backup to restore from, how to reconstruct the data, and the estimated recovery time. The plan, not the script, is what matters.

All posts