
To handle database migrations safely in production, use the expand-migrate-contract pattern, set lock_timeout and statement_timeout before any DDL, test the migration on a Neon branch first, and require a rollback plan in every pull request. Never ship a single ALTER TABLE against a hot table; break it into small, independently revertible deploys.
That's the playbook. The rest of this post is the working code, the lock numbers that matter, an honest comparison of the six migration tools you'd actually consider in 2026, and the Postgres gotchas that bite at scale.
Most schema changes don't break because the SQL was wrong. They break because the SQL was correct, but it ran inside a single deploy that coupled the schema change to code that depended on it.
Postgres takes locks when you run DDL. The most dangerous one is ACCESS EXCLUSIVE, which blocks every reader and writer for the duration of the statement. On a 200-million-row users table, an ALTER TABLE ... ADD COLUMN with a non-null default can hold that lock for minutes. During those minutes, every API call that touches the table queues up. Your p99 latency graph looks like a cliff.
The second failure mode is tooling. Drizzle Kit, Prisma Migrate, and most ORMs generate migration SQL from schema diffs. They are not deployment safety systems. They will happily emit a migration that drops a column the running application is still reading. The diff is correct. The deploy is suicide.
The fix is the same fix engineering teams have been converging on for a decade, but with sharper tools. You stop thinking of a migration as a single atomic event and start thinking of it as a coordinated rollout across multiple deploys.
The pattern decomposes any breaking schema change into three logical phases: expand, migrate, contract. In practice it's usually five or six deploys, and each one is independently revertible.
The shape:
At every checkpoint between phases, the system is in a known good state. If anything goes wrong, you stop, you don't have to run a "down" migration, and the old schema is still there as the source of truth. This is the property that makes the pattern worth the extra deploys.
The classic example is renaming a column. ALTER TABLE users RENAME COLUMN email TO email_address looks atomic. It is not safe. The instant the rename commits, the running application (which still has email in its query strings) starts throwing errors. You'd need a perfectly synchronized code-and-schema deploy, which doesn't exist.
The expand-contract version: add email_address as a new nullable column, dual-write to both, backfill, switch reads, stop dual-writing, drop email. Six deploys instead of one. Zero downtime instead of an outage.
Before any production DDL, set two session-level limits:
SET lock_timeout = '2s';
SET statement_timeout = '5s';
lock_timeout caps how long the migration waits in the lock queue. Without it, a long-running query in the background can pin your ALTER indefinitely while every new query queues behind it. With a 2-second cap, the migration fails fast instead of taking the database down.
statement_timeout caps how long the migration itself can run. Five seconds is a reasonable upper bound for any metadata-only DDL on Postgres 11+. If you're hitting it, you're doing something that rewrites the table, and you should not be doing that synchronously.
For indexes, never use plain CREATE INDEX on a production table. Use the concurrent variant:
CREATE INDEX CONCURRENTLY idx_users_org_id ON users (org_id);
CONCURRENTLY builds the index without blocking writes. It takes longer wall-clock time and uses more disk space, but it doesn't take an ACCESS EXCLUSIVE lock. The catch is that it cannot run inside a transaction, which breaks Prisma Migrate by default (more on that below).
For constraints, split validation into two steps. Add the constraint as NOT VALID first, which only takes a brief lock to update the catalog. Then VALIDATE CONSTRAINT separately, which takes a much weaker SHARE UPDATE EXCLUSIVE lock that doesn't block reads or writes.
ALTER TABLE orders ADD CONSTRAINT fk_org FOREIGN KEY (org_id) REFERENCES orgs(id) NOT VALID;
ALTER TABLE orders VALIDATE CONSTRAINT fk_org;
The 2026 landscape has six tools you'd seriously consider. They cover different ends of the spectrum: TypeScript-native, declarative, enterprise, SQL-first.
| Tool | Style | Concurrent index? | Rollback support | Best for |
|---|---|---|---|---|
| Drizzle Kit | TS schema, generates SQL | Yes (manual edit) | Manual | TypeScript teams who want SQL output |
| Prisma Migrate | schema.prisma, generates SQL | Blocked (tx wrap) | Manual | Full-stack TS apps using Prisma Client |
| Atlas | Declarative HCL | Yes | Plan-based | Teams treating schema as code |
| Sqitch | Deploy / verify / revert triplets | Yes | First-class | Teams that need strict reversibility |
| Liquibase | XML / YAML changesets | Yes | Built-in | Enterprise, JVM-heavy stacks |
| Flyway | Versioned SQL files | Yes | Paid tier only | SQL-first teams of any size |
Where each one wins:
Drizzle Kit is the cleanest TypeScript story, especially if you've already chosen Drizzle as your ORM. It generates a SQL file you can read and edit. The downside: it doesn't sequence anything. You generate the SQL, then you edit it to insert CONCURRENTLY, split constraints, and stage deploys yourself. If you want a deeper look at the ORM itself, our Drizzle ORM guide covers the runtime side.
Prisma Migrate has the friendliest DX for full-stack TypeScript apps. It also wraps every migration in a transaction, which means CREATE INDEX CONCURRENTLY flat-out doesn't work without manual SQL. The mitigation is a custom migration with hand-written SQL, which most teams figure out the second time they take production down. The honest Drizzle vs Prisma comparison walks through the trade-off in detail.
Atlas (from the Ariga team) takes a Terraform-style declarative approach. You write the desired schema, Atlas computes the diff. It supports CREATE INDEX CONCURRENTLY natively and has a strong "plan / apply" workflow. It's the most interesting newcomer for teams that already think in IaC.
Sqitch is the rigorous choice. Every migration is a triplet: a deploy script, a verify script, and a revert script. If your team has been burned by un-rollbackable migrations, Sqitch's discipline is worth the verbosity.
Liquibase and Flyway are the JVM-world veterans. Both work fine with Postgres. Liquibase is heavier (XML or YAML changesets, change-tracking metadata in the DB) and shines in regulated environments. Flyway is lighter (just versioned SQL files) and is what most teams pick if they want SQL-first migrations without ORM lock-in. Note Flyway's rollback feature requires the paid tier.
A handful of specific operations are worth memorizing because they look harmless and aren't:
NOT NULL on an existing column. Pre-Postgres 11, this took an ACCESS EXCLUSIVE lock and rewrote the table. Postgres 11+ has a fast path only if you set the column with a DEFAULT at the same time, and only if Postgres can store the default in the catalog without a rewrite. Adding NOT NULL to a column with no default still does a full table scan to verify. On a large table, do this in two steps: backfill the column to non-null values, add the constraint as NOT VALID, then VALIDATE.
Type changes. ALTER COLUMN ... TYPE rewrites the table for almost any change that isn't a simple binary-compatible widen (like varchar(50) to varchar(100)). Going from int to bigint rewrites every row. The safe path is the expand-contract dance: add a new bigint column, dual-write, backfill, swap reads, drop the old column.
Foreign key additions. ADD FOREIGN KEY takes ACCESS EXCLUSIVE on both the referenced and referencing tables until the constraint is validated. Use the NOT VALID / VALIDATE split shown above.
Large table backfills. Never run a single UPDATE users SET new_col = old_col on a multi-million-row table. It holds row locks for the duration and bloats the WAL. Batch it:
DO $$
DECLARE
batch_size INT := 5000;
rows_updated INT;
BEGIN
LOOP
UPDATE users
SET email_address = email
WHERE id IN (
SELECT id FROM users
WHERE email_address IS NULL
LIMIT batch_size
);
GET DIAGNOSTICS rows_updated = ROW_COUNT;
EXIT WHEN rows_updated = 0;
PERFORM pg_sleep(0.1);
END LOOP;
END $$;
Five thousand rows, hundred-millisecond sleep, repeat until done. The sleep gives autovacuum and the rest of your workload room to breathe. Tune the batch and sleep based on table size and write traffic; 1k rows with 50ms sleep is conservative, 10k rows with 200ms sleep is aggressive.
The most under-used safety tool in 2026 is database branching. Neon, Xata, and Postgres-flavored cloud providers offer copy-on-write database branches that materialize in seconds against your production data.
The flow looks like this:
neon branches create --name migration-test (or the equivalent in your provider).This catches the failures that staging never catches: lock contention against real data volumes, queries that suddenly need a sequential scan, foreign key validations that take five minutes instead of five seconds. It's the closest thing to running the migration in production without the consequences. Cadence engineers use this flow on every Postgres-shaped client because it's faster than maintaining a useful staging dataset.
If you're not sure whether your current migration tooling is up to the job, our Ship or Skip stack audit gives you an honest grade on what to keep and what to replace before the next big schema change.
This is the practice that separates teams that have been burned from teams that haven't been burned yet. Every pull request that contains a migration must include four sections in the description:
The PR template enforces this. The reviewer's job is to read the rollback plan first, before the schema diff. If the rollback plan is "rebuild from backup," that's a real answer, but it should be an explicit one, not a discovery you make at 2am.
For an end-to-end view of how this fits into a production stack, our scale-MVP-to-production guide covers the broader operational checklist. And if you're tracing migration-induced latency cliffs in production, the OpenTelemetry guide shows how to instrument your DDL and connect it to your traces.
A few patterns that look correct and break:
Be honest about scope. Expand-contract has overhead. If your situation is any of these, ship the simple migration:
ALTER against five users isn't an incident.The pattern has ROI curves. Use it where the math says yes. The mid-tier engineers on Cadence ($1,000/week) handle most simple migrations with the right safety nets in place; the senior tier ($1,500/week) is where you reach when you need someone to design the multi-deploy expand-contract sequence for a 200-million-row table without taking the system down.
Need a senior engineer for a tricky migration this week? Cadence shortlists vetted Postgres-fluent engineers in 2 minutes, with a 48-hour free trial. Every engineer on the platform is AI-native by default, so they'll move through the expand-contract dance with Cursor and Claude Code as their daily drivers. Audit your migration stack or book a senior engineer to own the rollout.
UPDATE statements (1k-10k rows) with pg_sleep(0.1) between batches until no rows remain.NOT NULL to the new column if needed, using the NOT VALID then VALIDATE split.Most safe migrations finish in under 100ms because they only touch the catalog (adding a nullable column, creating a NOT VALID constraint). Backfills can run for hours, but they run in batches that never block live traffic. Anything that holds an ACCESS EXCLUSIVE lock for longer than the lock_timeout should fail and be reworked.
On a small, low-traffic table, sometimes. On any hot table, no. Off-hours doesn't change the fact that ACCESS EXCLUSIVE blocks every reader and writer for the duration. If your "off-hours" still has 50 requests per second, you're queuing 50 requests for every second the migration holds the lock.
No. Adding a new table is safe. Adding a nullable column with no default is safe in Postgres 11+. Creating an index CONCURRENTLY is safe. The pattern is for changes that would otherwise lock a table or break code that's still running, which is mostly column renames, type changes, and constraint additions on existing data.
Branch your production database. Neon, Xata, and other Postgres-branching providers give you a copy-on-write replica in seconds. Run the migration there, time the lock window with pg_locks, and verify the resulting schema before you touch production.
Every migration PR should ship with a written rollback plan. A reversible down script is the best version. When reversibility is impossible (you dropped a column), the rollback plan documents the recovery path: which backup to restore from, how to reconstruct the data, and the estimated recovery time. The plan, not the script, is what matters.