How to write production-grade tests in 2026

Q: What's the difference between production-grade tests and just having good test coverage?

Production-grade tests are evaluated by whether they catch real bugs before customers do, not by what percentage of lines they touch. A suite with 95% coverage and no contract tests is less production-grade than a suite with 60% coverage that catches every Stripe API change overnight.

Q: How many integration tests are enough?

Enough to cover every external boundary (databases, APIs, queues, webhooks) at least once on the happy path and once on the most common failure mode. For a typical SaaS, that's 20 to 60 integration tests. Below 10 and you're probably under-covered; above 200 and you're probably testing the same boundaries redundantly.

Q: Should we use AI to write tests?

Yes, but treat AI-generated tests with more scrutiny than AI-generated code. Ask whether the test would fail if the implementation were deleted. If the answer is "no" or "I'm not sure," rewrite the test. The bias of generative models is toward tests that pass, not tests that constrain behavior.

Q: Are snapshot tests ever a good idea?

Two cases: opaque output where the diff is the assertion (rendered HTML email, compiled SQL), and small snapshots that humans will actually read on every change. For everything else, write explicit assertions; future-you will be grateful.

Q: How long should our CI test run take?

Aim for under 8 minutes for unit and integration, under 15 including E2E on a typical PR. Past those numbers, engineers start avoiding CI, splitting work into smaller PRs to dodge the runtime, or worst case, marking tests as .skip to unblock merges. Speed is a feature of the suite.

Q: Should every PR run the full E2E suite?

No. Run full E2E on merges to main and deploys to staging. On PRs, run the subset of E2E tests touching code paths the PR modifies. The savings on PR time are large and the regression risk is small.

Production-grade tests in 2026 are the ones that catch the bug before a paying customer does, and stay green through a year of refactors. Skip the testing pyramid. Optimize for integration tests against real services, contract tests at every boundary, and a small E2E layer that runs on real data. Quarantine flaky tests within 24 hours. Treat coverage percentage as a smoke alarm, not a goal.

That's the short answer. The long answer is most teams are still writing 2018-shaped test suites in 2026, and AI-generated code has made the cost of getting this wrong much higher than it used to be.

Why testing changed in 2026

Three things shifted: AI writes a meaningful share of your code, your SaaS depends on more third-party APIs than ever (Stripe, Supabase, Resend, Clerk, Vercel, OpenAI), and serverless made "spin up a real Postgres" a one-line CI step.

The effect: unit tests catch a smaller share of real bugs, integration tests are cheaper than ever, and the bugs AI ships are subtler. A generated function that "looks right" and passes a generated unit test is the modern typo: structurally fine, semantically wrong.

The testing pyramid is dead. Use the trophy.

The classic Mike Cohn pyramid (lots of unit tests, fewer integration, very few E2E) made sense when integration tests were slow and flaky and unit tests were the only thing CI could finish before lunch. Neither constraint holds anymore.

Kent C. Dodds's testing trophy is closer to the truth: a thick middle of integration tests, a smaller base of static analysis (TypeScript, ESLint, Biome), a small layer of unit tests for pure functions, and a thin cap of E2E. Here's how the two compare in practice:

Dimension	Pyramid (2018)	Trophy (2026)
Bulk of suite	Unit tests (~70%)	Integration tests (~60%)
Static analysis	Optional add-on	Foundational layer
Unit tests	High coverage required	Only for pure functions and tricky logic
Integration tests	"Too slow for CI"	Default for any code touching a DB or API
E2E	Aspirational, rarely runs	Small, mandatory, on real-ish data
CI runtime target	< 10 min	< 8 min for unit + integration, < 15 min including E2E
What catches Stripe webhook bugs	Nothing	Integration + contract tests

The shift matters most for SaaS. A SaaS bug is rarely "this function returned the wrong number." It's "this function returned the right number, but the webhook hadn't fired yet, so the user record was stale, so the entitlement check failed." That's an integration concern. You cannot mock your way to confidence in it.

Why integration tests beat unit tests for SaaS

Most SaaS code is glue. It takes input from one third party (a Stripe webhook, a Clerk user event), threads it through your database, and pushes the result to another third party (a Resend email, a Slack notification). The interesting bugs live in the glue.

A unit test of the glue function with everything mocked passes when the function is internally consistent. It does not pass when, say, you upgraded @clerk/nextjs and the payload shape changed in a way your TypeScript types didn't catch because you wrote them by hand.

Integration tests that spin up real Postgres (via Testcontainers, Docker Compose, or Render's preview environments) and hit real third-party sandboxes catch the bugs that ship. The pattern we use across Cadence engagements: every booking flow has an integration test that talks to a real Stripe test-mode account, a real Postgres instance, and a real Resend sandbox. If any of the four moves, the test breaks loudly.

A practical companion read: our guide to running integration tests in CI walks through the service-container setup that makes this fast enough to run on every PR.

Contract tests at every boundary

Contract tests are the missing layer in most 2026 test suites. They live between your code and the third party, and they answer one question: "Does the API still match the assumption we coded against?"

You write a contract test once per integration, run it nightly (or hourly if the vendor is volatile), and it fails the second the upstream shape changes. We've watched Stripe quietly rename fields, Supabase tighten RLS defaults, and OpenAI shift response formats. Without contract tests, you find out from a customer.

Tools worth knowing: Pact for cross-service contracts, MSW for HTTP-layer mocks generated from a contract, OpenAPI/zod schemas as the contract itself. The pattern matters more than the tool. Pin the shape. Test the shape against reality on a schedule. Treat a contract failure as a P1.

This pairs naturally with the discipline of mocking external APIs in tests for the per-PR run. Contract tests verify your mocks are still honest.

Snapshot tests are a trap (mostly)

Snapshot tests look efficient and produce confident-looking green checkmarks. They are also the single largest source of "we updated the snapshots without reading the diff" incidents.

When a snapshot fails, the natural reaction is npm test -- -u. The engineer who shipped the actual bug also shipped the snapshot that bakes the bug into "expected behavior." We've seen this destroy six months of test value in one PR.

Use snapshots in exactly two cases:

The output is genuinely opaque and inspecting the diff is the test (e.g., compiled SQL, rendered HTML email).
The snapshot file is small enough that a human will actually read it on every change.

For everything else, write an explicit assertion. expect(result.totalCents).toBe(1500) is more typing than expect(result).toMatchSnapshot(), but it tells the next reader what the test means. Snapshot files don't. If you're working in Next.js, our Vitest setup guide for Next.js shows the assertion patterns we default to before reaching for any snapshot.

Flaky tests need a 24-hour quarantine rule

The most damaging thing about a flaky test isn't the failure. It's the slow normalization of red. Once one test flakes weekly, the team learns to re-run CI without reading the failure. The next real bug ships under cover of "probably just the flaky one."

The rule we apply everywhere: if a test fails on a PR that didn't touch the relevant code, the test goes into a quarantine tag within 24 hours. Quarantined tests still run, still report, but don't block merges. The engineer who owns the area has a week to either fix it or delete it. If neither happens, it gets deleted automatically.

This sounds harsh. It is the only thing that works at scale. Companies that don't do this have CI suites with 4% flake rates and engineers who treat green as advisory.

A short flake-triage checklist:

Did the test depend on timing? Replace setTimeout waits with explicit polling on a condition.
Did it depend on order? Add seeded randomness or isolate state between tests.
Did it hit a real network? Add retry with jitter or move it to the contract-test schedule.
Did it depend on a clock? Inject the clock as a dependency and freeze it in tests.

AI-generated tests are different and need different scrutiny

If you have Claude Code, Cursor, or Copilot writing tests (you should), the failure mode shifts. AI is excellent at writing tests that pass. It is mediocre at writing tests that catch bugs. The two are not the same thing.

Three patterns we've seen repeatedly:

Mock-everything tests. The AI mocks the function under test's dependencies in a way that hard-codes the expected behavior. The test asserts the mock got called. The code could be deleted and the test would still pass against a stub.
Happy-path bias. AI writes the test for the path that's documented in the function's JSDoc. It rarely writes the test for the empty array, the duplicate insert, the timezone boundary, the unicode in the name field.
Tautological assertions. expect(result).toEqual(expected) where expected was computed by calling the function being tested. This passes forever and tests nothing.

The fix is review discipline, not a different tool. When you accept an AI-generated test, ask: "If I delete the implementation, does this test fail for the right reason?" If you can't answer yes in five seconds, rewrite the test.

Every engineer on Cadence is AI-native by default (Cursor, Claude Code, Copilot used daily, vetted on a voice interview before they unlock bookings). The vetting specifically probes for this. We've seen too many AI-generated suites pass review because no one checked whether the tests actually constrained the code.

Coverage gates: the percentage trap

A coverage target of 80% is one of the most counterproductive metrics in software. It optimizes for lines exercised, not bugs prevented. It rewards tests of trivial getters and punishes complex integration scenarios that touch fewer lines per assertion.

What to do instead:

Drop the global percentage gate. Replace it with per-file or per-folder gates on the files that matter (billing, auth, anything touching money or PII).
Add a coverage delta gate. A PR that drops coverage on a critical path by more than 2% gets a warning, not a block.
Track uncovered critical paths explicitly. A spreadsheet of "things the suite does not cover and why" is more useful than 87% global coverage.
Use mutation testing on the critical paths. Stryker (JS/TS) or Pitest (JVM) tells you whether your tests would catch a deliberately-introduced bug. This is the real coverage metric.

The percentage-gate failure mode: a junior engineer needs to merge a hotfix, the global gate trips at 79.8%, they add three meaningless tests on a config file, the bar rises, the hotfix ships, no one ever revisits the meaningless tests.

End-to-end tests on real data

The thinnest layer of the trophy is also the most controversial. E2E tests are slow, occasionally flaky, and the temptation is to skip them entirely. Don't.

The rule: E2E covers your top three to five revenue-critical user journeys, runs on every deploy to staging, and uses data that resembles production. "Real data" doesn't mean "actual customer data" (that's a compliance bomb). It means an anonymized snapshot, a synthetic generator (Snaplet, Drizzle Seed) seeded from realistic distributions, or a long-lived staging tenant with curated fixtures.

Playwright is the default in 2026. Cypress is fine if you already have it. Both produce traces and video on failure, which collapses debug time from hours to minutes. A typical Cadence-built E2E suite covers: signup, paywall, the core "aha moment" workflow, and one billing path. Four tests, 8 minutes, 95% of the value. For the deeper setup, see our E2E testing for SaaS guide.

What to do this week

If you read this and your current suite is 90% unit tests with mocked everything, here's the order that recovers the most value fastest:

Audit the top five integration points (auth, billing, email, the main third-party API, your own webhooks). Write one integration test per point that hits the real sandbox.
Add a contract test for each of the same five integrations, scheduled hourly.
Quarantine every test that's failed without a related code change in the last 30 days. Triage in week two.
Replace any snapshot test you can't justify in one sentence with an explicit assertion.
Drop the global coverage percentage gate. Add per-folder gates on billing/, auth/, and anything that writes to the database.

If you're a solo founder or a two-person team pre-revenue, you can skip steps 2 and 4 until you have real customers. Best practices have ROI curves and you're not on the curve yet. Ship the product, then come back.

If you're past the point where you can fit the whole suite in your head, this is exactly the work a Cadence senior engineer ($1,500/week) tends to own. Two weeks of dedicated test-suite work usually pays for itself the first time a billing edge case doesn't reach production. You can audit your current stack with our ship-or-skip tool to get an honest grade on what to prioritize first.

When you can skip most of this

A two-person team pre-revenue does not need contract tests against a sandbox Stripe account. They need to ship the product to someone who will pay. Best practices apply when their cost is lower than the bugs they prevent.

The line: once you have paying customers, once a regression would cost more than a day of engineering time, once two engineers can both deploy to production, you're past the skip threshold. Until then, write the integration test for the billing flow and call it done.

If you're staring at a flaky CI pipeline, a coverage number that lies, and a roadmap that won't wait, a Cadence senior engineer can take it on for a week at $1,500 with a 48-hour free trial. The first commit usually lands within 27 hours of booking. If they don't move the suite forward, you don't pay.

FAQ

What's the difference between production-grade tests and just having good test coverage?

Production-grade tests are evaluated by whether they catch real bugs before customers do, not by what percentage of lines they touch. A suite with 95% coverage and no contract tests is less production-grade than a suite with 60% coverage that catches every Stripe API change overnight.

How many integration tests are enough?

Enough to cover every external boundary (databases, APIs, queues, webhooks) at least once on the happy path and once on the most common failure mode. For a typical SaaS, that's 20 to 60 integration tests. Below 10 and you're probably under-covered; above 200 and you're probably testing the same boundaries redundantly.

Should we use AI to write tests?

Yes, but treat AI-generated tests with more scrutiny than AI-generated code. Ask whether the test would fail if the implementation were deleted. If the answer is "no" or "I'm not sure," rewrite the test. The bias of generative models is toward tests that pass, not tests that constrain behavior.

Are snapshot tests ever a good idea?

Two cases: opaque output where the diff is the assertion (rendered HTML email, compiled SQL), and small snapshots that humans will actually read on every change. For everything else, write explicit assertions; future-you will be grateful.

How long should our CI test run take?

Aim for under 8 minutes for unit and integration, under 15 including E2E on a typical PR. Past those numbers, engineers start avoiding CI, splitting work into smaller PRs to dodge the runtime, or worst case, marking tests as .skip to unblock merges. Speed is a feature of the suite.

Should every PR run the full E2E suite?

No. Run full E2E on merges to main and deploys to staging. On PRs, run the subset of E2E tests touching code paths the PR modifies. The savings on PR time are large and the regression risk is small.

Jayesh Patil

Web Developer

Web developer at withRemote. Writes on accessibility, responsive design, and the boring-but-correct front-end fundamentals.

All posts