
Production-grade tests in 2026 are the ones that catch the bug before a paying customer does, and stay green through a year of refactors. Skip the testing pyramid. Optimize for integration tests against real services, contract tests at every boundary, and a small E2E layer that runs on real data. Quarantine flaky tests within 24 hours. Treat coverage percentage as a smoke alarm, not a goal.
That's the short answer. The long answer is most teams are still writing 2018-shaped test suites in 2026, and AI-generated code has made the cost of getting this wrong much higher than it used to be.
Three things shifted: AI writes a meaningful share of your code, your SaaS depends on more third-party APIs than ever (Stripe, Supabase, Resend, Clerk, Vercel, OpenAI), and serverless made "spin up a real Postgres" a one-line CI step.
The effect: unit tests catch a smaller share of real bugs, integration tests are cheaper than ever, and the bugs AI ships are subtler. A generated function that "looks right" and passes a generated unit test is the modern typo: structurally fine, semantically wrong.
The classic Mike Cohn pyramid (lots of unit tests, fewer integration, very few E2E) made sense when integration tests were slow and flaky and unit tests were the only thing CI could finish before lunch. Neither constraint holds anymore.
Kent C. Dodds's testing trophy is closer to the truth: a thick middle of integration tests, a smaller base of static analysis (TypeScript, ESLint, Biome), a small layer of unit tests for pure functions, and a thin cap of E2E. Here's how the two compare in practice:
| Dimension | Pyramid (2018) | Trophy (2026) |
|---|---|---|
| Bulk of suite | Unit tests (~70%) | Integration tests (~60%) |
| Static analysis | Optional add-on | Foundational layer |
| Unit tests | High coverage required | Only for pure functions and tricky logic |
| Integration tests | "Too slow for CI" | Default for any code touching a DB or API |
| E2E | Aspirational, rarely runs | Small, mandatory, on real-ish data |
| CI runtime target | < 10 min | < 8 min for unit + integration, < 15 min including E2E |
| What catches Stripe webhook bugs | Nothing | Integration + contract tests |
The shift matters most for SaaS. A SaaS bug is rarely "this function returned the wrong number." It's "this function returned the right number, but the webhook hadn't fired yet, so the user record was stale, so the entitlement check failed." That's an integration concern. You cannot mock your way to confidence in it.
Most SaaS code is glue. It takes input from one third party (a Stripe webhook, a Clerk user event), threads it through your database, and pushes the result to another third party (a Resend email, a Slack notification). The interesting bugs live in the glue.
A unit test of the glue function with everything mocked passes when the function is internally consistent. It does not pass when, say, you upgraded @clerk/nextjs and the payload shape changed in a way your TypeScript types didn't catch because you wrote them by hand.
Integration tests that spin up real Postgres (via Testcontainers, Docker Compose, or Render's preview environments) and hit real third-party sandboxes catch the bugs that ship. The pattern we use across Cadence engagements: every booking flow has an integration test that talks to a real Stripe test-mode account, a real Postgres instance, and a real Resend sandbox. If any of the four moves, the test breaks loudly.
A practical companion read: our guide to running integration tests in CI walks through the service-container setup that makes this fast enough to run on every PR.
Contract tests are the missing layer in most 2026 test suites. They live between your code and the third party, and they answer one question: "Does the API still match the assumption we coded against?"
You write a contract test once per integration, run it nightly (or hourly if the vendor is volatile), and it fails the second the upstream shape changes. We've watched Stripe quietly rename fields, Supabase tighten RLS defaults, and OpenAI shift response formats. Without contract tests, you find out from a customer.
Tools worth knowing: Pact for cross-service contracts, MSW for HTTP-layer mocks generated from a contract, OpenAPI/zod schemas as the contract itself. The pattern matters more than the tool. Pin the shape. Test the shape against reality on a schedule. Treat a contract failure as a P1.
This pairs naturally with the discipline of mocking external APIs in tests for the per-PR run. Contract tests verify your mocks are still honest.
Snapshot tests look efficient and produce confident-looking green checkmarks. They are also the single largest source of "we updated the snapshots without reading the diff" incidents.
When a snapshot fails, the natural reaction is npm test -- -u. The engineer who shipped the actual bug also shipped the snapshot that bakes the bug into "expected behavior." We've seen this destroy six months of test value in one PR.
Use snapshots in exactly two cases:
For everything else, write an explicit assertion. expect(result.totalCents).toBe(1500) is more typing than expect(result).toMatchSnapshot(), but it tells the next reader what the test means. Snapshot files don't. If you're working in Next.js, our Vitest setup guide for Next.js shows the assertion patterns we default to before reaching for any snapshot.
The most damaging thing about a flaky test isn't the failure. It's the slow normalization of red. Once one test flakes weekly, the team learns to re-run CI without reading the failure. The next real bug ships under cover of "probably just the flaky one."
The rule we apply everywhere: if a test fails on a PR that didn't touch the relevant code, the test goes into a quarantine tag within 24 hours. Quarantined tests still run, still report, but don't block merges. The engineer who owns the area has a week to either fix it or delete it. If neither happens, it gets deleted automatically.
This sounds harsh. It is the only thing that works at scale. Companies that don't do this have CI suites with 4% flake rates and engineers who treat green as advisory.
A short flake-triage checklist:
setTimeout waits with explicit polling on a condition.If you have Claude Code, Cursor, or Copilot writing tests (you should), the failure mode shifts. AI is excellent at writing tests that pass. It is mediocre at writing tests that catch bugs. The two are not the same thing.
Three patterns we've seen repeatedly:
expect(result).toEqual(expected) where expected was computed by calling the function being tested. This passes forever and tests nothing.The fix is review discipline, not a different tool. When you accept an AI-generated test, ask: "If I delete the implementation, does this test fail for the right reason?" If you can't answer yes in five seconds, rewrite the test.
Every engineer on Cadence is AI-native by default (Cursor, Claude Code, Copilot used daily, vetted on a voice interview before they unlock bookings). The vetting specifically probes for this. We've seen too many AI-generated suites pass review because no one checked whether the tests actually constrained the code.
A coverage target of 80% is one of the most counterproductive metrics in software. It optimizes for lines exercised, not bugs prevented. It rewards tests of trivial getters and punishes complex integration scenarios that touch fewer lines per assertion.
What to do instead:
The percentage-gate failure mode: a junior engineer needs to merge a hotfix, the global gate trips at 79.8%, they add three meaningless tests on a config file, the bar rises, the hotfix ships, no one ever revisits the meaningless tests.
The thinnest layer of the trophy is also the most controversial. E2E tests are slow, occasionally flaky, and the temptation is to skip them entirely. Don't.
The rule: E2E covers your top three to five revenue-critical user journeys, runs on every deploy to staging, and uses data that resembles production. "Real data" doesn't mean "actual customer data" (that's a compliance bomb). It means an anonymized snapshot, a synthetic generator (Snaplet, Drizzle Seed) seeded from realistic distributions, or a long-lived staging tenant with curated fixtures.
Playwright is the default in 2026. Cypress is fine if you already have it. Both produce traces and video on failure, which collapses debug time from hours to minutes. A typical Cadence-built E2E suite covers: signup, paywall, the core "aha moment" workflow, and one billing path. Four tests, 8 minutes, 95% of the value. For the deeper setup, see our E2E testing for SaaS guide.
If you read this and your current suite is 90% unit tests with mocked everything, here's the order that recovers the most value fastest:
billing/, auth/, and anything that writes to the database.If you're a solo founder or a two-person team pre-revenue, you can skip steps 2 and 4 until you have real customers. Best practices have ROI curves and you're not on the curve yet. Ship the product, then come back.
If you're past the point where you can fit the whole suite in your head, this is exactly the work a Cadence senior engineer ($1,500/week) tends to own. Two weeks of dedicated test-suite work usually pays for itself the first time a billing edge case doesn't reach production. You can audit your current stack with our ship-or-skip tool to get an honest grade on what to prioritize first.
A two-person team pre-revenue does not need contract tests against a sandbox Stripe account. They need to ship the product to someone who will pay. Best practices apply when their cost is lower than the bugs they prevent.
The line: once you have paying customers, once a regression would cost more than a day of engineering time, once two engineers can both deploy to production, you're past the skip threshold. Until then, write the integration test for the billing flow and call it done.
If you're staring at a flaky CI pipeline, a coverage number that lies, and a roadmap that won't wait, a Cadence senior engineer can take it on for a week at $1,500 with a 48-hour free trial. The first commit usually lands within 27 hours of booking. If they don't move the suite forward, you don't pay.
Production-grade tests are evaluated by whether they catch real bugs before customers do, not by what percentage of lines they touch. A suite with 95% coverage and no contract tests is less production-grade than a suite with 60% coverage that catches every Stripe API change overnight.
Enough to cover every external boundary (databases, APIs, queues, webhooks) at least once on the happy path and once on the most common failure mode. For a typical SaaS, that's 20 to 60 integration tests. Below 10 and you're probably under-covered; above 200 and you're probably testing the same boundaries redundantly.
Yes, but treat AI-generated tests with more scrutiny than AI-generated code. Ask whether the test would fail if the implementation were deleted. If the answer is "no" or "I'm not sure," rewrite the test. The bias of generative models is toward tests that pass, not tests that constrain behavior.
Two cases: opaque output where the diff is the assertion (rendered HTML email, compiled SQL), and small snapshots that humans will actually read on every change. For everything else, write explicit assertions; future-you will be grateful.
Aim for under 8 minutes for unit and integration, under 15 including E2E on a typical PR. Past those numbers, engineers start avoiding CI, splitting work into smaller PRs to dodge the runtime, or worst case, marking tests as .skip to unblock merges. Speed is a feature of the suite.
No. Run full E2E on merges to main and deploys to staging. On PRs, run the subset of E2E tests touching code paths the PR modifies. The savings on PR time are large and the regression risk is small.