How to set up E2E testing for a SaaS

Q: How do I test Stripe billing in E2E?

Use Stripe test mode with the card number 4242 4242 4242 4242. Click your real upgrade button, complete the Stripe-hosted checkout, then poll your database (or a webhook-receipt log) for the checkout.session.completed event. Assert that the subscription row flipped to active and the feature gate unlocked in the UI.

Q: How do I stop my E2E tests from being flaky?

Three rules. Never use waitForTimeout, always use Playwright's auto-waiting expect (toBeVisible, toHaveText). Isolate test data per worker via per-worker accounts or fresh tenants. Allow one retry in CI; if a test fails twice, quarantine it with test.fixme() and fix or delete within a sprint. Tests that limp along for months erode trust in the whole suite.

To set up E2E testing for a SaaS in 2026, pick Playwright, write tests for five critical flows (signup, onboarding, billing, core happy path, account deletion), seed a fresh tenant per CI run, save auth state once with storageState, shard across 4-8 workers, and retry flaky tests once before quarantining them. Five reliable tests gating every PR catch about 80% of the bugs that used to escape to production.

The rest of this post is the playbook. Real code, real tradeoffs, and an honest section on when to skip E2E entirely.

Why E2E testing is different for a SaaS

Generic E2E advice (record a happy path, run it on every commit) was written for content sites and marketing pages. A SaaS has three things those products do not.

First, multi-tenancy. Tests cannot pollute each other's data, and "the same email signing up twice" is a real bug class you have to test. Second, billing. A broken upgrade flow is invisible until a customer hits it, and by then you have already paid the CAC. Third, async webhooks. Stripe sends checkout.session.completed 200ms to 4 seconds after the redirect. Your test has to wait for it or it asserts against an empty subscription row.

For a SaaS spending $5k/month on ads at a 3% trial-start rate, every broken signup hour costs about $200 in wasted spend. E2E exists to make sure that hour never happens.

Pick the tool first: Playwright, Cypress, or raw Puppeteer

In 2026 there are three real choices, and one is the default.

Tool	Best for	Cost	Cross-browser	Verdict
Playwright	New SaaS in 2026	Free	Chromium + Firefox + WebKit	Default
Cypress	Existing Cypress suites	Free; Cypress Cloud $75-300/mo	Chromium + Firefox + WebKit (slower)	Migrate gradually
Puppeteer	Headless scraping	Free	Chromium only	Wrong tool for user-flow E2E

Playwright is the answer for any new SaaS this year. The State of JS 2024 survey showed it overtaking Cypress in both satisfaction and usage growth. The reasons: free parallel execution via --shard, three real browser engines bundled in, multi-language bindings, and an auto-waiting expect API that eliminates most timing-related flake.

Cypress is still good software. If you already have a 200-test Cypress suite, do not throw it away. Migrate new flows to Playwright and let the Cypress suite shrink by attrition. The one place Cypress still wins is the interactive test runner, a better debugging experience than Playwright's UI mode, though the gap is closing.

Puppeteer is for headless Chromium scraping and PDF generation. Not a user-flow E2E framework. Skip.

For deeper coverage of the Playwright API itself, see our Playwright E2E testing deep dive. This post stays at the SaaS-application level.

The five SaaS flows worth testing first

You do not need 200 tests on day one. You need five, and they all have to actually work. Twenty reliable tests beat 200 flaky ones every time.

Signup with email verification. Sign up with a fresh email, poll the inbox for the verification link, click it, land on the post-verification page. This is the single most valuable test in the suite because it gates revenue.
Onboarding to first value. Whatever your product's "aha moment" is (creating the first project, sending the first message, importing the first file), test that the user can reach it from a fresh account in under 60 seconds.
Billing upgrade with Stripe test mode. Click upgrade, enter card 4242 4242 4242 4242, complete checkout, wait for the webhook, assert the subscription row flipped to active and the feature gate unlocked.
Core feature happy path. Whatever the product actually does. For a CRM it is "create a contact, log an interaction, see it in the timeline." For a project tool it is "create a task, assign it, mark it done."
Account deletion. Delete the account, assert the row is soft-deleted (or hard-deleted if you have already implemented GDPR data deletion properly), assert the user is logged out, assert the email can be reused for a new signup.

That is the floor. Everything else (admin flows, multi-user collaboration, edge cases on permissions) gets added one test per sprint as the product surfaces real bugs.

Data seeding: fresh tenant, shared staging, or production mock

How you give a test its starting state is the second-biggest decision after tool choice. Three patterns, three real tradeoffs.

Fresh tenant per CI run is the default and what you should reach for first. Before each test (or each worker), hit an API endpoint or a seed script that provisions a clean tenant: a new org, a new admin user, a known set of feature flags. After the test, soft-delete it. This costs 2-5 seconds of setup per test but gives you full isolation, which is what makes parallelism safe. Pair this pattern with a robust multi-tenancy schema so that creating and tearing down tenants is a database operation, not a re-deploy.

Shared staging environment is what most teams accidentally end up with. It works for the first three tests and breaks the moment you parallelize. Two workers signing up "test+ci@example.com" simultaneously hit your unique-email constraint and one fails for reasons unrelated to the actual code under test. Avoid.

Production mock with msw + Stripe test mode runs the entire stack in-process: real React app, mocked API responses, real Stripe test webhooks. Fastest to run (no network), hardest to keep in sync with the real backend. Use this for the happy-path PR gate and run a smaller fresh-tenant suite on merge.

For most teams: fresh tenant for the regression suite, msw + Stripe test mode for the PR smoke suite. That gives you sub-5-minute PR gates and full-fidelity regression on main.

Authentication: save state once, replay everywhere

Logging in through the UI in every test is the most common cause of slow E2E suites. Do it once and replay the cookie.

Playwright's storageState pattern: a global setup file logs in via the API (or once via the UI), saves the cookies and localStorage to a JSON file, and every other test loads that file as its starting context.

// global-setup.ts
import { chromium, FullConfig } from '@playwright/test';

export default async function globalSetup(config: FullConfig) {
  const browser = await chromium.launch();
  const page = await browser.newPage();

  await page.goto(`${process.env.BASE_URL}/login`);
  await page.fill('[data-test="email"]', process.env.TEST_USER_EMAIL!);
  await page.fill('[data-test="password"]', process.env.TEST_USER_PASSWORD!);
  await page.click('[data-test="login-submit"]');
  await page.waitForURL('**/dashboard');

  await page.context().storageState({ path: 'auth/admin.json' });
  await browser.close();
}

Then in playwright.config.ts:

export default defineConfig({
  globalSetup: require.resolve('./global-setup'),
  use: { storageState: 'auth/admin.json' },
  fullyParallel: true,
  retries: process.env.CI ? 1 : 0,
  workers: process.env.CI ? 4 : undefined,
  reporter: [['html'], ['github']],
});

For tests that mutate user state (changing the password, deleting the account), use per-worker accounts: assign worker 0 to test+w0@example.com, worker 1 to test+w1@example.com, and so on. Playwright exposes testInfo.workerIndex for exactly this. Test the actual UI login flow exactly once, in a dedicated test file that does not load storageState.

If your auth provider supports it, programmatic login via API is faster and more stable than UI login. We covered the broader picture in implementing authentication in 2026; if you are on Clerk, Supabase Auth, or Auth.js, all three expose a server-side helper for minting a session token directly.

Parallelism: shard to 4-8 workers

A 200-test suite running serially is a 25-minute CI job. The same suite sharded across 4 workers runs in 7 minutes. Across 8, in under 4. This is the single largest CI speedup available and Playwright supports it natively.

# .github/workflows/e2e.yml
jobs:
  e2e:
    runs-on: ubuntu-latest
    strategy:
      fail-fast: false
      matrix:
        shard: [1/4, 2/4, 3/4, 4/4]
    steps:
      - uses: actions/checkout@v4
      - uses: actions/setup-node@v4
        with: { node-version: 20 }
      - run: npm ci
      - run: npx playwright install --with-deps chromium
      - run: npx playwright test --shard=${{ matrix.shard }}
        env:
          BASE_URL: ${{ secrets.STAGING_URL }}
          TEST_USER_EMAIL: ${{ secrets.TEST_USER_EMAIL }}
          TEST_USER_PASSWORD: ${{ secrets.TEST_USER_PASSWORD }}
      - if: failure()
        uses: actions/upload-artifact@v4
        with:
          name: playwright-report-${{ matrix.shard }}
          path: playwright-report/
          retention-days: 7

Four shards on the GitHub Actions free tier (2,000 minutes/month) covers about 60 PR runs per day, which is more than most teams ship. If you outgrow that, Currents.dev orchestrates Playwright (and Cypress) test runs across machines with a unified dashboard, costing roughly $75-200/month depending on parallelism. It also stitches the per-shard HTML reports into one view, which is worth the price the first time you debug a flake at 2 AM.

For multi-browser coverage, add browser: [chromium, firefox, webkit] to the matrix. In practice, run all three on main and Chromium-only on PR. Cross-browser bugs are real but rare, and tripling your PR runtime to catch one a quarter is a bad trade.

Flaky test management: retry once, then quarantine

Every E2E suite gets flaky. The question is whether you have a process for it.

The rule we use: retries: 1 in CI. A test that fails twice in a row is genuinely broken; a test that fails once and passes on retry is flaky. The first retry catches transient network blips and Stripe webhook lag. Anything beyond two retries hides real bugs.

When a test flakes, mark it test.fixme() and open a ticket. The test still runs (so we see the failure pattern in the report), but the failure does not block merge. The ticket has a one-week deadline: fix the flake or delete the test. No third option. Tests that limp along for three months erode trust in the entire suite, and a suite people do not trust gets ignored.

The three root causes of flake, in order:

Sleep statements. Replace every page.waitForTimeout(2000) with expect(page.locator('...')).toBeVisible(). Auto-waiting is the single best feature Playwright shipped.
Test data contention. Two tests sharing a tenant is the second cause. Per-worker accounts fix it.
Third-party webhooks. Stripe, Auth0, Mailgun. Wrap these in a waitForWebhook helper that polls the database for the expected state with a 30-second timeout.

For test selection on the application side, use data-test attributes, not CSS classes. Classes change for styling reasons. data-test attributes only change when someone deliberately edits the test contract.

CI integration: the GitHub Actions file

The matrix file above is the production version. Two more details worth pinning down.

Video on failure, screenshots, traces. In playwright.config.ts:

use: {
  trace: 'on-first-retry',
  screenshot: 'only-on-failure',
  video: 'retain-on-failure',
}

Trace files are gold. Open one in npx playwright show-trace and you get a frame-by-frame replay of the failed test with network logs, console output, and the DOM snapshot at every action. Most flake debugs that used to take an hour now take five minutes.

Report as artifact. Upload playwright-report/ on failure (the matrix YAML above does this). Reviewers click through to a static HTML report from the PR check, which is faster than rerunning locally to reproduce.

Caching the browser binaries. Playwright bundles Chromium, Firefox, and WebKit, which is about 300MB. Cache them between runs:

- uses: actions/cache@v4
  with:
    path: ~/.cache/ms-playwright
    key: playwright-${{ hashFiles('package-lock.json') }}

This shaves 60-90 seconds off every CI run.

If you are still figuring out the bigger CI/CD picture, our CI/CD pipeline for startups post covers PR gates, deploy previews, and the right place E2E sits in the chain.

Reporting: HTML report + Currents.dev for teams

Solo project: Playwright's built-in HTML reporter is enough. Run npx playwright show-report and you get a searchable per-test view with traces, screenshots, and timing.

Team of 3+: install Currents.dev. The killer feature is parallel-run aggregation, which the standalone HTML reporter does not handle well across sharded jobs. You also get historical pass/fail rates per test, which makes flake quarantine objective ("this test failed 11 of the last 100 runs, into the freezer it goes") instead of vibes-based.

Slack integration: post on failure only. Teams that post on every pass quickly learn to ignore the channel. We use a GitHub Actions step that pings #engineering only when the matrix job fails on main.

When you can skip E2E entirely

E2E has an ROI curve. It bends sharply down for these cases:

Pre-revenue, two founders, one weekend deploy cycle. Your unit and integration tests already cover the surface area. E2E adds 3 days of setup that you would spend better shipping the next feature.
Single-page calculator or static site. Lighthouse CI and Vitest snapshots cover what matters.
Internal admin tools used by 3 employees. A 5-minute manual smoke test on Friday is cheaper than the first quarantined test.

If your unit and integration tests already gate signup, payment, and the core action, you are 80% of the way to E2E's protection. We covered the broader testing-tool tradeoff in Jest vs Vitest 2026, and if you want a single number for how much your test suite actually catches, see code coverage in 2026. A solid unit suite beats an absent E2E suite every time.

The Cadence connection

E2E rollouts are a typical Senior tier ($1,500/week) project on Cadence. The work is well-scoped, has a clear definition of done (5 flows green in CI, sub-10-minute runtime, retry+quarantine workflow documented), and benefits from an engineer who has done it before and will not relitigate Playwright vs Cypress for the third time.

Every engineer on Cadence is AI-native by baseline, vetted on Cursor and Claude Code fluency in the voice interview before they unlock bookings. Out of our 12,800-engineer pool, the median time to first commit on a new booking is 27 hours. For a contained piece of work like an E2E pipeline, that means tests in CI by the end of week one.

If you want to know whether your current testing setup is worth keeping, audit your stack with Ship or Skip. It will give you an honest grade on what to keep, what to replace, and what to delete.

Steps

Pick the tool. Playwright for any new SaaS in 2026. Cypress only if you are migrating an existing suite. Install with npm init playwright@latest.
Write your first test. Cover the signup flow end-to-end against a fresh tenant. Use data-test attributes, not CSS classes, for selectors.
Set up CI integration. Add the GitHub Actions matrix file above. Shard to 4 workers, cache the browser binaries, upload the HTML report on failure.
Manage flaky tests. Set retries: 1 in CI. When a test flakes, mark it test.fixme() and open a one-week ticket: fix or delete.
Add reporting. Start with the built-in HTML reporter. Add Currents.dev once you outgrow per-shard reports (typically around test #50 or worker #4).

Most SaaS testing rollouts stall not on tool choice but on the second engineer to touch them. If your team is one founder and one contractor, book a senior engineer on Cadence for a week. The 48-hour free trial covers writing the first three flows; if it is not in CI by Friday, you do not pay.

FAQ

How long does it take to set up E2E testing for a SaaS?

A senior engineer can ship the first 5 working tests against a fresh tenant in about 3-5 days. Hardening for CI parallelism, flake quarantine, and per-worker auth state adds another week. Most teams reach a stable, trusted suite in two sprints.

How many E2E tests should a SaaS start with?

Five. One per critical flow: signup, onboarding, billing upgrade, core feature happy path, and account deletion. Twenty reliable tests beat 200 flaky ones every time. Add one test per sprint as new features ship, not as a separate testing initiative.

Should I use Playwright or Cypress in 2026?

Playwright if you are starting fresh. Cypress if you have an existing Cypress suite worth keeping and migration cost is real. Playwright leads the State of JS 2024 satisfaction and usage scores, has Microsoft backing, ships free parallel sharding, and bundles three browser engines without paid add-ons.

How do I test Stripe billing in E2E?

Use Stripe test mode with the card number 4242 4242 4242 4242. Click your real upgrade button, complete the Stripe-hosted checkout, then poll your database (or a webhook-receipt log) for the checkout.session.completed event. Assert that the subscription row flipped to active and the feature gate unlocked in the UI.

How do I stop my E2E tests from being flaky?

Three rules. Never use waitForTimeout, always use Playwright's auto-waiting expect (toBeVisible, toHaveText). Isolate test data per worker via per-worker accounts or fresh tenants. Allow one retry in CI; if a test fails twice, quarantine it with test.fixme() and fix or delete within a sprint. Tests that limp along for months erode trust in the whole suite.

Harsh Shuddhalwar

Fullstack Developer

Fullstack developer at withRemote. Ships across the stack — TypeScript, Node, Postgres, Vercel. Writes on shipping speed and pragmatic architecture.

All posts