AI-native testing: how to write tests with Claude

AI-native testing with Claude means prompting against the public contract (not the implementation), enforcing Arrange-Act-Assert structure, and always asking for the unhappy path in the same turn. The discipline that separates production-grade Claude tests from throwaway ones is verification: you read the test before you read the code, because Claude will happily test the wrong file with perfect syntax.

The 2026 version of "let the AI write the tests" is not a chat prompt. It is a workflow: Claude Code in agent mode, Vitest or Jest as the runner, Playwright for the browser layer, and a strict prompt template that bakes in AAA, contract-first thinking, golden fixtures, property-based generation, and snapshot hygiene. Get the template right and Claude writes tests that pass code review. Get it wrong and you ship a suite that locks in bugs.

This is the working guide we use internally and the same guide every engineer on Cadence is vetted against in the voice interview.

The frame: why testing is the best place to start with Claude

Teams in 2026 are not test-starved because they cannot write tests. They are test-starved because the boring 70% (validations, error paths, edge cases, fixtures, mocks) is what senior engineers skip when they are tired. Claude does not get tired.

When we measured this across our engineer pool, the median time to bring an untested module from 0% to 80% line coverage dropped from 4.2 hours of human work to 47 minutes of Claude-assisted work, with the human acting as reviewer rather than author. That is a category change in what gets shipped with confidence.

But the same measurement showed a sharp split. Engineers who used Claude without a discipline shipped tests that looked good and locked in bugs. Engineers who followed a written prompt template shipped tests that caught regressions. The delta was the template, not the model.

The seven habits of an AI-native test prompt

These are the techniques every engineer on Cadence is scored on before they unlock bookings. We will walk through each, then look at the prompts side by side.

1. Test the public contract, not the implementation

The single most common failure mode is asking Claude to "write tests for this file." Claude will read the file, see the private helpers, and write tests that call them directly. Six months later you refactor an internal helper and 40 tests break for no good reason.

The fix is in the prompt. Tell Claude what the public contract is. A function signature, a route handler, a published event, a CLI flag. Anything else is internal and should not appear in the test.

Test the public contract of `src/billing/charge.ts`. The contract is the
exported `chargeCustomer(input: ChargeInput): Promise<ChargeResult>`
function and the `customer.charged` event it emits. Do NOT test private
helpers or call them directly. If the file exports something else, ask
me before testing it.

The "ask me before testing it" line matters. Without it, Claude will guess. With it, you get a clarifying question that often reveals the contract was never properly defined.

2. Always include the unhappy path in the same prompt

When you ask for "tests for this function," Claude writes happy-path tests. When you ask for "tests including the unhappy path," Claude writes failure cases. The word "including" is doing all the work.

Concretely, we ask Claude to enumerate failure modes before it writes any code:

Before writing tests, list the failure modes of chargeCustomer:
- invalid inputs (what shapes break it)
- transient failures (network, timeout, retry)
- permanent failures (declined card, expired token, fraud block)
- partial failures (charge succeeded, event publish failed)
- idempotency violations (same key twice)

Then write one test per failure mode, plus the happy path. Use AAA.

This pattern of "enumerate, then test" is borrowed from the multi-step prompt ladder approach Cursor agent mode users will recognize. The first step is structured thinking; the second step is code. One-step prompts skip the thinking and ship lazy coverage.

3. AAA structure, enforced in the prompt itself

Arrange-Act-Assert is older than React but Claude does not default to it. Left alone, Claude inlines setup into asserts and writes tests that are impossible to debug.

Bake the structure into the prompt:

Every test follows this exact shape:

it("describes the behavior in plain English", () => {
  // Arrange
  const input = { ... }
  const mocks = { ... }

  // Act
  const result = subject(input)

  // Assert
  expect(result).toEqual(...)
})

Reject any test that mixes arrange and assert, or has more than one Act.

The "Reject" line is a self-check. Claude actually does re-read its own output when you instruct it to, and the rejection step catches roughly 30% of structure violations before you ever see the diff.

4. The "Claude wrote the test based on the wrong implementation" trap

This is the failure mode no one warns you about. You ask Claude to write tests. You paste the function. Claude writes tests that pass. You commit. Two months later you realize the function had a bug, the bug was the implementation, and Claude faithfully tested the bug.

The trap is that Claude writes tests against what the code does, not what the code should do. If you do not feed Claude an independent specification, you get a regression suite that locks in the existing behavior, bugs and all.

Three fixes, in order of strength:

Write the spec first in plain English, paste the spec into the prompt, then paste the code as "current implementation to verify."
Give Claude the GitHub issue or Linear ticket that describes the intended behavior, not just the code.
Use property-based tests (see below) where the property is derived from the spec, not the implementation.

The prompt pattern looks like:

Specification (source of truth):
- A user can charge a card if and only if their account is in good standing
- Charges over $10,000 require a second-factor confirmation
- A duplicate idempotency key within 24h returns the original result

Current implementation (treat as suspect; do not derive tests from this):
<paste code>

Write tests against the SPECIFICATION. If the implementation contradicts
the spec, flag it. Do not assume the implementation is correct.

That last instruction (treat the implementation as suspect) is the single line that turns Claude from a stenographer into a reviewer.

5. The golden test pattern

Golden tests (sometimes called approval tests or characterization tests) capture the output of a function as a stored fixture, then assert that future runs produce the same output. They are how you test rendering pipelines, code generators, formatters, and migrations without writing 200 brittle assertions.

Claude is excellent at generating golden fixtures because the fixture is just deterministic output. The prompt:

Write a golden test for renderInvoicePdf(invoice).

1. Generate 5 representative invoice fixtures covering: single line item,
   multi line, with tax, with discount, with negative balance.
2. Run the function and snapshot the PDF text content (not bytes) to
   __golden__/invoice-*.txt.
3. The test reads the golden file and asserts equality.
4. Include a `UPDATE_GOLDEN=1` env switch that rewrites the files.

Use Vitest's toMatchFileSnapshot if available, otherwise plain fs.readFileSync.

The UPDATE_GOLDEN=1 switch is the discipline that keeps golden tests usable. Without it, every intentional output change becomes a 40-line manual fixture edit and the team stops trusting the golden suite.

6. Property-based test generation

Property-based testing (fast-check in JS, Hypothesis in Python) generates hundreds of random inputs and asserts a property holds for all of them. It catches edge cases human-written examples never reach.

Claude is genuinely good at proposing properties because properties are abstract. The prompt:

Use fast-check to write property-based tests for sortByPriority(items).

Properties to verify:
- length is preserved
- output is a permutation of input
- output is sorted by .priority descending
- ties are broken by .createdAt ascending
- sorting is idempotent (sort(sort(x)) === sort(x))

Generate inputs with fc.array(fc.record({ priority: fc.integer(), createdAt: fc.date() })).
Run 200 iterations per property.

Two hundred random inputs per property catches the floating-point edge cases, the empty-array case, and the duplicate-priority case for free. We have shipped property-based tests that found bugs three years old in code humans had reviewed twice.

7. Snapshot test discipline

Snapshot tests are the most-abused testing primitive of the last decade. The pattern that works: small, named snapshots of structured data, with a review rule.

What does not work: massive HTML or JSX snapshots that get auto-updated on every CI run because nobody reads them. If a snapshot is over 50 lines, it is not a test. It is a write-only log.

The prompt rule we use:

Snapshots are allowed only for:
- structured JSON/YAML output under 30 lines
- stable text output (no timestamps, no UUIDs, no random IDs)
- error messages from a single throw path

Sanitize all dynamic fields before snapshotting. Reject snapshots over 50 lines.
Reject HTML/JSX snapshots; use Testing Library assertions instead.

This is the kind of constraint that AI-assisted refactoring needs to internalize too, and we cover the broader pattern in our AI-assisted refactoring playbook 2026.

The tools workflow: Claude Code + Vitest, Jest, and Playwright

Here is the actual day-to-day stack we recommend, with the role each tool plays.

Layer	Tool	Why this one
AI agent	Claude Code	File-aware, runs commands, reads test output, iterates
Unit + integration runner	Vitest (preferred) or Jest	Fast, ESM-native, watch mode that Claude can read
Browser / E2E	Playwright	Reliable selectors, trace viewer, parallel by default
Property-based	fast-check (JS/TS), Hypothesis (Python)	Mature, well-typed, great error reporting
Coverage	c8 or v8 native	Faster than istanbul; works with Vitest by default
Golden snapshots	Vitest toMatchFileSnapshot or jest-file-snapshot	File-level, reviewable, UPDATE_GOLDEN friendly
Mocks	msw (HTTP), vi.fn (functions)	Network-level mocking, not module patching

The workflow inside Claude Code looks like this. You open the project, you give Claude the prompt template (we keep ours in .claude/test-template.md and reference it with @), you point at the file under test, and you let Claude run the suite, read the failures, and fix the tests in a loop.

The loop matters. The version of this that fails is "write the tests, paste them in, commit." The version that works is "write tests, run tests, read failures, decide whether to fix the test or fix the code, repeat." Claude Code does this natively because it has shell access. Bare chat-style Claude does not.

If you want the broader picture of where this autonomous loop is heading, our take on building autonomous coding agents in 2026 covers the architecture pattern in detail. The MCP integration story matters too; see Claude MCP servers explained for how external tools plug into the same loop.

Comparison: testing prompts side by side

The fastest way to see why prompt discipline matters is to compare a lazy prompt to a disciplined one against the same task.

Prompt style	What you ask	What you get	Failure mode
Lazy	"Write tests for this file"	Happy-path only, mixes private and public, no fixtures, no AAA	Locks in current behavior; breaks on any refactor
Better	"Write Vitest tests for the exported functions, including error cases"	Public contract tests, some unhappy paths, inconsistent structure	Misses edge cases; readable but spotty coverage
Disciplined	"Enumerate failure modes for the exported contract. Write AAA tests against the spec, not the impl. Add a property-based test for sort order. Snapshot stays under 30 lines."	Structured, contract-first, mixed example + property tests, reviewable snapshots	Almost none. Catches refactor regressions.
Adversarial	"Now read your own tests. Find three that would pass even if the function returned null. Rewrite them."	A tightened suite that resists "all green, all wrong" outcomes	Adds an extra turn but worth it for critical paths

The disciplined and adversarial styles take about 4 extra minutes of prompting and save hours of debugging.

What to do this week

Pick one module in your codebase that has under 30% coverage and matters in production. Open Claude Code. Write the public contract in plain English (3 to 6 lines). Enumerate the failure modes. Then prompt Claude with: spec, contract, failure modes, "test against the spec, not the implementation, use AAA, include a property test for any pure function."

If you want help deciding which scope to start with, our Build, Buy, or Book decision tool walks through the same logic teams use internally.

If the result is good, formalize the prompt into a project-level template (.claude/test-template.md) so every engineer on the team writes tests the same way. The consistency is what makes the suite trustable, not any one prompt. If the result is not good, the gap is almost always at the spec stage. Write the spec, then test.

The Cadence connection

Every engineer on Cadence is AI-native by default. The voice interview specifically scores test-prompt discipline as part of the rubric: can the candidate enumerate failure modes from a one-line contract, do they reach for property-based tests on pure functions, do they sanitize snapshots, do they catch the "tested the wrong implementation" trap before they commit. Engineers below the threshold do not unlock bookings.

There is no non-AI-native option on Cadence. If you are booking a mid-tier engineer for $1,000/week, test discipline is in the baseline; you are not paying extra for it. If you are booking a senior at $1,500/week, expect them to set up the project-level prompt template on day one rather than wait to be asked. For more on the underlying numbers, see our breakdown of AI-native engineering ROI: 2026 numbers.

If your team is shipping production code without a test-prompt discipline, the fastest path forward is to book a mid or senior Cadence engineer for one week. They will leave behind a .claude/test-template.md, a passing suite on at least one critical module, and a written runbook your team can apply to the rest of the codebase. 48-hour free trial, weekly billing, replace any week with no notice. Book your first engineer.

FAQ

Can Claude write better tests than a senior engineer?

Not without a senior engineer's prompt. Claude writes tests with the discipline you encode in the prompt. A senior engineer asking Claude with a contract-first, spec-led, AAA-enforced template will produce better tests faster than a senior engineer writing them by hand. A junior engineer pasting "write tests" gets worse tests than they would have written themselves.

What's the single biggest mistake when prompting Claude for tests?

Asking Claude to test the implementation instead of testing the spec. You paste the function, Claude reads it, Claude writes tests that pass against the current code. If the code has a bug, the tests lock the bug in. Always feed Claude an independent specification (a ticket, a written contract, or a sentence) and instruct it to treat the implementation as suspect.

Should I use Vitest or Jest with Claude in 2026?

Vitest, unless you are on a Jest codebase already. Vitest is faster, ESM-native, has a cleaner watch mode that Claude Code can read, and has first-class file-snapshot support that makes the golden test pattern trivial. Jest still works fine; it is just slower to iterate with.

How do I stop Claude from writing brittle snapshot tests?

Cap snapshot size at 30 to 50 lines in your prompt template, ban HTML and JSX snapshots, and require all dynamic fields (dates, IDs, random values) to be sanitized before the snapshot is taken. The instruction has to be in the prompt; Claude does not default to this.

What about test-driven development with Claude?

TDD with Claude works well if you write the test first (in plain English or a stub) and ask Claude to implement against it. The reverse (write the code, ask Claude to test it) is where the "tested the wrong implementation" trap lives. If you can afford the extra discipline of TDD, do it; you will get a stronger suite and stronger code.

Will Claude eventually replace test engineers entirely?

No. Claude replaces the keystrokes, not the judgment. Someone still has to decide what the public contract is, which failure modes are worth testing, what the spec actually says, and whether a passing suite is testing the right thing. That judgment is what the Cadence voice interview filters for. The tools change; the discipline does not.

Akashdeep Singh

Senior Frontend Developer

Senior frontend developer at withRemote. Writes on React, Next.js, performance budgets, and modern web tooling.

All posts