AI-assisted code review: how to use it without losing rigor

AI-assisted code review means a bot (CodeRabbit, Greptile, Qodo, or an ad-hoc Claude Code pass) does the first read of every PR. A human reviewer then uses that summary as a map, not a verdict, and drills into intent, tests, and invariants. Done right, teams merge roughly 4x faster and catch more bugs. Done wrong, you stamp LGTM on AI output you never actually read.

The difference between those two outcomes is workflow rigor, not tool choice. Most of the loss happens because reviewers treat the AI pass as the review.

What AI-assisted code review actually is in 2026

By early 2026, around 41% of commits land with meaningful AI assistance. That is the GitHub Copilot pull, the Cursor multi-file edit, the Claude Code agent run. Pull requests are about 18% larger than they were in 2023, and incidents per PR have climbed roughly 24% in the same window (Addy Osmani's writeup is the canonical source for the latter two numbers).

The shape of code review changed to absorb this. The pre-AI shape was: human writes code, second human reads diff, second human approves (covered in depth in our broader playbook on running code reviews effectively). The 2026 shape is: human plus AI write code, AI reads diff first and posts structured comments, human reads the AI summary and uses it to focus a sharper second pass.

That third step is where most teams quietly give up. They install CodeRabbit or Greptile, the bot leaves a clean summary on every PR, and the reviewer skims the summary and clicks approve. The bot did not review the code. The bot summarized it. Those are different verbs.

The AI-first triage pattern (and where it breaks)

The pattern that works has three steps and only three.

Step 1: bot auto-comments on PR open. Greptile, CodeRabbit, Qodo, or a custom GitHub Action running Claude Code posts within 60 seconds of the PR opening. The output is a high-level summary, a list of findings with severity, and inline suggestions on the diff. This is cheap, fast, and consistent across every PR. (For a fuller bake-off of the bots, see our writeup on the best AI code review tools.)

Step 2: human reads the AI summary as a map. The reviewer treats the bot output as scaffolding. It tells you what the bot thought was important. It does not tell you what is actually important. The map shows the terrain; you still have to walk it.

Step 3: human verifies what the AI flagged AND audits what it missed. This is the step that separates rigor from theater. You confirm the flagged issues are real (bots produce false positives, especially on security). You also scan for what the bot could not see: business logic, architectural fit, the implicit invariants of your system. This is where prompt engineering for engineers at the senior level shows up; the same prompt-as-spec discipline that makes AI write good code makes you a better verifier of it.

Where it breaks: the reviewer reads the summary and skips Step 3. Greptile's own benchmark notes that it catches around 82% of test issues vs CodeRabbit's 44%, but Greptile also generates more noise. Either way, neither tool catches all bugs. The AI pass is a floor, not a ceiling.

Rubber-stamp risk: the LGTM problem AI made worse

Pre-AI, large engineering orgs already had a rubber-stamp problem. Internal audits at multiple companies have estimated 60-80% of PR approvals went through without a real read. Reviewers were busy, the diff was small, the author had a good track record, ship it.

AI summaries make this feel safer. The summary looks thorough. The bot says "no critical issues found." Approve, merge, move on. The reviewer is not lying; they genuinely believe the review happened. The bot's output created the impression of rigor.

Two things make this dangerous:

Around 45% of AI-generated code contains a security flaw of some kind, per Osmani's analysis of recent studies. XSS vulnerabilities show up roughly 2.74x more often in AI-generated code than human-written code. The base rate of bad code in your PR queue is higher than it used to be.
The bot is biased toward what is in the diff. Architectural mistakes, missing tests for the unhappy path, and broken invariants in the surrounding code are mostly invisible to a tool that only reads the PR.

The fix is small and works: require one specific comment per approval. Even "I checked the auth path and the new claims are validated" counts. This forces the reviewer to do at least one real verification before clicking approve. Track the ratio of LGTM-only approvals in your repo. If it climbs above 30%, your AI tool is enabling theater.

Human-only zones: what to never delegate

Some categories of review do not get better with AI assistance. They get worse, because the AI's confident output crowds out the careful human read these areas need.

The list is short and worth pinning to your CONTRIBUTING.md:

Security review for authentication flows, secret handling, sandbox escape, SSRF, deserialization. AI tools flag obvious issues but miss the second-order ones.
Architectural decisions: module boundaries, where state lives, what becomes a service vs a library. The AI does not know your system's reasons.
Complex business logic: pricing, billing, compliance, anything where a single wrong branch costs real money. The AI does not know your customer contracts.
Migration safety: irreversible schema changes, data backfills, anything that cannot be rolled back with a redeploy.
Public API contracts: changes to anything other teams or external customers depend on. AI underweights blast radius.
Hallucination handling logic in production LLM code, where a single bad output ships to a customer. The patterns from handling hallucinations in production LLM apps need a human read every time.

The point is not that AI cannot help here. It is that you should treat AI output in these zones as a starting prompt for a careful human read, not a substitute.

AI-assisted vs human-only review: what each catches

What	AI bot catches well	Human still required
Style and lint drift	Yes, near-perfect	No
Obvious null/type errors	Yes	Spot-check
Test coverage gaps	Often	Verify the tests are real, not vacuous
Security (auth, SSRF, secrets)	Sometimes; high false-positive rate	Always
Architectural fit	Rarely; lacks context	Always
Business logic correctness	No	Always
Irreversible migrations	Flags but does not reason about safety	Always
Performance regressions	Sometimes (obvious O(n^2))	For anything load-bearing
Documentation drift	Yes, with the right setup	Spot-check

A note on test coverage: bots are good at saying "this branch lacks a test" and bad at noticing the existing tests assert nothing. A test that calls a function and never checks the return value is worse than no test. AI rarely flags this. Humans do.

Ad-hoc Claude Code review: the lightweight pattern

Not every team wants a third-party bot in the repo. For smaller teams, indie developers, or sensitive codebases, an ad-hoc Claude Sonnet 4.6 in production review pass works well and costs almost nothing.

The pattern is a stack of focused prompts, not one generalist prompt. Generalist prompts give you bland summaries. Specialist prompts give you specific findings.

A working sequence:

# Security pass
claude -p "Review the diff in this PR for security issues. \
Check: auth bypasses, missing input validation, secret leakage, \
SSRF, unsafe deserialization. Cite line numbers. \
Skip style. Be specific."

# Test verification pass
claude -p "For every test added or changed in this PR, \
write one sentence: what does it actually assert? \
Flag any test that calls a function but does not assert behavior."

# Invariants pass
claude -p "List the implicit invariants this code relies on \
(ordering, idempotence, atomicity, single-writer assumptions). \
Flag any change that breaks one."

Three short prompts catch more than one long one because each pass narrows the model's attention. This is the same reason specialist agents beat generalist agents in production AI work, the same reason building your first AI agent with tool calling starts by giving the model one job at a time.

The output goes into the PR as comments from the human reviewer, who has now read both the diff and three structured passes over it. That is the rigor. The bot did not replace the human; it gave the human three more sets of eyes.

How to evaluate AI-assisted review skill in an engineer

When you hire someone in 2026, you should be able to watch them do an AI-assisted review live. The signal is in what they verify versus what they accept.

Specific things to watch for:

Do they read the diff, or only the bot's summary? You will see this in the scroll pattern.
When the bot flags something, do they confirm it is real before commenting on it?
Do they look at what the bot did not flag? Do they ask, "what would I have caught here that the bot missed?"
Do they write at least one comment that is not a reaction to a bot comment?

This is part of why every engineer on Cadence is AI-native by default. The voice interview specifically scores Cursor and Claude Code fluency, prompt-as-spec discipline (the same discipline that makes Cursor rules configured right actually work), and verification habits. There are 12,800 engineers in the pool, and the AI-native baseline is what unlocks bookings; there is no non-AI-native option to opt into. If you need a senior engineer to set up your code review workflow this week, you can book a senior at $1,500/week and get a 48-hour free trial before any billing happens.

What to do this week

A short, concrete list:

Pick one bot for the larger repos. Greptile if you want depth and can tolerate noise. CodeRabbit if you want fewer comments and faster turnaround. Qodo if you want specialist agents. Test on one repo for two weeks before rolling out.
Write a PR contract. Four lines in your PR template: Intent, Proof (tests, screenshots), Risk tier (low/med/high), AI contribution (none/some/most).
Define human-only zones in CONTRIBUTING.md. The list above is a good starting set. Cut what does not apply.
Audit your last 20 merged PRs. Count how many were approved without a single specific comment. If more than 6 of 20, you have a rubber-stamp problem and AI summaries are accelerating it.
Add the three Claude Code prompts above to your repo as a scripts/review-passes.sh. Anyone reviewing a PR can run them in 30 seconds.

If you want a sharper picture of where to spend the human time, our Decide tool gives you a Build / Buy / Book recommendation on the next workflow change you are considering.

FAQ

Can AI replace code review entirely?

No. AI handles the mechanical pass: lint, types, obvious errors, surface-level test gaps. Architecture, security, business logic, and migration safety still need a human signoff. The 2026 workflow is AI plus human, not AI instead of human.

Which AI code review tool catches the most bugs?

In public benchmarks, Greptile catches around 82% of seeded test issues vs CodeRabbit's 44%, but Greptile produces noticeably more noise per PR. Qodo's specialist-agent approach falls in between. The right pick depends on whether your team tolerates noise or prefers signal density.

How do I stop my team from rubber-stamping AI summaries?

Require one specific comment per approval, even a short one like "checked the auth path." Track the ratio of LGTM-only approvals weekly. Add a written human-only-zones list to CONTRIBUTING.md. The behavior changes within two sprints.

Is AI-assisted code review safe for production?

Yes, when a human still owns the security, architecture, and business logic pass. Around 45% of AI-generated code contains some security flaw and XSS rates run roughly 2.74x higher than in human-written code, so AI alone is not enough. AI plus a disciplined human review is safer than the pre-AI baseline because every PR now gets a structured first pass.

Should I use a paid bot like CodeRabbit or just Claude Code?

Paid bots win on automatic PR-open triggers, persistent memory across PRs, and team-wide reporting. Claude Code wins on flexibility, cost for small teams, and the ability to run ad-hoc specialist passes. Many shops run both: a bot for the always-on first pass, Claude Code for the deeper drill on risky PRs.

All posts