
AI changed software engineering interviews by killing the algorithmic puzzle and replacing it with paired AI debugging. In 2026, the top format is a 30-minute trial PR: the candidate opens Cursor, picks up a real ticket, and ships a working patch with Claude or Copilot in the loop. Interviewers score prompt-as-spec discipline, verification habits, and AI code review judgment, not whiteboard recall.
LeetCode broke first. By late 2024, Copilot was solving "medium" array problems in real time, and any candidate with a second monitor could ghost-pilot through a Karat screen. By 2026, hiring managers stopped pretending the screen worked. The interview itself is now the thing AI has changed most, more than the job description, more than the comp band.
This post is the operator's manual for the new format. What replaced the puzzle, how to evaluate prompt-as-spec on a live call, the 30-minute trial-PR rubric, what to score on AI code review, and the questions you should retire today. If you want the macro story (why hiring shifted, what budgets did, where recruiters went), read the companion piece on how AI changed the developer hiring process. This one is the interview.
The replacement is not one format, it is three layered ones. Most senior teams now run them in this order:
Together that is roughly 95 minutes of evaluation, against the 3-4 hours the old loop ate (one phone screen, one algorithmic, one system design, one behavioral). The new loop is shorter because it tests fewer proxies and more of the actual job.
Here is the old format versus the new, side by side.
| Stage | Old format (2022) | New format (2026) | What changed |
|---|---|---|---|
| Screen | LeetCode medium on HackerRank | 30-min trial PR in a sandbox repo | Algorithm recall is a solved problem; shipping is not |
| Technical round | Whiteboard tree traversal | Paired AI debugging on a real bug | Tests prompt-as-spec and verification, not memorization |
| Code review | Optional, async, low signal | Mandatory: 20-min review of a suspect PR | AI writes most code now; reviewing it is the highest-yield skill |
| System design | "Design Twitter" | "Wire a Claude agent into this checkout flow" | Distributed systems are still tested, but framed around agent loops and RAG |
| Behavioral | "Tell me about a conflict" | "Walk me through your last prompt ladder" | Soft skills + AI workflow exposed in one question |
| Trial period | None (offer cold) | 48-hour paid trial on real scope | Removes the hire-and-pray gamble entirely |
The columns hide one thing worth saying out loud: nothing in the new format requires the candidate to suppress the AI. Old interviews banned tools and watched for cheating. New interviews assume the tools are on and score how well the candidate steers them.
This is the format that replaced the take-home and the LeetCode screen in one move. The mechanics:
What you score, in order of weight:
A good candidate finishes in 22 minutes with a clean diff and a 90-second Loom that includes the phrase "I asked Claude to scope this first, then generated, then ran the failing test." A weak one finishes at 29 minutes, ships a 400-line diff with hallucinated imports, and narrates "I just asked Cursor to fix it."
The format also kills resume inflation in a way the old loop never did. Every candidate ships something the same day. If the resume claims senior and the PR reads junior, you see it in 30 minutes instead of week 3.
The second round is live. 45 minutes. The interviewer hands the candidate a codebase with a real bug ("the checkout total is wrong for orders with more than 5 items") and watches.
This is where prompt-as-spec really shows. A senior engineer in 2026 does this:
A weak candidate does this:
Both candidates may eventually ship a fix. The senior shipped a fix they understand and can defend. The weak one shipped a fix they cannot explain, which is the failure mode that breaks production at 3am six weeks later. The whole point of the paired session is to surface this difference, which an async take-home cannot.
The skill you are scoring is the same one a working playbook for AI-assisted refactoring documents at length: scope first, generate second, verify third, ship fourth. Reverse any two and the output regresses.
This is the round most companies skip and most senior engineers wish they had not. The setup: hand the candidate a 200-line PR that looks fine. It compiles, the tests pass, the diff reads cleanly. It also contains one or two real problems the AI generated and the original author did not catch.
Common plants:
A 2026-fluent reviewer catches at least one in 20 minutes and flags it with the kind of comment they would leave on a teammate's PR. The signal is not "found the bug." The signal is the reviewing instinct: "I do not trust this section; let me run it locally before I approve."
This round directly tests the skill that matters most on a small team in 2026. When every engineer's AI-native ROI math compounds to 2.5-3x throughput, the bottleneck moves from writing code to reviewing what the model wrote. Code review is the new code.
If you are still asking any of these questions, you are measuring the wrong thing.
The general rule: if a question can be answered better by Claude in 30 seconds than by a candidate in 30 minutes, the question is testing memorization, not engineering. Cut it.
If your team needs to ship the new format this quarter without rebuilding the whole loop, the fastest path is to pilot the trial PR on three open roles, keep the 45-min paired session, and drop everything else for a month. Measure offer-to-accept and 90-day retention against the old loop. The data converges within one cycle.
Cadence is one option here, and a transparent one. Every engineer on the platform is AI-native by default; the unlock criteria are a voice interview scoring Cursor / Claude / Copilot fluency, prompt-as-spec discipline, verification habits, and multi-step prompt-ladder thinking. There is no non-AI-native tier and no opt-out. 50/100 unlocks bookings.
The booking flow replaces the interview loop entirely. A founder posts a spec, the platform shortlists 4 vetted engineers in 2 minutes, the founder picks one, and the engineer ships their first commit at a 27-hour median. The 48-hour trial is free, so the trial PR is the actual work, not a synthetic sandbox. If the engineer is wrong for the scope, you replace them on Friday and try the next one.
If you want to run the new interview format yourself, you can; the rubric in this post is the same one we built ours around. If you would rather skip the loop entirely on this one role, decide whether to build, buy, or book the next feature and we will tell you straight if booking is the right call.
Pricing for the booking path:
| Tier | Weekly rate | Best for |
|---|---|---|
| Junior | $500 | Cleanup, dependency hygiene, doc-writing, well-documented integrations |
| Mid | $1,000 | Standard features, end-to-end shipping, refactors, test coverage |
| Senior | $1,500 | Owned scope, architecture, complex refactors, performance, edge cases |
| Lead | $2,000 | Architectural decisions, systems design, fractional CTO, scale work |
Weekly billing. Cancel any week. Daily ratings drive auto-replacement. Engineers earn 80% of weekly rate.
Three concrete moves, in priority order:
If the bottleneck is not the format but the pipeline, booking is the alternative. Look at a worked set of AI engineering interview questions for 2026 for the question bank, then either run the loop yourself or skip it on the next role.
The 4-hour take-home is dead. The 30-minute trial PR replaced it. Anything longer than 90 minutes loses senior candidates and gets ghost-written by AI anyway, so the signal disappears either way. Keep it short, real, and recorded.
Yes, always. Banning AI in a 2026 interview is like banning Stack Overflow in 2015: you are testing artificial constraints, not the actual job. Score how well the candidate steers the tool, not whether they can pretend it does not exist.
You stop trying. The new format assumes AI is on, then tests skills AI cannot fake: real-time prompt-as-spec on a live call, verification reasoning, code review judgment on a planted bug. Cheating the new format means already being good at the job.
A scoped agent-or-RAG design question on a real product surface. Instead of "design Twitter," ask "wire a Claude agent into this checkout flow; what fails first at 10x scale?" Same distributed-systems thinking, framed around the work the candidate will actually do.
Roughly 95 minutes of evaluation across three rounds, plus a 48-hour paid trial if you can swing it. The old 4-hour loop optimized for false negatives; the new short loop plus a trial optimizes for true positives. Time-to-offer compresses from 3 weeks to 5 days for most roles.
Fullstack developer at withRemote. Ships across the stack — TypeScript, Node, Postgres, Vercel. Writes on shipping speed and pragmatic architecture.