
AI engineering interview questions in 2026 cluster into seven categories: RAG architecture, agent loops, prompt engineering, evaluation, production cost, security, and vendor API knowledge. Strong candidates answer with concrete numbers, named tools (RAGAS, LangGraph, Anthropic prompt caching), and honest trade-offs. Below is a 35-question kit, each with a one-line scoring rubric a founder can grade against, plus the meta-rule: pair the questions with a paid 4-hour trial.
Three years ago, an AI engineering interview meant CNNs, gradient descent variants, and a backprop-by-hand whiteboard. In 2026, that material is roughly 20-30% of the loop. The other 70-75% is GenAI: RAG system design, agent loop reliability, evaluation harnesses, and production cost.
The opener has changed too. The new opening is some variant of "design a RAG system for customer support tickets" or "add an LLM feature to an existing product without burning the budget." Roughly 80% of senior AI eng rounds we surveyed open with a RAG system design.
Take-homes have tightened. A working chatbot is no longer enough. Companies now ask candidates to ship an evaluation harness alongside it: a small gold set, a judge model, and a CI step that fails the build when faithfulness drops below threshold. If the candidate ships the chatbot but skips the eval, that is itself the signal.
Pick five to seven questions across at least four categories. More than ten turns the call into a trivia quiz and the signal collapses.
Each rubric tells you what a strong answer covers. Score three ways: strong (covers all rubric points), partial (covers two of three), weak (misses the rubric or recites a textbook definition). If the candidate can only define terms, downgrade. Production scars beat textbook fluency at the senior level.
Pair the questions with a short paid trial. A 4-hour scoped task on real code outperforms a 90-minute interview at every level above junior.
RAG is the new baseline. Asking a 2026 AI engineer about RAG is like asking a 2018 backend engineer about REST. Expect fluency, not surprise.
1. When would you NOT use RAG? Rubric: Names small-corpus cases (just stuff it in the prompt), highly dynamic data where the index is always stale, and cases where fine-tuning genuinely wins (style, format, structured extraction). Bonus: mentions hybrid approaches (RAG plus tool calling).
2. Compare sparse vs dense retrieval and explain hybrid. Rubric: Names BM25 and an embedding model (text-embedding-3-large, Cohere v3, Voyage). Explains why hybrid (BM25 for exact terms, dense for semantic similarity) outperforms either alone on most production datasets.
3. How do you choose chunk size? Rubric: Discusses the tension between recall (smaller chunks, more relevant hits) and context (larger chunks, less assembly). Names a starting point (256-512 tokens), explains overlap, and admits the answer is dataset-dependent. Bonus: parent-document retrieval.
4. When does reranking earn its latency? Rubric: Knows reranking adds 100-400ms and 1-3 cents per query at scale. Justifies it on top-K retrieval where the first-stage retriever returns 50+ candidates and precision matters more than recall (legal, medical, support).
5. How do you set up a RAG eval harness? Rubric: Names RAGAS or DeepEval. Lists the three core metrics: faithfulness (does the answer match the context), context relevance (did we retrieve the right chunks), answer relevance (does the answer match the question). Mentions a gold set of 50-200 question-answer pairs.
6. A user asks a multi-hop question and the system fails. How do you debug? Rubric: Decomposes the question into sub-questions. Checks retrieval first (did we find both required chunks?). Then checks generation (did the model fail to combine them?). Names query rewriting and HyDE as remedies for retrieval, and a planning step or agent for generation.
For a deeper dive on one common failure mode, see our piece on how to handle hallucinations in production LLM apps, which goes into the layered defense pattern.
Agents are where most 2026 AI features die in production. Ask the questions that surface scars.
7. Walk me through the difference between an agent and an LLM chain. Rubric: A chain has a fixed graph; an agent picks the next step at each turn. Agents add value when state, tool selection, or multi-step planning matters. Otherwise a chain is faster and cheaper. Bonus: names LangGraph or CrewAI as agent frameworks.
8. A tool call fails. What does your loop do? Rubric: Catches the error, returns a structured error message to the model (not a stack trace), bounds retries (typically 2-3), and either falls back to a different tool or surfaces the failure to the user. Strong answers also mention idempotency and side-effect rollback.
9. How do you prevent infinite loops? Rubric: Step budget (cap at 10-25 steps per task), cycle detection (same tool with same args twice in a row), and a "no progress" check (no new information added in N steps). Names a real incident if they have one.
10. How do you evaluate a non-deterministic agent? Rubric: Trace-level eval (did each step do what it should?), outcome-level eval (did the final state match the goal?), and a small adversarial set. Mentions LangSmith, Arize, or a custom trace logger. Honest about how hard this is.
11. What memory types does your agent use? Rubric: Distinguishes working memory (current turn), episodic (past turns), semantic (facts about the user), and procedural (how to do tasks). Doesn't oversell; most production agents use just working plus a thin episodic store.
For a primer, our walk-through on building your first AI agent with tool calling covers the loop semantics in detail.
In 2026, prompt engineering is half craft, half cost optimization. The cost half is what separates senior from mid.
12. Walk me through prompt caching. Rubric: Knows Anthropic's prompt caching gives roughly 90% input cost reduction on cache hits with a 5-minute or 1-hour TTL. Knows you cache the static prefix (system prompt, tool definitions, long context), not the changing suffix. Bonus: knows OpenAI auto-caches but Anthropic requires explicit cache_control.
13. How do you choose few-shot examples? Rubric: Diversity (cover edge cases, not just easy ones), recency (refresh as data drifts), and dynamic selection (retrieve nearest examples per query, don't hard-code). 3-5 examples usually beats 10 due to diminishing returns and cost.
14. How do you make a model refuse out-of-scope requests reliably? Rubric: Explicit refusal policy in the system prompt with examples. Structured outputs (refusal as a typed field, not free text). Eval set of out-of-scope cases. Honest answer: 100% refusal is impossible; aim for 95-98% with a fallback classifier.
15. What goes in the system prompt vs the user prompt? Rubric: System: identity, capabilities, constraints, tool definitions, output format. User: the actual request and any per-call context. The user prompt should be substitutable; the system should be cacheable.
16. What does prompt-as-spec mean? Rubric: The prompt is precise enough that a competent engineer (or another LLM) could write the same code from it. Function signature plus 3 examples plus 1 edge case. The same artifact briefs the human and the model.
If your team is sharpening this skill, our breakdown of prompt engineering for senior software engineers covers the prompt-as-spec discipline end to end.
Evaluation is where the bluffers crack. Ask in detail.
17. What are the limitations of LLM-as-judge? Rubric: Position bias (favors first or last response), verbosity bias (favors longer answers), self-preference (judge model favors its own family), and length bias on rubrics. Mitigations: pairwise comparison, randomized order, multiple judges, calibrated rubrics.
18. How do you curate a gold set? Rubric: Sample real production traffic (not synthetic), label by hand, target 50-200 examples for a v1. Refresh quarterly or when the input distribution drifts. Reserve a hold-out so the eval set isn't overfit by prompt iteration.
19. Difference between offline and online eval? Rubric: Offline runs on the gold set in CI; catches regressions before deploy. Online measures user-facing metrics (thumbs up/down, conversion, escalation rate); catches drift and intent shifts. You need both.
20. How do you wire eval into CI/CD? Rubric: On every PR that touches prompts or models, run the eval set. Block the merge if faithfulness or accuracy drops below a threshold (often 2-3 percentage points below baseline). Names a tool: LangSmith, Braintrust, Promptfoo, or a custom GitHub Action.
21. How do you slice eval metrics? Rubric: Disaggregate by intent (FAQ vs troubleshooting vs billing), by user persona (new vs power user), by language, and by error mode (hallucination vs refusal vs format break). An aggregate 92% accuracy can hide a 60% accuracy on the highest-value slice.
This is the most underweighted category in competitor question banks, which is exactly why it surfaces strong candidates.
22. We hit 10M queries/day and the bill is $80k/month. Where do you cut first? Rubric: Step 1: prompt caching (often 50-70% reduction). Step 2: model routing (cheap model handles 60-80% of queries, expensive only when escalated). Step 3: shorter context (chunk pruning, summarization). Step 4: batch API for non-real-time work (50% discount on Anthropic).
23. Walk me through model routing. Rubric: Cheap model (Haiku 4.5, ~$1/$5 per million tokens) for classification, intent detection, structured extraction. Mid model (Sonnet 4.6, ~$3/$15) for most user-facing reasoning. Expensive (Opus 4) only for hard reasoning or planning. Routing logic: a small classifier or rules.
For more on which model handles which job, our comparison of Claude Opus, Sonnet, and Haiku breaks down the routing matrix.
24. Latency vs cost: when do you pay for latency? Rubric: Real-time chat needs sub-second first-token; batch jobs can wait minutes. Interactive features under 800ms feel snappy; over 2s feel broken. Streaming hides latency for chat but not for tool calls.
25. When do you use the batch API? Rubric: Anything async: nightly summarization, eval runs, content moderation backfills, embedding generation. 50% discount on Anthropic and OpenAI. Wrong fit for anything user-facing under a minute.
Security is finally a real interview category in 2026. Two years ago you could ship without a prompt-injection defense. Not anymore.
26. How do you mitigate prompt injection? Rubric: Layered defense, not a single trick. Layer 1: input filtering (regex for obvious injection patterns). Layer 2: structured outputs (schema-constrained, not free text). Layer 3: least-privilege tools (a tool that can read but not delete). Layer 4: human-in-the-loop for destructive actions. Honest answer: prompt injection cannot be 100% prevented in agentic systems; reduce blast radius.
27. How do you handle PII in an LLM pipeline? Rubric: Tokenize before the model sees it (replace names, emails, IDs with placeholders), de-tokenize on output. Use a model with a no-training data agreement (Anthropic, Azure OpenAI). Log redacted versions only. Bonus: knows Anthropic does not train on API data by default.
28. How do you detect a jailbreak attempt? Rubric: Pattern detection (DAN, ignore previous instructions), output classifier (does the response violate policy?), and rate limiting on suspicious users. Strong answers admit detection is partial; the design must assume some jailbreaks succeed.
This is the category that filters "knows the field" from "ships in the field."
29. What's the latest in the Anthropic API that you've shipped? Rubric: Names two of: prompt caching, extended thinking, batch API, computer use, Claude Code SDK, files API. Explains a real use case for one. For more on shipping with Claude Sonnet 4.6 specifically, our production guide covers the defaults.
30. What's the latest in the OpenAI API that you've shipped? Rubric: Names two of: structured outputs, Realtime API, batch endpoint, Assistants API (knows it's deprecated for new builds), o-series reasoning models. Explains a real use case.
31. Where does Gemini 2.5 actually beat the alternatives? Rubric: 2M-token context window for whole-codebase or whole-document tasks, native multimodal (video frames as input), and price for long-context jobs. Honest about weaknesses: refusal behavior is rougher, structured outputs less mature.
32. When would you switch providers mid-product? Rubric: Cost shift over 2-3x, sustained latency regression, refusal behavior breaks a use case, or a new capability (long context, multimodal) unlocks a feature. Knows switching costs: prompt re-tuning, eval re-baselining, output drift.
Every rubric in this post follows a pattern. A strong answer names the failure mode, names the tool, names the trade-off. A weak answer is a textbook definition.
Example. Question 17 (LLM-as-judge limitations). Strong: "Position bias and verbosity bias. We mitigate with pairwise comparison and randomized order. Even then we trust the judge for relative ranking, not absolute scoring." Weak: "LLMs can be biased and may not always be accurate."
The strong answer signals production scars. The weak one signals reading a blog post once.
Don't trust certainty about a non-deterministic system. If a candidate claims their RAG is "always faithful" or their agent "never loops," ask for the eval numbers and the worst incident.
Here is the honest take. Interview signal degrades fast above question seven. A 4-hour scoped task on a real piece of your code (build a small RAG pipeline against your docs, add one tool to an existing agent, write an eval harness for an existing prompt) outperforms a 90-minute structured interview at every level above junior.
| Approach | Time to first signal | Cost | Signal quality | Best for |
|---|---|---|---|---|
| 5-round interview loop | 3-6 weeks | 10-30 hours of team time | Verbal fluency, not shipping | Senior FTE hire |
| Take-home plus 2 interviews | 1-2 weeks | 5-10 team hours plus paid take-home | Code sample plus reasoning | Mid or senior FTE |
| Cadence 48-hour trial | 48 hours | $0 trial, then $500-2k/wk | Real production output | Booking by week, scope-bounded |
Every engineer on Cadence is AI-native by baseline. Before they unlock bookings, they pass a voice interview scored specifically on the skills above: Cursor and Claude Code fluency, prompt-as-spec discipline, RAG and agent reasoning, eval habits, vendor API depth. There is no non-AI-native option on the platform.
Cadence pricing is locked: junior $500/week, mid $1,000/week, senior $1,500/week, lead $2,000/week. Weekly billing, replace any week, no notice period. The 48-hour free trial means you get two days of real shipping signal before the first invoice. For founders comparing this against a 5-round interview loop, the math usually decides itself. If you have a scoped AI feature in front of you right now, ask Cadence's Build/Buy/Book tool for a 60-second recommendation before you start the interview loop.
The questions in this post are still useful. Use them in the trial kickoff call to align on language, surface the candidate's depth, and frame the work. Just don't use them as the gate. Real code on real scope is the gate.
For a broader hiring playbook, see how to hire an AI engineer. For the generic-developer interview kit (non-AI-specific), see questions to ask a developer.
Try the trial-first approach. Spec one AI feature, book a senior Cadence engineer, and use the 48-hour free trial as your interview. You will know more about the engineer in two days of shipping than you would in five rounds of questions. Replace the week if it doesn't click; no notice period.
Five to seven across at least four categories. More than ten turns the call into a quiz and signal collapses. Pair the questions with a paid trial for real signal.
Only if the role touches model training. For applied AI engineering work (RAG, agents, eval, integration), classic ML is at best 20% of the signal. Most production AI eng work in 2026 is pipeline plumbing and eval discipline, not model architecture.
"Walk me through an LLM eval setup you actually shipped, including the gold set, the judge model, and one false positive you caught." It surfaces depth in two minutes and bluffers cannot fake it.
Yes, with the one-line rubrics in this post. The harder skill is grading vague answers, not asking the question. If you cannot tell strong from weak, ask a senior friend to sit in for the first three calls.
Every engineer passes a voice interview scored on Cursor, Claude Code, and Copilot fluency, prompt-as-spec discipline, verification habits, and multi-step prompt-ladder thinking. 50/100 unlocks bookings, and there is no opt-out tier. Daily ratings during a booking auto-replace anyone who stops shipping.