How to handle hallucinations in production LLM apps

You handle hallucinations in production LLM apps with a layered defense: ground the model with RAG and tool calls, require citations, validate outputs with schemas and a verifier model, run continuous evals, and escalate uncertain answers to a stronger model or a human. You cannot eliminate hallucinations. You can only reduce and detect them.

That last sentence is the part most teams skip, and it is the reason most LLM features get pulled from production six weeks after launch.

The honest baseline: you cannot eliminate, only reduce and detect

A 2024 Stanford study found legal-query hallucination rates of 58% to 88% across the major frontier models. Broader 2026 measurements put general hallucination rates between 50% and 82% depending on prompting style and task. These numbers do not go to zero with a bigger model, a better prompt, or a sharper temperature setting.

The deeper issue is incentive. Training objectives reward confident guessing over calibrated uncertainty, so models trained at scale learn to fabricate plausible answers rather than refuse. A NAACL 2025 study showed that hallucination-focused finetuning can cut error rates by 90% to 96% in narrow domains, but even those numbers have a tail.

Production teams that ship reliably plan for failure, not perfection. They build the layered defense below, assign each layer to a specific engineer, and treat the residual rate as something to monitor like uptime, not a bug to close.

Layer 1: structural mitigation (the cheapest wins)

Structural mitigation is the work you do before the model generates a token. It is the highest-ROI layer because it changes the input distribution, not the output.

RAG with citation requirement

Retrieval-augmented generation grounds the answer in your documents instead of the model's training memory. The pattern is well understood now: hybrid retrieval (BM25 plus dense vectors), a cross-encoder reranker, then prompt the model with retrieved chunks. We covered the full stack in our production RAG architecture guide.

The non-negotiable add-on is a citation requirement. The system prompt enforces: every factual claim must attach a quoted source from the retrieved context, or the model refuses. If you only do RAG without citations, the model still smooths over gaps with plausible-sounding fabrication. If you add citations, you can mechanically check that every claim maps to a real source.

Structured outputs via JSON schema

OpenAI structured outputs and Anthropic tool-use schemas constrain the response shape so it cannot be malformed. When supported, the model returns JSON that matches your schema 100% of the time. This kills an entire class of failures: parse errors, missing fields, hallucinated keys.

Structured outputs also force the model to commit to discrete fields rather than free-form prose, which makes downstream validation trivial.

Tool-use grounding

For anything that involves real facts (a customer's order status, a current price, a row in your database), do not let the LLM answer from memory. Define a tool, let the model call it, and put the database response into the prompt. Our OpenAI function calling guide walks through the pattern in detail.

Tool calls turn the LLM into an editor of facts retrieved by deterministic code. That is the right division of labor. Numbers come from the database. Prose comes from the model.

Layer 2: guardrails (validate before the user sees it)

Layer 2 catches the failures Layer 1 missed. It runs after generation, before the user sees the answer.

Schema and regex validation

The cheapest guardrails are deterministic. Run a regex pass over the output: dates that are in the future when they should be in the past, dollar amounts above the user's account balance, fields that should never be empty, named entities that do not appear in the retrieved context. A few hours of regex work catches an embarrassing percentage of obvious failures.

If your output is structured, validate it against a JSON schema. Reject and retry on failure with a corrective prompt that includes the validation error.

Guardrails AI and NeMo Guardrails

For the things regex cannot catch (factual consistency, PII leakage, off-topic drift), use a framework. Guardrails AI gives you typed validators you wrap around any model call: profanity, PII, factual consistency against a reference, valid SQL, and so on. NVIDIA NeMo Guardrails handles topic boundaries and conversation flow, useful for chatbots that need to stay in their lane.

These frameworks save you from writing the boilerplate, but the validators they ship are baselines. Plan to write a few custom ones specific to your domain.

Second-LLM verifier

For high-stakes outputs, run a second model on the first answer. The verifier prompt looks like: "Here is a user question, here is the retrieved context, here is the answer. Score 0-3 for: faithfulness to context, hallucination risk, refusal-appropriateness." If the score is below threshold, retry, escalate, or refuse.

Verifiers typically run on a cheaper model (Haiku, GPT-4o-mini, Gemini Flash) and add 20% to 40% to per-call cost. That is acceptable when a wrong answer would cost more.

Layer 3: evals (you cannot improve what you do not measure)

If you do not have an eval pipeline, you do not have a production LLM app. You have a demo with users.

Custom hold-out sets

The first thing to build is a hold-out set: 100 to 300 questions from your domain with verified correct answers. Hand-curate them. Include the easy cases, the edge cases, the gotchas you have already seen in production. Lock the set, version it, and run it against every model change, prompt change, retrieval change, and dependency upgrade.

A 100-question set you actually run on every PR beats a 5,000-question set you forget exists. Start small, grow over time as you learn what your real failure modes look like.

LLM-as-judge for hallucination

For each candidate answer in the hold-out set, run a judge model that scores faithfulness against the retrieved context and the verified answer. LLM-as-judge typically agrees with human raters 75% to 85% of the time, which is good enough for relative comparisons across versions.

Track the score over time. A regression in faithfulness on your eval set is a release-blocking signal.

Phoenix, Braintrust, LangSmith

You do not need to build the pipeline yourself. Arize Phoenix is the open-source default. Braintrust and LangSmith are the commercial options. They all do hold-out evals, judge scoring, prompt experiments, and trace inspection. Pick by ergonomics and your team's existing tooling, not feature lists. The differences in capability are smaller than the differences in how often your team will actually open the dashboard.

Layer 4: model-tier escalation and human-in-the-loop

The final layer routes work across a tier of models and humans by confidence.

The pattern is: a cheap model handles the easy cases (Haiku 4.5, GPT-4o-mini, Gemini Flash). When the verifier score is low, the structured-output validation fails, or the model emits an explicit refusal, the request escalates to a stronger model (Sonnet, GPT-4o, Opus). When the stronger model also returns low confidence, or when the topic is legally or financially material, the request goes to a human reviewer.

In practice, the cheap tier covers 70% to 85% of traffic if your routing is sensible. The strong tier picks up most of the rest. The human queue stays manageable when escalation gates are tight, which is the only way human review survives contact with real volume.

Layer	Cost to add	Hallucination reduction	Latency cost	Best for
Structured outputs + tool calling	1-3 days	30-60%	negligible	anything with a fixed schema
RAG with citation requirement	1-3 weeks	40-70% on factual queries	+200-500ms	knowledge-base, legal, medical
Guardrails AI / NeMo	3-5 days	10-30%	+50-150ms	PII, topic, format checks
Second-LLM verifier	2-4 days	20-50% catch rate	+1-3s	high-stakes outputs
Eval pipeline (Phoenix / Braintrust / LangSmith)	1-2 weeks	indirect, big over time	offline	every team shipping LLMs
Human review on low confidence	ops setup	highest, where used	minutes to hours	legal, financial, medical

What this looks like in legal, medical, and financial apps

The teams that absolutely cannot hallucinate ship LLM features anyway. They just architect them differently.

Legal: Casetext (now Thomson Reuters CoCounsel) and Harvey both lean hard on retrieval plus citation plus human review. Every claim cites a primary source. Refusal is the default when the corpus does not support the answer. Lawyers review before delivery to clients. The LLM accelerates the work; it does not replace the lawyer's signature.

Medical: clinical decision-support apps avoid free-form medical advice entirely. Dosing comes from FDA-approved tables retrieved by tool calls. The LLM rewrites the table into patient-facing prose, with the clinician approving every output. There is no path where the model makes up a number.

Financial: the durable pattern is "numbers from the database, prose from the model." A statement summary, a transaction explanation, an investment-allocation rationale. The LLM gets the numbers as inputs and writes around them. It never generates a number itself.

The common thread is that the LLM is the editor, not the source of truth. That framing matters more than any single technique.

Who actually owns this on your team

Mitigation work splits cleanly across engineer levels. If you are scoping the project, here is roughly how it lays out.

Layer 1 structural is senior or lead work. RAG architecture, retrieval tuning, structured-output schema design, and tool definitions all benefit from someone who has shipped a similar system before.
Layer 2 guardrails is mid-level work. Wiring up Guardrails AI, writing custom validators, and standing up a verifier model is a 1- to 2-week project for a competent mid engineer.
Layer 3 evals is split. A senior owns the rubric and the judge prompts. A mid runs the pipeline and integrates it into CI. A junior maintains the dataset and triages regressions.
Layer 4 escalation is senior or lead. Routing errors are expensive (a wrong cheap-tier answer that should have escalated, or a flood into the human queue), so the design decisions need real judgment.

Every Cadence engineer is AI-native by default, which is the relevant detail when scoping this work. The platform's voice interview specifically scores Cursor / Claude / Copilot fluency, prompt-as-spec discipline, and the verification habit, the exact instinct you want from someone touching LLM code in production. There is no non-AI-native option on Cadence; the baseline is the floor.

If you are sketching the budget, Cadence weekly rates are junior $500, mid $1,000, senior $1,500, and lead $2,000. A typical hallucination-mitigation project on a young product is 2-4 weeks of senior time plus 1-2 weeks of mid time, with a junior maintaining the eval set after launch. If you want to try the approach without commitment, the Build/Buy/Book decider maps the work against your in-house options first.

What to do this week

If your team has an LLM feature in production and no real hallucination strategy, here is a one-week plan.

Pick the single user-facing surface where a hallucination would hurt most. Build a 50-question hold-out set for that surface. Add structured outputs if your model supports it; pair them with a JSON schema check. Wire a cheap second-LLM verifier on the output. Decide your refusal default and write the user-facing copy for it.

That is a week of senior work. It will not eliminate hallucinations. It will move the rate from "untracked, scary" to "measured, bounded, recoverable," which is the actual definition of production. Once that is in place, you can layer in RAG, guardrail frameworks, and eval pipelines without panic.

If you want help implementing this end-to-end, every Cadence engineer ships with verification habits baked in (it's part of the AI-native baseline that gates the platform), and the 48-hour free trial means you can prototype the eval pipeline before committing to a week.

FAQ

Can you fully eliminate LLM hallucinations in production?

No. The honest answer is you cannot eliminate them. You can layer mitigations (RAG, structured outputs, verifiers, evals, human review) to reduce frequency and catch most that slip through, but a non-zero rate is permanent. Plan for that.

What's the single highest-impact mitigation?

Structured outputs with JSON schema, when your task allows it. The model cannot return invalid shapes, which kills a class of failures and makes downstream parsing safe. Pair it with tool calling for facts and you have removed the easiest 30% to 60% of hallucinations before doing anything fancy.

Do I need RAG if I have a frontier model?

Yes, for any factual or proprietary-data use case. Frontier models still hallucinate on long-tail facts and cannot know your private data. RAG plus a citation requirement is the standard production pattern, and it is the difference between "the model is mostly right" and "every claim is verifiable."

How big should my eval set be?

Start with 50 to 100 hand-verified examples. Grow it to 300 to 500 over the first quarter. Bigger is better, but a small tight set you run on every change beats a 5,000-question set you never look at. The point of an eval set is to make regressions impossible to ignore.

What's a reasonable refusal rate?

5% to 15% for most B2B knowledge apps. Below 5% usually means the model is overconfident; above 20% means retrieval is broken or your threshold is too strict. Tune to the point where the business cost of a wrong answer exceeds the cost of a refusal, then live with the number.

All posts