May 7, 2026 · 10 min read · Cadence Editorial

Building production RAG: 2026 architecture

production rag architecture — Building production RAG: 2026 architecture
Photo by [Brett Sayles](https://www.pexels.com/@brett-sayles) on [Pexels](https://www.pexels.com/photo/server-racks-on-data-center-5480781/)

Building production RAG: 2026 architecture

Production RAG architecture in 2026 is hybrid retrieval (BM25 plus dense vectors), a cross-encoder reranker, contextual chunk embeddings, and an agent that can skip retrieval when the model already knows the answer. Naive vector search alone fails roughly 40% of the time at retrieval, so the modern reference stack treats retrieval as a multi-stage pipeline, not a single similarity query.

The interesting question in 2026 is not "what does production RAG look like." That's settled. The interesting question is which third of your RAG projects shouldn't be RAG at all, now that 200k-token context windows are standard and 1M-token windows ship in Gemini and Claude.

What actually changed between 2023 and 2026

In 2023 a "RAG system" meant LangChain plus Pinecone plus text-embedding-ada-002, top-5 cosine similarity, stuff into the prompt, ship. That stack was a demo, not a system. By 2024 most teams discovered the same failure mode: retrieval recall in the 60-70% range, hallucinations on adjacent-but-wrong chunks, no evals to tell when it broke.

Three things shifted by 2026.

First, retrieval got serious. Hybrid (BM25 plus dense) is the default; pure-vector pipelines are now considered an anti-pattern outside of trivial use cases. Cross-encoder rerankers (Cohere Rerank 3, Voyage rerank-2.5, BGE reranker v2) became table stakes. Anthropic's September 2024 contextual retrieval paper showed a 49% reduction in failed retrievals from one trick: prepending a 50-100 token chunk-level context summary before embedding. With a reranker layered on, that number hits 67%.

Second, evals got serious. Ragas, Arize Phoenix, and TruLens went from "nice to have" to "you cannot ship without one." Roughly 60% of new RAG deployments in 2026 ship with a labeled eval set on day one, up from under 30% in early 2025.

Third, long context arrived. Claude 4.5 Sonnet handles 200k tokens reliably; Gemini 2.5 Pro handles 1M. For a corpus that fits in context (most internal company wikis, most product documentation), the right answer is now "skip RAG, just paste the docs." We'll come back to this.

The 2026 reference architecture

Here's the stack that handles 80% of production cases without heroics.

Ingestion. Parse to clean markdown. Use unstructured.io or LlamaParse for PDFs; nothing else has caught up. Strip boilerplate. Preserve tables as markdown, not flattened text.

Chunking. Recursive split at 512 tokens for factoid corpora (support docs, policy), 1024 for analytical (research, contracts). Skip 128-token chunks; they fragment concepts and tank precision. Semantic chunking is worth it only for high-value documents because you pay an embedding call per sentence.

Contextual embedding. Before embedding each chunk, prepend a one-paragraph context generated by a cheap model (Haiku 4.5 or gpt-4o-mini). The prompt: "Here is the full document. Here is one chunk. Write a 50-100 token context that situates this chunk in the document." This is the single highest-ROI change you can make in 2026, and most blog posts still don't mention it.

Embedding. Voyage-3-large at $0.06 per 1M tokens and 1024 dimensions is the price-performance leader. OpenAI text-embedding-3-large at $0.13 per 1M and 3072 dims is fine if you're already on OpenAI. Cohere embed-v4 is competitive and supports multilingual out of the box.

Vector store. See the table below. Short version: pgvector until 1M vectors, Turbopuffer or Pinecone after.

Hybrid retrieval. Run BM25 and dense in parallel. Fuse with Reciprocal Rank Fusion (RRF). Take top 50.

Rerank. Cohere Rerank 3 (~100ms, API) or BGE reranker v2-m3 (~30ms on a small GPU) compresses top 50 down to top 5-8. This is a 10-30% precision lift for under 100ms.

Generation. Claude 4.5 Sonnet, GPT-5, or Gemini 2.5 Pro. Stream tokens. Cite sources inline.

Eval. Ragas for offline; Phoenix or LangSmith for live tracing. Minimum 50-100 labeled query-answer pairs; faithfulness greater than 0.9, context precision greater than 0.8.

That's the whole thing. Everything else is optimization.

Picking a vector store: honest numbers at 100k and 1M docs

Most "vector DB comparison" posts read like ad copy. Here's the actual decision matrix at two scales, assuming 1024-dim embeddings and roughly 1KB of metadata per doc.

Store100k docs (cost/mo)1M docs (cost/mo)Best forWhere it loses
pgvector (on Supabase/RDS)~$25~$200-400Teams already on Postgres, sub-1M corpora, hybrid SQL filtersHNSW recall drops past ~5M vectors; no native BM25 (use pg_trgm or paradedb)
Turbopuffer~$30~$120Cost-sensitive scale; serverless; built-in BM25Newer, smaller ecosystem; less tooling around it
Pinecone (Serverless)~$70~$500Teams that want zero ops; hybrid with sparse-dense indexesCost climbs fast; vendor lock
Weaviate (Cloud)~$300~$1500Multi-tenant SaaS, multi-vector per object, GraphQLMost expensive of the four; overkill for single-tenant apps
Qdrant (self-hosted)~$50 (1 small node)~$300 (1 large node)Teams that want OSS, fast filters, will run infraOps burden; you own the upgrades

At 100k vectors almost any choice works. The decision matters at 1M-plus, and the real winners in 2026 are pgvector (if you can stay on Postgres) and Turbopuffer (if you can't). Pinecone is still the safe enterprise pick, you just pay 3-4x for it.

The patterns most posts skip

The reference architecture above gets you to "good." The next four patterns get you to "actually shippable."

Contextual retrieval

Anthropic published this in September 2024 and it remains underused. Before embedding chunk N, prepend a context like: "This chunk discusses the refund policy for international orders, in the context of the company's 2024 returns documentation." Embed the full string. Index BM25 over the same string.

Reported results: 49% reduction in retrieval failures vs. naive embedding; 67% with a reranker on top. Cost: one Haiku call per chunk at ingest, around $1.02 per million tokens of source documents. For a 100k-doc corpus that's typically $50-100 one-time. Worth it.

Multi-query expansion and rewriting

User queries are short and ambiguous. "How do refunds work for EU customers?" Don't embed it directly. Use a small model to generate 3-5 paraphrased variants (different vocabulary, broader and narrower scope), retrieve for each, fuse results. Adds 100-200ms and roughly doubles recall on ambiguous queries.

The related move is HyDE (Hypothetical Document Embeddings): ask the model to draft a hypothetical answer first, then embed that and search. Costs a model call but improves complex-query retrieval significantly.

Structured retrieval (BM25 over JSON)

Half your queries aren't really semantic. "Show me all tickets from last week with priority high." Don't embed that, query it. The pattern: keep your structured fields (dates, enums, IDs, tags) in normal indexes; use vector search only for the genuinely fuzzy parts. BM25 over JSON metadata covers a surprising amount of ground.

If you're using OpenAI function calling correctly, the model can route the query: "this needs SQL, that needs vector, this one needs both."

Agentic RAG with a skip-retrieval tool

The 2026 move: give your agent a search_docs tool, but also explicitly tell it that for general knowledge it should answer directly. Most well-known facts ("what's the capital of France") don't need retrieval. Roughly 20-40% of queries in a typical product support corpus are answerable from base model knowledge; running retrieval on them adds latency, cost, and noise.

The agent loop becomes: classify, retrieve only if needed, optionally retrieve again with a refined query, answer. This is more expensive per query (multiple model calls) but ships better answers on ambiguous inputs.

Cost math at 100k and 1M docs

Here's a realistic monthly bill for a customer-support RAG handling 100k user queries per month, assuming hybrid retrieval, contextual embedding at ingest, Cohere Rerank 3, and Claude 4.5 Sonnet generation.

Line item100k docs1M docs
Embedding (Voyage-3-large, ingest)$90 (one-time)$900 (one-time)
Contextual context generation (Haiku)$80 (one-time)$800 (one-time)
Vector store (Turbopuffer)$30/mo$120/mo
Query embeddings (100k queries)$1/mo$1/mo
Rerank (Cohere, 100k * top 50)$200/mo$200/mo
Generation (Claude 4.5, 100k queries, ~3k tokens each)$900/mo$900/mo
Evals (Phoenix self-hosted)$50/mo$50/mo
Recurring monthly~$1,180~$1,270

The recurring cost barely moves between 100k and 1M docs because generation dominates. The one-time ingest costs scale linearly. Most teams over-spend on the vector store and under-spend on rerank and evals; the table above shows why that's backwards.

When you should skip RAG entirely

This is the most contrarian point in the post and the least controversial inside teams that have shipped a few of these.

If your full corpus fits in 200k tokens (around 500 pages of text), you probably don't need RAG. Paste the whole thing in the system prompt with prompt caching enabled. Claude 4.5's prompt cache is $0.30 per million tokens cached; a 150k-token system prompt costs $0.045 per cached read. You'll often beat your RAG pipeline on quality, latency, and cost.

If your corpus is 200k to 1M tokens and your model is Gemini 2.5 Pro, same answer: in-context with caching wins for most workloads.

RAG genuinely earns its keep when:

  • The corpus is larger than 1M tokens
  • The corpus changes faster than you can refresh prompt caches
  • You need verifiable citations to specific source chunks
  • You have multi-tenant data isolation requirements (per-customer indexes)
  • Latency is dominated by retrieval, not generation, and you can pre-filter aggressively

If none of those apply, build the long-context version first. It's a weekend of work, costs less, and gives you a baseline to beat. This is the sort of working style change that separates 2023 RAG thinking from 2026 RAG thinking: the model is the system, retrieval is one of its tools, not the architecture.

What to do this week

  1. If you're starting fresh: pgvector or Turbopuffer, hybrid (BM25 + dense), contextual embedding at ingest, Cohere Rerank 3, Ragas eval set with 50 labeled pairs. Ship that. Don't add more until evals say you need to.
  2. If you have a naive RAG in production: add the reranker first (biggest single quality lift), then contextual retrieval, then evals. In that order.
  3. If your corpus is small enough: stop, try the long-context version with prompt caching, compare quality and cost. You may not need RAG.
  4. Set up Phoenix or LangSmith for live tracing. Without observability you cannot tell when retrieval is the bug.

If you're scoping this build and want a concrete recommendation on what to build vs. what to buy vs. who should own it, run your spec through the Build/Buy/Book tool. It returns a one-paragraph recommendation tuned for your stage.

Where Cadence fits

This stack is harder to staff than to design. The teams who ship RAG well in 2026 are the ones whose engineers reach for Cursor for multi-file refactors, Claude Code for the long-running ingestion scripts, and the right model for the right step without thinking about it. That working style is the baseline, not the upgrade path.

Every engineer on Cadence is AI-native by default. The voice interview specifically scores Cursor, Claude Code, and Copilot fluency, prompt-as-spec discipline, multi-step prompt-ladder thinking, and verification habits before an engineer unlocks bookings. There is no non-AI-native option on the platform; we don't run two pools.

For a production RAG build, a Senior at $1,500/week typically owns the architecture (chunking strategy, eval design, vector store choice) and a Mid at $1,000/week handles the ingestion pipeline, observability, and on-call. A Lead at $2,000/week is overkill unless you're shipping multi-tenant or hitting 10M-plus vectors. Weekly billing, 48-hour free trial, replace any week. Two minutes to spec, vetted shortlist back the same hour.

If you'd rather not assemble this stack alone, book an engineer for the week. 48-hour free trial, weekly billing, replace any week.

FAQ

What's the best vector database for production RAG in 2026?

For most teams: pgvector if you're already on Postgres and under 1M vectors, Turbopuffer if you need cost-efficient scale, Pinecone if you want zero ops and don't mind paying 3-4x. Weaviate and Qdrant are good but rarely the right first pick.

Do I still need RAG if I'm using Claude or Gemini with a 1M-token context?

Often no. If your corpus fits in context and doesn't change faster than you can refresh prompt caches, paste the whole thing in the system prompt with caching enabled. RAG earns its keep when the corpus exceeds 1M tokens, changes constantly, or needs verifiable citations.

What is contextual retrieval and why does it matter?

Contextual retrieval (introduced by Anthropic in September 2024) prepends a 50-100 token document-level context to each chunk before embedding. It reduces failed retrievals by 49% on its own and by 67% combined with a reranker. It's the single highest-ROI improvement you can make in 2026 and remains underused.

How much does production RAG cost at 1M documents?

For 100k user queries per month: roughly $1,200-1,500/month recurring (generation dominates) plus $1,500-2,000 one-time for ingest embedding and contextual generation. Most teams overpay for the vector store and underpay for reranking and evals.

What's the difference between agentic RAG and standard RAG?

Standard RAG always retrieves before generating. Agentic RAG gives the model a search_docs tool and lets it decide whether to retrieve, retrieve again with a refined query, or answer directly from base knowledge. It's more expensive per call but handles ambiguous queries and well-known facts better.

Should I use Ragas or Phoenix for evals?

Both. Ragas is the standard for offline metric calculation (faithfulness, context precision, answer relevance). Phoenix or LangSmith handle live tracing in production. You want offline evals on a 50-100 query golden set and live tracing on every production call.

All posts