Cost to add RAG to a SaaS app

Q: How long does it take to add RAG to a SaaS app?

Two to four weeks for a basic Q&A bot, five to nine weeks for multi-tenant with permissions, and twelve to twenty weeks for a production-grade system with evals. The biggest variance driver is whether your team has shipped RAG before; AI-native engineers cut these timelines roughly in half.

Q: Should I use pgvector or a dedicated vector database?

Start with pgvector if you're on Postgres and have under 5 million vectors. It's free, plays nicely with your existing SQL filters, and removes one operational dependency. Move to Pinecone, Turbopuffer, or Qdrant when you need sub-50ms latency at high concurrency or cross 10 million vectors.

Q: What's the cheapest production RAG stack in 2026?

pgvector for storage, Voyage-3 for embeddings, no reranker (or Cohere Rerank if quality matters), and Claude Haiku 4.5 for generation. At 100,000 queries per month, this comes to about $850 all-in, dominated by LLM generation cost.

Q: Do I need a vector database, or can I use plain Postgres?

For most B2B SaaS use cases under 5 million vectors, pgvector inside your existing Postgres is the right call. The "you must use a dedicated vector DB" advice from 2023 reflects an earlier landscape; pgvector with HNSW indexes is fast enough for the vast majority of products.

Q: What's the per-query cost of a typical RAG system?

Roughly $0.005 to $0.010 per query at production scale, dominated by LLM generation. If you cache repeat queries and route to a smaller model when confidence is high, you can drive that under $0.003. Embedding and vector search costs are usually negligible.

Q: Can I add RAG without rewriting my SaaS app?

Yes. RAG is an additive feature: a new endpoint that takes a query, retrieves chunks, calls an LLM, and returns a streamed answer. You don't restructure existing code. The integration work is mostly in your front end (chat UI, streaming, citations) and your data ingestion pipeline (which docs sync, when, with what metadata).

Adding RAG to a SaaS app in 2026 typically costs $8,000 to $100,000 in build cost, plus $50 to $5,000 per month in infrastructure. The cheap end is a basic Q&A bot over your docs (6 to 10 engineer-weeks). The expensive end is multi-tenant retrieval with permissions, rerankers, and a real eval pipeline (30 to 50 engineer-weeks). Per-query cost runs $0.0003 to $0.005 at production scale.

That spread is large because "add RAG" means three very different products depending on what you actually ship. A search box that summarizes your help center is not the same project as a per-customer assistant that respects row-level permissions, falls back gracefully when retrieval misses, and gets evaluated on every release. Below we break the cost into engineer-weeks, infra dollars, and per-query economics, then show what to skip and what to pay for.

What "RAG" actually means at three scope tiers

Before pricing anything, name the scope. The cost gap between tiers is roughly 5x at each step.

Tier 1: Basic Q&A on your docs. A chatbot that answers questions about your public help content or a single uploaded knowledge base. One tenant (effectively), no permissions, no streaming citations, no eval harness. You ingest a few thousand chunks, embed them once, run cosine search, stuff the top 5 into a Claude or GPT prompt, and call it shipped.

Tier 2: Multi-tenant with permissions. Each customer in your SaaS app gets their own corpus. Searches must respect row-level access (user A cannot see tenant B's data, and inside a tenant, free users cannot see paid-only content). You add hybrid search (BM25 plus vectors), a reranker, query rewriting, and conversation memory. This is the realistic shape of "RAG inside a B2B SaaS product."

Tier 3: Production-grade with evals and observability. Tier 2 plus a regression eval suite (RAGAS or custom), tracing per query, grounding/hallucination scores, A/B testing of retrieval changes, and a feedback loop where thumbs-down events get triaged. This is what production AI teams at companies like Notion, Linear, and Intercom actually run.

Scope tier	Engineer-weeks	Build cost (Cadence)	Build cost (US agency)	Timeline
Basic Q&A on docs	6 to 10	$8k to $15k	$40k to $80k	2 to 4 weeks
Multi-tenant w/ permissions	12 to 20	$20k to $40k	$90k to $200k	5 to 9 weeks
Production-grade w/ evals	30 to 50	$50k to $100k	$200k to $500k	12 to 20 weeks
Cadence (any tier)	AI-native devs ship 2-3x faster	$500 to $2,000/wk	N/A	48-hour trial, ship weekly

The Cadence column uses our weekly pricing: junior $500, mid $1,000, senior $1,500, lead $2,000. Most teams run a senior plus a mid for a multi-tenant build, which is $2,500 per week against an agency line item that often crosses six figures.

The real cost line items

Below is what the dollars buy. Skip any item that doesn't match your scope.

Embedding model

You pay per million tokens to convert text into vectors. The choice matters more for monthly costs than for accuracy at small scale.

Model	Price per 1M tokens	Dimensions	Notes
OpenAI text-embedding-3-small	$0.02	1,536	Cheap default, good baseline
OpenAI text-embedding-3-large	$0.13	3,072	Highest quality from OpenAI, 6x cost
Voyage-3	$0.06	1,024	Often beats OpenAI on retrieval quality, 2x storage savings
Cohere Embed-4	$0.12	variable	Strong for multilingual + long docs
Mistral Embed	~$0.01	1,024	Cheapest credible option

For a corpus of 5 million tokens (roughly a 50,000-page docs site), you embed once for $0.10 to $0.65. That's the corpus cost. Query embeddings are the long-term bill: at 100,000 queries per month with 50-token queries, you spend about $0.10 to $0.65 per month on query embeddings. Trivial.

The interesting decision is not "which embedding API is cheapest." It is "do I run text-embedding-3-large or Voyage-3," because the difference shows up in retrieval quality, which shows up in reranker hit rate, which shows up in your generation cost. Use Voyage-3 by default in 2026 unless you already have an OpenAI contract.

Vector database

This is where teams overspend. The honest picture:

Option	Cost at 1M vectors	Cost at 10M vectors	Best for
pgvector (existing Postgres)	~$0 incremental	~$50/mo (RAM)	Anyone already on Postgres, < 5M vectors
Turbopuffer	~$10/mo	~$80/mo	Object-storage backed, cheap at scale
Pinecone Serverless	~$50/mo	~$300/mo	Managed, fast, mature
Qdrant Cloud	~$30/mo	~$200/mo	Self-host friendly, strong filters
Weaviate Cloud	~$25/mo	~$250/mo	Hybrid search built-in

If you already run Postgres (most B2B SaaS does), start with pgvector. It is free, supports the same SQL filters you use for everything else, and handles up to a few million vectors fine on a $50/month RDS box. The "use a vector DB" advice you read in 2023 is mostly outdated; pgvector caught up. We covered the full breakdown in the best vector database for production, which is worth reading before you sign a Pinecone contract.

Switch to a dedicated vector store only when (a) you cross 5 to 10 million vectors, (b) your latency budget under 50ms is tight, or (c) you need features pgvector lacks like binary quantization at scale.

Reranker

A reranker (Cohere Rerank, Voyage Rerank, Jina) takes the top 50 results from your vector search and re-scores them with a cross-encoder. This is the single highest-impact retrieval improvement; it typically lifts top-3 hit rate by 15 to 30 percentage points.

Cohere Rerank 3 costs $2 per 1,000 queries. At 100,000 queries per month, that's $200. Voyage Rerank-2 is similar. Skip the reranker on Tier 1 (basic Q&A); add it on Tier 2 and above. The math is unambiguous once you're past hobby scale.

LLM generation

The biggest variable in your monthly bill. With Claude Haiku 4.5 at roughly $1 per million input tokens and $5 per million output tokens, a typical RAG call (3,000 input tokens of context, 300 output tokens) costs $0.0045. At 100,000 queries per month, that's $450. At 1 million queries, $4,500.

Use a smaller model (Haiku, GPT-4o-mini, Mistral Small) for most queries; route to the bigger model only when retrieval confidence is low or the user explicitly asks for deeper reasoning. This routing decision typically halves your generation bill.

Eval and observability

LangSmith, Arize Phoenix, or Braintrust run $200 to $1,500 per month depending on trace volume. You don't need this on Tier 1. On Tier 3, it is the difference between "we changed retrieval and silently broke 12% of queries" and "we caught the regression on the eval suite before merge."

Per-query cost math at three traffic levels

Here is what a Tier 2 system actually costs to operate at three sizes. Assumes pgvector, Voyage-3 embeddings, Cohere Rerank, Claude Haiku 4.5 for generation.

Component	1,000 queries/mo	100,000 queries/mo	1,000,000 queries/mo
Query embeddings (Voyage-3)	$0.003	$0.30	$3.00
Vector search (pgvector on existing DB)	$0	$0	~$50 (bigger box)
Reranker (Cohere)	$2	$200	$2,000
LLM generation (Haiku 4.5, 3.3k tokens avg)	$4.50	$450	$4,500
Observability	$0	$200	$800
Per-query cost	$0.0065	$0.0085	$0.0073
Total monthly	~$7	~$850	~$7,350

A few notes that the typical "RAG cost" article skips:

Per-query cost is roughly flat across scale because reranker and LLM dominate, and both are linear in volume.
The fixed costs (vector DB, observability) are negligible at scale. Pinecone's $50/mo base is irrelevant when you're spending $4,500 on generation.
If you cache identical queries (a Redis-backed semantic cache cuts ~20% of traffic), you can shave 15 to 20% off the bill.

What blows up the budget

Eight things eat budgets in real RAG projects. Pattern-match to your situation and decide which to pay for.

Re-embedding when you change models. Switching from OpenAI to Voyage-3 means re-embedding the entire corpus. At 100M tokens, that's a $6 to $13 spend, but the engineer-week to run the migration safely is the real cost.
Ingesting weird formats. PDFs with tables, scanned images, slide decks, and ZIP archives all need OCR, layout parsing, and chunking heuristics. This alone is 2 to 4 engineer-weeks for any non-trivial corpus.
Permissions and tenancy. Filtering by tenant_id is easy; filtering by per-document ACLs that update in real time is hard. Plan 2 to 3 weeks if your SaaS has a complex sharing model.
Hybrid search. Combining BM25 (keyword) with vector results helps a lot for product names, error codes, and acronyms. About a week to wire up cleanly.
Streaming citations. Showing the source paragraph next to each generated claim, with click-to-jump, takes a week of careful UI work.
Eval pipeline. A real eval suite with golden questions, judge prompts, and CI integration is 1 to 2 weeks if you've built one before, 4 to 6 if you haven't.
Hallucination guardrails. Grounding checks, refusal patterns, and confidence thresholds. Easy to add badly, hard to add well.
Latency budget under 1 second. If your product needs sub-second responses, you'll spend a week tuning chunk sizes, top-k, and reranker depth. Most teams pick wrong on the first pass.

We've watched founders try to ship all eight in the first sprint. None of them shipped. Pick three for V1.

What "AI-native engineers" actually changes here

RAG is the textbook case where AI-native developer fluency is worth real money. A senior engineer who lives in Cursor and Claude Code can scaffold a Tier 1 RAG in two days, including ingestion, retrieval, generation, and a basic Streamlit eval UI. Without that fluency, the same scope takes a week of stitching together LangChain tutorials.

Every engineer on Cadence is AI-native by baseline, vetted on Cursor, Claude Code, and Copilot fluency in a voice interview before they unlock bookings. We don't sell that as a premium tier; it's the only kind of engineer on the platform. The pool is 12,800 deep with a 67% trial-to-active conversion rate, so when you book a senior at $1,500/week to build out Tier 2 RAG, you're getting someone who has shipped this exact pattern before.

The honest comparison: a $200/hour US agency working on retainer will ship the same Tier 2 system, slowly, in 12 weeks for $90k+. A senior on Cadence ships in 6 to 8 weeks for $9k to $12k. The gap is real because AI-native engineers spend less time writing boilerplate and more time on the parts that matter (retrieval quality, eval design, permissions). For the broader picture of how that affects production architecture decisions, the production RAG architecture breakdown goes deeper.

How to spend the budget

If you have $20k to spend, do this:

Week 1 to 2: Ship Tier 1 (basic Q&A on your docs). Use pgvector, Voyage-3, Claude Haiku, no reranker. Measure top-3 retrieval hit rate on 50 hand-written questions.
Week 3 to 5: Add reranker, hybrid search, and per-tenant filtering. Now you're at Tier 2.
Week 6 to 8: Add the eval pipeline (RAGAS or Braintrust), streaming citations, and one round of chunking tuning based on what the eval reveals.

That's 6 to 8 engineer-weeks. At Cadence senior pricing ($1,500/week), you're at $9k to $12k cash out, with the option to replace the engineer any week if pace slips. If you're sizing this against full-time hires or agency retainers, the framing in the cost to migrate JavaScript to TypeScript post covers the same engineer-week math for refactor-style projects.

If you want a Build-vs-Buy-vs-Book recommendation tailored to your stack, run the decide tool; it takes about 90 seconds and asks the right questions about scope and existing infrastructure. If you're already past the deciding phase and need to ship, book a senior on Cadence and use the 48-hour free trial to scope the work before you commit a dollar.

Try Cadence: book a vetted AI-native engineer in 2 minutes, run a 48-hour free trial, ship Tier 1 RAG in your first week. Weekly billing. Replace any week. Cancel any time.

FAQ

How long does it take to add RAG to a SaaS app?

Two to four weeks for a basic Q&A bot, five to nine weeks for multi-tenant with permissions, and twelve to twenty weeks for a production-grade system with evals. The biggest variance driver is whether your team has shipped RAG before; AI-native engineers cut these timelines roughly in half.

Should I use pgvector or a dedicated vector database?

Start with pgvector if you're on Postgres and have under 5 million vectors. It's free, plays nicely with your existing SQL filters, and removes one operational dependency. Move to Pinecone, Turbopuffer, or Qdrant when you need sub-50ms latency at high concurrency or cross 10 million vectors.

What's the cheapest production RAG stack in 2026?

pgvector for storage, Voyage-3 for embeddings, no reranker (or Cohere Rerank if quality matters), and Claude Haiku 4.5 for generation. At 100,000 queries per month, this comes to about $850 all-in, dominated by LLM generation cost.

Do I need a vector database, or can I use plain Postgres?

For most B2B SaaS use cases under 5 million vectors, pgvector inside your existing Postgres is the right call. The "you must use a dedicated vector DB" advice from 2023 reflects an earlier landscape; pgvector with HNSW indexes is fast enough for the vast majority of products.

What's the per-query cost of a typical RAG system?

Roughly $0.005 to $0.010 per query at production scale, dominated by LLM generation. If you cache repeat queries and route to a smaller model when confidence is high, you can drive that under $0.003. Embedding and vector search costs are usually negligible.

Can I add RAG without rewriting my SaaS app?

Yes. RAG is an additive feature: a new endpoint that takes a query, retrieves chunks, calls an LLM, and returns a streamed answer. You don't restructure existing code. The integration work is mostly in your front end (chat UI, streaming, citations) and your data ingestion pipeline (which docs sync, when, with what metadata).

All posts