
Hybrid search combines BM25 (a keyword-matching algorithm that scores documents by term frequency and rarity) with vector embeddings (semantic similarity from a neural model), then fuses the two ranked lists, usually with Reciprocal Rank Fusion (RRF), and optionally re-ranks the top 50 with a cross-encoder like Cohere Rerank. The result: BM25 catches exact matches (product SKUs, error codes, names), embeddings catch paraphrases and intent, and the fusion step beats either one alone on almost every benchmark.
If you've ever shipped a search box on top of pgvector or Pinecone and watched it confidently miss a query for ECONNREFUSED 5432, you already know why pure vector search is not enough. And if you've stared at an Elasticsearch BM25 result page that returns nothing for "how do I cancel my plan" because the word "cancel" doesn't appear in the help doc (it says "end your subscription"), you know why pure keyword search is not enough either. Hybrid is the answer almost everyone converges on by their second iteration.
This post walks through the architecture, the two main fusion strategies, a working code example with score normalization, and the tool choices in 2026.
BM25 is a tuned bag-of-words scorer. It computes, for each candidate document, a score based on three signals: how often the query terms appear (term frequency), how rare those terms are across the corpus (inverse document frequency), and how long the document is (length normalization). It's been the default in Lucene, Elasticsearch, OpenSearch, and Solr for over a decade because it's fast, interpretable, and shockingly hard to beat for queries where the user already speaks your corpus's vocabulary.
The places BM25 falls apart:
You can patch some of this with synonym dictionaries, fuzzy matching, and query expansion, and teams did exactly that for a decade. Embeddings just made it tractable without hand-curating thesauri.
Vector search encodes both the query and the documents into a 384, 768, or 1536-dimensional vector (depending on the model: text-embedding-3-small is 1536, Cohere embed-v3 is 1024, BGE-base is 768), then returns the documents with the smallest cosine distance to the query vector. This is great for "what's that song with the line about a yellow submarine," and miserable for ERR_CONNECTION_REFUSED.
Specifically, embeddings struggle with:
A vector index will happily return a document with cosine similarity 0.82 to your query that contains zero of the words you searched for. Sometimes that's magic. Sometimes it's wrong in a way the user immediately notices.
The standard architecture in 2026 looks like this:
tsvector), once as a vector index (pgvector, Pinecone, Qdrant, Weaviate, Vespa, or a managed alternative like Turbopuffer).Why RRF and not weighted score addition? Because BM25 scores and cosine similarities live in incompatible numeric ranges. BM25 outputs unbounded positive numbers (a top result might score 14.3, a bad one 0.5). Cosine similarity is bounded in [-1, 1], usually clustered in [0.3, 0.9] for sensible queries. Adding bm25_score + cosine_score gives BM25 a 10x weight by accident. You can normalize, but RRF sidesteps the whole mess by looking only at the rank position, not the score magnitude.
For each document d, RRF computes:
RRF_score(d) = sum over each ranker r of: 1 / (k + rank_r(d))
where k is a smoothing constant (60 is the value from the original paper and it's robust enough that nobody really tunes it). Documents that don't appear in a ranker's top-N contribute zero. Higher is better.
Two retrievers, RRF, done. No score normalization, no per-query tuning, no weight schedule. It's the rare ML technique that just works.
Here's a minimal hybrid retriever in TypeScript, sitting on top of Postgres with pgvector for embeddings and Postgres tsvector for BM25-ish full-text search. It returns the top 10 fused results, with optional Cohere rerank as a third pass.
import { sql } from "./db";
import { embed } from "./embeddings"; // wraps OpenAI text-embedding-3-small
import { CohereClient } from "cohere-ai";
const cohere = new CohereClient({ token: process.env.COHERE_API_KEY! });
interface Hit { id: string; text: string; score: number; rank: number }
async function bm25Search(query: string, k = 50): Promise<Hit[]> {
const rows = await sql<{ id: string; text: string; score: number }[]>`
SELECT id, text, ts_rank_cd(tsv, plainto_tsquery('english', ${query})) AS score
FROM docs
WHERE tsv @@ plainto_tsquery('english', ${query})
ORDER BY score DESC
LIMIT ${k}
`;
return rows.map((r, i) => ({ ...r, rank: i + 1 }));
}
async function vectorSearch(query: string, k = 50): Promise<Hit[]> {
const qVec = await embed(query); // returns number[1536]
const rows = await sql<{ id: string; text: string; score: number }[]>`
SELECT id, text, 1 - (embedding <=> ${qVec}::vector) AS score
FROM docs
ORDER BY embedding <=> ${qVec}::vector
LIMIT ${k}
`;
return rows.map((r, i) => ({ ...r, rank: i + 1 }));
}
function rrfFuse(lists: Hit[][], k = 60, topN = 20): Hit[] {
const scores = new Map<string, { hit: Hit; score: number }>();
for (const list of lists) {
for (const hit of list) {
const prev = scores.get(hit.id);
const contribution = 1 / (k + hit.rank);
if (prev) {
prev.score += contribution;
} else {
scores.set(hit.id, { hit, score: contribution });
}
}
}
return Array.from(scores.values())
.sort((a, b) => b.score - a.score)
.slice(0, topN)
.map(({ hit, score }) => ({ ...hit, score }));
}
export async function hybridSearch(query: string, withRerank = true) {
const [keyword, semantic] = await Promise.all([
bm25Search(query, 50),
vectorSearch(query, 50),
]);
const fused = rrfFuse([keyword, semantic], 60, 20);
if (!withRerank) return fused.slice(0, 10);
const rerank = await cohere.rerank({
model: "rerank-v3.5",
query,
documents: fused.map((h) => h.text),
topN: 10,
});
return rerank.results.map((r) => fused[r.index]);
}
A few notes on this code, because the details matter:
ts_rank_cd is not strict BM25; it's Postgres's cover-density ranker. For workloads under a few million rows, it's close enough. Above that, move to Elasticsearch or OpenSearch for a real BM25 implementation. The team at Cursor Rules for production teams actually has a good template for codifying these "scale up at X" rules in .cursor/rules/.embed call should be cached by query hash. Re-embedding the same query is a waste.Cohere Rerank (and its open alternatives, like BGE Reranker and Jina Reranker) is a cross-encoder: it concatenates the query and each candidate document and runs them through a transformer to produce a relevance score. This is much more accurate than bi-encoder embeddings because the model sees both pieces of text at once and can attend across them.
The trade-off is latency and cost. A bi-encoder embed is cached; a rerank call is per-query, per-candidate. Rules of thumb:
You have three architectural choices: stitch two specialized stores together, run an all-in-one engine, or stay inside Postgres. The right answer depends on corpus size, team size, and how much you want to operate.
| Approach | BM25 engine | Vector engine | Setup time | Good for | Trade-off |
|---|---|---|---|---|---|
| Postgres-only | tsvector + ts_rank_cd | pgvector + HNSW | 1 day | <5M docs, small team, RAG MVPs | Not true BM25; HNSW recall drops past 10M vectors |
| Elasticsearch / OpenSearch hybrid | Native BM25 | Native dense_vector (since 8.0) | 3-5 days | 5M-500M docs, existing ES team | Two index types in one cluster; ops still nontrivial |
| Pinecone + Elasticsearch | Elasticsearch BM25 | Pinecone serverless | 2 days | Vector-heavy, want managed | Two vendors, two bills, latency from two round trips |
| Weaviate (all-in-one) | Built-in BM25 | Built-in HNSW | 1 day | Mid-size RAG, want one box | Less mature BM25 than Lucene; vendor lock-in |
| Vespa (all-in-one) | Native BM25 | Native HNSW or IVF | 1-2 weeks | Large scale (Spotify, Yahoo) | Steepest learning curve on this list |
| Typesense Cloud | Native BM25 | Native vector search | 1 day | Site search, e-commerce | Less flexible scoring than Elasticsearch |
| Turbopuffer | Optional BM25 | Optimized vector | 1-2 days | Cost-sensitive vector workloads | Newer, smaller community |
The single-vendor pitch (Weaviate, Vespa, Typesense) is real: one schema, one query API, one bill. The two-vendor pitch (pgvector + Elasticsearch, or Pinecone + OpenSearch) is also real: each tool is the strongest pick at its job, and you avoid betting your search layer on one company's roadmap.
If you're below 10M documents and your team already runs Postgres, just use pgvector and tsvector. Most teams who jump straight to a dedicated vector DB regret the operational overhead a year in. We covered the broader cost picture in our breakdown on the cost to add semantic search, which walks through real numbers for each path.
If you're shipping search this quarter, here's the sequence that consistently works:
nDCG@10 or just "did the right doc make top 5."If you don't have an engineer who's shipped this before, the fastest path is to book one who has. Every engineer on Cadence is AI-native by default (Cursor, Claude Code, Copilot fluency is vetted in a voice interview before they unlock bookings), and the platform's 12,800-engineer pool means RAG and search experience is a 2-minute filter, not a 6-week recruiter loop. Search infra is exactly the kind of bounded, well-scoped project where a senior at $1,500/week on a 2-week booking ships further than a 6-week contractor search.
The other place hybrid search tends to ship cleanly is alongside agentic SaaS features; the LLM's tool-use loop almost always calls a search(query) tool, and the quality of that tool is the single biggest determinant of whether the agent works.
Decide what to build next? Run the Build / Buy / Book recommender on your search feature. It takes 2 minutes and tells you whether to build in-house, buy a managed tool, or book a vetted engineer for a 2-week sprint.
For almost every production use case, yes. The BEIR benchmark and most internal evals show 10-30% lifts in nDCG@10 from adding BM25 to a vector retriever, with the biggest gains on queries containing rare terms, identifiers, or named entities. The exception is purely semantic workloads (recommendations, "find similar images") where keyword overlap doesn't help.
Weighted score fusion adds normalized BM25 and cosine scores with tunable weights (alpha * bm25 + (1 - alpha) * cosine). It can outperform RRF if you tune alpha carefully per query type, but it requires score normalization (min-max or z-score) and breaks when score distributions shift. RRF only uses rank position, so it's robust without tuning. Start with RRF; only switch to weighted fusion if your eval set demands it.
Postgres with pgvector and an HNSW index handles 5M to 10M vectors comfortably on a reasonable instance (32GB RAM, NVMe SSD). Above that, recall starts to degrade and query latency climbs, and a dedicated vector DB like Pinecone, Qdrant, or Turbopuffer pays for itself. Most teams overshoot here and reach for a vector DB at 100K docs, which is wasted complexity.
For RAG context selection (where you're feeding the top 5 docs to an LLM) rerank often pays for itself; the LLM is very sensitive to junk context. For user-facing search, run the A/B test. If your fused top-10 is already 80% precision, rerank gets you to 85%, which may or may not be worth 300ms of latency.
You can try. Models like text-embedding-3-large and Cohere embed-v3 are dramatically better than 2022-era embeddings, and on conversational queries they sometimes beat hybrid. But on the long tail (rare terms, exact identifiers, product codes), BM25 is still cheaper, faster, and more reliable. We have a related deep-dive on how to reduce AI coding mistakes in production that touches on the same "deterministic + probabilistic" pattern at the code level; the principle is identical.
Senior automation engineer at withRemote. Writes on CI/CD, test pyramids, and removing toil from engineering pipelines.