I am a...
Learn more
How it worksPricingFAQ
Account
May 24, 2026 · 11 min read · By Deeksha Durgesh

Hybrid search: combining BM25 with embeddings

hybrid search bm25 embeddings — Hybrid search: combining BM25 with embeddings
Photo by [Markus Spiske](https://www.pexels.com/@markusspiske) on [Pexels](https://www.pexels.com/photo/display-coding-programming-development-1921326/)

Hybrid search: combining BM25 with embeddings

Hybrid search combines BM25 (a keyword-matching algorithm that scores documents by term frequency and rarity) with vector embeddings (semantic similarity from a neural model), then fuses the two ranked lists, usually with Reciprocal Rank Fusion (RRF), and optionally re-ranks the top 50 with a cross-encoder like Cohere Rerank. The result: BM25 catches exact matches (product SKUs, error codes, names), embeddings catch paraphrases and intent, and the fusion step beats either one alone on almost every benchmark.

If you've ever shipped a search box on top of pgvector or Pinecone and watched it confidently miss a query for ECONNREFUSED 5432, you already know why pure vector search is not enough. And if you've stared at an Elasticsearch BM25 result page that returns nothing for "how do I cancel my plan" because the word "cancel" doesn't appear in the help doc (it says "end your subscription"), you know why pure keyword search is not enough either. Hybrid is the answer almost everyone converges on by their second iteration.

This post walks through the architecture, the two main fusion strategies, a working code example with score normalization, and the tool choices in 2026.

Why BM25 alone breaks on natural-language queries

BM25 is a tuned bag-of-words scorer. It computes, for each candidate document, a score based on three signals: how often the query terms appear (term frequency), how rare those terms are across the corpus (inverse document frequency), and how long the document is (length normalization). It's been the default in Lucene, Elasticsearch, OpenSearch, and Solr for over a decade because it's fast, interpretable, and shockingly hard to beat for queries where the user already speaks your corpus's vocabulary.

The places BM25 falls apart:

  • Synonyms and paraphrase: "cancel subscription" vs. "end my plan."
  • Misspellings: "stipe payment failed."
  • Intent queries: "best laptop for video editing under 2k."
  • Cross-language queries, if the corpus is multilingual.

You can patch some of this with synonym dictionaries, fuzzy matching, and query expansion, and teams did exactly that for a decade. Embeddings just made it tractable without hand-curating thesauri.

Why embeddings alone break on exact-match queries

Vector search encodes both the query and the documents into a 384, 768, or 1536-dimensional vector (depending on the model: text-embedding-3-small is 1536, Cohere embed-v3 is 1024, BGE-base is 768), then returns the documents with the smallest cosine distance to the query vector. This is great for "what's that song with the line about a yellow submarine," and miserable for ERR_CONNECTION_REFUSED.

Specifically, embeddings struggle with:

  • Exact identifiers: SKUs, error codes, version numbers, function names.
  • Rare proper nouns that the embedding model wasn't trained on.
  • Negation: "shirts that are not red" and "red shirts" produce very similar vectors.
  • Recall on long documents, where the average pooling washes out the relevant span.

A vector index will happily return a document with cosine similarity 0.82 to your query that contains zero of the words you searched for. Sometimes that's magic. Sometimes it's wrong in a way the user immediately notices.

The hybrid recipe (BM25 + embeddings + RRF, optional rerank)

The standard architecture in 2026 looks like this:

  1. Index the corpus twice: once as a BM25 inverted index (Elasticsearch, OpenSearch, Typesense, Vespa, or Postgres with tsvector), once as a vector index (pgvector, Pinecone, Qdrant, Weaviate, Vespa, or a managed alternative like Turbopuffer).
  2. At query time, fire both searches in parallel. Ask each for the top 50 or top 100 results.
  3. Fuse the two ranked lists using Reciprocal Rank Fusion, which is the simplest method that consistently wins on benchmarks like BEIR.
  4. Optionally rerank the top 20-50 of the fused list with a cross-encoder (Cohere Rerank v3, BGE Reranker, or a small fine-tuned model). The cross-encoder reads the query and each candidate together, which is more expensive but much more accurate than the bi-encoder embedding step.
  5. Return the top 10 to the user.

Why RRF and not weighted score addition? Because BM25 scores and cosine similarities live in incompatible numeric ranges. BM25 outputs unbounded positive numbers (a top result might score 14.3, a bad one 0.5). Cosine similarity is bounded in [-1, 1], usually clustered in [0.3, 0.9] for sensible queries. Adding bm25_score + cosine_score gives BM25 a 10x weight by accident. You can normalize, but RRF sidesteps the whole mess by looking only at the rank position, not the score magnitude.

Reciprocal Rank Fusion, in one formula

For each document d, RRF computes:

RRF_score(d) = sum over each ranker r of: 1 / (k + rank_r(d))

where k is a smoothing constant (60 is the value from the original paper and it's robust enough that nobody really tunes it). Documents that don't appear in a ranker's top-N contribute zero. Higher is better.

Two retrievers, RRF, done. No score normalization, no per-query tuning, no weight schedule. It's the rare ML technique that just works.

A working code example with score normalization

Here's a minimal hybrid retriever in TypeScript, sitting on top of Postgres with pgvector for embeddings and Postgres tsvector for BM25-ish full-text search. It returns the top 10 fused results, with optional Cohere rerank as a third pass.

import { sql } from "./db";
import { embed } from "./embeddings"; // wraps OpenAI text-embedding-3-small
import { CohereClient } from "cohere-ai";

const cohere = new CohereClient({ token: process.env.COHERE_API_KEY! });

interface Hit { id: string; text: string; score: number; rank: number }

async function bm25Search(query: string, k = 50): Promise<Hit[]> {
  const rows = await sql<{ id: string; text: string; score: number }[]>`
    SELECT id, text, ts_rank_cd(tsv, plainto_tsquery('english', ${query})) AS score
    FROM docs
    WHERE tsv @@ plainto_tsquery('english', ${query})
    ORDER BY score DESC
    LIMIT ${k}
  `;
  return rows.map((r, i) => ({ ...r, rank: i + 1 }));
}

async function vectorSearch(query: string, k = 50): Promise<Hit[]> {
  const qVec = await embed(query); // returns number[1536]
  const rows = await sql<{ id: string; text: string; score: number }[]>`
    SELECT id, text, 1 - (embedding <=> ${qVec}::vector) AS score
    FROM docs
    ORDER BY embedding <=> ${qVec}::vector
    LIMIT ${k}
  `;
  return rows.map((r, i) => ({ ...r, rank: i + 1 }));
}

function rrfFuse(lists: Hit[][], k = 60, topN = 20): Hit[] {
  const scores = new Map<string, { hit: Hit; score: number }>();
  for (const list of lists) {
    for (const hit of list) {
      const prev = scores.get(hit.id);
      const contribution = 1 / (k + hit.rank);
      if (prev) {
        prev.score += contribution;
      } else {
        scores.set(hit.id, { hit, score: contribution });
      }
    }
  }
  return Array.from(scores.values())
    .sort((a, b) => b.score - a.score)
    .slice(0, topN)
    .map(({ hit, score }) => ({ ...hit, score }));
}

export async function hybridSearch(query: string, withRerank = true) {
  const [keyword, semantic] = await Promise.all([
    bm25Search(query, 50),
    vectorSearch(query, 50),
  ]);
  const fused = rrfFuse([keyword, semantic], 60, 20);
  if (!withRerank) return fused.slice(0, 10);

  const rerank = await cohere.rerank({
    model: "rerank-v3.5",
    query,
    documents: fused.map((h) => h.text),
    topN: 10,
  });
  return rerank.results.map((r) => fused[r.index]);
}

A few notes on this code, because the details matter:

  • ts_rank_cd is not strict BM25; it's Postgres's cover-density ranker. For workloads under a few million rows, it's close enough. Above that, move to Elasticsearch or OpenSearch for a real BM25 implementation. The team at Cursor Rules for production teams actually has a good template for codifying these "scale up at X" rules in .cursor/rules/.
  • We fetch 50 from each retriever, fuse to 20, then rerank to 10. The 50/20/10 funnel is a reasonable default; tune it based on latency budget. Rerank is the slow step (200-500ms for 20 docs with Cohere).
  • The embed call should be cached by query hash. Re-embedding the same query is a waste.

When you should add rerank as a third pass

Cohere Rerank (and its open alternatives, like BGE Reranker and Jina Reranker) is a cross-encoder: it concatenates the query and each candidate document and runs them through a transformer to produce a relevance score. This is much more accurate than bi-encoder embeddings because the model sees both pieces of text at once and can attend across them.

The trade-off is latency and cost. A bi-encoder embed is cached; a rerank call is per-query, per-candidate. Rules of thumb:

  • Add rerank when you're returning fewer than 20 results to a user (chat, search box, RAG context for an LLM). The accuracy bump is worth the 200-500ms.
  • Skip rerank for autocomplete, in-product nav search, or any UI where p95 latency matters more than top-1 precision.
  • Self-host rerank (BGE Reranker on a GPU) if you're doing more than 1M rerank calls per month. Cohere is great for getting started; the math flips around a million queries.

Comparison: hybrid search tooling in 2026

You have three architectural choices: stitch two specialized stores together, run an all-in-one engine, or stay inside Postgres. The right answer depends on corpus size, team size, and how much you want to operate.

ApproachBM25 engineVector engineSetup timeGood forTrade-off
Postgres-onlytsvector + ts_rank_cdpgvector + HNSW1 day<5M docs, small team, RAG MVPsNot true BM25; HNSW recall drops past 10M vectors
Elasticsearch / OpenSearch hybridNative BM25Native dense_vector (since 8.0)3-5 days5M-500M docs, existing ES teamTwo index types in one cluster; ops still nontrivial
Pinecone + ElasticsearchElasticsearch BM25Pinecone serverless2 daysVector-heavy, want managedTwo vendors, two bills, latency from two round trips
Weaviate (all-in-one)Built-in BM25Built-in HNSW1 dayMid-size RAG, want one boxLess mature BM25 than Lucene; vendor lock-in
Vespa (all-in-one)Native BM25Native HNSW or IVF1-2 weeksLarge scale (Spotify, Yahoo)Steepest learning curve on this list
Typesense CloudNative BM25Native vector search1 daySite search, e-commerceLess flexible scoring than Elasticsearch
TurbopufferOptional BM25Optimized vector1-2 daysCost-sensitive vector workloadsNewer, smaller community

The single-vendor pitch (Weaviate, Vespa, Typesense) is real: one schema, one query API, one bill. The two-vendor pitch (pgvector + Elasticsearch, or Pinecone + OpenSearch) is also real: each tool is the strongest pick at its job, and you avoid betting your search layer on one company's roadmap.

If you're below 10M documents and your team already runs Postgres, just use pgvector and tsvector. Most teams who jump straight to a dedicated vector DB regret the operational overhead a year in. We covered the broader cost picture in our breakdown on the cost to add semantic search, which walks through real numbers for each path.

What to do this week

If you're shipping search this quarter, here's the sequence that consistently works:

  1. Stand up pgvector + tsvector on your existing Postgres. One day of work.
  2. Wire the hybrid query above. Half a day.
  3. Build a 50-query eval set from real user queries (your support tickets and analytics are gold here). Score yourself on nDCG@10 or just "did the right doc make top 5."
  4. Add Cohere Rerank to the top 20. Re-run the eval. If the lift is under 5%, kill it; if it's over 10%, ship it.
  5. Only graduate to Elasticsearch + dedicated vector DB when corpus size, latency, or scoring control forces the move.

If you don't have an engineer who's shipped this before, the fastest path is to book one who has. Every engineer on Cadence is AI-native by default (Cursor, Claude Code, Copilot fluency is vetted in a voice interview before they unlock bookings), and the platform's 12,800-engineer pool means RAG and search experience is a 2-minute filter, not a 6-week recruiter loop. Search infra is exactly the kind of bounded, well-scoped project where a senior at $1,500/week on a 2-week booking ships further than a 6-week contractor search.

The other place hybrid search tends to ship cleanly is alongside agentic SaaS features; the LLM's tool-use loop almost always calls a search(query) tool, and the quality of that tool is the single biggest determinant of whether the agent works.

Decide what to build next? Run the Build / Buy / Book recommender on your search feature. It takes 2 minutes and tells you whether to build in-house, buy a managed tool, or book a vetted engineer for a 2-week sprint.

FAQ

Is hybrid search always better than vector search alone?

For almost every production use case, yes. The BEIR benchmark and most internal evals show 10-30% lifts in nDCG@10 from adding BM25 to a vector retriever, with the biggest gains on queries containing rare terms, identifiers, or named entities. The exception is purely semantic workloads (recommendations, "find similar images") where keyword overlap doesn't help.

What's the difference between RRF and weighted score fusion?

Weighted score fusion adds normalized BM25 and cosine scores with tunable weights (alpha * bm25 + (1 - alpha) * cosine). It can outperform RRF if you tune alpha carefully per query type, but it requires score normalization (min-max or z-score) and breaks when score distributions shift. RRF only uses rank position, so it's robust without tuning. Start with RRF; only switch to weighted fusion if your eval set demands it.

How big does my corpus need to be before I need a dedicated vector DB?

Postgres with pgvector and an HNSW index handles 5M to 10M vectors comfortably on a reasonable instance (32GB RAM, NVMe SSD). Above that, recall starts to degrade and query latency climbs, and a dedicated vector DB like Pinecone, Qdrant, or Turbopuffer pays for itself. Most teams overshoot here and reach for a vector DB at 100K docs, which is wasted complexity.

Do I need Cohere Rerank, or is BM25 + embeddings enough?

For RAG context selection (where you're feeding the top 5 docs to an LLM) rerank often pays for itself; the LLM is very sensitive to junk context. For user-facing search, run the A/B test. If your fused top-10 is already 80% precision, rerank gets you to 85%, which may or may not be worth 300ms of latency.

Can I skip BM25 entirely and just use embeddings with a better model?

You can try. Models like text-embedding-3-large and Cohere embed-v3 are dramatically better than 2022-era embeddings, and on conversational queries they sometimes beat hybrid. But on the long tail (rare terms, exact identifiers, product codes), BM25 is still cheaper, faster, and more reliable. We have a related deep-dive on how to reduce AI coding mistakes in production that touches on the same "deterministic + probabilistic" pattern at the code level; the principle is identical.

Deeksha Durgesh
Senior Automation Developer

Senior automation engineer at withRemote. Writes on CI/CD, test pyramids, and removing toil from engineering pipelines.

All posts