Why your RAG pipeline is lying to you (and how to fix it)

Here’s a pattern I see constantly: a team builds a RAG pipeline, tests it on 10 questions, gets good results, and ships it. Three months later, users are complaining that the AI “makes things up” — but the model isn’t hallucinating. It’s faithfully summarizing the wrong documents.

The retrieval is the problem. It almost always is.

The retrieval quality illusion

Most RAG demos work great because the test corpus is small and the questions are obvious. When you have 50 documents and ask “what is our refund policy,” cosine similarity over embeddings will find the right chunk almost every time.

Scale that to 50,000 documents with overlapping terminology, ambiguous queries, and chunks that are semantically similar but contextually different? Your retrieval precision drops off a cliff, and the LLM cheerfully generates authoritative-sounding answers from irrelevant context.

Three things that actually moved the needle

1. Chunking strategy matters more than embedding model choice

We tested four chunking approaches on the same corpus:

Fixed-size (512 tokens): Baseline. Dumb but consistent.
Semantic chunking: Split on topic boundaries using an LLM. Better for long documents, but slow and expensive to index.
Hierarchical chunking: Parent-child chunks where retrieval matches on children but returns the parent. This was our winner for clinical documents.
Sentence-window: Retrieve a sentence, expand to surrounding context. Good for FAQ-style content.

Hierarchical chunking improved our retrieval precision by 23% over fixed-size — without changing the embedding model at all. The intuition: clinical notes have a natural structure (subjective, objective, assessment, plan), and preserving that structure in the chunk hierarchy gives the retriever much better signal.

2. Hybrid search is not optional

Pure vector search has a fundamental weakness: it struggles with exact terminology. If a user asks about “metformin 500mg” and your embedding model maps that close to “diabetes medication,” you might retrieve a general diabetes overview instead of the specific metformin dosing protocol.

The fix: combine vector search with keyword search (BM25). We use PGVector for embeddings and ts_vector for full-text search in the same Postgres database, then merge results with reciprocal rank fusion.

-- Simplified version of our hybrid search
WITH vector_results AS (
  SELECT id, 1.0 / (ROW_NUMBER() OVER (ORDER BY embedding <=> $1) + 60) as rrf_score
  FROM documents
  ORDER BY embedding <=> $1
  LIMIT 20
),
keyword_results AS (
  SELECT id, 1.0 / (ROW_NUMBER() OVER (ORDER BY ts_rank(tsv, query) DESC) + 60) as rrf_score
  FROM documents, plainto_tsquery($2) query
  WHERE tsv @@ query
  LIMIT 20
)
SELECT id, SUM(rrf_score) as combined_score
FROM (SELECT * FROM vector_results UNION ALL SELECT * FROM keyword_results) combined
GROUP BY id
ORDER BY combined_score DESC
LIMIT 5;

This single change — adding keyword search alongside vector search — reduced our “wrong document retrieved” rate by 40%.

3. Evaluate retrieval separately from generation

Most teams evaluate RAG end-to-end: “did the final answer match the expected answer?” This is necessary but not sufficient. You need to measure retrieval quality independently:

Retrieval precision: Of the chunks retrieved, how many were actually relevant?
Retrieval recall: Of the relevant chunks in the corpus, how many did we retrieve?
Mean reciprocal rank: Was the best chunk ranked first, or buried at position 4?

We built a test set of 200 queries with manually labeled relevant chunks. It took two days to create and it’s the single most valuable asset in our RAG pipeline. When retrieval metrics drop, we know before the LLM output quality degrades.

The uncomfortable truth

RAG is sold as “just add your documents and ask questions.” The reality is that building a RAG pipeline that works reliably at scale is a search engineering problem — and search engineering is hard. The LLM is the easy part.

If your RAG pipeline is underperforming, stop tuning the prompt. Look at what documents are actually being retrieved. Nine times out of ten, that’s where the problem lives.