Building Production RAG Systems That Don't Hallucinate

Retrieval-Augmented Generation is powerful but brittle. Here's how we architect RAG pipelines that are actually reliable in production.

Retrieval-Augmented Generation sounds simple: retrieve relevant documents, stuff them in the context, get a better answer. In practice, production RAG systems fail in a dozen quiet ways. Here’s what we’ve learned building them at scale.

The retrieval problem is harder than the generation problem

Most RAG tutorials focus on the LLM prompt. The actual reliability bottleneck is retrieval. A model can only work with what it’s given — if the retriever returns the wrong chunks, no prompt engineering saves you.

The most common failure modes:

Semantic mismatch: The user asks about “contract termination” but your documents use “agreement dissolution.” Embedding similarity fails. Fix: hybrid search (dense + sparse/BM25) almost always outperforms either alone.

Chunk boundary problems: You split a document at the wrong point and cut the answer in half. The chunk with the question context and the chunk with the answer end up separated. Fix: overlapping chunks and parent-document retrieval.

Stale index: Your vector store doesn’t reflect the current state of your documents. Fix: deterministic document IDs so updates replace rather than duplicate.

Re-ranking changes everything

After retrieval, re-rank your top-k results before sending them to the LLM. A cross-encoder re-ranker (Cohere Rerank, or a fine-tuned BERT model) dramatically improves precision at the cost of a few extra milliseconds. The improvement in answer quality is consistently worth it.

Structuring context for the model

Don’t just concatenate chunks. Structure your context:

[Source: Annual Report 2024, Section: Risk Factors]
...chunk content...

[Source: Board Minutes, 2024-03-15]
...chunk content...

This gives the model provenance signals and dramatically reduces cross-document confusion. Include the source in the prompt and ask the model to cite it in the response — you get free attribution and a hallucination check.

Knowing when not to answer

The most underrated feature of a good RAG system is knowing when to say “I don’t know.” Set a retrieval confidence threshold. If no chunk clears it, return a graceful fallback instead of having the model confabulate an answer.

Implement this as a two-stage check:

Top retrieval score below threshold → no relevant documents found
LLM self-assessment in the prompt: “Only answer if the provided sources contain the relevant information”

Both catches are necessary. Neither is sufficient alone.

Eval before you ship

Build an eval set before you build the system. 50–100 question/answer pairs covering your expected query distribution, with ground-truth answers verified by a domain expert. Run it on every change. Track:

Answer correctness (LLM judge or exact match)
Retrieval recall at k (did the right chunk appear in the top results?)
Faithfulness (did the answer contradict the sources?)