Retrieval-augmented generation — RAG — is the default pattern for getting a language model to answer using your organisation's own knowledge. It is also the pattern most likely to look finished when it is roughly half done. A RAG prototype can be built in an afternoon. A RAG system that holds up under real questions, real documents, and real users is a different undertaking.
This article covers where the gap between prototype and production actually lies, and the patterns we rely on to close it.
The prototype that fools everyone
The basic recipe is well known: split your documents into chunks, embed them, store the vectors, and at query time retrieve the closest chunks and hand them to the model as context. Build this on a tidy set of documents, ask it a few obvious questions, and it works. The demo is genuinely convincing.
Then it meets reality — messy documents, ambiguous questions, near-duplicate content, information spread across several sources — and the cracks show. The system retrieves the wrong passage, or a stale one, or misses the answer entirely while sounding just as confident. Nothing has crashed. The output is simply, fluently wrong.
RAG is the pattern most likely to look finished when it is roughly half done.
Retrieval is the hard part
Almost every weak RAG system is weak at retrieval, not generation. If the right information reaches the model, modern models use it well. If it does not, no amount of prompt engineering will save the answer. So retrieval quality is where the engineering effort belongs.
Chunking with structure, not by character count
Splitting documents into fixed-size blocks routinely severs a sentence, a table, or an idea down the middle. We chunk along the document's actual structure — sections, headings, logical units — so each chunk is a coherent, self-contained piece of information. Better chunks are the cheapest large improvement available.
Hybrid retrieval
Vector search captures meaning but can miss exact terms — a product code, an error number, a specific name. Keyword search catches those but misses paraphrase. Production systems use both and combine the results. The two methods fail in different places, and together they cover far more ground.
Re-ranking what you retrieve
Initial retrieval is tuned for speed and recall: get a generous set of plausible candidates fast. A second, more careful re-ranking step then scores those candidates for genuine relevance and keeps only the best. This two-stage shape — broad then precise — consistently outperforms trying to get retrieval perfect in a single pass.
Generation patterns that prevent confident nonsense
With good retrieval in place, a few generation-side patterns keep the system honest.
- Instruct the model to ground every claim. It should answer only from the retrieved context and explicitly say when the context does not contain the answer — rather than filling the gap from its general knowledge.
- Cite sources. Every answer should point back to the passages it used. Citations let users verify, build trust, and make wrong answers diagnosable.
- Handle the empty case deliberately. When retrieval finds nothing relevant, the correct response is to say so. "I don't have that information" is a feature, not a failure.
You cannot improve what you do not evaluate
The single practice that most separates production RAG from prototype RAG is evaluation. Before launch, we build a test set of real questions paired with their correct answers and the passages that should be retrieved. Every change to chunking, retrieval, or prompting is measured against that set. Without it, tuning is guesswork and every "improvement" is a coin flip.
Evaluation continues after launch. Real user questions are harder and stranger than anything you anticipated; they are also the best material for improving the system. Logging queries, retrieved context, and outcomes turns live usage into a steady feedback loop.
The shape of a RAG system that lasts
Production RAG is structured retrieval, hybrid search, a re-ranking pass, disciplined grounded generation with citations, and a real evaluation loop running before and after launch. None of it is exotic. It is simply the work that the afternoon prototype skips — and the work that decides whether the system earns trust or quietly loses it.