Internals Decoded
Topics/RAG
AI Internals

How RAG Systems Work

A technical walkthrough of the retrieval augmented generation architecture, from document ingestion and chunking to vector search and prompt assembly.

RAGLLMvector searchembeddingsretrieval
12 min readUpdated Jun 23, 2026

Imagine you are a writer working on an article about a subject you know nothing about. You could try to remember every book you have ever read, but that will not end well. Instead, you call a librarian. You describe what you need. The librarian searches a massive catalog, pulls a few relevant books, and hands them to you. You read the passages, then write your article using that fresh information.

That is RAG. The writer is the large language model. The librarian is the retrieval system. The catalog is a vector database. The books are chunks of your documents.

Retrieval augmented generation couples a language model with an external knowledge store. The model no longer has to rely solely on the frozen weights it was trained with. It can pull in up to date, domain specific facts at inference time and condition its answer on them. For senior engineers building with AI, this is not a magic trick. It is a multi component distributed system with real engineering tradeoffs. This article walks through exactly how that system works under the hood.

The Two Pipelines

A RAG system has two separate but connected pipelines. One runs offline. One runs online. The offline pipeline ingests your documents, breaks them into chunks, converts those chunks into vector embeddings, and stores them in a searchable index. The online pipeline takes a user query, fetches relevant chunks from that index, and feeds them to the language model as part of the prompt. Keeping these pipelines separate lets you scale ingestion independently of serving. You can update embeddings, swap retrieval strategies, or add new data sources without touching the generation code.

Think of the offline pipeline as the librarian who organizes the books on the shelves. The online pipeline is the librarian who answers a patron's question by pulling the right books off those shelves. Both roles are critical. If the books are shelved badly, the librarian cannot find them. If the librarian asks the wrong question, even well shelved books will not help.

Offline: Building the Knowledge Index

Loading and Normalizing Documents

The offline pipeline starts with raw data. PDF manuals. HTML pages. Database exports. Internal wikis. The first job is to load everything and turn it into clean, plain text. This is not just about extracting strings. You have to preserve logical structure where possible. Headings, tables, and code blocks all carry meaning. A table of pricing data should not become a wall of numbers. It might need to be converted into a series of "if this, then that" statements or a structured representation that the embedding model can digest.

Frameworks like LangChain and LlamaIndex provide document loaders for common formats. They produce standardized objects that hold text and metadata. Metadata matters. You will want to filter by date, document type, or access level later. Without metadata, your vector index is just a bag of anonymous text blobs.

Chunking Strategies

Once you have clean text, you must split it into chunks. A chunk is the unit of retrieval. When a user asks a question, the system will return one or more chunks. The size and shape of those chunks determines whether the model gets the right context.

Chunking is not a trivial "split every 512 tokens" operation. It is a design decision that balances recall and precision. Small chunks are precise. They capture a single fact. But they can lose surrounding context. Large chunks preserve context. But they can bury the relevant fact in noise. There is no one right answer. You have to experiment with your own data.

Several strategies exist. Here are the most common ones.

  • Fixed size chunking splits text into segments of a given token length, often with some overlap. It is simple but oblivious to semantic boundaries. A sentence might be cut in half.
  • Recursive chunking uses a hierarchy of separators. It tries double newlines first, then single newlines, then spaces. This keeps paragraphs and sentences intact as much as possible while still hitting the target size.
  • Semantic chunking uses embeddings to group sentences that are similar to each other. This creates chunks that are topically coherent. The downside is that you must run an embedding model during preprocessing, which adds cost.
  • Document based chunking respects the structure of the source. It splits on Markdown headings, table boundaries, or code block markers. For a codebase, this means a function and its docstring stay together.
  • Agentic chunking goes further by asking an LLM to decide where to split. The model reads the text and identifies natural breakpoints. This is experimental but can produce highly coherent chunks.

In practice, teams often start with recursive chunking and a moderate chunk size like 512 tokens with 10% overlap. They then inspect retrieval outputs for a set of test queries. If the returned chunks feel disjointed, they increase overlap or switch to semantic chunking. If chunks are too large and full of irrelevant text, they shrink the size. Chunking is an iterative process.

From Chunks to Embeddings

After chunking, each chunk must become a vector. An embedding model takes text and outputs a dense vector of numbers. The magic is that chunks with similar meaning end up close together in this vector space. The model does not understand meaning the way a human does. It has learned statistical patterns from massive text corpora. But for retrieval, that is usually enough.

The embedding model you choose matters a lot. A model trained on general web text will not perform well on medical records or legal contracts. If your domain is specialized, you may need to fine tune an embedding model on your own data. Dense passage retrieval, or DPR, is a technique that fine tunes two encoders. One for queries and one for passages. The training pulls relevant pairs close together and pushes irrelevant pairs apart. This reshapes the embedding space to align with your specific retrieval needs.

It is important to understand that embeddings are not universal truth vectors. They reflect the training data and objectives of the model that produced them. If your documents use jargon that the embedding model has never seen, the vectors will be poor. If the model was trained on English but your documents are in French, you have a problem. Treat embeddings as a tool, not a solved problem.

You now have millions of high dimensional vectors. You need to find the top k vectors closest to a query vector in milliseconds. A linear scan of all vectors does not scale. You need an approximate nearest neighbor index.

Vector databases like Milvus and libraries like FAISS provide several index types. The two most common are HNSW and IVF-PQ.

HNSW stands for Hierarchical Navigable Small World. It is a graph based index. Each vector is a node. Nodes are connected to their nearest neighbors. The graph has multiple layers. Top layers link distant clusters. Bottom layers link close neighbors. Search starts at a top layer entry point and greedily walks toward nodes that are closer to the query. It descends through layers until it reaches the bottom. This is fast and accurate but memory intensive.

IVF-PQ combines inverted file indexing with product quantization. First, vectors are clustered into groups using k-means. At query time, only the nearest few clusters are searched. Product quantization compresses each vector into a short code, saving memory. The distance is approximated using these codes. IVF-PQ trades some accuracy for much lower memory and faster search.

You will also see hybrid search setups. They combine dense vector search with sparse lexical search like BM25. BM25 is great at exact term matching. Dense search is great at semantic similarity. Together they cover each other's blind spots. Many production RAG systems use both and merge the results.

Online: Answering a Query

Query Rewriting

A user types a question. "What's the policy on vacation days for remote employees hired after 2022?" That is a good query. But often users are less precise. They type "how do I get time off" or "vacation rules remote." The embedding of that sloppy query may not be close to the embedding of the relevant policy chunk. Query rewriting fixes this.

Query rewriting uses a language model to reformulate the user's raw input into a more effective search query. You give the model a system prompt like: "You are a search query rewriter. Take the user's question and rewrite it to be information dense and use the terminology found in our HR policy documents." The model might output "remote employee vacation day accrual policy for hires after 2022." That rewritten query is then embedded and used for retrieval.

This step is optional but powerful. It decouples the user's conversational language from the formal language of your documents. It can also expand abbreviations, add synonyms, and remove filler words. The cost is an extra LLM call per query. For high traffic systems, you might cache rewritten queries or use a smaller, specialized model.

Retrieval: Finding the Right Chunks

The rewritten query is embedded using the same model that processed your chunks. The resulting vector is sent to the vector database. The database runs an ANN search and returns the top k chunks with the highest similarity scores. The similarity metric is usually cosine similarity or inner product.

You can also apply metadata filters at this stage. If you know the user only has access to certain document categories, you filter before or after the vector search. If the query mentions a date range, you filter chunks by their timestamp metadata. This reduces noise and improves relevance.

The number k is a hyperparameter. Too small and you might miss the right chunk. Too large and you waste context window space and confuse the model. A typical starting point is to retrieve 10 to 20 chunks.

Reranking to Refine Results

The initial retrieval is fast but coarse. The ANN index uses a lightweight similarity function. It might return chunks that are vaguely related but not exactly what the user needs. Reranking fixes this.

A reranker is a more powerful model that takes the query and a candidate chunk and outputs a precise relevance score. Cross encoders are the usual choice. They process the query and chunk together through a transformer, allowing full attention between them. This is much more accurate than comparing separate embeddings. But it is also much slower. You cannot run a cross encoder over millions of chunks. So you use it only on the top 100 or so candidates from the initial retrieval.

The reranker reorders the chunks. The top few chunks after reranking are the ones that will go into the prompt. Some systems also use a diversity reranker to avoid returning near duplicate chunks. This ensures the model sees a variety of information.

Assembling the Prompt and Generating an Answer

You now have a set of relevant chunks. The final step is to build the prompt. A typical prompt template looks like this:

You are a helpful assistant. Use the following context to answer the user's question.
If you don't know the answer, say so.

Context:
{chunk1}
{chunk2}
{chunk3}

Question: {user_query}
Answer:

The model reads the context and the question, then generates an answer. The key instruction is to ground the answer in the provided context. Without that instruction, the model might ignore the chunks and rely on its own parametric knowledge. That defeats the purpose of RAG.

The model's response may include citations. Some systems ask the model to reference which chunks it used. This helps with debugging and trust. If the answer is wrong, you can trace it back to a specific chunk and see if the chunk was irrelevant or the model misinterpreted it.

Where RAG Breaks Down

RAG is not a cure for hallucination. It is a way to give the model better inputs. If retrieval fails, the model will still produce a plausible sounding but incorrect answer. The failure might be even harder to spot because the answer will be adjacent in embedding space to something real.

Retrieval can fail for many reasons. Poor chunking splits a key fact across two chunks, and only one is retrieved. The embedding model does not capture the nuance of a domain specific term. The user's query requires multi hop reasoning across several documents, but the system only retrieves chunks from one. The vector index returns chunks that are lexically similar but semantically wrong. The reranker is not trained on your domain and makes bad choices.

Fixing these failures is an engineering discipline. You need evaluation sets with ground truth answers. You measure retrieval recall. You measure answer faithfulness. You look at failure cases and adjust chunking, embeddings, or retrieval parameters. You might add a knowledge graph to handle multi hop questions. You might fine tune the embedding model on your data. You might add a verification step where the model checks its own answer against the retrieved chunks.

The feedback loop is critical. Log which chunks were retrieved and which were actually used. Log queries that got low confidence answers. Use that data to improve the offline pipeline. Re chunk documents that cause problems. Add missing content. RAG is not a set and forget system. It requires ongoing tuning.

Connecting the Pieces

A RAG system is a pipeline of transformations. Raw documents become chunks. Chunks become vectors. Vectors become a searchable index. A user query becomes a rewritten query. That query becomes a vector. That vector pulls chunks from the index. Those chunks become part of a prompt. That prompt becomes an answer.

Each transformation is a place where things can go wrong. But each is also a place where you can intervene and improve. The separation of offline and online pipelines gives you the flexibility to experiment. You can swap embedding models without changing the generation code. You can add reranking without touching the vector index. You can adjust chunking without retraining the language model.

The next time you hear someone say "just add RAG," you will know what that really means. It means building a librarian, organizing a catalog, and teaching a writer to read the books you hand it. It is not magic. It is engineering. And like all engineering, it rewards careful thought and rigorous testing.

Weekly internals
One breakdown every week
How Docker, Git, Kubernetes, VS Code, and the tools you use every day actually work. No fluff. Built for senior engineers.
Subscribe free →
Explore more
/topics
Browse all topics
More deep dives on AI internals — inference, agents, vector databases, MCP, and more.