RAG - Review | Hun Tae Kim

This time, I’m looking back at “Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks” by Lewis et al. from Facebook AI Research, published in 2020 with v4 in 2021. Although it’s a few years old now, it laid out the basic idea behind combining language models with external knowledge retrieval. The core concept is to retrieve relevant documents first, then condition the generator on both the original input and the retrieved text.

Looking at the architecture diagram felt familiar: encode queries and documents, find similar documents with vector search like MIPS, and feed them into a generator. The paper’s framing around “parametric memory”, knowledge stored in the model’s weights, and “non-parametric memory”, the external document index, was the more interesting part for me.

The paper explores how to combine these two memory types, especially for sequence-to-sequence tasks.

Key concepts

Hybrid memory: parametric + non-parametric

RAG models explicitly combine two types of knowledge storage:

Parametric Memory: The knowledge implicitly learned and stored within the parameters of a pre-trained sequence-to-sequence model (like BART in the paper).
Non-Parametric Memory: An external knowledge source, typically a large corpus of text (like Wikipedia), indexed for fast retrieval. In RAG, this is often a dense vector index accessed via a neural retriever (like DPR).

Core architecture

The system generally consists of:

Retriever (p_η(z|x)): Takes an input x and retrieves a set of relevant documents z from the non-parametric memory. This involves a query encoder and a document index (often pre-computed document embeddings). The query encoder is typically fine-tuned.
Generator (p_θ(y|x, z)): A sequence-to-sequence model (like BART) that takes the original input x and a retrieved document z to generate the output sequence y.

RAG-Sequence vs. RAG-Token models

The paper proposes two main variants based on how retrieval and generation interact:

RAG-Sequence: Retrieves a single set of documents based on the input x and uses the same document z (from the retrieved set) to generate the entire output sequence y. The final probability p(y|x) involves marginalizing (summing) the sequence probability p_θ(y|x, z) over the top-k retrieved documents z, weighted by the retriever probability p_η(z|x).
RAG-Token: Can potentially use a different document z for each token y_i being generated. At each step i, it calculates the probability of the next token by marginalizing over the top-k documents, conditioned on x and the previously generated tokens y_{1:i-1}. The final sequence probability is the product of these per-token probabilities.

Decoding strategies for RAG-Sequence

Because the RAG-Sequence likelihood p(y|x) involves a sum over documents, it doesn’t factorize neatly per token, making standard beam search difficult. The paper proposes:

Thorough Decoding: Run beam search separately for each of the top-k documents z, generating a set of candidate sequences Y. For each candidate y in Y, calculate its full probability p_θ(y|x, z_i) for every document z_i in the top-k set. If y wasn’t found in the beam search for a specific z_i, run an “additional forward pass” to compute this probability. Finally, calculate the marginal score for y by summing p_η(z_i|x) * p_θ(y|x, z_i) across all z_i.
Fast Decoding: An approximation to speed things up. After generating the candidate set Y from the per-document beam searches, assume p_θ(y|x, z_i) ≈ 0 if y did not appear in the beam search results for document z_i. This avoids the need for additional forward passes.

What I learned

Parametric and non-parametric memory

The paper’s distinction between parametric and non-parametric memory resonated with discussions about pure neural vs. hybrid AI systems, like those involving LeCun or Chollet. RAG explicitly adds a non-parametric retrieval component, which feels somewhat symbolic, like lookup or search, alongside the parametric model. Seeing this framing in an early paper helped me think about how many powerful language-model systems are not purely parametric end-to-end functions; they often include structured components.

Untangling RAG-Sequence decoding

This was the most complex part for me initially. The key steps that became clearer through discussion were:

Run separate beam searches conditioned on each top-k document z_i.
Collect all unique hypotheses y generated across all these beams into a set Y.
For each hypothesis y in Y, calculate its final score by summing its weighted probability across all top-k documents: Score(y) = Σ [ p_η(z_i|x) * p_θ(y|x, z_i) ].
The tricky part: If a specific y wasn’t found in the beam search for a specific z_i, “Thorough Decoding” requires calculating that missing p_θ(y|x, z_i).
How the “Additional Forward Pass” Works: This isn’t about retrieving more documents. It means taking the generator model, feeding it x and z_i, and forcing it to generate the sequence y token-by-token. At each step j, you look at the probability the model’s softmax layer assigned to the actual token y_j (even if it wasn’t the most likely token). Multiplying these probabilities gives the sequence probability p_θ(y|x, z_i). This probability might be low, but it’s non-zero. “Fast Decoding” just approximates these low probabilities as zero to save computation.

Beam search for task-specific accuracy

It clicked that the use of beam search here feels different from typical open-ended language model generation. For tasks like QA, where RAG is often applied, there’s usually a more specific target answer. Beam search helps explore different generation paths conditioned on different evidence documents to find the most probable correct answer according to the model’s combined parametric and non-parametric knowledge. This contrasts with sampling strategies in generative language models, where creativity or diversity may matter more than finding one best factual output.

The synthesis advantage

The paper highlights a benefit of RAG over purely extractive QA systems. Because RAG generates the final answer based on retrieved context, it can synthesize information or rephrase findings from multiple documents. An extractive system can only return verbatim spans. This ability to combine evidence is useful, especially when the answer requires pulling different facts or phrasings from different retrieved passages.

Summary

The RAG paper introduced a framework for combining generative sequence-to-sequence models with knowledge stored in external text corpora. By retrieving relevant documents first and conditioning generation on them, RAG tries to produce more factual and specific outputs for knowledge-intensive tasks.

The distinction between RAG-Sequence and RAG-Token, especially the decoding strategies for RAG-Sequence, shows that retrieval integration is more complex than simply concatenating retrieved text. There is marginalization over documents, per-document beam search, and careful scoring.

Reflecting on it now, RAG is one way to make LLMs more grounded and verifiable. It is a hybrid system: part learned model, part external memory.