RAG - Review | Hun Tae Kim

This time, I’m looking back at the paper “Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks” by Lewis et al. from Facebook AI Research (published 2020, v4 2021). Although it’s a few years old now, it laid out some foundational ideas for combining large language models with external knowledge retrieval. The core concept is to augment the generation process by first retrieving relevant documents and then conditioning the language model on both the original input and the retrieved text.

Looking at the architecture diagram felt familiar: encoding queries and documents, finding similar documents via vector search (like MIPS), and feeding them into a generator. However, the paper’s framing around combining “parametric memory” (knowledge stored in the LLM’s weights) and “non-parametric memory” (the external document index) was an interesting perspective.

The paper explores how to effectively combine these two memory types, proposing specific methods for training and decoding, particularly focusing on sequence-to-sequence tasks.

Key Concepts

Hybrid Memory: Parametric + Non-Parametric
RAG models explicitly combine two types of knowledge storage:

Parametric Memory: The knowledge implicitly learned and stored within the parameters of a pre-trained sequence-to-sequence model (like BART in the paper).
Non-Parametric Memory: An external knowledge source, typically a large corpus of text (like Wikipedia), indexed for fast retrieval. In RAG, this is often a dense vector index accessed via a neural retriever (like DPR).

Core Architecture
The system generally consists of:

Retriever (p_η(z|x)): Takes an input x and retrieves a set of relevant documents z from the non-parametric memory. This involves a query encoder and a document index (often pre-computed document embeddings). The query encoder is typically fine-tuned.
Generator (p_θ(y|x, z)): A sequence-to-sequence model (like BART) that takes the original input x and a retrieved document z to generate the output sequence y.

RAG-Sequence vs. RAG-Token Models
The paper proposes two main variants based on how retrieval and generation interact:

RAG-Sequence: Retrieves a single set of documents based on the input x and uses the same document z (from the retrieved set) to generate the entire output sequence y. The final probability p(y|x) involves marginalizing (summing) the sequence probability p_θ(y|x, z) over the top-k retrieved documents z, weighted by the retriever probability p_η(z|x).
RAG-Token: Can potentially use a different document z for each token y_i being generated. At each step i, it calculates the probability of the next token by marginalizing over the top-k documents, conditioned on x and the previously generated tokens y_{1:i-1}. The final sequence probability is the product of these per-token probabilities.

Decoding Strategies for RAG-Sequence
Because the RAG-Sequence likelihood p(y|x) involves a sum over documents, it doesn’t factorize neatly per token, making standard beam search difficult. The paper proposes:

Thorough Decoding: Run beam search separately for each of the top-k documents z, generating a set of candidate sequences Y. For each candidate y in Y, calculate its full probability p_θ(y|x, z_i) for every document z_i in the top-k set. If y wasn’t found in the beam search for a specific z_i, run an “additional forward pass” to compute this probability. Finally, calculate the marginal score for y by summing p_η(z_i|x) * p_θ(y|x, z_i) across all z_i.
Fast Decoding: An approximation to speed things up. After generating the candidate set Y from the per-document beam searches, assume p_θ(y|x, z_i) ≈ 0 if y did not appear in the beam search results for document z_i. This avoids the need for additional forward passes.

Key Takeaways (What I Learned)

Parametric/Non-Parametric Framing and Symbolic AI Links
The paper’s distinction between parametric and non-parametric memory resonated with ongoing discussions about pure neural vs. hybrid AI systems (like those involving LeCun or Chollet). RAG explicitly incorporates a non-parametric, retrieval component (which feels somewhat symbolic, like a lookup or search) alongside the parametric LLM. Seeing this framing in a relatively early paper helped contextualize the idea that many powerful “LM” systems aren’t purely parametric end-to-end functions but incorporate structured components.

Untangling RAG-Sequence Decoding
This was the most complex part for me initially. The key steps that became clearer through discussion were:

Run separate beam searches conditioned on each top-k document z_i.
Collect all unique hypotheses y generated across all these beams into a set Y.
For each hypothesis y in Y, calculate its final score by summing its weighted probability across all top-k documents: Score(y) = Σ [ p_η(z_i|x) * p_θ(y|x, z_i) ].
The tricky part: If a specific y wasn’t found in the beam search for a specific z_i, “Thorough Decoding” requires calculating that missing p_θ(y|x, z_i).
How the “Additional Forward Pass” Works: This isn’t about retrieving more documents. It means taking the generator model, feeding it x and z_i, and forcing it to generate the sequence y token-by-token. At each step j, you look at the probability the model’s softmax layer assigned to the actual token y_j (even if it wasn’t the most likely token). Multiplying these probabilities gives the sequence probability p_θ(y|x, z_i). This probability might be low, but it’s non-zero. “Fast Decoding” just approximates these low probabilities as zero to save computation.

Beam Search for Task-Specific Accuracy vs. LM Generation
It clicked that the use of beam search here feels different from typical open-ended LM generation. For tasks like QA where RAG is often applied, there’s usually a more specific target answer. Beam search helps explore different generation paths conditioned on different evidence documents to find the overall most probable correct answer according to the model’s combined parametric and non-parametric knowledge. This contrasts with sampling strategies in generative LMs aiming for creativity or diversity rather than a single best factual output.

The Synthesis Advantage
The paper (and our discussion) highlighted a key benefit of RAG over purely extractive QA systems. Because RAG generates the final answer based on retrieved context, it can synthesize information or rephrase findings from multiple documents. An extractive system can only return verbatim spans. This ability to combine evidence is useful, as shown in examples where RAG might pull different facts or phrasings from different retrieved passages to construct the final answer.

Summary & Final Thoughts

The RAG paper introduced a framework for combining the generative power of large sequence-to-sequence models with the knowledge stored in external text corpora. By retrieving relevant documents first and conditioning generation on them, RAG aims to produce more factual and specific outputs, especially for knowledge-intensive tasks.

The distinction between the RAG-Sequence and RAG-Token approaches, and particularly the detailed decoding strategies proposed for RAG-Sequence (Thorough vs. Fast), show that effectively integrating retrieval requires careful thought beyond just concatenating text. It’s more complex than a simple embed-retrieve-generate pipeline, involving marginalization and specific search/scoring techniques like beam search per document.

Reflecting on it now, RAG represents a step in making LLMs more grounded and verifiable, bridging the gap between purely parametric models and external knowledge sources. It also highlights the ongoing evolution of AI architectures, moving towards hybrid systems that leverage different kinds of computation and memory.