RAG - Review
This time, I’m looking back at “Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks” by Lewis et al. from Facebook AI Research, published in 2020 with v4 in 2021. Although it’s a few years old now, it laid out the basic idea behind combining language models with external knowledge retrieval. The core concept is to retrieve relevant documents first, then condition the generator on both the original input and the retrieved text.
Looking at the architecture diagram felt familiar: encode queries and documents, find similar documents with vector search like MIPS, and feed them into a generator. The paper’s framing around “parametric memory”, knowledge stored in the model’s weights, and “non-parametric memory”, the external document index, was the more interesting part for me.
The paper explores how to combine these two memory types, especially for sequence-to-sequence tasks.
Key concepts
Hybrid memory: parametric + non-parametric
RAG models explicitly combine two types of knowledge storage:
- Parametric Memory: The knowledge implicitly learned and stored within the parameters of a pre-trained sequence-to-sequence model (like BART in the paper).
- Non-Parametric Memory: An external knowledge source, typically a large corpus of text (like Wikipedia), indexed for fast retrieval. In RAG, this is often a dense vector index accessed via a neural retriever (like DPR).
Core architecture
The system generally consists of:
- Retriever (
p_η(z|x)): Takes an inputxand retrieves a set of relevant documentszfrom the non-parametric memory. This involves a query encoder and a document index (often pre-computed document embeddings). The query encoder is typically fine-tuned. - Generator (
p_θ(y|x, z)): A sequence-to-sequence model (like BART) that takes the original inputxand a retrieved documentzto generate the output sequencey.
RAG-Sequence vs. RAG-Token models
The paper proposes two main variants based on how retrieval and generation interact:
- RAG-Sequence: Retrieves a single set of documents based on the input
xand uses the same documentz(from the retrieved set) to generate the entire output sequencey. The final probabilityp(y|x)involves marginalizing (summing) the sequence probabilityp_θ(y|x, z)over the top-k retrieved documentsz, weighted by the retriever probabilityp_η(z|x). - RAG-Token: Can potentially use a different document
zfor each tokeny_ibeing generated. At each stepi, it calculates the probability of the next token by marginalizing over the top-k documents, conditioned onxand the previously generated tokensy_{1:i-1}. The final sequence probability is the product of these per-token probabilities.
Decoding strategies for RAG-Sequence
Because the RAG-Sequence likelihood p(y|x) involves a sum over documents, it doesn’t factorize neatly per token, making standard beam search difficult. The paper proposes:
- Thorough Decoding: Run beam search separately for each of the top-k documents
z, generating a set of candidate sequencesY. For each candidateyinY, calculate its full probabilityp_θ(y|x, z_i)for every documentz_iin the top-k set. Ifywasn’t found in the beam search for a specificz_i, run an “additional forward pass” to compute this probability. Finally, calculate the marginal score foryby summingp_η(z_i|x) * p_θ(y|x, z_i)across allz_i. - Fast Decoding: An approximation to speed things up. After generating the candidate set
Yfrom the per-document beam searches, assumep_θ(y|x, z_i) ≈ 0ifydid not appear in the beam search results for documentz_i. This avoids the need for additional forward passes.
What I learned
Parametric and non-parametric memory
The paper’s distinction between parametric and non-parametric memory resonated with discussions about pure neural vs. hybrid AI systems, like those involving LeCun or Chollet. RAG explicitly adds a non-parametric retrieval component, which feels somewhat symbolic, like lookup or search, alongside the parametric model. Seeing this framing in an early paper helped me think about how many powerful language-model systems are not purely parametric end-to-end functions; they often include structured components.
Untangling RAG-Sequence decoding
This was the most complex part for me initially. The key steps that became clearer through discussion were:
- Run separate beam searches conditioned on each top-k document
z_i. - Collect all unique hypotheses
ygenerated across all these beams into a setY. - For each hypothesis
yinY, calculate its final score by summing its weighted probability across all top-k documents:Score(y) = Σ [ p_η(z_i|x) * p_θ(y|x, z_i) ]. - The tricky part: If a specific
ywasn’t found in the beam search for a specificz_i, “Thorough Decoding” requires calculating that missingp_θ(y|x, z_i). - How the “Additional Forward Pass” Works: This isn’t about retrieving more documents. It means taking the generator model, feeding it
xandz_i, and forcing it to generate the sequenceytoken-by-token. At each stepj, you look at the probability the model’s softmax layer assigned to the actual tokeny_j(even if it wasn’t the most likely token). Multiplying these probabilities gives the sequence probabilityp_θ(y|x, z_i). This probability might be low, but it’s non-zero. “Fast Decoding” just approximates these low probabilities as zero to save computation.
Beam search for task-specific accuracy
It clicked that the use of beam search here feels different from typical open-ended language model generation. For tasks like QA, where RAG is often applied, there’s usually a more specific target answer. Beam search helps explore different generation paths conditioned on different evidence documents to find the most probable correct answer according to the model’s combined parametric and non-parametric knowledge. This contrasts with sampling strategies in generative language models, where creativity or diversity may matter more than finding one best factual output.
The synthesis advantage
The paper highlights a benefit of RAG over purely extractive QA systems. Because RAG generates the final answer based on retrieved context, it can synthesize information or rephrase findings from multiple documents. An extractive system can only return verbatim spans. This ability to combine evidence is useful, especially when the answer requires pulling different facts or phrasings from different retrieved passages.
Summary
The RAG paper introduced a framework for combining generative sequence-to-sequence models with knowledge stored in external text corpora. By retrieving relevant documents first and conditioning generation on them, RAG tries to produce more factual and specific outputs for knowledge-intensive tasks.
The distinction between RAG-Sequence and RAG-Token, especially the decoding strategies for RAG-Sequence, shows that retrieval integration is more complex than simply concatenating retrieved text. There is marginalization over documents, per-document beam search, and careful scoring.
Reflecting on it now, RAG is one way to make LLMs more grounded and verifiable. It is a hybrid system: part learned model, part external memory.