Rethinking Sequence-to-Sequence - Review
Reading older papers often gives a clearer view of how current ideas developed. Recently, I went through the 2015 ICLR paper Neural Machine Translation by Jointly Learning to Align and Translate by Dzmitry Bahdanau, KyungHyun Cho, and Yoshua Bengio. It tackles a core problem in early sequence-to-sequence models for machine translation.
The main issue they identified was the “bottleneck” in the standard RNN encoder-decoder framework popular at the time. These models tried to compress the entire meaning of a source sentence, regardless of length, into a single fixed-length vector. As the paper noted, this made long sentences difficult; performance tended to drop significantly as sentences got longer.
Their proposed solution was to allow the decoder to look back at the source sentence and selectively focus on relevant parts when generating each target word. This avoids forcing all information through one fixed vector.
Key concepts
Here are the core ideas:
- The problem: fixed-length vector bottleneck: Standard encoder-decoders map an input sequence
x = (x_1, ..., x_{T_x})to a fixed context vectorc. The decoder then generates the outputy = (y_1, ..., y_{T_y})based solely oncand previously generated words. This compression limits the model’s capacity, especially for long inputs. - The solution: alignment mechanism: Instead of one
c, the proposed model computes a distinct context vectorc_ifor each target wordy_i. Thisc_iis a weighted sum of annotations(h_1, ..., h_{T_x})from the encoder. Eachh_jcorresponds to a source wordx_j, or more precisely, the hidden state around it. - How it works: alignment model and context vector:
- The weight
a_{ij}for each annotationh_jwhen generatingy_idepends on how well the input around positionjaligns with the output at positioni. - These weights are calculated using an “alignment model”
a, which takes the previous decoder hidden states_{i-1}and the encoder annotationh_jas input to produce a scoree_{ij}. e_{ij} = a(s_{i-1}, h_j)- The weights
a_{ij}are obtained by normalizing these scores with a softmax:a_{ij} = exp(e_{ij}) / Σ_k exp(e_{ik}). - The context vector
c_iis then the weighted sum:c_i = Σ_j a_{ij} h_j. - Crucially, the alignment model
a(parameterized as a small feedforward network) is trained jointly with the rest of the system.
- The weight
- Soft vs. hard alignment: The paper uses the term “soft alignment.” This contrasts with “hard alignment,” which would involve making a deterministic choice of which single source word aligns with the target word. Soft alignment uses a weighted average over all source annotations. This makes the mechanism differentiable and allows the model to learn alignments implicitly through backpropagation. It also handles cases where a target word depends on multiple source words, or vice versa.
- The encoder: bidirectional RNN (BiRNN): To ensure the annotation
h_jcaptures context from both before and after the source wordx_j, they used a BiRNN. This consists of a forward RNN processing the sequence fromx_1tox_{T_x}and a backward RNN processing it fromx_{T_x}tox_1. The annotationh_jis the concatenation of the forward hidden state\vec{h}_jand the backward hidden state\cev{h}_j. While BiRNNs weren’t new, their use here makes sense for creating richer annotations.
What I learned
Reflecting on the paper, several points stand out:
- Performance improvement on long sentences: The results clearly show the benefit. The standard RNNencdec model’s performance drops sharply with sentence length, while the proposed RNNsearch model remains much more robust. The BLEU scores confirm a significant improvement, bringing NMT closer to traditional phrase-based systems of the time.
- Interpretability via alignment: The alignment weights
a_{ij}can be visualized. This gives some insight into what parts of the source sentence the model focuses on when generating a specific target word. The visualizations showed mostly monotonic alignments, as expected between English and French, but also the ability to handle local reordering, like adjective-noun flips. This interpretability is a nice side effect compared with trying to understand a monolithic RNN. - Handling reordering and length differences: The soft alignment naturally deals with source and target phrases having different lengths or requiring non-trivial mappings, without needing explicit mechanisms like NULL tokens used in traditional SMT.
- Link to Transformers: Reading this after knowing about Transformers makes the connection clear. The core mechanism, scoring source annotations based on the current decoder state, using softmax for weights, and computing a weighted sum, is basically attention. The Transformer later built on this by removing recurrence and adding multi-head attention, positional encodings, and so on.
Summary
This paper addressed a clear limitation in early NMT: the fixed-length vector bottleneck. The solution was straightforward but powerful: allow the decoder to learn where to focus in the source sequence. The “soft alignment” mechanism is, in essence, the attention mechanism that later became central to architectures like the Transformer.
Looking back now, the idea feels intuitive, but implementing it effectively and showing its benefits in 2014/2015 mattered. It’s a clear paper: problem, solution, evidence. Reading it helps connect older sequence-to-sequence models to the models we use today.