I recently went back to read the 2014 paper “Sequence to Sequence Learning with Neural Networks” by Sutskever, Vinyals, and Le. Since it’s such a seminal paper – practically ancient by today’s ML standards – I thought it would be interesting to look back. While the method itself (LSTM encoder-decoder) is relatively simple compared to modern architectures, focusing on their thinking process back when deep learning was still in its infancy was insightful. I found some things they mentioned almost casually that felt non-trivial to me now.

Here are some of my main takeaways:

Refreshing Views on DNN Power

The paper starts by framing Deep Neural Networks (DNNs) in ways I hadn’t explicitly considered before. They state:

“DNNs are powerful because they can perform arbitrary parallel computation for a modest number of steps.”

This sentence, while true and maybe obvious in retrospect, struck me. Of course, we know matrix multiplications are parallelizable and run well on GPUs, but thinking about it from the perspective of individual neurons performing computations in parallel felt like a refreshing angle on why NNs are inherently suited for this hardware.

They also highlight:

“…their ability to sort N N-bit numbers using only 2 hidden layers of quadratic size…”

Again, a specific example of computational power packed into a relatively simple network that I hadn’t really internalized.

The Core Problem and the LSTM Solution

The authors clearly state the limitation they were tackling:

“Despite their flexibility and power, DNNs can only be applied to problems whose inputs and targets can be sensibly encoded with vectors of fixed dimensionality.”

This was a major hurdle for many important tasks like machine translation or question answering, where sequence lengths vary. This is where the Long Short-Term Memory (LSTM) network comes in.

For me, the biggest contribution of this paper is that they took the LSTM and successfully trained an encoder-decoder architecture at scale on a difficult task (English-to-French translation). They showed that LSTMs weren’t just a theoretical curiosity; they were practical for real-world, large-scale NLP problems. This was really the first paper that demonstrated LSTMs could work in this way, paving the path for much future research.

The Input Reversal Trick

One specific technical detail that stood out was their trick of reversing the input sentence:

“…reversing the order of the words in all source sentences (but not target sentences) improved the LSTM’s performance markedly, because doing so introduced many short term dependencies between the source and the target sentence which made the optimization problem easier.”

My first reaction was: Okay, reversing the input brings the first source word closer to the first target word, which makes sense for translation since the beginning often sets the context. But what about the last words of the source sentence? Don’t they get pushed really far away from the end of the target sentence?

That’s a valid point, but I think it highlights a clever trade-off. Getting the beginning of the translation right is often very important; it lays the groundwork. By reversing the input, they made it much easier for the optimization process (SGD) to “establish communication” between the early parts of the source and target sequences. The performance gains they reported (perplexity dropping from 5.8 to 4.7, BLEU jumping from 25.9 to 30.6) suggest this was a very effective trade-off, making the model learn much better, even if it seems counter-intuitive for the tail end of the sequences.

Final Thoughts

Reading this paper was a great reminder of how far the field has come, but also of the clear thinking that set the foundations. I really admire Ilya Sutskever’s way of thinking – when I listen to him on podcasts, he often speaks with such clarity and wisdom. Looking at this early work reinforces that impression. Maybe I should make a point of reading through more of his papers.