BERT - Review
Let’s talk about a paper that’s at the heart of NLP: “BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding”. I wanted to know why everyone went crazy about BERT, especially since its actual architecture seems straightforward. Interestingly, it was simpler than expected, and understanding it shed some light on why later models (like RoBERTa) made the changes they did.
Key Concepts
Bidirectional Representation (Masked Language Modeling)
Instead of predicting the next word like GPT, BERT randomly masks tokens in the input sequence and forces the model to predict these masked tokens. This means BERT can look at context from both left and right simultaneously, allowing it to learn richer, bidirectional representations.
Next Sentence Prediction (NSP)
BERT adds another pre-training objective called NSP. It tries to predict if two sentences are logically consecutive or randomly paired. The idea was to help BERT learn relationships between sentences, which should improve tasks like question answering and inference.
Masked Token Prediction and Noise
BERT randomly masks 15% of tokens, but with a twist. Among masked tokens, 80% are replaced by [MASK]
, 10% remain unchanged, and 10% are replaced with random tokens. This mixture prevents BERT from relying solely on trivial strategies, encouraging robustness.
Fine-tuning over Feature-based
Previous methods (like ELMo) were mainly feature-based—training a model to produce embeddings and then feeding those embeddings into separate downstream models. BERT popularized the fine-tuning approach: you pre-train one model and directly adjust its parameters for downstream tasks, making the whole thing simpler and more effective.
Key Takeaways (What I Learned)
Why Bidirectional is Better
At first glance, bidirectionality seems obviously better than a single-direction model (like GPT). But traditional language models couldn’t do true bidirectionality because tokens could directly “see” themselves during training. BERT’s masking trick neatly bypasses this issue by hiding tokens randomly. It’s simple yet effective, and I finally understood why it was a big deal—previous methods were fundamentally restricted by directionality.
NSP—Initially Helpful, Eventually Questioned
The NSP task seemed sensible—training the model to understand relationships between sentence pairs should help downstream tasks like QA and NLI. However, later research (notably RoBERTa) showed that NSP actually wasn’t very helpful. RoBERTa dropped NSP completely and got better performance. This suggests that NSP might have introduced unnecessary bias or noise into the model’s representations.
Robustness Through Masking
The masking strategy, including replacing tokens with random or unchanged tokens, seemed odd at first. But the reasoning makes sense—if the model always sees a masked token, it might get used to always predicting something new. By occasionally giving it unchanged tokens, the model can’t default to always predicting a different word. It feels like they were trying to cover all bases for robustness. It’s an elegant yet somewhat subtle trick.
Understanding RoBERTa through BERT
Right after reading BERT, I jumped into RoBERTa. RoBERTa basically says BERT was good but undertrained and overly complicated with NSP. They dropped NSP entirely, trained longer with dynamic masking (changing masked tokens every epoch), and used a larger and more diverse corpus. Unsurprisingly, performance improved. It clarified to me why RoBERTa, rather than the original BERT, became the go-to choice today.
Decoder vs. Encoder Models
After understanding BERT and RoBERTa, it struck me that decoder models (like GPT) took over partly because their training objective (predicting the next token) is simpler, more scalable, and more versatile. Masked language modeling, while powerful, creates a gap between pre-training and fine-tuning (the [MASK] token doesn’t appear in fine-tuning). GPT’s approach naturally aligns training and inference, making generalization smoother.
Connection with Recent Work (Fill-in-the-Middle)
I also remembered a recent but less-known paper “Fill-in-the-Middle” that tries a similar idea with decoder-only models. It showed that predicting tokens masked in the middle (like BERT) could improve decoder models without any architectural changes. It felt like a nod to BERT’s approach but adapted to modern, decoder-only models. A neat trick for incremental performance gains.
Summary & Final Thoughts
BERT is simpler than it looks. It cleverly sidesteps the limitations of unidirectional language models by masking tokens randomly, achieving deep bidirectionality. Although the Next Sentence Prediction objective ended up being unnecessary (as RoBERTa later showed), BERT’s main innovation—the masked language model—is undeniably clever and valuable. It laid down the foundations for understanding that bidirectionality can matter, but it also revealed how small, seemingly good ideas (like NSP) can introduce unexpected biases.
In the end, BERT’s approach taught me two things: first, that subtle adjustments (like masking instead of predicting the next word) can fundamentally shift model capabilities; and second, that just because a feature seems intuitive (like NSP), it doesn’t guarantee it will enhance performance—sometimes, simplicity works best.