Let’s talk about a paper at the heart of NLP: “BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding”. I wanted to know why everyone went crazy about BERT, especially since the actual architecture seems straightforward. It turned out to be simpler than I expected, and understanding it helped explain why later models, like RoBERTa, made the changes they did.


Key concepts

Bidirectional representation (masked language modeling)

Instead of predicting the next word like GPT, BERT randomly masks tokens in the input sequence and forces the model to predict them. This lets BERT look at context from both the left and right at the same time, so it can learn bidirectional representations.

Next Sentence Prediction (NSP)

BERT adds another pre-training objective called NSP. It tries to predict if two sentences are logically consecutive or randomly paired. The idea was to help BERT learn relationships between sentences, which should improve tasks like question answering and inference.

Masked token prediction and noise

BERT randomly masks 15% of tokens, but with a twist. Among masked tokens, 80% are replaced by [MASK], 10% remain unchanged, and 10% are replaced with random tokens. This mixture prevents BERT from relying on trivial strategies.

Fine-tuning over feature-based methods

Previous methods (like ELMo) were mainly feature-based—training a model to produce embeddings and then feeding those embeddings into separate downstream models. BERT popularized the fine-tuning approach: you pre-train one model and directly adjust its parameters for downstream tasks, making the whole thing simpler and more effective.


What I learned

Why bidirectionality is better

At first glance, bidirectionality seems obviously better than a single-direction model like GPT. But traditional language models couldn’t do true bidirectionality because tokens could directly “see” themselves during training. BERT’s masking trick bypasses this issue by hiding tokens randomly. Previous methods were fundamentally restricted by directionality.

NSP: sensible at first, questionable later

The NSP task seemed sensible: training the model to understand relationships between sentence pairs should help downstream tasks like QA and NLI. But later research, notably RoBERTa, showed that NSP wasn’t very helpful. RoBERTa dropped NSP completely and got better performance. This suggests that NSP might have introduced unnecessary bias or noise into the model’s representations.

Robustness through masking

The masking strategy, including replacing tokens with random or unchanged tokens, seemed odd at first. But the reasoning makes sense: if the model always sees a masked token, it might get used to always predicting something new. By occasionally giving it unchanged tokens, the model can’t default to always predicting a different word. It feels like they were trying to cover all bases for robustness. It’s a subtle trick.

Understanding RoBERTa through BERT

Right after reading BERT, I jumped into RoBERTa. RoBERTa basically says BERT was good but undertrained and overly complicated with NSP. They dropped NSP entirely, trained longer with dynamic masking (changing masked tokens every epoch), and used a larger and more diverse corpus. Unsurprisingly, performance improved. It clarified to me why RoBERTa, rather than the original BERT, became the go-to choice today.

Decoder vs. encoder models

After understanding BERT and RoBERTa, it struck me that decoder models like GPT took over partly because their training objective, predicting the next token, is simpler, more scalable, and more versatile. Masked language modeling is powerful, but it creates a gap between pre-training and fine-tuning because the [MASK] token doesn’t appear during fine-tuning. GPT’s approach naturally aligns training and inference.

Connection with recent work (Fill-in-the-Middle)

I also remembered a recent but less-known paper, “Fill-in-the-Middle”, that tries a similar idea with decoder-only models. It showed that predicting tokens masked in the middle, like BERT, could improve decoder models without architectural changes. It felt like a nod to BERT’s approach, adapted to modern decoder-only models.


Summary

BERT is simpler than it looks. It sidesteps the limitations of unidirectional language models by randomly masking tokens, giving the model access to both left and right context. Although Next Sentence Prediction ended up being unnecessary, as RoBERTa later showed, BERT’s main idea still mattered a lot.

The bigger lesson for me is that a small objective change, like masking instead of next-token prediction, can shift what a model is good at. But BERT also shows the opposite lesson: an intuitive feature like NSP can still turn out to be unnecessary. Sometimes the simpler version wins.