I knew about curriculum learning, a concept introduced by Yoshua Bengio. The idea is straightforward: train a model with a curriculum, starting with easier data and moving toward harder datasets. The hope is that this helps the model learn and generalize.

I wondered whether this could apply to large language models. For instance, you could start with basic concepts, like kindergarten-level material, and then gradually introduce more advanced subjects like mathematics or science. My intuition was that this kind of structure might lead to better generalization.

However, I couldn’t find much literature exploring this idea in depth, except for Microsoft’s Phi models. These models use synthetic datasets to generate elementary-level data, teaching a small model to learn English, with some success. For larger models trained on the internet, the consensus seemed to be that data diversity would wash out the benefits of curriculum order.

The model would eventually encounter all parts of the dataset anyway, regardless of order. I accepted that answer too quickly and stopped thinking about it.


What made me think again

My perspective shifted when I listened to a podcast featuring Yann LeCun and Gary Marcus, interviewed by Lex Fridman. It struck me that these interviews were conducted five years ago, in 2019. In the rapidly evolving field of machine learning, five years is a long time.

Back in 2019, models like ChatGPT and GPT-3 didn’t exist, and models were much weaker at language understanding and common sense reasoning. The assessments made by LeCun and Marcus may have been accurate for their time, but they feel outdated now.

The problem of outdated information

Current large language models are pre-trained on massive datasets from many places and time periods. Those datasets inevitably contain outdated information. For example, a statement from a prominent scientist in 2010 claiming that deep learning had hit a dead end might have sounded reasonable then, but it is false today.

During pre-training, language models are not explicitly told which statements are time-dependent. This raises a practical question: how should we handle outdated training data? Should we filter it out, or let it stay?

As time passes, the amount of newer information will increase. Could we simply dilute outdated data with newer data? The pre-training stage, as far as I understand, is relatively straightforward: next-token prediction and backpropagation across the corpus.

There doesn’t seem to be an inherent mechanism in that process for handling outdated information. I’m still unsure how to address it well.

The paradox of conflicting information

But the internet is already full of conflicting information. For instance, one can find sources claiming that global warming is false, while others say it is true. There is also a vast amount of fiction, like novels, that is not meant to be factual.

Despite this, current language models still learn something that resembles knowledge. They hallucinate, but they also seem to build a rough factual world model. Maybe the size of the pre-training corpus matters here. Conflicting information might partially cancel out, or at least give the model enough examples to learn different contexts and perspectives.


Reconsidering curriculum learning

This brings me back to the original question: is curriculum learning unnecessary for large language models? I suspect the limited research on curriculum learning for LLMs might be because its impact is marginal.

If curriculum learning is not needed, that says something interesting about the generalization ability of large language models. They can learn from a chaotic sea of information without much explicit guidance.

I still need to refine my thoughts here. The relationship between dataset scale, conflicting information, and curriculum learning is more complicated than I first thought.