This time, its about data. About a month ago, the HuggingFace team released the FineWeb dataset, an open dataset for training large language models (LLMs). This caught my attention because most data used by big tech companies like Google, Facebook, and OpenAI for training their models is kept secret. HuggingFace has done the heavy lifting of scraping web data and cleaning it up for public use. But here’s where it gets interesting: they followed up with the FineWebEdu dataset. Using LLAMA 3 70 billion, they filtered the original dataset to include only high-quality educational content suitable for middle and high school students. The results were impressive — these curated datasets proved more effective in training models than previous open datasets. The Power of Quality Data Andrej Karpathy’s experiment further validated the importance of data quality. He trained GPT-2 from scratch using the FineWeb datasets and found that it outperformed GPT-3 when trained on the same number of tokens. This revelation underscores a crucial point: high-quality data might be even more critical than we initially thought. Thinking from first principles, it’s becoming clear that data and model architecture are everything in AI. The model is never better than its data — you can see this in action when conversing with LLAMA, ChatGPT, or Claude. Their distinct “vibes” and characteristics stem from their training data and fine-tuning methods. The Data Moat and the Quest for Quality This realization led me to ponder: how can we obtain high-quality data? About a year ago, there was concern that we might be running out of training data. GPT 3.5 (ChatGPT) was trained on a vast amount of web data, and it seemed we might need even more to improve model performance. However, this hasn’t emerged as a critical issue. In an interview , Ilya Sutskever suggested that synthetic data could mitigate the problem of data scarcity at this scale. Dario Amodei from Anthropic independently echoed this sentiment, expressing hope that various methods, including synthetic data, could address data limitations. The Synthetic Data Solution? So, how can we create data good enough to train models for better performance? What kind of data do we actually need? It’s important to note that I’m specifically talking about large language models pre-trained on text corpora using semi-supervised learning — not the hypothetical AGI scenarios we’ve discussed before. The success of the FineWebEdu dataset surprised me. It was a relatively simple procedure — just filtering out high-quality data — yet the model benefited significantly. This suggests there’s much more potential to explore in data curation and generation. Tailoring Data for Specific Abilities Sebastien Bubeck from Microsoft AI, who led the development of their Phi model, shared an interesting approach. Since Phi is a smaller model, they steered away from training it on general information datasets. Instead, they created synthetic datasets specifically for reasoning tasks. This brings us to the ultimate question: What kind of datasets, what types of text, do we need to help models better understand the world and improve their reasoning capabilities? The Delicate Balance As I reflect on these musings, I’m increasingly drawn to a crucial question: How can we craft synthetic data to aid understanding in language models? The “bitter lesson” of AI has shown us that sufficient scale of internet data has provided large language models with a surprising level of understanding of the real world. Models like GPT-4 are quite capable in this regard. But to push further, what kind of datasets should we craft? The FineWebEdu dataset experiment revealed something quite obvious in hindsight: most of the worldwide web datasets are, well, not great. They’re noisy, cluttered with ads, and often low quality. However, we can’t simply discard these vast datasets because they offer something invaluable — diversity. They provide a glimpse of the underlying reality of human communication and knowledge. Creating data from scratch presents its own challenges. If we rely too heavily on synthetic data, we risk hindering the model because ultimately, those datasets will follow the token distributions of previous models. It’s a tricky balancing act. It might help to anthropomorphize the situation a bit. The model just wants to learn. Our job is to create data that aids the model in better understanding the world around it. This opens up exciting possibilities for data augmentation. There’s a lot to play with here, lots of potential avenues to explore. We could experiment with different ways of combining real-world and synthetic data, or develop new techniques for enhancing the quality of existing datasets without losing their inherent diversity. Of course, without empirical experiments or other validations, all of this remains speculative. But I believe we’re far from reaching the ceiling in terms of what’s possible with data quality and model understanding. There’s still enormous potential waiting to be unlocked.