Llama 3 Paper - Review (Part 1)

Today I’m reviewing Meta’s Llama 3 technical paper. The paper is long enough that I’m splitting my notes into two parts; this first part focuses on pre-training and infrastructure. Llama 3 is a big step up from Llama 2, with the flagship 405B parameter model performing competitively against models like GPT-4. What makes the paper interesting is how much detail Meta gives about scaling: data preparation, model training, and the infrastructure needed to keep everything running.

Key concepts

Scaling up pre-training

Llama 3 scales far beyond previous versions. The flagship model uses 405B parameters and was trained on approximately 15T tokens, compared with 1.8T for Llama 2. Training the flagship model used nearly 50x more compute than Llama 2’s largest model. Meta made this work with a standard dense Transformer rather than a Mixture-of-Experts (MoE) architecture.

Data quality and processing

Meta emphasized data quality rather than just quantity. Their processing pipeline removed PII, applied several levels of deduplication, and used both heuristic and model-based filtering. They also used classifiers to identify and upsample higher-quality code and reasoning content. For multilingual support, they added language-specific processing and quality filtering.

Context length scaling

Llama 3 was designed to handle context windows up to 128K tokens. Rather than training on long sequences from the beginning, which would be very expensive because self-attention scales quadratically, they used a multi-phase approach: first training on 8K contexts, then gradually increasing to 128K tokens in the final stages of pre-training over approximately 800B tokens.

Hardware and infrastructure

Training at this scale required huge hardware resources and a serious infrastructure stack. The 405B model was trained on up to 16K H100 GPUs. They used tensor parallelism, pipeline parallelism, context parallelism, and data parallelism, which they call “4D parallelism,” to distribute computation. They report 38-43% Model FLOPs Utilization (MFU) at this scale.

Post-training alignment

After pre-training, the models went through post-training alignment using supervised fine-tuning (SFT), rejection sampling, and Direct Preference Optimization (DPO). One interesting note is their deliberate choice to avoid more complex reinforcement learning algorithms like PPO, which they found less stable and harder to scale.

Specialized capabilities

The paper describes several capabilities added during post-training, including code generation with execution-based feedback, multilingual performance, reasoning, tool use, and factuality. For many of these, they built specialized data generation pipelines, often using earlier versions of Llama 3 to generate training data.

What I learned

Simple, stable architectures matter

One of the most interesting choices Meta made was sticking with a dense Transformer architecture rather than using a Mixture-of-Experts approach. They explicitly say this was to “maximize training stability,” which suggests that at huge scale, reliability and predictability can be more valuable than theoretical efficiency. This matches what DeepSeek researchers have also mentioned about the difficulty of scaling MoE models.

Data quality beats architectural cleverness

The paper spends a lot of time on data curation, which is probably the right emphasis. Even with massive compute resources, Meta still invested heavily in filtering, curation, and quality assessment. The deduplication, model-based filtering, and domain-specific pipelines all point to the same thing: the dataset still matters enormously.

Model-bootstrapped data creation

Meta used earlier versions of Llama 3 to generate data for later training iterations. For capabilities like code generation, the model generated samples, and those samples were filtered based on execution results. This self-improvement loop, where models help train their successors, is becoming more common, but seeing it at this scale is still striking.

Context parallelism for long sequences

The paper’s description of context parallelism (CP) for long sequences was useful. By dividing input sequences into chunks across GPUs and using all-gather operations to collect key-value tensors, they trained on 128K context lengths without excessive memory usage. This differs from previous techniques I’ve seen and shows how specialized LLM infrastructure is becoming.

Alignment is still labor-intensive

The post-training sections show how much work alignment still takes. They performed six rounds of alignment, iteratively collecting human preferences, generating synthetic data, and fine-tuning. Each round built on the previous one, using increasingly capable models. The process still required human annotation and quality control throughout.

Infrastructure is its own problem

The sections on reliability and operational challenges show how hard training at this scale remains. During a 54-day period, they experienced 419 unexpected interruptions, with GPU issues accounting for nearly 60% of them. They also observed 1-2% daily throughput variation because environmental temperature affected GPU clocks. Those details made the infrastructure problem feel very concrete.

Summary

The Llama 3 paper gives a detailed look at how Meta trained a competitive frontier model. OpenAI and Anthropic still had some lead with proprietary models, but Llama 3 shows that enough scale and careful engineering can get publicly released weights surprisingly close.

What stood out to me is how conservative Meta’s approach is in some places. They focused on reliability, scalability, and maintainability rather than chasing more exotic architectures. The dense Transformer choice, combined with a relatively simple alignment procedure, shows a preference for methods that can be scaled reliably.

The infrastructure and pipeline engineering also stood out. From their custom HTML parser to their parallelism strategies to their reliability engineering, the paper makes clear that training models at this scale requires raw compute and the systems that make that compute usable.

Overall, this paper is less about one clever trick and more about execution. Quality data, reliable infrastructure, and careful alignment are doing a lot of the work.