I recently watched Andrej Karpathy’s “Let’s Reproduce GPT-2 (124M)” video. These are my notes on the parts that stood out, especially positional embeddings, transformer architecture tweaks, and implementation details that matter when reproducing GPT-2.


Key concepts

Transformer modifications: Pre-Layer Normalization (Pre-LN)

In the original transformer, Layer Normalization comes after attention and feedforward layers (Post-LN). GPT-2 switched this, applying normalization beforehand (Pre-LN). This small tweak stabilizes training by improving gradient flow, especially in deeper models.

Transformer as MapReduce

Karpathy uses an analogy: Transformers operate similarly to a MapReduce process. Self-attention acts like a “Reduce” step, combining information from all tokens, while the subsequent MLP acts like “Map,” independently processing each token’s representation.

Optimizing attention computation

GPT-2’s attention mechanism involves a lot of tensor reshaping for efficiency. Input tensors are projected simultaneously into queries, keys, and values across all heads using batch matrix multiplication. These tensors are then reshaped and transposed for multi-head attention computation.

FlashAttention

FlashAttention is a recent, efficient algorithm for speeding up attention computations. By breaking attention into smaller “tiles” and combining multiple computations (kernel fusion), it reduces memory usage, allowing transformers to handle longer sequences faster.

Weight sharing between input and output layers

GPT-2 shares weights between the token embedding layer (wte) and the final linear output layer (lm_head). This design choice helps the model learn more consistent representations and often improves generalization, even though it initially seemed counterintuitive to me.


What I learned

Why learned positional embeddings work well

At first, positional embeddings felt abstract. Now I see why they are powerful: each embedding channel can capture different positional information, like long-range versus short-range patterns. This lets the model adapt its position representations to the training data, unlike fixed sine-cosine methods.

The impact of Pre-LN on training stability

Before this, I thought normalization order wouldn’t matter much. But moving LayerNorm before each attention and MLP block stabilizes gradients, helping especially in deeper transformer stacks. This highlights the subtle but meaningful effects minor architectural tweaks can have.

Autoregressive prediction and the “last vector” mystery

I always wondered why GPT predicts from only the last token’s embedding, even though the whole sequence is processed. The reason is that the last token embedding already contains information from the entire context through causal attention. Using earlier vectors directly would break the causal property.

Why attention head dimensionality (head_size) matters

Each attention head having a smaller, separate dimensionality (head_size) surprised me at first. But now it makes sense: smaller head dimensions push each head to specialize in different relationships, making the model efficient and expressive.


Summary

This GPT-2 deep dive clarified details I had treated as minor before: positional embeddings, attention optimizations, normalization order, and head dimensions. Small architectural choices can matter a lot.

The MapReduce analogy also helped me understand the internal data flow. And understanding why GPT predicts using only the final token made the autoregressive setup feel much more concrete.