Andrej Karpathy’s video “Let’s build GPT: from scratch, in code, spelled out” is a clear, practical walkthrough of a small GPT implementation. He keeps the setup intentionally simple: a character-level GPT trained on a tiny Shakespeare dataset, just 1MB, so the implementation stays approachable.

Here are the notes that stuck with me:


Key concepts

  1. Data Preparation
    • Encodes characters into numerical tokens (65 unique tokens from Tiny Shakespeare).
    • Data split into training (90%) and validation (10%).
  2. Chunking
    • Transformers use fixed-length contexts (block_size) rather than full text.
    • Random sampling of batches (batch_size) for computational efficiency.
  3. Bigram Language Model
    • Simplest language model: predicting next character from current one.
    • Uses embedding tables (token embedding table) to convert tokens into prediction logits.
  4. Self-Attention Mechanism
    • Computes relevance between tokens via queries and keys.
    • Uses masking (tril matrix) to prevent looking into future tokens.
  5. Multi-Head Attention
    • Each head independently calculates attention on different subsets of token embeddings.
    • Outputs from multiple heads concatenated and projected linearly.
  6. Feed-Forward Networks
    • Simple MLP layers placed after attention to process token-level details.
    • Typically have 4× embedding dimensionality internally.
  7. Residual Connections and LayerNorm
    • Residual connections (identity pathways) make training stable, preventing vanishing gradients.
    • Layer Normalization stabilizes activations and gradients across token dimensions.
  8. Scaling the Model
    • Increasing parameters (layers, heads, embedding size) significantly improves performance.
    • Hyperparameter tuning, like learning rate reduction, matters more at larger scales.

What clicked for me

These points stood out while I was watching and thinking through the lecture:

The Asymmetry of Queries and Keys

At first, it felt odd to separate queries and keys, since they’re computed similarly from the same input. But the asymmetry is deliberate: queries ask questions, and keys provide labels. Reversing them breaks this logic. Keeping them separate is important for measuring relevance between tokens.

Masking After Attention Calculation

At first, masking after calculating full attention scores seemed inefficient, since we discard some calculations. But this makes the computation easier and faster on GPUs. The simplicity and parallelism outweigh the wasted work.

The Role of Value Vectors

Values store the actual content aggregated by attention. At first, they felt redundant, but it is clearer now that keys and queries only determine how much each token contributes, while values contain what gets communicated. This distinction makes the attention mechanism expressive.

Why Multi-Head Attention Works Well

Having multiple heads isn’t arbitrary. Each head can focus on different features or relationships, like syntax vs. semantics. The “divide and conquer” approach works because it boosts representational capacity without drastically increasing complexity.

Residual Connections: Essential for Stability

Residual connections let gradients flow more freely and help prevent vanishing gradients. Transformers’ depth would be severely limited without them.

LayerNorm vs. BatchNorm

Layer Normalization works better than Batch Normalization for Transformers because it normalizes across features, not batches. It stabilizes gradients across sequences, which BatchNorm struggles to handle. I appreciated this subtle point.


Final thoughts

Building GPT from scratch, even on a tiny dataset, clarified a lot of my confusion about Transformers. It helped me see how queries, keys, values, attention, and normalization layers fit together. It also showed me why small implementation details like masking and residual connections matter.

Karpathy’s video made GPT’s internal workings feel much more concrete.