This time, I review the paper “An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale”, commonly known as ViT (Vision Transformer). My initial impression was that the idea is almost surprisingly simple: take a standard Transformer, originally designed for NLP, and apply it to images by splitting them into patches. Despite the simplicity, the results are strong, especially when scaled with large datasets.


Key concepts

Transformer on images (patches as tokens)

ViT treats images exactly like text: an image is divided into small patches, each patch flattened into a vector, and these vectors become tokens fed into the Transformer model. Each patch is analogous to a “word” in NLP, which is intuitive and effective.

Linear embedding of patches

Each image patch is flattened and projected linearly into a fixed-dimensional vector embedding. Unlike CNNs, which explicitly encode spatial locality, ViT relies on these embeddings and positional encoding to learn spatial relationships.

Learnable classification token

An additional learnable “CLS” token is prepended to the patch embeddings. This special token aggregates global information through the Transformer’s layers and becomes the final representation for classification, similar to BERT.

Position embeddings

ViT uses standard learnable 1D positional embeddings. These embeddings help the model recognize positions, although they’re less spatially intuitive than explicit 2D positional encodings. They work, and the authors saw no significant advantage in switching to more complex 2D embeddings.

Hybrid models

They also briefly experimented with a hybrid approach where CNN feature maps replace raw patches. It’s worth noting, but ultimately secondary. The pure Transformer approach already performs well without CNN pre-processing, suggesting that CNN inductive biases might not be strictly necessary when data scale is sufficient.


What I learned

Surprisingly simple yet effective

The biggest surprise for me was how straightforward this approach is. It feels almost trivial: patch up an image, pass it through a Transformer, and let self-attention handle the rest. There aren’t complicated image-specific tweaks, yet it achieves solid performance. The simplicity makes the model generalizable and scalable.

Less inductive bias, more flexibility

ViT intentionally lacks many inductive biases built into CNNs, like locality, translational invariance, and hierarchical features. This initially felt like a disadvantage. But it also lets ViT learn from scratch and potentially find more general patterns. CNNs bake in assumptions about images; ViT does not. This made me reconsider how inductive biases might restrict model capacity, especially when enough data is available.

Why positional embeddings can be problematic

One point I found confusing at first was positional embeddings. If the input resolution changes (for example, when fine-tuning on higher resolution images), positional embeddings trained for smaller sequences can lose their meaning. Because patches correspond to positions in a fixed grid, changing resolution changes their relative indices. The paper solves this by simply interpolating positional embeddings. However, I still wonder if using relative positional coordinates normalized within [0,1] in 2D would have been better. That could preserve relative positions across resolutions better than interpolating discrete embeddings.

Hybrid approach (CNN + Transformer)

The authors briefly mentioned a hybrid approach: extracting patches from CNN feature maps instead of raw pixels. This was interesting because it mixes CNN inductive bias with the global context modeling of Transformers. But the paper showed no strong advantage over a pure Transformer. I initially thought a hybrid model might be superior, but apparently the simplicity of pure ViT is good enough.

Changing trends in Transformer norm placement

I also noticed ViT applies LayerNorm before each block, unlike earlier Transformer implementations such as the original Vaswani et al. paper. This pattern, now common in later models like GPT-3, seemed older than I realized. It made me rethink when exactly this architectural choice became standard.


Summary

ViT adapts Transformers to images by treating image patches as tokens. The approach feels almost too straightforward, yet it works well with enough data. The minimal inductive bias is both a limitation and an advantage: the model has to learn spatial structure from scratch, but it also gets more flexibility.

The paper reinforced my belief that simplicity, when applied thoughtfully, often beats complicated hand-crafted architectures. ViT isn’t doing something fundamentally complex. It takes a proven model and changes the input modality. That alone is the point.