ViT: Vision Transformer — Images as Sequences of Patches
Published:
Why Apply Transformers to Images?
By 2020, Transformers dominated NLP. But images seemed fundamentally different: pixels are arranged in 2D grids, and spatial relationships (pixels close together tend to be related) seemed to call for convolutions, not attention.
Two arguments for Transformers:
- Global context: attention can relate any two pixels directly, no matter how far apart — convolutions need many layers to do this.
- Scalability: Transformer training scales beautifully with data and compute in ways CNNs don’t.
The challenge: a 224×224 image has 50,176 pixels. Full self-attention over all of them is O(n²) = ~2.5 billion operations per image — completely infeasible.
The solution: patches.
The ViT Pipeline
Key Design Choices
Patch size: 16×16 pixels → 196 patches for 224×224. A patch size of 32×32 gives 49 patches (faster, lower accuracy). Smaller patches = more tokens = more compute = higher accuracy.
Position embedding: ViT uses simple learned 1D position embeddings (patches numbered 0–196). Experiments showed 2D-aware position embeddings give little additional benefit.
[CLS] token: Borrowed from BERT. A learnable token prepended to the patch sequence. After L layers of self-attention, the [CLS] output is used for classification — it aggregates global information from all patches.
Transformer: Identical to BERT — bidirectional, full self-attention.
Variants
| Model | Patch | Layers | Heads | Params |
|---|---|---|---|---|
| ViT-Tiny | 16 | 12 | 3 | 5.7M |
| ViT-Small | 16 | 12 | 6 | 22M |
| ViT-Base | 16 | 12 | 12 | 86M |
| ViT-Large | 16 | 24 | 16 | 307M |
| ViT-Huge | 14 | 32 | 16 | 632M |
Modern successors: DeiT (data-efficient training with distillation), Swin Transformer (hierarchical windows), MAE (masked autoencoder pre-training).
✅ Key Takeaways
- ViT splits images into fixed-size patches and treats them as tokens in a standard Transformer encoder.
- Uses a [CLS] token for classification and learned 1D position embeddings for patches.
- Requires large-scale pre-training (JFT-300M / ImageNet-21K) to outperform CNNs — has weaker inductive bias.
- Spawned a huge family: DeiT, Swin, BEiT, MAE, DINO — the dominant vision backbone as of 2024.
