ViT: Vision Transformer — Images as Sequences of Patches

2 minute read

Published:

TL;DR: ViT splits a 224×224 image into 16×16 pixel patches, flattens each patch into a vector, adds position embeddings, and feeds the sequence into a standard BERT-like Transformer encoder. At ImageNet scale it matches or beats CNNs.

Why Apply Transformers to Images?

By 2020, Transformers dominated NLP. But images seemed fundamentally different: pixels are arranged in 2D grids, and spatial relationships (pixels close together tend to be related) seemed to call for convolutions, not attention.

Two arguments for Transformers:

  1. Global context: attention can relate any two pixels directly, no matter how far apart — convolutions need many layers to do this.
  2. Scalability: Transformer training scales beautifully with data and compute in ways CNNs don’t.

The challenge: a 224×224 image has 50,176 pixels. Full self-attention over all of them is O(n²) = ~2.5 billion operations per image — completely infeasible.

The solution: patches.

The ViT Pipeline

① Split into patches 224×224 → 14×14 patches (16×16 px each = 196 patches) ② Flatten + Linear Proj. Each patch: 16×16×3 = 768 → linear → d_model = 768 ③ [CLS] + Position Embed [CLS] + learned 1D position embeds ④ Encoder Standard Transformer Encoder (L layers) [CLS] → MLP Head Class: "cat" ✓ ⚠ Data requirement ViT underperforms CNNs on ImageNet alone. Needs JFT-300M or ImageNet-21K pre-training — CNNs' inductive biases help with less data.
Figure 1: The ViT pipeline: split image into 16×16 patches → flatten → linear projection → prepend [CLS] → add position embeddings → standard Transformer encoder → classify via [CLS] head.

Key Design Choices

Patch size: 16×16 pixels → 196 patches for 224×224. A patch size of 32×32 gives 49 patches (faster, lower accuracy). Smaller patches = more tokens = more compute = higher accuracy.

Position embedding: ViT uses simple learned 1D position embeddings (patches numbered 0–196). Experiments showed 2D-aware position embeddings give little additional benefit.

[CLS] token: Borrowed from BERT. A learnable token prepended to the patch sequence. After L layers of self-attention, the [CLS] output is used for classification — it aggregates global information from all patches.

Transformer: Identical to BERT — bidirectional, full self-attention.

Variants

ModelPatchLayersHeadsParams
ViT-Tiny161235.7M
ViT-Small1612622M
ViT-Base16121286M
ViT-Large162416307M
ViT-Huge143216632M

Modern successors: DeiT (data-efficient training with distillation), Swin Transformer (hierarchical windows), MAE (masked autoencoder pre-training).

✅ Key Takeaways

  • ViT splits images into fixed-size patches and treats them as tokens in a standard Transformer encoder.
  • Uses a [CLS] token for classification and learned 1D position embeddings for patches.
  • Requires large-scale pre-training (JFT-300M / ImageNet-21K) to outperform CNNs — has weaker inductive bias.
  • Spawned a huge family: DeiT, Swin, BEiT, MAE, DINO — the dominant vision backbone as of 2024.