ViT: Vision Transformer — Images as Sequences of Patches

2 minute read

Published: January 13, 2024

TL;DR: ViT splits a 224×224 image into 16×16 pixel patches, flattens each patch into a vector, adds position embeddings, and feeds the sequence into a standard BERT-like Transformer encoder. At ImageNet scale it matches or beats CNNs.

Why Apply Transformers to Images?

By 2020, Transformers dominated NLP. But images seemed fundamentally different: pixels are arranged in 2D grids, and spatial relationships (pixels close together tend to be related) seemed to call for convolutions, not attention.

Two arguments for Transformers:

Global context: attention can relate any two pixels directly, no matter how far apart — convolutions need many layers to do this.
Scalability: Transformer training scales beautifully with data and compute in ways CNNs don’t.

The challenge: a 224×224 image has 50,176 pixels. Full self-attention over all of them is O(n²) = ~2.5 billion operations per image — completely infeasible.

The solution: patches.

The ViT Pipeline

Figure 1: The ViT pipeline: split image into 16×16 patches → flatten → linear projection → prepend [CLS] → add position embeddings → standard Transformer encoder → classify via [CLS] head.

Key Design Choices

Patch size: 16×16 pixels → 196 patches for 224×224. A patch size of 32×32 gives 49 patches (faster, lower accuracy). Smaller patches = more tokens = more compute = higher accuracy.

Position embedding: ViT uses simple learned 1D position embeddings (patches numbered 0–196). Experiments showed 2D-aware position embeddings give little additional benefit.

[CLS] token: Borrowed from BERT. A learnable token prepended to the patch sequence. After L layers of self-attention, the [CLS] output is used for classification — it aggregates global information from all patches.

Transformer: Identical to BERT — bidirectional, full self-attention.

Variants

Model	Patch	Layers	Heads	Params
ViT-Tiny	16	12	3	5.7M
ViT-Small	16	12	6	22M
ViT-Base	16	12	12	86M
ViT-Large	16	24	16	307M
ViT-Huge	14	32	16	632M

Modern successors: DeiT (data-efficient training with distillation), Swin Transformer (hierarchical windows), MAE (masked autoencoder pre-training).

✅ Key Takeaways

ViT splits images into fixed-size patches and treats them as tokens in a standard Transformer encoder.
Uses a [CLS] token for classification and learned 1D position embeddings for patches.
Requires large-scale pre-training (JFT-300M / ImageNet-21K) to outperform CNNs — has weaker inductive bias.
Spawned a huge family: DeiT, Swin, BEiT, MAE, DINO — the dominant vision backbone as of 2024.

Share on

Bluesky Facebook LinkedIn X (formerly Twitter)

Alessio Borgi

ViT: Vision Transformer — Images as Sequences of Patches

Why Apply Transformers to Images?

The ViT Pipeline

Key Design Choices

Variants

✅ Key Takeaways

Share on

You May Also Enjoy

GIN: Graph Isomorphism Network — The Most Expressive GNN

GraphSAGE: Inductive Learning on Large Graphs

GAT: Graph Attention Networks

GCN: Graph Convolutional Networks