Class Token vs Pooling in Vision Transformers

4 minute read

Published:

TL;DR: The [CLS] token learns to aggregate global image information through attention. Global average pooling (GAP) simply averages all patch token outputs. Both work; [CLS] tends to better capture discriminative global features, while GAP spreads gradient signal across all patches and trains more stably. Modern self-supervised ViTs often use both.

The Problem: From Patches to Image

After L Transformer blocks, you have a sequence of token representations:

[z_CLS, z₁, z₂, ..., z_N]   ∈ ℝ^{(N+1) × d_model}

For image classification (one label per image), you need a single vector. Two approaches:

Strategy 1: The [CLS] Token

Borrowed from BERT, a learnable vector [CLS] is prepended to the patch sequence at position 0. It has no corresponding image region — it starts as a random trainable embedding.

After L blocks of multi-head attention, the [CLS] position has attended to every patch at every layer. By the final layer, z_CLS is expected to contain a global summary of the image.

The classification head is applied only to z_CLS:

logits = W_head · z_CLS

Why it works: Attention is content-based. The [CLS] token learns to attend to the most discriminative patches — it preferentially gathers features that matter for classification. Through training, it specialises as a global image descriptor.

Drawback: Only one token receives the gradient from the classification loss directly. Training can be slower to propagate globally, especially in early layers.

Strategy 2: Global Average Pooling (GAP)

Instead of a special token, simply average all N patch token outputs at the final layer:

z_image = (1/N) Σᵢ zᵢ

The classification head is applied to z_image:

logits = W_head · z_image

Why it works: Every patch contributes equally to the representation (initially). Gradient flows back to every patch token during training — the learning signal is spread across the full sequence from the start.

Drawback: No mechanism for selective attention at readout time. All patches contribute equally regardless of relevance — a background patch contributes as much as the object of interest.

What Experiments Show

Dosovitskiy et al. (ViT, 2020) found that both strategies perform comparably when trained at the same scale. The original ViT uses [CLS] (following BERT convention).

DeiT finds similar results. MAE (masked autoencoder) uses GAP because masked reconstruction benefits from gradients flowing to all patches.

DINO and DINOv2 use [CLS] — and the [CLS] token embedding from DINOv2 is remarkably useful for dense tasks (segmentation, depth) despite being trained with classification objectives.

The [CLS] token as a query over the image: In later layers, the [CLS] token's query vector asks "which patches contain the most class-relevant information?" and its key becomes the aggregated answer. This is why [CLS] representations from large pre-trained ViTs are strong classifiers even with a linear head — they have learned to summarise images through selective attention.

[CLS] Token as a Dense Feature Extractor

An important property: because [CLS] attends to all patches, its attention weights in the last layer form an attention map — a rough spatial map of which regions mattered for classification.

DINO exploits this: the attention maps from a self-supervised ViT produce surprisingly clean segmentation-like highlights of the foreground object, with no segmentation supervision whatsoever.

Input image: dog on grass
[CLS] last-layer attention: high weight on dog, low on grass
Attention map: roughly segments the dog

This property is unique to [CLS]-based ViTs (not GAP) and makes them powerful for localisation without detection supervision.

When to Use Each

ScenarioRecommendation
Image classificationEither; [CLS] matches BERT convention
Self-supervised pretraining (MAE-style)GAP (gradient to all patches)
Dense prediction (segmentation, depth)Patch tokens directly (not CLS or GAP)
Image retrieval / linear probing[CLS] (especially DINOv2)
Multimodal models (CLIP, LLaVA)[CLS] (standard for vision encoders)

A Hybrid Approach

Some models use both: the [CLS] token representation and the mean-pooled patch tokens are concatenated or ensembled. BEiT-3 and some CLIP variants find marginal gains from this.

Summary

Property[CLS] TokenGlobal Average Pooling
ArchitectureExtra prepended tokenNo extra token
ReadoutOne token’s outputMean of all patch outputs
Gradient distributionConcentrated at position 0Spread across all patches
Selective attentionYes (implicit via attention)No (uniform averaging)
Interpretable attention mapYesNo
PerformanceComparableComparable
Used byViT, DeiT, DINO, CLIPMAE, some CNN hybrids

The [CLS] token is the dominant convention in Transformer-based vision models. Understanding it — and its alternative — clarifies how image representations are formed and why ViT attention maps can serve as segmentation signals without any spatial supervision.