DeiT: Training ViTs Efficiently Without Large Datasets

4 minute read

Published:

TL;DR: DeiT (Touvron et al., Facebook AI, 2021) trains ViT-Base to 81.8% top-1 on ImageNet-1k in 3 days on 8 GPUs — no JFT needed. The key: a distillation token that learns from a RegNet teacher's soft predictions, combined with RandAugment, Mixup, and CutMix.

ViT’s Data Hunger Problem

The original ViT paper was clear: without pre-training on JFT-300M (Google’s internal 300M-image dataset), ViT performed significantly worse than ResNets on ImageNet-1k. Transformers lack the inductive biases of convolutions (local connectivity, translation equivariance) and need far more data to learn them from scratch.

This was a serious problem. JFT-300M is not publicly available. ViT seemed to require data most researchers could not access.

DeiT solved this.

The DeiT Solution: Two Ingredients

1. Knowledge Distillation with a Distillation Token

DeiT introduces a distillation token — a second special token (alongside [CLS]) prepended to the patch sequence. The distillation token learns to mimic the output of a teacher network (a strong ConvNet, specifically a RegNetY-16GF).

The training loss combines:

  • Classification loss: cross-entropy between [CLS] output and one-hot ground truth labels
  • Distillation loss: cross-entropy between distillation token output and the teacher’s predicted class probabilities (hard or soft labels)
L = (1/2) · L_CE(z_CLS, y) + (1/2) · L_CE(z_distil, y_teacher)

The distillation token attends to all patches via normal self-attention — but its output is supervised by the teacher, not the ground truth. This gives the student ViT a richer training signal: not just “is this a cat?” but “what features does a strong model think matter here?”

Hard distillation (DeiT default): use the teacher’s argmax as target (hard label).
Soft distillation: use the teacher’s full softmax distribution as target (richer signal, slightly better).

Why a distillation token instead of just adding a loss to [CLS]? The distillation token specialises for mimicking the teacher independently of the classification token. At inference, you can use [CLS], the distillation token, or ensemble both. Keeping them separate allows different roles to emerge.

2. Strong Data Augmentation

Without large datasets, augmentation is crucial. DeiT applies:

  • RandAugment: random magnitude augmentations (colour jitter, shear, translate)
  • Mixup: blend two images and their labels linearly
  • CutMix: replace a rectangular patch of one image with a patch from another, blending labels proportionally
  • Random Erasing: randomly erase a rectangular region
  • Label smoothing: soft one-hot labels (prevents overconfidence)

These augmentations effectively multiply the training set diversity. Crucially, they provide the regularisation that was previously achieved by scale.

Architecture: DeiT = ViT + Distillation Token

DeiT-Base is architecturally identical to ViT-Base (d_model=768, 12 heads, 12 layers, 16×16 patches) with one addition: the distillation token.

At inference time, DeiT uses only the [CLS] output (standard classification), or optionally averages [CLS] and distillation token predictions for a small gain.

Three sizes:

  • DeiT-Ti (5M parameters): 72.2% top-1
  • DeiT-S (22M): 79.8% top-1
  • DeiT-B (86M): 81.8% top-1

For reference, ViT-B without JFT training achieves ~74% on ImageNet-1k.

Why a ConvNet Teacher?

DeiT intentionally uses a ConvNet (RegNetY) as the teacher rather than another ViT. The hypothesis: CNNs have locality inductive biases baked in that ViTs lack. Distilling from a CNN transfers these biases to the ViT student — teaching it to pay attention to local patterns it would otherwise learn only with much more data.

This is supported by empirical results: distilling from a ResNet or RegNet teacher outperforms distilling from a ViT teacher at the same scale.

Legacy and Influence

DeiT established the standard training recipe for ViTs on moderate-scale data:

  1. DeiT-III (2022) refined the recipe further: no distillation, but a 3-Augment strategy + layer scale + strong regularisation
  2. Many subsequent ViT papers (BEiT, MAE, DINO) adopt DeiT-style augmentation as baseline
  3. The distillation token idea influenced multimodal models where tokens from different modalities train independently

Summary

ComponentRole
[CLS] tokenLearns from ground truth labels
Distillation tokenLearns from teacher (ConvNet) predictions
RandAugment + Mixup + CutMixRegularisation without scale
RegNet teacherTransfers CNN inductive biases

DeiT showed that ViTs do not need private 300M-image datasets — they need the right training strategy. It made ViT accessible to the research community and established data augmentation + distillation as the standard recipe for data-efficient Transformer training in vision.