DeiT: Training ViTs Efficiently Without Large Datasets
Published:
ViT’s Data Hunger Problem
The original ViT paper was clear: without pre-training on JFT-300M (Google’s internal 300M-image dataset), ViT performed significantly worse than ResNets on ImageNet-1k. Transformers lack the inductive biases of convolutions (local connectivity, translation equivariance) and need far more data to learn them from scratch.
This was a serious problem. JFT-300M is not publicly available. ViT seemed to require data most researchers could not access.
DeiT solved this.
The DeiT Solution: Two Ingredients
1. Knowledge Distillation with a Distillation Token
DeiT introduces a distillation token — a second special token (alongside [CLS]) prepended to the patch sequence. The distillation token learns to mimic the output of a teacher network (a strong ConvNet, specifically a RegNetY-16GF).
The training loss combines:
- Classification loss: cross-entropy between [CLS] output and one-hot ground truth labels
- Distillation loss: cross-entropy between distillation token output and the teacher’s predicted class probabilities (hard or soft labels)
The distillation token attends to all patches via normal self-attention — but its output is supervised by the teacher, not the ground truth. This gives the student ViT a richer training signal: not just “is this a cat?” but “what features does a strong model think matter here?”
Hard distillation (DeiT default): use the teacher’s argmax as target (hard label).
Soft distillation: use the teacher’s full softmax distribution as target (richer signal, slightly better).
2. Strong Data Augmentation
Without large datasets, augmentation is crucial. DeiT applies:
- RandAugment: random magnitude augmentations (colour jitter, shear, translate)
- Mixup: blend two images and their labels linearly
- CutMix: replace a rectangular patch of one image with a patch from another, blending labels proportionally
- Random Erasing: randomly erase a rectangular region
- Label smoothing: soft one-hot labels (prevents overconfidence)
These augmentations effectively multiply the training set diversity. Crucially, they provide the regularisation that was previously achieved by scale.
Architecture: DeiT = ViT + Distillation Token
DeiT-Base is architecturally identical to ViT-Base (d_model=768, 12 heads, 12 layers, 16×16 patches) with one addition: the distillation token.
At inference time, DeiT uses only the [CLS] output (standard classification), or optionally averages [CLS] and distillation token predictions for a small gain.
Three sizes:
- DeiT-Ti (5M parameters): 72.2% top-1
- DeiT-S (22M): 79.8% top-1
- DeiT-B (86M): 81.8% top-1
For reference, ViT-B without JFT training achieves ~74% on ImageNet-1k.
Why a ConvNet Teacher?
DeiT intentionally uses a ConvNet (RegNetY) as the teacher rather than another ViT. The hypothesis: CNNs have locality inductive biases baked in that ViTs lack. Distilling from a CNN transfers these biases to the ViT student — teaching it to pay attention to local patterns it would otherwise learn only with much more data.
This is supported by empirical results: distilling from a ResNet or RegNet teacher outperforms distilling from a ViT teacher at the same scale.
Legacy and Influence
DeiT established the standard training recipe for ViTs on moderate-scale data:
- DeiT-III (2022) refined the recipe further: no distillation, but a 3-Augment strategy + layer scale + strong regularisation
- Many subsequent ViT papers (BEiT, MAE, DINO) adopt DeiT-style augmentation as baseline
- The distillation token idea influenced multimodal models where tokens from different modalities train independently
Summary
| Component | Role |
|---|---|
| [CLS] token | Learns from ground truth labels |
| Distillation token | Learns from teacher (ConvNet) predictions |
| RandAugment + Mixup + CutMix | Regularisation without scale |
| RegNet teacher | Transfers CNN inductive biases |
DeiT showed that ViTs do not need private 300M-image datasets — they need the right training strategy. It made ViT accessible to the research community and established data augmentation + distillation as the standard recipe for data-efficient Transformer training in vision.
