MAE: Masked Autoencoders Are Scalable Vision Learners
Published:
The Self-Supervised Question
Can a model learn powerful visual representations without any labels?
For language, BERT answered yes — mask some tokens, predict them, and the model learns deep representations of language. For images, analogous methods existed (BEiT, iGPT) but were slow, complex, or required tokenisers.
MAE (Masked Autoencoders Are Scalable Vision Learners) found a clean, scalable answer.
The MAE Framework
The Core Idea
- Take an image, split into patches (e.g., 16×16)
- Randomly mask 75% of patches — remove them from the sequence
- Run the encoder only on the 25% visible patches (much shorter sequence!)
- Use a lightweight decoder to reconstruct the pixel values of all 196 patches
- Compute pixel-space reconstruction loss only on the masked patches
Input image (196 patches)
↓ mask 75%
Visible patches (49 tokens)
↓ heavy ViT encoder
Encoded visible tokens (49)
↓ add mask tokens, positional info
Full sequence (196 tokens)
↓ lightweight ViT decoder
Reconstructed pixels (196 patches)
↓ loss on masked patches only
Why 75% Masking?
This is much higher than BERT’s 15% token masking. The reason: images are heavily spatially redundant. With 15% masking, a model can reconstruct masked patches by simple interpolation of neighbours — no semantic understanding needed.
At 75% masking, reconstruction requires genuine understanding of image structure, textures, and object semantics. The task becomes hard enough to be informative.
The Asymmetric Encoder-Decoder Design
The key efficiency insight: the encoder only sees visible patches.
| Component | Input | Architecture | Size |
|---|---|---|---|
| Encoder | 25% visible patches | Large ViT (e.g., ViT-Large) | ~307M params |
| Decoder | All 196 positions | Small Transformer | ~8-16M params |
The encoder processes only 49 tokens (at 75% masking) instead of 196 — a 4× reduction in sequence length → 8× reduction in attention computation (quadratic). This makes MAE 3× faster than supervised ViT pre-training at the same encoder size.
At each masked position, the decoder receives a shared [MASK] token plus positional embedding. The decoder’s job is reconstruction; the encoder’s representations are what actually transfer to downstream tasks.
Reconstruction Target: Raw Pixels
MAE reconstructs raw normalised pixel values (not dVAE tokens like BEiT, not CLIP features). This is simpler and works better:
Where x̂ᵢ is the decoder’s pixel prediction for masked patch i, and xᵢ is the normalised ground truth pixel values. Loss is computed only over masked patches (not visible ones, which the encoder has direct access to).
What Does MAE Learn?
MAE-pretrained encoders learn rich structural representations of images. Visualising the reconstructions:
- At 25% masking: near-perfect reconstruction (task too easy)
- At 75% masking: plausible but slightly blurry reconstruction — the model fills in missing structure coherently, hallucinating textures consistent with the visible context
The representations capture:
- Semantic categories (strong linear probing accuracy)
- Spatial structure (good for detection and segmentation)
- Textures and local patterns (captured in early layers)
Downstream Performance
MAE ViT-Large (pre-trained on ImageNet-1k, no labels):
| Task | MAE ViT-L | Supervised ViT-L |
|---|---|---|
| ImageNet top-1 (fine-tune) | 87.8% | 86.6% |
| COCO object detection | 60.4 APᵇ | 57.2 |
| ADE20K segmentation | 53.6 mIoU | 50.5 |
Self-supervised MAE pre-training beats supervised pre-training on all major benchmarks — a remarkable result.
MAE vs Other Self-Supervised Methods
| Method | Objective | Target | Masking ratio |
|---|---|---|---|
| BERT | Predict masked tokens | Token ID | 15% |
| BEiT | Predict masked tokens | dVAE tokens | ~40% |
| SimCLR | Contrastive (augmented views) | None (no masking) | — |
| MAE | Reconstruct pixels | Raw pixels | 75% |
| DINO | Self-distillation | None | — |
MAE is the simplest reconstruction objective. Its power comes from scale, the asymmetric design, and the high masking ratio.
Influence
MAE inspired a wave of masked autoencoder models:
- VideoMAE: extend to video, mask 90% of spatiotemporal patches
- AudioMAE: apply to spectrogram patches
- Point-MAE: extend to 3D point clouds
- MAR (Masked Autoregressive): generation variant
The core idea — mask aggressively, reconstruct efficiently, learn representations — has become a standard pre-training paradigm across modalities.
Summary
| Property | Value |
|---|---|
| Masking ratio | 75% (much higher than BERT’s 15%) |
| Encoder input | Only visible patches (efficiency) |
| Decoder input | All positions (visible + mask tokens) |
| Reconstruction target | Raw pixels (normalised) |
| Training speed | ~3× faster than supervised ViT |
| Result | Beats supervised pre-training at scale |
MAE is self-supervised learning at its most elegant: a simple objective, an efficient design, and representations that outperform their supervised counterparts.
