MAE: Masked Autoencoders Are Scalable Vision Learners

4 minute read

Published:

TL;DR: MAE (He et al., Facebook AI, 2021) pre-trains a ViT by masking 75% of patches and reconstructing the pixel values. An asymmetric design runs the heavy encoder only on visible patches, making pretraining 3× faster than supervised training. The learned representations transfer better than supervised ImageNet weights on most downstream tasks.

The Self-Supervised Question

Can a model learn powerful visual representations without any labels?

For language, BERT answered yes — mask some tokens, predict them, and the model learns deep representations of language. For images, analogous methods existed (BEiT, iGPT) but were slow, complex, or required tokenisers.

MAE (Masked Autoencoders Are Scalable Vision Learners) found a clean, scalable answer.

The MAE Framework

The Core Idea

  1. Take an image, split into patches (e.g., 16×16)
  2. Randomly mask 75% of patches — remove them from the sequence
  3. Run the encoder only on the 25% visible patches (much shorter sequence!)
  4. Use a lightweight decoder to reconstruct the pixel values of all 196 patches
  5. Compute pixel-space reconstruction loss only on the masked patches
Input image (196 patches)
       ↓  mask 75%
Visible patches (49 tokens)
       ↓  heavy ViT encoder
Encoded visible tokens (49)
       ↓  add mask tokens, positional info
Full sequence (196 tokens)
       ↓  lightweight ViT decoder
Reconstructed pixels (196 patches)
       ↓  loss on masked patches only

Why 75% Masking?

This is much higher than BERT’s 15% token masking. The reason: images are heavily spatially redundant. With 15% masking, a model can reconstruct masked patches by simple interpolation of neighbours — no semantic understanding needed.

At 75% masking, reconstruction requires genuine understanding of image structure, textures, and object semantics. The task becomes hard enough to be informative.

Images vs text redundancy: In text, each word carries unique semantic content — masking 75% would be gibberish. In images, adjacent patches are highly correlated (pixels vary smoothly). Aggressive masking forces the model to "understand" rather than "interpolate".

The Asymmetric Encoder-Decoder Design

The key efficiency insight: the encoder only sees visible patches.

ComponentInputArchitectureSize
Encoder25% visible patchesLarge ViT (e.g., ViT-Large)~307M params
DecoderAll 196 positionsSmall Transformer~8-16M params

The encoder processes only 49 tokens (at 75% masking) instead of 196 — a 4× reduction in sequence length → 8× reduction in attention computation (quadratic). This makes MAE 3× faster than supervised ViT pre-training at the same encoder size.

At each masked position, the decoder receives a shared [MASK] token plus positional embedding. The decoder’s job is reconstruction; the encoder’s representations are what actually transfer to downstream tasks.

Reconstruction Target: Raw Pixels

MAE reconstructs raw normalised pixel values (not dVAE tokens like BEiT, not CLIP features). This is simpler and works better:

L = (1/M) Σ_{masked i} || x̂ᵢ − xᵢ ||²

Where x̂ᵢ is the decoder’s pixel prediction for masked patch i, and xᵢ is the normalised ground truth pixel values. Loss is computed only over masked patches (not visible ones, which the encoder has direct access to).

What Does MAE Learn?

MAE-pretrained encoders learn rich structural representations of images. Visualising the reconstructions:

  • At 25% masking: near-perfect reconstruction (task too easy)
  • At 75% masking: plausible but slightly blurry reconstruction — the model fills in missing structure coherently, hallucinating textures consistent with the visible context

The representations capture:

  • Semantic categories (strong linear probing accuracy)
  • Spatial structure (good for detection and segmentation)
  • Textures and local patterns (captured in early layers)

Downstream Performance

MAE ViT-Large (pre-trained on ImageNet-1k, no labels):

TaskMAE ViT-LSupervised ViT-L
ImageNet top-1 (fine-tune)87.8%86.6%
COCO object detection60.4 APᵇ57.2
ADE20K segmentation53.6 mIoU50.5

Self-supervised MAE pre-training beats supervised pre-training on all major benchmarks — a remarkable result.

MAE vs Other Self-Supervised Methods

MethodObjectiveTargetMasking ratio
BERTPredict masked tokensToken ID15%
BEiTPredict masked tokensdVAE tokens~40%
SimCLRContrastive (augmented views)None (no masking)
MAEReconstruct pixelsRaw pixels75%
DINOSelf-distillationNone

MAE is the simplest reconstruction objective. Its power comes from scale, the asymmetric design, and the high masking ratio.

Influence

MAE inspired a wave of masked autoencoder models:

  • VideoMAE: extend to video, mask 90% of spatiotemporal patches
  • AudioMAE: apply to spectrogram patches
  • Point-MAE: extend to 3D point clouds
  • MAR (Masked Autoregressive): generation variant

The core idea — mask aggressively, reconstruct efficiently, learn representations — has become a standard pre-training paradigm across modalities.

Summary

PropertyValue
Masking ratio75% (much higher than BERT’s 15%)
Encoder inputOnly visible patches (efficiency)
Decoder inputAll positions (visible + mask tokens)
Reconstruction targetRaw pixels (normalised)
Training speed~3× faster than supervised ViT
ResultBeats supervised pre-training at scale

MAE is self-supervised learning at its most elegant: a simple objective, an efficient design, and representations that outperform their supervised counterparts.