MAE: Masked Autoencoders Are Scalable Vision Learners

4 minute read

Published: March 16, 2024

TL;DR: MAE (He et al., Facebook AI, 2021) pre-trains a ViT by masking 75% of patches and reconstructing the pixel values. An asymmetric design runs the heavy encoder only on visible patches, making pretraining 3× faster than supervised training. The learned representations transfer better than supervised ImageNet weights on most downstream tasks.

The Self-Supervised Question

Can a model learn powerful visual representations without any labels?

For language, BERT answered yes — mask some tokens, predict them, and the model learns deep representations of language. For images, analogous methods existed (BEiT, iGPT) but were slow, complex, or required tokenisers.

MAE (Masked Autoencoders Are Scalable Vision Learners) found a clean, scalable answer.

The MAE Framework

The Core Idea

Take an image, split into patches (e.g., 16×16)
Randomly mask 75% of patches — remove them from the sequence
Run the encoder only on the 25% visible patches (much shorter sequence!)
Use a lightweight decoder to reconstruct the pixel values of all 196 patches
Compute pixel-space reconstruction loss only on the masked patches

Input image (196 patches)
       ↓  mask 75%
Visible patches (49 tokens)
       ↓  heavy ViT encoder
Encoded visible tokens (49)
       ↓  add mask tokens, positional info
Full sequence (196 tokens)
       ↓  lightweight ViT decoder
Reconstructed pixels (196 patches)
       ↓  loss on masked patches only

Why 75% Masking?

This is much higher than BERT’s 15% token masking. The reason: images are heavily spatially redundant. With 15% masking, a model can reconstruct masked patches by simple interpolation of neighbours — no semantic understanding needed.

At 75% masking, reconstruction requires genuine understanding of image structure, textures, and object semantics. The task becomes hard enough to be informative.

Images vs text redundancy: In text, each word carries unique semantic content — masking 75% would be gibberish. In images, adjacent patches are highly correlated (pixels vary smoothly). Aggressive masking forces the model to "understand" rather than "interpolate".

The Asymmetric Encoder-Decoder Design

The key efficiency insight: the encoder only sees visible patches.

Component	Input	Architecture	Size
Encoder	25% visible patches	Large ViT (e.g., ViT-Large)	~307M params
Decoder	All 196 positions	Small Transformer	~8-16M params

The encoder processes only 49 tokens (at 75% masking) instead of 196 — a 4× reduction in sequence length → 8× reduction in attention computation (quadratic). This makes MAE 3× faster than supervised ViT pre-training at the same encoder size.

At each masked position, the decoder receives a shared [MASK] token plus positional embedding. The decoder’s job is reconstruction; the encoder’s representations are what actually transfer to downstream tasks.

Reconstruction Target: Raw Pixels

MAE reconstructs raw normalised pixel values (not dVAE tokens like BEiT, not CLIP features). This is simpler and works better:

L = (1/M) Σ_{masked i} || x̂ᵢ − xᵢ ||²

Where x̂ᵢ is the decoder’s pixel prediction for masked patch i, and xᵢ is the normalised ground truth pixel values. Loss is computed only over masked patches (not visible ones, which the encoder has direct access to).

What Does MAE Learn?

MAE-pretrained encoders learn rich structural representations of images. Visualising the reconstructions:

At 25% masking: near-perfect reconstruction (task too easy)
At 75% masking: plausible but slightly blurry reconstruction — the model fills in missing structure coherently, hallucinating textures consistent with the visible context

The representations capture:

Semantic categories (strong linear probing accuracy)
Spatial structure (good for detection and segmentation)
Textures and local patterns (captured in early layers)

Downstream Performance

MAE ViT-Large (pre-trained on ImageNet-1k, no labels):

Task	MAE ViT-L	Supervised ViT-L
ImageNet top-1 (fine-tune)	87.8%	86.6%
COCO object detection	60.4 APᵇ	57.2
ADE20K segmentation	53.6 mIoU	50.5

Self-supervised MAE pre-training beats supervised pre-training on all major benchmarks — a remarkable result.

MAE vs Other Self-Supervised Methods

Method	Objective	Target	Masking ratio
BERT	Predict masked tokens	Token ID	15%
BEiT	Predict masked tokens	dVAE tokens	~40%
SimCLR	Contrastive (augmented views)	None (no masking)	—
MAE	Reconstruct pixels	Raw pixels	75%
DINO	Self-distillation	None	—

MAE is the simplest reconstruction objective. Its power comes from scale, the asymmetric design, and the high masking ratio.

Influence

MAE inspired a wave of masked autoencoder models:

VideoMAE: extend to video, mask 90% of spatiotemporal patches
AudioMAE: apply to spectrogram patches
Point-MAE: extend to 3D point clouds
MAR (Masked Autoregressive): generation variant

The core idea — mask aggressively, reconstruct efficiently, learn representations — has become a standard pre-training paradigm across modalities.

Summary

Property	Value
Masking ratio	75% (much higher than BERT’s 15%)
Encoder input	Only visible patches (efficiency)
Decoder input	All positions (visible + mask tokens)
Reconstruction target	Raw pixels (normalised)
Training speed	~3× faster than supervised ViT
Result	Beats supervised pre-training at scale

MAE is self-supervised learning at its most elegant: a simple objective, an efficient design, and representations that outperform their supervised counterparts.

Share on

Bluesky Facebook LinkedIn X (formerly Twitter)

Alessio Borgi