Swin Transformer: Hierarchical Vision with Shifted Windows

3 minute read

Published: January 14, 2024

TL;DR: Swin Transformer (Liu et al., Microsoft, 2021) constrains self-attention to non-overlapping local windows of patches, then shifts those windows each layer to allow connections across boundaries. Hierarchical feature maps (like a CNN) make it ideal for detection and segmentation — not just classification.

ViT’s Limitation

ViT computes full self-attention across all 196 patches. This is O(n²) in the number of patches — manageable at 224×224 but breaks down for high-resolution images (e.g., 1024×1024 for detection).

CNNs build hierarchical feature maps: early layers capture fine details (many small feature maps), later layers capture coarse semantics (fewer, larger feature maps). ViT has no such hierarchy.

Swin Transformer fixes both problems.

Two Key Ideas

1. Window-Based Attention (W-MSA)

Instead of attending over the whole image, Swin divides the patch grid into non-overlapping local windows of M×M patches (M=7 by default).

Self-attention runs within each window independently. If the image has n patches and windows have M² patches, complexity drops from O(n²) to O(n·M²) — linear in image size.

Figure 1: Layer L uses regular windows (no cross-window attention). Layer L+1 shifts the windows by (M/2, M/2), creating new windows that cross the original boundaries — enabling cross-window information flow.

2. Shifted Windows (SW-MSA)

Window-based attention is efficient but windows are isolated: a patch at the right edge of window 1 never interacts with its neighbour at the left edge of window 2.

The shift trick: alternate between regular and shifted window configurations every layer. Shifted windows cross the old boundaries, allowing information to flow across the grid.

To handle patches at the edges that don’t fill a full window, cyclic shift and a masking strategy handle the boundary conditions efficiently.

Hierarchical Feature Maps

After each stage, patch merging concatenates 2×2 neighbouring patches and projects them to 2×d dimensions. This halves spatial resolution and doubles channel width — mimicking CNN downsampling.

Stage	Spatial size	Channels
Input patches	H/4 × W/4	96
After Stage 1	H/4 × W/4	96
After Stage 2	H/8 × W/8	192
After Stage 3	H/16 × W/16	384
After Stage 4	H/32 × W/32	768

These multi-scale features plug directly into standard detection heads (FPN, DETR) and segmentation decoders — something ViT cannot easily do.

Where Swin Wins

Swin won COCO object detection and ADE20K segmentation upon release. Its hierarchical design and local attention make it the preferred ViT variant for dense prediction tasks.

✅ Key Takeaways

Swin uses local window attention (O(n·M²)) instead of global attention (O(n²)) — linear in image size.
Shifted windows alternate each layer, allowing cross-window connections without extra cost.
Hierarchical stages produce multi-scale features, making Swin compatible with detection and segmentation heads.
Won multiple leaderboards in 2021 and remains a top backbone for dense visual tasks.

Share on

Bluesky Facebook LinkedIn X (formerly Twitter)

Alessio Borgi

Swin Transformer: Hierarchical Vision with Shifted Windows

ViT’s Limitation

Two Key Ideas

1. Window-Based Attention (W-MSA)

2. Shifted Windows (SW-MSA)

Hierarchical Feature Maps

Where Swin Wins

✅ Key Takeaways

Share on

You May Also Enjoy

GIN: Graph Isomorphism Network — The Most Expressive GNN

GraphSAGE: Inductive Learning on Large Graphs

GAT: Graph Attention Networks

GCN: Graph Convolutional Networks