Swin Transformer: Hierarchical Vision with Shifted Windows

3 minute read

Published:

TL;DR: Swin Transformer (Liu et al., Microsoft, 2021) constrains self-attention to non-overlapping local windows of patches, then shifts those windows each layer to allow connections across boundaries. Hierarchical feature maps (like a CNN) make it ideal for detection and segmentation — not just classification.

ViT’s Limitation

ViT computes full self-attention across all 196 patches. This is O(n²) in the number of patches — manageable at 224×224 but breaks down for high-resolution images (e.g., 1024×1024 for detection).

CNNs build hierarchical feature maps: early layers capture fine details (many small feature maps), later layers capture coarse semantics (fewer, larger feature maps). ViT has no such hierarchy.

Swin Transformer fixes both problems.

Two Key Ideas

1. Window-Based Attention (W-MSA)

Instead of attending over the whole image, Swin divides the patch grid into non-overlapping local windows of M×M patches (M=7 by default).

Self-attention runs within each window independently. If the image has n patches and windows have M² patches, complexity drops from O(n²) to O(n·M²) — linear in image size.

Layer L: Window-MSA Window 1 self-attn inside Window 2 self-attn inside Window 3 self-attn inside Window 4 self-attn inside ⚠ No attention across window boundaries! shift Layer L+1: Shifted-Window-MSA ✓ New windows cross old boundaries → cross-window attention! Hierarchical stages: Stage 1 (H/4) → Stage 2 (H/8) → Stage 3 (H/16) → Stage 4 (H/32) Patch merging doubles channels, halves spatial size — like a strided conv. Produces FPN-compatible features.
Figure 1: Layer L uses regular windows (no cross-window attention). Layer L+1 shifts the windows by (M/2, M/2), creating new windows that cross the original boundaries — enabling cross-window information flow.

2. Shifted Windows (SW-MSA)

Window-based attention is efficient but windows are isolated: a patch at the right edge of window 1 never interacts with its neighbour at the left edge of window 2.

The shift trick: alternate between regular and shifted window configurations every layer. Shifted windows cross the old boundaries, allowing information to flow across the grid.

To handle patches at the edges that don’t fill a full window, cyclic shift and a masking strategy handle the boundary conditions efficiently.

Hierarchical Feature Maps

After each stage, patch merging concatenates 2×2 neighbouring patches and projects them to 2×d dimensions. This halves spatial resolution and doubles channel width — mimicking CNN downsampling.

StageSpatial sizeChannels
Input patchesH/4 × W/496
After Stage 1H/4 × W/496
After Stage 2H/8 × W/8192
After Stage 3H/16 × W/16384
After Stage 4H/32 × W/32768

These multi-scale features plug directly into standard detection heads (FPN, DETR) and segmentation decoders — something ViT cannot easily do.

Where Swin Wins

Swin won COCO object detection and ADE20K segmentation upon release. Its hierarchical design and local attention make it the preferred ViT variant for dense prediction tasks.

✅ Key Takeaways

  • Swin uses local window attention (O(n·M²)) instead of global attention (O(n²)) — linear in image size.
  • Shifted windows alternate each layer, allowing cross-window connections without extra cost.
  • Hierarchical stages produce multi-scale features, making Swin compatible with detection and segmentation heads.
  • Won multiple leaderboards in 2021 and remains a top backbone for dense visual tasks.