Over-smoothing vs Over-squashing: The Difference

4 minute read

Published:

TL;DR: Oversmoothing = forward-pass feature collapse from too much averaging (nearby nodes become identical). Oversquashing = gradient/information collapse at bottleneck edges for long-range communication. Both increase with depth but in different ways, on different nodes, and need different fixes.

The Confusion

Both oversmoothing and oversquashing:

  • Occur with deep GNNs
  • Cause performance degradation
  • Involve information loss

They are often mentioned together or confused. But they are fundamentally different phenomena.

Head-to-Head Comparison

PropertyOversmoothingOversquashing    
Root causeIterated averaging → all embeddings convergeExponential neighbourhood growth → info bottleneck    
DirectionForward pass (computation)Both forward (dilution) and backward (gradient)    
Which nodes affectedAll nodes, especially nearby onesNodes that are far apart (long paths)    
Graph structureWorse on dense, well-connected graphsWorse on tree-like, sparse graphs with bridge edges    
With more layersProvably gets worse (converges to constant)Could get better (reach distant nodes) but squashing increases    
MeasureDirichlet energy → 0; MAD → 0Jacobian ∂h_v/∂x_u → 0
Spectral viewLow-pass filter removes high frequenciesNot spectral: it’s about topology/curvature    
FixResidual connections, jump knowledge, APPNPGraph rewiring, global attention, virtual nodes    

When You Have Oversmoothing

You add layers hoping to capture longer-range patterns, but performance peaks at 2-3 layers then drops. Node embeddings in the last layer have near-zero pairwise distances. The model assigns nearly the same embedding to all nodes.

Symptom: accuracy peaks at 2-3 layers, then monotonically decreases. MAD scores drop toward zero with depth.

Fix: residual connections (GCNII), APPNP, JK-Net (jumping knowledge). Do NOT add more layers — that makes it worse.

When You Have Oversquashing

You have a task requiring long-range reasoning (e.g., predicting whether two distant atoms in a molecule will react). The model performs well on local structure tasks but fails on long-range ones. Adding more layers doesn’t help.

Symptom: performance on long-range tasks (e.g., LRGB benchmarks) is poor regardless of depth. Jacobian norms near zero for distant node pairs.

Fix: graph rewiring (SDRF, add virtual nodes), global attention (Graph Transformers, GPS). Adding residual connections does NOT fix oversquashing — information still can’t reach distant nodes.

A Unified View

Li et al. and Alon & Yahak propose viewing both as failures of information flow, but in different regimes:

Short range:  Oversmoothing dominates (too many hops → convergence)
Long range:   Oversquashing dominates (too few paths → bottlenecks)

They create opposing pressures on depth:

  • Oversmoothing says: use FEWER layers
  • Task requirements say: use MORE layers (to reach distant nodes)
  • Oversquashing says: more layers don’t help anyway for bottlenecks

The resolution: decouple propagation from transformation (APPNP, SGC) and/or add global attention (Graph Transformers, GPS).

The practical diagnostic: Run your GNN on the same task with increasing layers (1, 2, 4, 8, 16). If performance peaks early and then drops: oversmoothing. If performance never improves beyond a ceiling regardless of depth, and tasks require long-range reasoning: oversquashing. If both: you need both architectural and rewiring fixes.

Fixes Summary

Oversmoothing fixes (forward collapse):

  • GCNII: residual connections to initial representation
  • JK-Net: concatenate all layer outputs
  • APPNP: teleport back to initial features during propagation
  • DropEdge: randomly drop edges to reduce averaging
  • PairNorm: explicit normalisation to maintain diversity

Oversquashing fixes (bottleneck communication):

  • SDRF: Ricci flow-based graph rewiring
  • Virtual node: global communication node
  • Graph Transformers: bypass message passing for long-range
  • GPS: combine local MPNN + global attention

Fixes for both:

  • GPS (General, Powerful, Scalable): local MPNN avoids oversmoothing; global attention bypasses oversquashing

Summary

QuestionOversmoothingOversquashing
Where does info die?Nearby (convergence)At bottleneck edges (long range)
When does it hurt?Dense graphs, many layersSparse graphs with bridges, long-range tasks
Can more layers help?Never (makes it worse)Should, but squashing increases too
Key fixResiduals, less aggregationRewiring, global attention

These two pathologies define the fundamental challenges of deep GNNs. Understanding both — and distinguishing them — is essential for diagnosing GNN failures and choosing appropriate solutions.

References