Over-smoothing vs Over-squashing: The Difference
Published:
The Confusion
Both oversmoothing and oversquashing:
- Occur with deep GNNs
- Cause performance degradation
- Involve information loss
They are often mentioned together or confused. But they are fundamentally different phenomena.
Head-to-Head Comparison
| Property | Oversmoothing | Oversquashing | ||||
|---|---|---|---|---|---|---|
| Root cause | Iterated averaging → all embeddings converge | Exponential neighbourhood growth → info bottleneck | ||||
| Direction | Forward pass (computation) | Both forward (dilution) and backward (gradient) | ||||
| Which nodes affected | All nodes, especially nearby ones | Nodes that are far apart (long paths) | ||||
| Graph structure | Worse on dense, well-connected graphs | Worse on tree-like, sparse graphs with bridge edges | ||||
| With more layers | Provably gets worse (converges to constant) | Could get better (reach distant nodes) but squashing increases | ||||
| Measure | Dirichlet energy → 0; MAD → 0 | Jacobian | ∂h_v/∂x_u | → 0 | ||
| Spectral view | Low-pass filter removes high frequencies | Not spectral: it’s about topology/curvature | ||||
| Fix | Residual connections, jump knowledge, APPNP | Graph rewiring, global attention, virtual nodes |
When You Have Oversmoothing
You add layers hoping to capture longer-range patterns, but performance peaks at 2-3 layers then drops. Node embeddings in the last layer have near-zero pairwise distances. The model assigns nearly the same embedding to all nodes.
Symptom: accuracy peaks at 2-3 layers, then monotonically decreases. MAD scores drop toward zero with depth.
Fix: residual connections (GCNII), APPNP, JK-Net (jumping knowledge). Do NOT add more layers — that makes it worse.
When You Have Oversquashing
You have a task requiring long-range reasoning (e.g., predicting whether two distant atoms in a molecule will react). The model performs well on local structure tasks but fails on long-range ones. Adding more layers doesn’t help.
Symptom: performance on long-range tasks (e.g., LRGB benchmarks) is poor regardless of depth. Jacobian norms near zero for distant node pairs.
Fix: graph rewiring (SDRF, add virtual nodes), global attention (Graph Transformers, GPS). Adding residual connections does NOT fix oversquashing — information still can’t reach distant nodes.
A Unified View
Li et al. and Alon & Yahak propose viewing both as failures of information flow, but in different regimes:
Short range: Oversmoothing dominates (too many hops → convergence)
Long range: Oversquashing dominates (too few paths → bottlenecks)
They create opposing pressures on depth:
- Oversmoothing says: use FEWER layers
- Task requirements say: use MORE layers (to reach distant nodes)
- Oversquashing says: more layers don’t help anyway for bottlenecks
The resolution: decouple propagation from transformation (APPNP, SGC) and/or add global attention (Graph Transformers, GPS).
Fixes Summary
Oversmoothing fixes (forward collapse):
- GCNII: residual connections to initial representation
- JK-Net: concatenate all layer outputs
- APPNP: teleport back to initial features during propagation
- DropEdge: randomly drop edges to reduce averaging
- PairNorm: explicit normalisation to maintain diversity
Oversquashing fixes (bottleneck communication):
- SDRF: Ricci flow-based graph rewiring
- Virtual node: global communication node
- Graph Transformers: bypass message passing for long-range
- GPS: combine local MPNN + global attention
Fixes for both:
- GPS (General, Powerful, Scalable): local MPNN avoids oversmoothing; global attention bypasses oversquashing
Summary
| Question | Oversmoothing | Oversquashing |
|---|---|---|
| Where does info die? | Nearby (convergence) | At bottleneck edges (long range) |
| When does it hurt? | Dense graphs, many layers | Sparse graphs with bridges, long-range tasks |
| Can more layers help? | Never (makes it worse) | Should, but squashing increases too |
| Key fix | Residuals, less aggregation | Rewiring, global attention |
These two pathologies define the fundamental challenges of deep GNNs. Understanding both — and distinguishing them — is essential for diagnosing GNN failures and choosing appropriate solutions.
References
- Li, Q., Han, Z., & Wu, X.-M. (2018). Deeper Insights Into Graph Convolutional Networks for Semi-Supervised Classification. AAAI 2018 (oversmoothing).
- Alon, U., & Yahav, E. (2021). On the Bottleneck of Graph Neural Networks and Its Practical Implications. ICLR 2021 (oversquashing).
- Topping, J., Di Giovanni, F., Chamberlain, B. P., Dong, X., & Bronstein, M. M. (2022). Understanding over-squashing and Bottlenecks on Graphs via Curvature. ICLR 2022.
