Over-smoothing vs Over-squashing: The Difference

7 minute read

Published: April 15, 2024

TL;DR: Oversmoothing = forward-pass feature collapse from too much averaging (nearby nodes become identical). Oversquashing = gradient/information collapse at bottleneck edges for long-range communication. Both increase with depth but in different ways, on different nodes, and need different fixes.

Oversmoothing vs oversquashing — Over-smoothing vs over-squashing — two distinct failure modes in deep GNNs (Topping et al., 2022)

Intuition First

Imagine you are in a room full of people whispering a message from person to person. Oversmoothing is what happens when everyone repeats the average of all messages they heard — after enough rounds, everyone says the same thing. The content has been diluted to nothing.

Oversquashing is different: imagine two distant groups connected by a single corridor (one “bridge” person). All information between the groups must squeeze through that one person. No matter how many rounds of whispering, the bridge person cannot faithfully relay an exponentially growing flood of messages.

Same symptom (performance collapse), completely different causes.

Left: oversmoothing — node features fade toward a uniform value. Right: oversquashing — all cross-cluster information must traverse the single red bridge node.

The Confusion

Both oversmoothing and oversquashing:

Occur with deep GNNs
Cause performance degradation
Involve information loss

They are often mentioned together or confused. But they are fundamentally different phenomena.

Head-to-Head Comparison

Property	Oversmoothing	Oversquashing
Root cause	Iterated averaging → all embeddings converge	Exponential neighbourhood growth → info bottleneck
Direction	Forward pass (computation)	Both forward (dilution) and backward (gradient)
Which nodes affected	All nodes, especially nearby ones	Nodes that are far apart (long paths)
Graph structure	Worse on dense, well-connected graphs	Worse on tree-like, sparse graphs with bridge edges
With more layers	Provably gets worse (converges to constant)	Could get better (reach distant nodes) but squashing increases
Measure	Dirichlet energy → 0; MAD → 0	Jacobian	∂h_v/∂x_u	→ 0
Spectral view	Low-pass filter removes high frequencies	Not spectral: it’s about topology/curvature
Fix	Residual connections, jump knowledge, APPNP	Graph rewiring, global attention, virtual nodes

When You Have Oversmoothing

You add layers hoping to capture longer-range patterns, but performance peaks at 2-3 layers then drops. Node embeddings in the last layer have near-zero pairwise distances. The model assigns nearly the same embedding to all nodes.

Symptom: accuracy peaks at 2-3 layers, then monotonically decreases. MAD scores drop toward zero with depth.

Fix: residual connections (GCNII), APPNP, JK-Net (jumping knowledge). Do NOT add more layers — that makes it worse.

When You Have Oversquashing

You have a task requiring long-range reasoning (e.g., predicting whether two distant atoms in a molecule will react). The model performs well on local structure tasks but fails on long-range ones. Adding more layers doesn’t help.

Symptom: performance on long-range tasks (e.g., LRGB benchmarks) is poor regardless of depth. Jacobian norms near zero for distant node pairs.

Fix: graph rewiring (SDRF, add virtual nodes), global attention (Graph Transformers, GPS). Adding residual connections does NOT fix oversquashing — information still can’t reach distant nodes.

Worked Diagnostic Example

Consider a 4-layer GCN on a path graph: A — B — C — D — E — F — G — H — I — J (10 nodes).

Oversmoothing check: Compute Mean Average Distance (MAD) between node embeddings at each layer.

Layer 1: MAD = 0.82 (distinct features)
Layer 2: MAD = 0.51
Layer 4: MAD = 0.09 (nearly uniform)

The embeddings have collapsed — all nodes look alike. If you need to classify node A differently from node J, the model cannot.

Oversquashing check: Compute the Jacobian ∂h_A / ∂x_J (how much does node J’s input affect node A’s output?).

With 4 layers, A has a 4-hop receptive field, which includes J (distance 9). So ∂h_A / ∂x_J = 0 — A literally cannot see J.
Even with 9 layers (reaching J), the path A→…→J has exponentially many competing paths that dilute the signal to near-zero.

Both problems can coexist: you need 9 layers to reach J (depth demand), but 9 layers cause oversmoothing. The fix is not “just add more layers.”

Key Insight: Oversmoothing is measured in the forward pass (do node embeddings converge?). Oversquashing is measured via Jacobians (does a distant node's input influence this node's output?). You can have one without the other: a 2-layer GCN on a bottleneck graph has oversquashing but not oversmoothing.

A Unified View

Li et al. and Alon & Yahak propose viewing both as failures of information flow, but in different regimes:

Short range:  Oversmoothing dominates (too many hops → convergence)
Long range:   Oversquashing dominates (too few paths → bottlenecks)

They create opposing pressures on depth:

Oversmoothing says: use FEWER layers
Task requirements say: use MORE layers (to reach distant nodes)
Oversquashing says: more layers don’t help anyway for bottlenecks

The resolution: decouple propagation from transformation (APPNP, SGC) and/or add global attention (Graph Transformers, GPS).

The practical diagnostic: Run your GNN on the same task with increasing layers (1, 2, 4, 8, 16). If performance peaks early and then drops: oversmoothing. If performance never improves beyond a ceiling regardless of depth, and tasks require long-range reasoning: oversquashing. If both: you need both architectural and rewiring fixes.

Fixes Summary

Oversmoothing fixes (forward collapse):

GCNII: residual connections to initial representation
JK-Net: concatenate all layer outputs
APPNP: teleport back to initial features during propagation
DropEdge: randomly drop edges to reduce averaging
PairNorm: explicit normalisation to maintain diversity

Oversquashing fixes (bottleneck communication):

SDRF: Ricci flow-based graph rewiring
Virtual node: global communication node
Graph Transformers: bypass message passing for long-range
GPS: combine local MPNN + global attention

Fixes for both:

GPS (General, Powerful, Scalable): local MPNN avoids oversmoothing; global attention bypasses oversquashing

Summary

Question	Oversmoothing	Oversquashing
Where does info die?	Nearby (convergence)	At bottleneck edges (long range)
When does it hurt?	Dense graphs, many layers	Sparse graphs with bridges, long-range tasks
Can more layers help?	Never (makes it worse)	Should, but squashing increases too
Key fix	Residuals, less aggregation	Rewiring, global attention

These two pathologies define the fundamental challenges of deep GNNs. Understanding both — and distinguishing them — is essential for diagnosing GNN failures and choosing appropriate solutions.

References

Li, Q., Han, Z., & Wu, X.-M. (2018). Deeper Insights Into Graph Convolutional Networks for Semi-Supervised Classification. AAAI 2018 (oversmoothing).
Alon, U., & Yahav, E. (2021). On the Bottleneck of Graph Neural Networks and Its Practical Implications. ICLR 2021 (oversquashing).
Topping, J., Di Giovanni, F., Chamberlain, B. P., Dong, X., & Bronstein, M. M. (2022). Understanding over-squashing and Bottlenecks on Graphs via Curvature. ICLR 2022.

Share on

Bluesky Facebook LinkedIn X (formerly Twitter)

Alessio Borgi

Over-smoothing vs Over-squashing: The Difference

Intuition First

The Confusion

Head-to-Head Comparison

When You Have Oversmoothing

When You Have Oversquashing

Worked Diagnostic Example

A Unified View

Fixes Summary

Summary

References

Share on

You May Also Enjoy

Output, Gated, and Special Activations: Softmax, GLU, SIREN, and More

Modern Activation Functions: GELU, SiLU, Mish, and Smooth Gating

Activation Functions in Neural Networks: Why Non-Linearity Matters

FoPE: Fourier Position Embedding for Length Generalization

Alessio Borgi

Intuition First

The Confusion

Head-to-Head Comparison

When You Have Oversmoothing

When You Have Oversquashing

Worked Diagnostic Example

A Unified View

Fixes Summary

Summary

References

Share on

You May Also Enjoy

📄 Output, Gated, and Special Activations: Softmax, GLU, SIREN, and More

📄 Modern Activation Functions: GELU, SiLU, Mish, and Smooth Gating

📄 Activation Functions in Neural Networks: Why Non-Linearity Matters

📄 FoPE: Fourier Position Embedding for Length Generalization

Output, Gated, and Special Activations: Softmax, GLU, SIREN, and More

Modern Activation Functions: GELU, SiLU, Mish, and Smooth Gating

Activation Functions in Neural Networks: Why Non-Linearity Matters

FoPE: Fourier Position Embedding for Length Generalization