Oversmoothing: When All Node Embeddings Become the Same
Published:
The Problem: Deep GNNs Fail
Empirically: GCN with 2 layers works. With 8 layers, accuracy drops dramatically. With 64 layers, performance collapses to near-random. This is not overfitting — validation loss also degrades. It is oversmoothing.
Why do deeper networks hurt in GNNs when they help in CNNs and Transformers? Because graph convolution is fundamentally a smoothing operation: it mixes each node’s features with its neighbours’. Repeat this enough times, and all features converge.
The Mathematics of Oversmoothing
Consider GCN’s propagation step (ignoring learnable weights for clarity):
After K iterations: H^{(K)} = S̃ᴷ H^{(0)}.
S̃ is a symmetric positive semi-definite matrix with eigenvalues in [0, 2] (for normalised Laplacian). Its eigendecomposition is S̃ = U Λ Uᵀ, so:
As K → ∞:
- Eigenvalue λ = 1 → Λᴷ[i,i] = 1ᴷ = 1 (unchanged)
- Eigenvalue λ < 1 → Λᴷ[i,i] → 0 (suppressed)
- Eigenvalue λ > 1 → impossible for normalised Laplacian
All eigenvalues except λ = 1 (the constant eigenvector u₁ = [1,…,1]/√N) shrink to zero. The limit is:
This is a rank-1 matrix: every row is the same (a weighted global mean of features). All node embeddings converge to the same vector.
Spectral Interpretation
Oversmoothing is exactly low-pass filtering taken to the limit. GCN applies the filter h(λ) = 1 - λ/2 (roughly) — amplifying low frequencies (λ ≈ 0) and suppressing high frequencies (λ ≈ 2). After many layers, only the λ = 0 component survives — the global mean.
The graph signal becomes perfectly smooth: adjacent nodes are identical. For node classification, where you need to distinguish adjacent nodes (which often have different classes in heterophilic graphs), this is catastrophic.
How Fast Does Oversmoothing Happen?
The convergence rate depends on the second eigenvalue λ₂ of S̃. The spectral gap Δ = 1 - λ₂ determines how fast:
- Large spectral gap (dense, well-connected graph): fast oversmoothing (few layers needed to destroy information)
- Small spectral gap (sparse, weakly connected graph): slower oversmoothing
For Cora (a sparse citation network), oversmoothing is slow — models can use 4-8 layers before degrading. For a complete graph (everyone connected), oversmoothing happens in 1-2 steps.
Oversmoothing vs Vanishing Gradients
These are different phenomena:
| Oversmoothing | Vanishing gradients | |
|---|---|---|
| Cause | Repeated averaging in forward pass | Exploding/vanishing in backprop |
| Fix | Architectural (residuals, jump connections, less aggregation) | Residuals, batch norm, gradient clipping |
| Deep learning parallel | Feature collapse | Training instability |
| Occurs even without | Learning (pure propagation) | Nonlinearities |
Measuring Oversmoothing
Mean Average Distance (MAD): average pairwise L2 distance between all node embeddings. MAD → 0 as oversmoothing intensifies.
| Dirichlet Energy: E(H) = Σ_{(u,v)∈E} | h_u - h_v | ². E → 0 means adjacent nodes are identical — perfect smoothing. |
Monitoring Dirichlet energy across layers reveals exactly when and how fast oversmoothing occurs.
Solutions to Oversmoothing
| Approach | Mechanism | Example |
|---|---|---|
| Residual connections | Skip connections preserve pre-aggregation features | GCNII |
| Jumping Knowledge | Concatenate outputs of all layers | JK-Net |
| DropEdge | Randomly drop edges during training | DropEdge |
| PairNorm | Normalise pair-distances to prevent collapse | PairNorm |
| Separate propagation | Propagate once; transform many times | APPNP, SGC |
| Graph Transformers | No repeated neighbourhood averaging | Graphormer, GPS |
| Sheaf GNNs | Restriction maps prevent collapse across class boundaries | Neural Sheaf Diffusion |
Summary
Oversmoothing is not a bug in implementation — it is a mathematical property of iterated graph averaging:
- Spectral view: low-pass filtering → only DC component survives → constant signal
- Power iteration view: S̃ᴷ → rank-1 projection → all nodes get the same representation
- Practical consequence: GNNs with more than ~3-4 layers fail on standard benchmarks
- Fix: prevent repeated averaging (residuals, separate propagation) or use global attention (Graph Transformers)
Understanding oversmoothing is the first step to understanding why GNN depth scaling is fundamentally different from Transformer depth scaling — and why simply adding more layers is not the solution.
References
- Li, Q., Han, Z., & Wu, X.-M. (2018). Deeper Insights Into Graph Convolutional Networks for Semi-Supervised Classification. AAAI 2018.
- Oono, K., & Suzuki, T. (2020). Graph Neural Networks Exponentially Lose Expressive Power for Node Classification. ICLR 2020.
- Chen, M., Wei, Z., Huang, Z., Ding, B., & Li, Y. (2020). Simple and Deep Graph Convolutional Networks. ICML 2020 (GCNII — addresses oversmoothing).
- Zhao, L., & Akoglu, L. (2020). PairNorm: Tackling Oversmoothing in GNNs. ICLR 2020.
