Oversmoothing: When All Node Embeddings Become the Same

5 minute read

Published:

TL;DR: Oversmoothing occurs when repeated graph convolution (averaging over neighbours) causes all node embeddings to converge to the same values, erasing the distinction between nodes. It is mathematically equivalent to low-pass filtering: infinite iterations → DC component only → constant signal over the graph.

The Problem: Deep GNNs Fail

Empirically: GCN with 2 layers works. With 8 layers, accuracy drops dramatically. With 64 layers, performance collapses to near-random. This is not overfitting — validation loss also degrades. It is oversmoothing.

Why do deeper networks hurt in GNNs when they help in CNNs and Transformers? Because graph convolution is fundamentally a smoothing operation: it mixes each node’s features with its neighbours’. Repeat this enough times, and all features converge.

The Mathematics of Oversmoothing

Consider GCN’s propagation step (ignoring learnable weights for clarity):

H^{(k+1)} = S̃ H^{(k)} where S̃ = D̃^{-1/2} Ã D̃^{-1/2}

After K iterations: H^{(K)} = S̃ᴷ H^{(0)}.

S̃ is a symmetric positive semi-definite matrix with eigenvalues in [0, 2] (for normalised Laplacian). Its eigendecomposition is S̃ = U Λ Uᵀ, so:

S̃ᴷ = U Λᴷ Uᵀ

As K → ∞:

  • Eigenvalue λ = 1 → Λᴷ[i,i] = 1ᴷ = 1 (unchanged)
  • Eigenvalue λ < 1 → Λᴷ[i,i] → 0 (suppressed)
  • Eigenvalue λ > 1 → impossible for normalised Laplacian

All eigenvalues except λ = 1 (the constant eigenvector u₁ = [1,…,1]/√N) shrink to zero. The limit is:

S̃ᴷ X → u₁ u₁ᵀ X (as K → ∞)

This is a rank-1 matrix: every row is the same (a weighted global mean of features). All node embeddings converge to the same vector.

Spectral Interpretation

Oversmoothing is exactly low-pass filtering taken to the limit. GCN applies the filter h(λ) = 1 - λ/2 (roughly) — amplifying low frequencies (λ ≈ 0) and suppressing high frequencies (λ ≈ 2). After many layers, only the λ = 0 component survives — the global mean.

The graph signal becomes perfectly smooth: adjacent nodes are identical. For node classification, where you need to distinguish adjacent nodes (which often have different classes in heterophilic graphs), this is catastrophic.

How Fast Does Oversmoothing Happen?

The convergence rate depends on the second eigenvalue λ₂ of S̃. The spectral gap Δ = 1 - λ₂ determines how fast:

  • Large spectral gap (dense, well-connected graph): fast oversmoothing (few layers needed to destroy information)
  • Small spectral gap (sparse, weakly connected graph): slower oversmoothing

For Cora (a sparse citation network), oversmoothing is slow — models can use 4-8 layers before degrading. For a complete graph (everyone connected), oversmoothing happens in 1-2 steps.

The diameter paradox: You might think: "I need K layers to reach nodes K hops away, so add more layers for better coverage." But adding more layers also accelerates oversmoothing for nearby nodes. The optimal depth is a trade-off between coverage (more layers = larger receptive field) and smoothing (more layers = less discrimination). For most graphs, this optimum is 2-3 layers.

Oversmoothing vs Vanishing Gradients

These are different phenomena:

 OversmoothingVanishing gradients
CauseRepeated averaging in forward passExploding/vanishing in backprop
FixArchitectural (residuals, jump connections, less aggregation)Residuals, batch norm, gradient clipping
Deep learning parallelFeature collapseTraining instability
Occurs even withoutLearning (pure propagation)Nonlinearities

Measuring Oversmoothing

Mean Average Distance (MAD): average pairwise L2 distance between all node embeddings. MAD → 0 as oversmoothing intensifies.

Dirichlet Energy: E(H) = Σ_{(u,v)∈E} h_u - h_v ². E → 0 means adjacent nodes are identical — perfect smoothing.
E(H) = tr( Hᵀ L H ) where L is the graph Laplacian

Monitoring Dirichlet energy across layers reveals exactly when and how fast oversmoothing occurs.

Solutions to Oversmoothing

ApproachMechanismExample
Residual connectionsSkip connections preserve pre-aggregation featuresGCNII
Jumping KnowledgeConcatenate outputs of all layersJK-Net
DropEdgeRandomly drop edges during trainingDropEdge
PairNormNormalise pair-distances to prevent collapsePairNorm
Separate propagationPropagate once; transform many timesAPPNP, SGC
Graph TransformersNo repeated neighbourhood averagingGraphormer, GPS
Sheaf GNNsRestriction maps prevent collapse across class boundariesNeural Sheaf Diffusion

Summary

Oversmoothing is not a bug in implementation — it is a mathematical property of iterated graph averaging:

  1. Spectral view: low-pass filtering → only DC component survives → constant signal
  2. Power iteration view: S̃ᴷ → rank-1 projection → all nodes get the same representation
  3. Practical consequence: GNNs with more than ~3-4 layers fail on standard benchmarks
  4. Fix: prevent repeated averaging (residuals, separate propagation) or use global attention (Graph Transformers)

Understanding oversmoothing is the first step to understanding why GNN depth scaling is fundamentally different from Transformer depth scaling — and why simply adding more layers is not the solution.

References