Why Sheaf Diffusion Doesn’t Oversmooth: The Null Space Account
Published:
The Classical Oversmoothing Result
Theorem (Li et al., 2018): Let H^{(0)} = X (initial features). For a k-layer GCN with normalised adjacency à = D^{-1/2}ÃD^{-1/2}:
As k → ∞ (assuming spectral radius of W < 1 and connected G):
So H^{(k)} → (1 · πᵀ X W) — each node’s representation converges to the same d-dimensional vector, weighted by its degree. All inter-node discrimination is destroyed.
In terms of the Laplacian: The convergence is equivalent to:
The projection onto the null space of L̃ = I − Ã is the projection onto constants.
Why the Null Space Determines Oversmoothing
Any graph signal x can be decomposed as:
where x₀ = proj_{ker(L)} x (the harmonic/constant part) and x₊ = proj_{im(L)} x (the non-constant part).
Applying (I−L̃):
- x₀ is unchanged: (I−L̃)x₀ = x₀ (since Lx₀ = 0)
x₊ decays: (I−L̃)x₊ = (1−λ₊)x₊ where 1−λ₊ < 1 for positive eigenvalues λ₊
After k iterations: (I−L̃)^k x = x₀ + (1−λ₊)^k x₊ → x₀ as k → ∞.
The null space = the oversmoothing attractor. The larger and richer the null space, the more information survives at large depth.
For standard L: ker(L) = {constant functions} → dimension d per component. Oversmoothing is severe.
The Sheaf Fix: Enlarging the Null Space
For the Sheaf Laplacian Δ_F:
Theorem (Bodnar et al., 2022, Prop. 1): For any non-trivial sheaf F (not all maps identical):
Proof sketch: The constant functions satisfy F_{u▷e}x_u = F_{v▷e}x_v when x_u = x_v = c (constant) only if F_{u▷e}c = F_{v▷e}c for all edges — i.e., (F_{u▷e} − F_{v▷e})c = 0. For non-identical maps and generic constants c, this fails. So constants are NOT in ker(Δ_F) for non-trivial sheaves, and ker(Δ_F) is a different (richer) space.
Consequence: Sheaf diffusion converges to the space of global sections, not to constants. Global sections can vary across nodes (in structured ways determined by the restriction maps) while still satisfying the pairwise consistency constraints.
Quantifying the Oversmoothing Improvement
Define the discrimination ratio as the ratio of the dimension of the oversmoothing attractor to the total signal dimension:
For GCN: ρ(GCN) = 1/N per feature dimension — collapses to 1 value per N nodes.
For NSD with learned diagonal maps: ρ(NSD) can be up to d/(Nd) × K for some K dependent on the map structure — potentially much larger than 1/N.
In the best case: NSD with full-rank diagonal maps on a connected graph can achieve dim H⁰ = d (same as GCN if maps are sign-consistent) up to potentially Nd − rank(δ₀) dimensions.
The Role of Stalk Dimension d
Increasing the stalk dimension d has a direct effect on the oversmoothing attractor:
For the standard Laplacian: ker(L ⊗ I_d) = {constants}^d — dimension d per component.
For a non-trivial sheaf: dim H⁰(G, F) can scale with d in complex ways. For generic diagonal maps, dim H⁰ scales as O(d) — larger d gives a richer attractor.
Practical implication: Increasing d from 1 to 2 or 3 can substantially increase the information retained after many diffusion steps — this is one reason NSD with d=2 outperforms d=1 on deep architectures.
Depth vs Oversmoothing: Sheaf vs Standard GNNs
Why standard GNNs suffer more: The constant-function attractor has zero discriminative power for node classification — adjacent nodes of different classes become indistinguishable. The global-section attractor of sheaf GNNs can maintain inter-class distinctions if the restriction maps are learned appropriately.
Rate of Convergence to H⁰
The convergence rate is determined by the spectral gap:
where λ_gap = λ_{dim H⁰ + 1}(Δ_F^{norm}) is the smallest non-zero eigenvalue of the normalised Sheaf Laplacian.
Implications:
- Large spectral gap → fast convergence → fewer layers needed to reach H⁰
- Small spectral gap → slow convergence → many layers preserve gradient components
For optimal depth, choose K such that (1 − λ_gap)^K ≈ 0.1 — i.e., K ≈ 2/λ_gap. Graphs with small spectral gap (nearly disconnected) require many more layers.
Skip Connections and Residual Sheaf Diffusion
Even with the richer attractor of sheaf diffusion, at very large K the model still collapses to H⁰. To prevent this and enable very deep sheaf GNNs, residual connections are added:
With α < 1, the update is a convex combination of the current signal and the diffused signal. As k → ∞:
This is APPNP-style personalised PageRank applied to the Sheaf Laplacian — the solution is close to X₀ (via the (1−α) term) while being partially smoothed toward H⁰. Oversmoothing is avoided regardless of depth.
Formal Connection to Dirichlet Energy
The oversmoothing of standard GNNs can be precisely characterised by the Dirichlet energy:
As layers increase, E(H^{(k)}) → 0 — features become identical. This is the mathematical definition of oversmoothing.
For sheaf GNNs, the Sheaf Dirichlet energy:
converges to 0 as well — but this means features satisfy the sheaf consistency condition, not that they are identical. The model learns to represent the relational structure of the graph, not to erase all differences.
References
- Bodnar, C., Giovanni, F. D., Chamberlain, B. P., Liò, P., & Bronstein, M. M. (2022). Neural Sheaf Diffusion. NeurIPS 2022 (the main theoretical source for the null-space account of sheaf oversmoothing avoidance).
- Li, Q., Han, Z., & Wu, X.-M. (2018). Deeper Insights Into Graph Convolutional Networks for Semi-Supervised Classification. AAAI 2018 (the classic oversmoothing result for GCN — the standard GNN baseline being improved upon).
- Oono, K., & Suzuki, T. (2020). Graph Neural Networks Exponentially Lose Expressive Power for Node Classification. ICLR 2020 (formal exponential convergence rate of GNN oversmoothing — quantifies what sheaf diffusion avoids).
