Neural Sheaf Diffusion: Learning Sheaves End-to-End

5 minute read

Published:

TL;DR: NSD (Bodnar et al., 2022) jointly learns restriction maps F_{u→e} (via an MLP on node features) and performs sheaf diffusion with the resulting Sheaf Laplacian. At each layer: (1) predict restriction maps from current features; (2) build the Sheaf Laplacian; (3) diffuse. This is a principled, topology-aware alternative to standard GNNs that is theoretically grounded in algebraic topology.

The NSD Architecture

NSD has two interleaved components:

1. Sheaf Predictor (Map Learner)

Given the current node features H^{(k)}, learn restriction maps for each edge:

F^{(k)}_{u→e} = MLP_F( h^{(k)}_u, h^{(k)}_v ) for edge e = (u,v)

The MLP takes both endpoints’ features and outputs a d×d matrix (or a lower-dimensional parameterisation). The maps are computed freshly at each layer — they evolve as features evolve.

2. Sheaf Diffusion

Build the normalised Sheaf Laplacian Δ_F^{(k)} from the learned maps, then diffuse:

H^{(k+1)} = ( I - Δ_{F^{(k)}}^{norm} ) H^{(k)} W^{(k)}

Where W^{(k)} is a learnable weight matrix (same as in GCN). The diffusion step updates node features using the sheaf-aware neighbourhood aggregation.

The Full Layer

Expanding the diffusion step for node v:

h^{(k+1)}_v = h^{(k)}_v - Σ_{u ∈ N(v)} Δ_F[v,u] h^{(k)}_u W^{(k)}

Where Δ_F[v,u] = -(Δ_F^{norm}){vu} = F^T{u→e} F_{v→e} / (normalisation).

Unpacking this: each neighbour u’s features are first transformed by the learned edge maps (F_{v→e} and F_{u→e}), then used to update v. The key difference from standard GCN: the transformation is per-edge and learned, not shared across all edges.

Why NSD Handles Heterophily

On homophilic graphs: the MLP learns F_{u→e} ≈ I (identity) — equivalent to standard GCN.

On heterophilic graphs: the MLP learns F_{u→e} that rotates u’s features into a compatible space with v’s features, even when they have different label-driven directions. The “agreement” condition F_{u→e} x_u ≈ F_{v→e} x_v can be satisfied with x_u ≠ x_v — the maps accommodate difference.

Theoretical result (Bodnar et al.): On heterophilic graphs, the optimal sheaf maps align features across class boundaries such that sheaf diffusion is class-preserving — nodes in the same class converge, nodes in different classes do not. This is the opposite of standard GCN oversmoothing, which makes all nodes converge regardless of class.

Heterophily resolution: Standard GCN uses Δ_trivial = L ⊗ I_d, which pushes all neighbours to be equal. NSD uses Δ_F with learned maps, which defines "equal" to mean "equal under the sheaf transformation." By learning maps that flip the feature direction for nodes of different classes, NSD can make diffusion convergent within classes and divergent across classes — the right inductive bias for heterophilic tasks.

Connection to Other Architectures

GCN: special case with F_{u→e} = I for all edges (trivial sheaf).

GCNII: residual connection to initial features + NSD = Neural Sheaf Diffusion with residuals.

H2GCN: another heterophily-focused GNN. H2GCN separates ego and neighbour aggregations and concatenates multi-hop features. NSD is more principled (topology-grounded) but similar in spirit.

GAT: attention weights α_{uv} can be seen as learning scalar restriction maps (d=1 case). NSD generalises this to full d×d matrix maps.

Oversmoothing Under NSD

Does NSD oversmooth? The convergence of sheaf diffusion depends on the Sheaf Laplacian spectrum. If the null space of Δ_F is large (many global sections), diffusion converges to that null space — which may preserve class structure.

Key theorem: if the learned sheaf has a null space that separates node classes, infinite sheaf diffusion converges to the class-consistent subspace — not to a single constant vector. This is the fundamental advantage over standard GCN, which converges to a constant.

Computational Cost

For a graph with N nodes and E edges:

  • Restriction map prediction: O(E · d²) (one MLP call per directed edge)
  • Sheaf Laplacian construction: O(E · d²) (block matrix assembly)
  • Diffusion step: O(E · d²) (sparse block matrix-vector product)

Compared to standard GCN O(E · d) per layer, NSD is O(d) more expensive. For d=64, this is 64× more computation per layer — significant for large graphs.

Summary

StepOperationPurpose
Sheaf predictorMLP(h_u, h_v) → F_{u→e}Learn per-edge restriction maps
Laplacian constructionΔ_F = δ₀^T δ₀Build sheaf-aware operator
DiffusionH ← (I - Δ_F^{norm}) H WFeature propagation with sheaf structure
ReadoutMLP(h_v)Node classification

NSD provides a principled connection between algebraic topology (cellular sheaves) and graph neural networks — offering a theoretical explanation for why standard GNNs fail on heterophilic graphs and a mathematically grounded fix.

References

  • Bodnar, C., Giovanni, F. D., Chamberlain, B. P., Liò, P., & Bronstein, M. M. (2022). Neural Sheaf Diffusion: A Topological Perspective on Heterophily and Oversmoothing in GNNs. NeurIPS 2022 (NSD: the full framework for learning sheaf restriction maps from data via MLP predictors and applying sheaf diffusion for node classification).
  • Hansen, J., & Gebhart, T. (2020). Sheaf Neural Networks. NeurIPS 2020 GRL+ Workshop (the foundational sheaf GNN paper that NSD extends with learned instead of fixed restriction maps).
  • Chamberlain, B. P., Rowbottom, J., Gorinova, M., Webb, S., Rossi, E., & Bronstein, M. M. (2021). GRAND: Graph Neural Diffusion. ICML 2021 (GRAND: continuous graph diffusion framing of GNNs, which NSD extends to the sheaf setting).