Neural Sheaf Diffusion (Bodnar et al., NeurIPS 2022): Learning the Relational Geometry

6 minute read

Published:

Paper: Bodnar, C., Giovanni, F. D., Chamberlain, B. P., Liò, P., & Bronstein, M. M. (2022). Neural Sheaf Diffusion: A Topological Perspective on Heterophily and Oversmoothing in GNNs. NeurIPS 2022.
Contribution: Learns sheaf restriction maps end-to-end via MLP predictors. Proves that learned sheaf diffusion avoids oversmoothing (non-trivial H⁰) and handles heterophily (maps can encode anti-alignment). Achieves state-of-the-art on heterophilic benchmarks at the time of publication.

The Central Idea

Hansen & Gebhart (2020) showed that fixed sheaf maps generalise GCN. The key limitation was: who specifies the maps?

NSD’s answer: learn them from data. For each edge (u, v), a small MLP predicts the restriction maps F_{u▷e} and F_{v▷e} from the features of u and v:

[F_{u▷e} | F_{v▷e}] = MLP(h_u, h_v) ∈ ℝ^{d × 2d}

The MLP output is reshaped into two d×d matrices. The maps are edge-specific and node-feature-dependent — they adapt to the relational geometry of each edge as observed in the data.

From these learned maps, the Sheaf Laplacian Δ_F is constructed, and sheaf diffusion is applied:

H^{(k+1)} = (I − Δ_F^{norm}) H^{(k)} W^{(k)}

The maps are re-predicted at each layer (or shared across layers as a hyperparameter choice).

The Full NSD Architecture

Input: Graph G, node features X₀ ∈ ℝ^{N×d}

For each layer k = 1, ..., K:
  1. Sheaf predictor: for each edge (u,v),
        [F_{u▷e} | F_{v▷e}] = MLP_k(h_u^{(k-1)}, h_v^{(k-1)})

  2. Build Δ_F^{(k)} from predicted maps (block matrix assembly)

  3. Normalise: Δ_F^{norm} = D_F^{-1/2} Δ_F D_F^{-1/2}

  4. Diffuse: H^{(k)} = (I − Δ_F^{norm}) H^{(k-1)} W^{(k)}

  5. Apply nonlinearity: H^{(k)} ← σ(H^{(k)})

Output: H^{(K)} for node classification / other downstream tasks

The weight matrix W^{(k)} ∈ ℝ^{d×d} is applied after diffusion — it allows the model to rotate/scale features in each stalk before the next layer’s map prediction.

Map Parameterisation Options

The paper studies four choices for the MLP output:

1. General maps (F_{v▷e} ∈ ℝ^{d×d}): most expressive, d² parameters per map.

2. Symmetric maps (F_{v▷e} = F_{v▷e}ᵀ): the Sheaf Laplacian blocks are symmetric matrices; used when the relational geometry is undirected.

3. Diagonal maps (F_{v▷e} = diag(f₁,…,f_d)): d parameters per map, feature-wise scaling. The Sheaf Laplacian is block-diagonal with diagonal blocks.

4. Orthogonal maps (F_{v▷e} ∈ O(d)): d(d−1)/2 parameters per map (Cayley parameterisation). Gauge-equivariant; the Connection Laplacian special case.

Design recommendation from the paper: Diagonal maps are the sweet spot — they add relational structure beyond identity maps, reduce parameter count, and remain interpretable (each feature dimension gets its own signed scaling). General maps achieve highest expressiveness but can overfit on small graphs.

Theoretical Analysis: Oversmoothing

Theorem (NSD, Sec. 4.1): For any non-degenerate sheaf F (i.e., not all restriction maps are the same), the null space ker(Δ_F) is not the space of constant functions. In particular:

dim ker(Δ_F) ≥ d and ker(Δ_F) ⊉ {constant functions}

Proof sketch: The global sections ker(δ₀) = ker(Δ_F) satisfy F_{u▷e}x_u = F_{v▷e}x_v for all (u,v,e). When the maps are not all identity, this system has solutions that are not constant — the maps encode “consistent” non-constant assignments.

Consequence: Sheaf diffusion converges to a richer subspace than constant functions. The long-time attractor of the diffusion carries task-relevant structure (when maps are learned appropriately), so “oversmoothing” converges to useful features rather than destroying them.

Theoretical Analysis: Heterophily

Theorem (NSD, Sec. 4.2): There exist restriction maps F such that the minimum Sheaf Dirichlet energy configuration is one where adjacent nodes have different features.

Proof sketch: Consider edge (u, v) where u and v have different labels. Choose F_{u▷e} and F_{v▷e} such that F_{u▷e}x_u = F_{v▷e}x_v implies x_u ≠ x_v (e.g., F_{u▷e} = I, F_{v▷e} = −I forces x_u = −x_v for consistency). The model can learn to represent heterophilic structure by learning such maps.

Informal summary: Standard GCN minimises Σ_{(u,v)∈E} h_u − h_v ². This penalises heterophilic pairs — neighbours with different features pay a high energy cost. Sheaf diffusion minimises Σ_{(u,v)∈E} F_{u▷e}h_u − F_{v▷e}h_v ². With learned maps, this can reward heterophilic pairs (F_{u▷e}x_u = F_{v▷e}x_v with x_u ≠ x_v) — the model learns that “consistent” means “different in this structured way”.

Connection to FAGCN and Signed Attention

FAGCN (Bo et al., 2021) uses signed attention: edge weights a_{uv} ∈ [−1, +1]. This is equivalent to a sheaf with scalar restriction maps: F_{u▷e} = 1, F_{v▷e} = a_{uv} (a scalar ±1 per edge).

NSD generalises this from scalar (1D) to matrix (d×d) restriction maps — enabling richer relational representations than simple sign-flipping.

Empirical Results

Heterophilic node classification benchmarks (Cornell, Texas, Wisconsin, Actor, Chameleon, Squirrel):

ModelCornellTexasWisconsinActorChameleonSquirrel
GCN57.059.551.827.359.836.9
GAT54.358.449.426.360.540.7
GPRGNN80.378.482.434.667.550.4
FAGCN79.282.482.634.964.343.8
NSD-diag83.687.685.336.869.456.5
NSD-orth85.088.486.036.270.257.1

NSD consistently outperforms all prior methods on heterophilic benchmarks, with orthogonal maps generally performing best.

Implementation Details

Stalk dimension d: Typically d=2 or d=3. The paper shows diminishing returns for d>5, with d=2 often optimal.

Sheaf predictor: Small MLP (2 layers, hidden dim 64). Shared or per-layer.

Map normalization: After predicting maps, optionally apply a softmax or normalisation to control the magnitude of restriction maps.

Computational cost: O(E·d²) for map prediction + O(N·d²) for Sheaf Laplacian application. For large d, this becomes expensive. In practice d=2 or d=3 keeps cost manageable.

Regularisation: L2 regularisation on restriction map magnitudes prevents the maps from becoming degenerate (all-zero or all-identity).

Limitations

  1. Stalk dimension d: The optimal d is task-dependent and requires tuning.
  2. Sheaf predictor expressiveness: The MLP maps only pairs (h_u, h_v) — it cannot use higher-order neighbourhood information to predict edge maps.
  3. Fixed diffusion filter: The filter (I − Δ_F^{norm}) is a fixed low-pass filter. PNSD (Zaghen et al., 2024) addresses this by making the filter polynomial and learnable.
  4. Scalability: Map prediction scales with number of edges; for dense graphs, this is expensive.

References