Sheaves and Heterophily: A Complete Theoretical Account

7 minute read

Published:

TL;DR: GCN's aggregation minimises the Dirichlet energy E(H) = Σ_{(u,v)} ||h_u − h_v||², penalising all feature differences across edges. On heterophilic graphs (adjacent nodes of different class), this forces features to become similar — exactly wrong. Sheaf diffusion minimises the Sheaf Dirichlet energy E_F(H) = Σ_{(u,v)} ||F_{u▷e}h_u − F_{v▷e}h_v||², which — with learned restriction maps — penalises inconsistency with the relational structure, not raw feature differences. Adjacent nodes of different classes can be consistent (zero sheaf energy) while being different.

The Homophily Assumption in GCN

Homophily: the tendency of nodes to connect to nodes of the same class. The homophily ratio h(G) ={(u,v)∈E : y_u=y_v}/Equantifies this.

GCN’s loss: minimising node classification cross-entropy implicitly minimises a combination of:

  1. Task loss: predicting the right labels
  2. Smoothness penalty: the Dirichlet energy E(H) = Σ_{(u,v)} h_u − h_v ²

The smoothness penalty penalises any feature difference across edges — whether the edge connects same-class or different-class nodes. On homophilic graphs (most edges connect same-class nodes), this is a useful regulariser. On heterophilic graphs, it is actively harmful: the model is penalised for correctly representing that different-class neighbours should have different features.

Formal statement: For a GCN with L layers and normalised aggregation, the learned representations H^{(L)} minimise:

L_{task}(H^{(L)}) + λ · Σ_{k=1}^{L} E(H^{(k)})

where λ is determined by the learning rate and layer depth. The Dirichlet energy term forces all-neighbour feature similarity.

The Heterophily Challenge: What the Energy Imposes

On a bipartite heterophilic graph (alternating class labels, every edge crosses class boundaries):

E(H) = Σ_{(u,v)∈E} ||h_u − h_v||² = 0 ⟺ h_u = h_v ∀(u,v)∈E ⟺ H is constant

The only zero-energy signal is the constant — but different classes should have different representations. So GCN is forced to trade off task accuracy against smoothness, and cannot achieve both.

What a good model should do on a heterophilic graph: Different-class pairs should have maximally different features; same-class pairs should have similar features. This is the opposite of what the standard Dirichlet energy encodes.

How Sheaf Maps Encode Heterophily

For a heterophilic edge (u, v) with different class labels y_u ≠ y_v, the sheaf restriction maps can encode the “correct” relational structure:

Example: scalar sheaf with negative map

Choose F_{u▷e} = +1 and F_{v▷e} = −1 (or vice versa). Then the consistency condition is:

F_{u▷e} x_u = F_{v▷e} x_v ⟺ x_u = −x_v

Adjacent nodes of opposite class should have opposite feature values. The global section is the antipodal assignment — not a constant. The Sheaf Dirichlet energy is zero when x_u = −x_v, even though x_u ≠ x_v.

Example: orthogonal map with 180° rotation

Choose F_{u▷e} = I and F_{v▷e} = −I (negation). Global sections satisfy x_u = −x_v — same as above, but in d dimensions.

More generally: choose F_{v▷e} = R for any rotation R. Global sections then satisfy x_u = Rᵀ x_v — adjacent nodes’ features are related by a rotation, not equal. This represents heterophilic graphs where the relational structure is a structured rotation between class representations.

The General Heterophily Theorem

Theorem (Bodnar et al., 2022, Theorem 1, informal): For any desired node assignment x* where x_u ≠ x_v for heterophilic edges (u,v), there exist restriction maps F such that x* is a global section of F (i.e., x* ∈ ker(Δ_F)).

Proof: For each heterophilic edge e = (u,v), choose F_{u▷e} and F_{v▷e} such that F_{u▷e}x_u = F_{v▷e}xv. This is always satisfiable (e.g., by setting F{u▷e} = I and F_{v▷e} = (x_v)⁻¹ x_u when these are invertible). □

Interpretation: Any desired feature assignment — including one that maximally separates different-class nodes — can be made a global section of some sheaf. NSD learns these maps from data, finding the restriction maps that make the task-optimal features lie in ker(Δ_F).

Key implication: When NSD learns restriction maps, it is simultaneously learning (1) the "correct" relational geometry between adjacent nodes and (2) the space of globally consistent features. If the maps are learned correctly, the task-optimal features are global sections, and sheaf diffusion converges to them naturally — no matter how many layers are used.

Why Signed Attention Is Not Enough

FAGCN (Bo et al., 2021) uses signed attention a_{uv} ∈ [−1, +1], which allows negative edge weights. This is the scalar sheaf case: F_{u▷e} = 1, F_{v▷e} = a_{uv}.

The scalar sheaf can represent edge-wise sign information (same or opposite direction), but cannot represent rotational relationships between features. For d-dimensional features, all d channels must agree or disagree uniformly — there is no per-channel or rotational heterophily representation.

Diagonal sheaf maps (as in NSD-diag) allow d independent signs per edge — different feature channels can have different relationship types simultaneously.

General/orthogonal sheaf maps allow arbitrary rotational relationships — representing the full d-dimensional relational geometry.

Necessary and Sufficient Conditions for Sheaf Heterophily

Necessary condition: For sheaf diffusion to correctly classify heterophilic graphs, the global sections of the learned sheaf must include the task-optimal node features (or a transformation of them that is linearly separable).

Sufficient condition: If the MLP sheaf predictor can represent the restriction maps that make the task-optimal features consistent (the theorem above guarantees such maps exist), and if the learning algorithm finds them, then sheaf diffusion converges to representations that correctly classify heterophilic nodes.

In practice: NSD with diagonal maps has been empirically shown to find approximately correct maps on most heterophilic benchmarks — the maps encode the class-conditional relational geometry automatically.

Comparison with Other Heterophily Methods

MethodKey mechanismLimitation
H2GCN (Zhu et al., 2020)Ego + neighbourhood featuresFeature engineering, no principled framework
GPRGNN (Chien et al., 2021)Learnable polynomial in LCan use high-frequency components but not richer relational structure
FAGCN (Bo et al., 2021)Signed edge attention (scalar sheaf)Scalar only — all channels treated uniformly
NSDLearned d×d restriction mapsFull relational geometry, requires choosing d
PNSDLearned maps + polynomial filterMaps + spectral flexibility

The sheaf framework provides the most principled treatment of heterophily — it explains why heterophily is hard for GCN (wrong null space) and how to fix it (change the null space via restriction maps).

Empirical Validation

On Cornell (h = 0.11, highly heterophilic): 85% accuracy for NSD vs 57% for GCN. The gap is entirely explained by the heterophily account above — GCN’s Dirichlet energy penalises the correct class-conditional feature assignment, while NSD’s Sheaf Dirichlet energy (with learned maps) rewards it.

References