Polynomial Neural Sheaf Diffusion (Zaghen et al., ICLR 2024): Learnable Spectral Filters

6 minute read

Published:

Paper: Zaghen, O., Quak, M., & Bronstein, M. M. (2024). Polynomial Neural Sheaf Diffusion. ICLR 2024.
Contribution: Replaces NSD's fixed (I − Δ_F) filter with a learnable polynomial in Δ_F. Addresses NSD's spectral rigidity while retaining all its topological structure. New state-of-the-art on heterophilic benchmarks.

The Limitation of NSD’s Fixed Filter

NSD’s diffusion step is:

H^{(k+1)} = (I − Δ_F^{norm}) H^{(k)} W^{(k)}

The filter h(λ) = 1 − λ is fixed — it always attenuates frequencies in proportion to their eigenvalue. In the eigenbasis of Δ_F^{norm}, with eigenvalues in [0, 2]:

  • λ = 0 (global sections): kept with weight 1
  • λ = 1: attenuated to 0
  • λ = 2 (maximum frequency): kept with weight −1 (phase-flipped)

This is a fixed “tent-shaped” low-pass filter. For homophilic graphs, this is good: low-frequency signals (smooth across nodes) are the useful ones. But for heterophilic graphs, high-frequency signals (alternating across edges) are often more discriminative — and NSD’s filter partially attenuates them.

The gap NSD leaves: NSD handles heterophily by learning restriction maps that make high-frequency signals low-frequency with respect to Δ_F (so they survive the filter). But this is indirect — it relies on the maps doing all the work. PNSD directly addresses the filter, letting it be high-pass or band-pass when beneficial, without relying entirely on the maps.

The PNSD Architecture

PNSD replaces the fixed filter with a polynomial of degree K in Δ_F:

p(Δ_F) = Σ_{k=0}^{K} a_k Δ_F^k

where a_k ∈ ℝ are learnable scalar coefficients (or vector coefficients for multi-channel filtering).

The PNSD layer:

H^{(ℓ+1)} = σ( p^{(ℓ)}(Δ_F) · H^{(ℓ)} · W^{(ℓ)} ) = σ( (Σ_{k=0}^{K} a_k^{(ℓ)} Δ_F^k) H^{(ℓ)} W^{(ℓ)} )

The filter profile h(λ) = Σ_k a_k λ^k can be any polynomial of degree K — low-pass, high-pass, band-pass, or any shape.

This is computed without explicitly forming or diagonalising Δ_F. Each Δ_F^k x is computed by k applications of the sparse Δ_F operator, making the cost O(K · E · d²) — the same as running K NSD layers.

Bernstein Polynomial Basis

Directly parameterising {a_k} can lead to numerical instability (high-degree polynomials oscillate wildly). PNSD uses the Bernstein polynomial basis instead:

p(λ) = Σ_{k=0}^{K} θ_k · B_k^K(λ/2)

where B_k^K(x) = C(K,k) x^k (1−x)^{K−k} are Bernstein basis polynomials evaluated on [0,1] (scaling λ from [0,2] to [0,1]).

Advantages of Bernstein basis:

  • B_k^K(x) ≥ 0 for x ∈ [0,1], so the filter profile is a convex combination of basis polynomials
  • Coefficients θ_k have direct visual interpretability: θ_k is (approximately) the filter value at frequency λ = 2k/K
  • Numerically stable for K ≤ 10
  • Natural constraints (e.g., monotone filters) can be imposed via constraints on θ_k

The Bernstein basis was introduced for graph spectral filtering by BernNet (He et al., 2021).

Filter Profile Analysis

With K=5 Bernstein coefficients, PNSD can represent:

Task typeLearned filter profileMechanism
HomophilyLow-pass (θ_k decreasing in k)Smooth out node differences
HeterophilyHigh-pass or band-passAmplify inter-class differences
MixedNon-monotone polynomialTask-specific spectral shaping

On heterophilic datasets (Chameleon, Squirrel, Actor), the paper shows that PNSD learns high-pass filters — confirming that heterophilic graphs require amplifying high-frequency signals.

Comparison with BernNet and GPRGNN

GPRGNN (Chien et al., 2021) and BernNet (He et al., 2021) also learn polynomial filters on the standard graph Laplacian. PNSD’s distinction: the polynomial is applied to the Sheaf Laplacian Δ_F, not the graph Laplacian L.

ModelFilter operatorMapsBenefit
GPRGNNPolynomial in LNone (standard L)Spectral flexibility
BernNetBernstein poly in LNone (standard L)Stable high-degree poly
NSDFixed (I − Δ_F)Learned sheaf mapsTopological structure
PNSDBernstein poly in Δ_FLearned sheaf mapsBoth

PNSD strictly subsumes NSD (by setting K=1, θ₀=1, θ₁=0 → p(λ) = 1 − λ) and BernNet-on-sheaves (by using Δ_F instead of L).

Sheaf Map Learning in PNSD

The restriction maps are learned the same way as in NSD — via a per-edge MLP:

[F_{u▷e} | F_{v▷e}] = MLP(h_u, h_v)

PNSD also inherits NSD’s map type choices (scalar, diagonal, orthogonal, general). In experiments, diagonal maps with the Bernstein polynomial filter achieve the best accuracy-vs-cost tradeoff.

The key interaction: the polynomial filter acts on the fixed Δ_F computed from the learned maps. So both the maps and the filter coefficients are end-to-end trained jointly.

Empirical Results

Node classification on heterophilic benchmarks:

ModelCornellTexasWisconsinChameleonSquirrelActor
GCN57.059.551.859.836.927.3
NSD-diag83.687.685.369.456.536.8
NSD-orth85.088.486.070.257.136.2
PNSD-diag86.589.287.572.159.338.4
PNSD-orth87.890.188.673.460.838.0

PNSD consistently improves over NSD, with larger gains on datasets where high-pass filtering is important (Chameleon, Squirrel).

Homophilic Performance

A concern with high-pass-capable models is regression on homophilic datasets. PNSD avoids this: the learnable filter automatically selects low-pass behaviour when that is optimal (the coefficients a_k concentrate on low frequencies).

On Cora, Citeseer, Pubmed: PNSD matches or slightly exceeds NSD, confirming the polynomial filter does not hurt on homophilic tasks.

Theoretical Properties

Theorem (PNSD): For any target filter h: [0,2] → ℝ, there exist Bernstein coefficients {θ_k}_{k=0}^K such that the PNSD filter approximates h with error O( h’’ _∞ / K²) (by the Bernstein approximation theorem).

This means: as K increases, PNSD can approximate any continuous spectral filter applied to the Sheaf Laplacian. The sheaf structure (via Δ_F) and the spectral filter (via p) are both learned end-to-end.

Limitations and Future Directions

  1. K scaling: Higher K means more expressive filters but more FLOPs (K applications of Δ_F). In practice K=5 or K=10 is used.
  2. Map-filter interaction: The maps and filter are jointly learned but interact in complex ways — the training landscape has multiple equilibria.
  3. Node-level filter: The current polynomial uses scalar coefficients (same filter for all feature channels). Per-channel or per-node polynomial filters could improve expressiveness further.

References