PolyNSD: Polynomial Neural Sheaf Diffusion

11 minute read

Published: May 26, 2026

TL;DR: Neural Sheaf Diffusion is powerful but expensive and numerically fragile. PolyNSD replaces repeated diffusion with a stable polynomial filter on the sheaf Laplacian, keeping the geometry while making training cheaper and more robust.

Paper: "Polynomial Neural Sheaf Diffusion" · arXiv:2512.00242
Authors: A. Borgi, P. Liò
Venue: arXiv preprint, 2025 · 📄 Read the paper

First page of the Polynomial Neural Sheaf Diffusion paper — Paper preview — Polynomial Neural Sheaf Diffusion: A Spectral Filtering Approach on Cellular Sheaves (Borgi, 2025).

Why This Paper Exists

PolyNSD starts from a practical frustration: Neural Sheaf Diffusion is powerful, but it is harder to train and scale than it should be. The original formulation asks the model to repeatedly build and normalise a sheaf diffusion operator with dense restriction maps and expensive matrix machinery. That is a strong theoretical framework, but not always a friendly engineering one.

This paper asks a sharper question: can we keep the geometric benefits of sheaf diffusion while making the propagation rule behave more like a stable, interpretable spectral GNN?

Background: Neural Sheaf Diffusion

A Sheaf Neural Network enriches a graph with a cellular sheaf: each node and edge gets a vector space (a stalk), and each endpoint of each edge gets a restriction map encoding how node signals relate to edge signals. The sheaf Laplacian encodes this relational geometry and replaces the standard graph Laplacian in the diffusion operator.

Neural Sheaf Diffusion (NSD) — the dominant sheaf GNN approach — learns restriction maps end-to-end and runs diffusion on the sheaf Laplacian. It handles heterophily well and resists oversmoothing, but has three practical problems:

SVD-based normalisation: requires expensive SVD decomposition of the sheaf Laplacian at every layer, making Laplacian rebuilds slow.
Dense restriction maps: one d × d matrix per node-edge pair, scaling quadratically with stalk dimension d.
Brittle gradients: the normalised sheaf Laplacian construction is numerically unstable for large d, leading to gradient issues.

Left: the first four Chebyshev polynomials T₀ (flat, blue), T₁ (linear, teal), T₂ (parabola, amber), T₃ (cubic, purple), each drawn with a CSS animation. Right: the PolyNSD filter is a learned convex combination of these basis polynomials — the orange curve shows an example high-pass shape (emphasising high-frequency, heterophily-relevant components). The coefficients sum to 1 for stability.

The Main Design Choice

PolyNSD takes the perspective of spectral GNNs seriously: instead of repeatedly applying a fragile diffusion operator layer after layer, it learns a polynomial filter directly on the normalised sheaf Laplacian. That means the network can shape the frequency response explicitly, while keeping the computation sparse and stable.

The PolyNSD Fix

PolyNSD replaces the NSD propagation operator with a degree-K polynomial of a spectrally rescaled normalised sheaf Laplacian, evaluated via a stable three-term Chebyshev recurrence.

This gives:

Explicit K-hop receptive field in a single layer (independently of the stalk dimension d).
Trainable spectral response as a convex mixture of K+1 orthogonal polynomial basis responses — the model learns which frequency components to amplify or suppress.
No SVD needed: the recurrence only requires sparse matrix-vector products.
Stability via convex mixtures (coefficients sum to 1) + spectral rescaling to [−1, 1] + residual/gated paths.

PolyNSD architecture showing lifting, sheaf Laplacian construction, spectral rescaling, Chebyshev polynomial evaluation, and gated residual update — Figure 1 — The PolyNSD pipeline starts by lifting node features into stalk spaces, learns restriction maps to build the sheaf Laplacian, rescales the spectrum to a stable range, and then applies a Chebyshev polynomial filter with a gated residual correction. The important point is that diffusion is no longer a fragile repeated operator: it becomes a controlled spectral module with explicit receptive field and better numerical behaviour.

Architecture Overview

The full architecture is deliberately clean. Node features are lifted into stalk spaces, diffusion is performed through polynomial filtering on the sheaf Laplacian, and the output head reads the result back for prediction. That simplicity is part of the contribution: the model becomes easier to reason about than earlier sheaf pipelines with heavier normalisation machinery.

Diagonal Restriction Maps

The key parameter-reduction insight: diagonal restriction maps (a vector of d scalars per node-edge pair instead of a d × d matrix) are sufficient for strong performance. This reduces per-edge parameter count from O(d²) to O(d) and decouples performance from large stalk dimensions.

The Practical Win

This is where the paper becomes especially useful. Many sheaf models implicitly suggest that more expressive geometry requires larger dense restriction maps. PolyNSD shows that this is often the wrong tradeoff. If the spectral filter is doing the right global work, the local maps can stay lightweight and still capture the anisotropic behavior that matters.

Key Insight — why diagonal maps are sufficient once the polynomial filter is strong: There is a natural division of labour in PolyNSD. The polynomial filter on the sheaf Laplacian handles global spectral shaping — deciding which frequency components of the graph signal to amplify or suppress across the entire graph. The restriction maps handle local relational structure — encoding how each node's features relate to each adjacent edge. Once the polynomial filter does the global work, the local maps only need to encode directionality and sign, not full rotational geometry. Diagonal maps (a vector of d scalars per node-edge pair) capture directional anisotropy without needing a full d×d matrix. The polynomial handles the global; the diagonal map handles the local. Splitting the task this way reduces parameters from O(d²) to O(d) per edge with negligible accuracy loss.

Why Diagonal Maps Are Enough

This is one of the paper’s most useful empirical findings. Earlier sheaf models tended to assume that expressive sheaf learning required large dense restriction matrices. PolyNSD shows that this is often unnecessary: once the spectral filter itself is strong enough, diagonal maps can already encode the right anisotropic behaviour while being much cheaper to train and much less numerically delicate.

Concrete 3-Node Chebyshev Recurrence Example

Consider a path graph A–B–C with d=1 stalks (scalar features) and diagonal restriction maps. Let F_{A→e_AB} = diag(1) = 1, F_{B→e_AB} = diag(−1) = −1 on edge AB, and F_{B→e_BC} = diag(1) = 1, F_{C→e_BC} = diag(1) = 1 on edge BC. Initial node features: x = [x_A, x_B, x_C]ᵀ.

Step 1 — Build the sheaf Laplacian block. For a path A–B–C with these maps, the (unnormalised) sheaf Laplacian is:

Δ_F = [ 1 1 0 ] (contribution from edge AB: F_{A}ᵀF_{A} − F_{A}ᵀF_{B} = 1·I − 1·(−1) = 2 on diagonal A) [ 1 2 −1] (B is incident to both edges) [ 0 −1 1 ] (C is incident to edge BC only)

After spectral rescaling to [−1, 1] (dividing by the largest eigenvalue ~3 and shifting), we get the normalised Δ̃_F.

Step 2 — Chebyshev recurrence with K=2. The three Chebyshev basis evaluations are:

T₀(Δ̃_F) x = x = [x_A, x_B, x_C]ᵀ (identity — 0-hop, each node only sees itself) T₁(Δ̃_F) x = Δ̃_F x ≈ [x_A + x_B, x_A + 2x_B − x_C, −x_B + x_C]ᵀ (1-hop — each node sees direct neighbours) T₂(Δ̃_F) x = 2·Δ̃_F·T₁(Δ̃_F)x − T₀(Δ̃_F)x (2-hop — each node sees 2-hop neighbourhood)

Step 3 — 3-hop receptive field “for free”. The T₂ term gives node A access to information from node C (2 hops away) in a single PolyNSD layer with K=2. In NSD, reaching C from A requires 2 separate message-passing layers (A→B in layer 1, B→C in layer 2). PolyNSD achieves the same 2-hop receptive field in one layer — because the Chebyshev recurrence computes multi-hop aggregations algebraically without stacking layers. For K=3, node A would see 3 hops with a single filter evaluation. This is the key efficiency gain: K polynomial terms in one layer = K separate NSD layers, but with only one set of learned parameters and one set of map computations.

Learned weights example. With K=2, PolyNSD learns weights [α₀, α₁, α₂] (convex mixture summing to 1). For a homophilic graph, the model might learn [0.6, 0.3, 0.1] (low-pass, dominated by T₀). For a heterophilic graph like Cornell, it might learn [0.1, −0.3, 0.6] (high-pass, dominated by T₂ which oscillates — amplifying differences between nodes). This spectral flexibility is what makes PolyNSD work well on both homophilic and heterophilic benchmarks with a single architecture.

Results

Influence decay versus hop distance on Minesweeper comparing NSD and PolyNSD variants — Figure 2 — On Minesweeper, the influence-decay plot shows the mechanism behind PolyNSD’s stability: polynomial variants retain meaningful medium-range signal for longer, while the standard NSD curves collapse much faster as hop distance grows. This is exactly what you want from a sheaf model that should mix information beyond the immediate neighbourhood without becoming numerically brittle.

Influence decay versus hop distance on Roman Empire comparing NSD and PolyNSD variants — Figure 3 — The Roman Empire benchmark tells a similar story in a heterophilic regime: PolyNSD keeps the long-range influence profile substantially flatter, which means information can still travel across structurally distant but label-relevant nodes. That matters because heterophily is exactly where overly local message passing tends to fail.

Influence decay versus hop distance on Amazon Ratings comparing NSD and PolyNSD variants — Figure 4 — On Amazon Ratings, the polynomial filters again preserve signal over larger hop distances than their NSD counterparts. Read these curves as a frequency-domain sanity check: the learned filter is not just more accurate, it is shaping propagation in a way that better matches the graph’s long-range structure.

Key results vs. NSD and spectral GNN baselines:

New SOTA on both homophilic (Cora, CiteSeer, PubMed) and heterophilic (Texas, Film, Wisconsin) benchmarks — inverting the NSD trend that required large stalk dimensions for heterophilic gains.
Diagonal maps + small d match or exceed NSD with dense maps + large d.
Lower runtime and memory: no SVD, sparse recurrence, small stalk dimensions.
Spectral filter shape is interpretable: the model learns when to apply low-pass (homophilic) vs. high-pass (heterophilic) filters.

Why the Result Is Interesting Beyond This Paper

PolyNSD is more than a performance bump over NSD. It suggests a better recipe for future sheaf models: keep the geometric inductive bias, but move expensive expressivity away from fragile local parameterisations and into stable global filtering mechanisms. That is a useful design lesson whether the next step is node classification, heterophily, or more general geometric deep learning.

Why This Paper Matters

PolyNSD is important because it makes sheaf GNNs more usable. It preserves the geometric advantages of sheaf diffusion, but removes several implementation bottlenecks that previously made these models expensive or unstable. In practice, that is what turns a promising theory into something researchers can run, compare, and build on.

✅ Key Takeaways

PolyNSD replaces the NSD diffusion operator with a degree-K Chebyshev polynomial in the normalised sheaf Laplacian, evaluated via a stable three-term recurrence.
Diagonal restriction maps are sufficient — decoupling performance from stalk dimension and reducing parameters from O(d²) to O(d) per edge.
Stable by design: convex mixture coefficients + spectral rescaling + residual paths prevent gradient collapse.
SOTA on homo- and heterophilic benchmarks with lower runtime and memory than NSD.

Share on

Bluesky Facebook LinkedIn X (formerly Twitter)

Alessio Borgi

PolyNSD: Polynomial Neural Sheaf Diffusion

Why This Paper Exists

Background: Neural Sheaf Diffusion

The Main Design Choice

The PolyNSD Fix