Sheaf GNNs for Heterophilic Node Classification: Benchmarks and Results

5 minute read

Published:

TL;DR: NSD and PNSD achieve state-of-the-art on all major heterophilic benchmarks (Cornell, Texas, Wisconsin, Chameleon, Squirrel, Actor) as of 2022–2024. The gains are largest on datasets with low homophily ratio (Cornell h=0.11, Texas h=0.11) and smallest on datasets where high-pass filtering alone is sufficient (Actor h=0.22). Stalk dimension d=2 is optimal on most datasets; d>5 rarely helps and increases overfitting.

The Heterophilic Benchmark Suite

The standard heterophilic node classification benchmarks (from Pei et al., 2020 and Lim et al., 2021):

DatasetNodesEdgesClassesHomophily hNotes
Cornell18329550.11WebKB university pages
Texas18330950.11WebKB university pages
Wisconsin25149950.20WebKB university pages
Actor7,60030,01950.22Film actor co-occurrence
Chameleon2,27736,10150.23Wikipedia chameleon pages
Squirrel5,201217,07350.22Wikipedia squirrel pages
Roman-Empire22,66265,854180.05Wikipedia Roman Empire article
Amazon-Ratings24,492186,10050.38Amazon product ratings
Homophily ratio h ={(u,v): y_u=y_v}/E— lower means more heterophilic.

Comprehensive Results: All Methods on All Datasets

Results from NSD (Bodnar et al., 2022) and PNSD (Zaghen et al., 2024):

Cornell (h=0.11): | Model | Accuracy | |—|—| | GCN | 57.0 ± 4.7 | | GraphSAGE | 67.3 ± 7.8 | | GAT | 54.3 ± 7.2 | | APPNP | 73.5 ± 6.4 | | GPRGNN | 80.3 ± 8.1 | | FAGCN | 79.2 ± 9.8 | | H2GCN | 82.7 ± 5.3 | | NSD-diag | 83.6 ± 4.2 | | NSD-orth | 85.0 ± 4.8 | | PNSD-orth | 87.8 ± 3.9 |

Squirrel (h=0.22): | Model | Accuracy | |—|—| | GCN | 36.9 ± 1.3 | | GAT | 40.7 ± 1.6 | | GPRGNN | 50.4 ± 1.5 | | FAGCN | 43.8 ± 1.4 | | NSD-diag | 56.5 ± 1.2 | | NSD-orth | 57.1 ± 1.3 | | PNSD-orth | 60.8 ± 1.1 |

What Drives the Performance Gains?

1. Null space adaptation: NSD learns maps that make the task-optimal features lie in ker(Δ_F). For Cornell and Texas (h=0.11), this requires strong anti-alignment maps (F_{v▷e} ≈ −I) for heterophilic edges — the model learns these automatically.

2. Spectral flexibility (PNSD): On Chameleon and Squirrel, PNSD learns high-pass filters — amplifying the high-frequency components that differentiate between classes. This cannot be achieved by NSD’s fixed (I−Δ_F) filter.

3. Stalk dimension: Larger d → richer null space → better representations. But d>5 often overfits on small datasets (Cornell has only 183 nodes).

Stalk Dimension Ablation

On Cornell, varying stalk dimension d:

dNSD-diagNSD-orth
178.979.2
283.685.0
383.985.3
483.484.7
882.183.8

Peak at d=2 or d=3 — larger d brings diminishing returns and higher overfitting risk for small graphs.

On larger datasets (Chameleon, Squirrel): Peak d is higher (d=4 or d=5) because the larger training set supports higher-capacity models.

Practical recommendation: Start with d=2. Try d=3 and d=4. Use cross-validation on the validation set to select d. Do not exceed d=5 without explicit regularisation (L2 on maps, dropout on stalks).

Map Type Ablation

On heterophilic benchmarks, comparing map types:

Map typeCornellChameleonSquirrel
Scalar (sign only)80.263.550.1
Diagonal83.669.456.5
Symmetric84.170.157.0
Orthogonal85.070.257.1
General (d²)84.469.856.9

Orthogonal maps achieve the best results, likely because gauge equivariance prevents degenerate maps and stabilises training. General maps can sometimes overfit (especially on small graphs like Cornell).

When Do Sheaves Win Decisively?

Sheaf GNNs outperform all non-sheaf methods when:

  1. Low homophily (h < 0.25): heterophily is the dominant challenge
  2. Many edges per node: the sheaf has enough information to learn meaningful restriction maps
  3. Moderate graph size (100–10K nodes): enough data for map learning, small enough to keep cost manageable

When Do Sheaves Not Help?

Sheaf GNNs offer small or no benefit when:

  1. High homophily (h > 0.6): Cora, Citeseer, Pubmed — standard GCN already works well
  2. Very small graphs (N < 100): insufficient training data for map learning
  3. Very large graphs (N > 100K): sheaf map prediction over all edges is expensive; approximate methods needed
  4. No edge features or node features: the sheaf predictor MLP needs input features to learn meaningful maps

New Benchmarks: Roman-Empire and Amazon-Ratings

Lim et al. (2021) introduced more reliable heterophilic benchmarks:

Roman-Empire (h=0.05): 22,662 nodes, 65,854 edges, Wikipedia text data. Very low homophily. NSD achieves 73.5% vs GCN’s 59.4%.

Amazon-Ratings (h=0.38): 24,492 nodes, 186,100 edges. Moderately heterophilic. NSD achieves 48.7% vs GCN’s 42.3%.

These larger, more realistic benchmarks confirm that sheaf GNNs provide consistent gains on heterophilic data, even at scales where the original WebKB datasets (Cornell, Texas, Wisconsin) were too small to draw reliable conclusions.

Practical Implementation Checklist

For a new heterophilic dataset:

  1. Compute homophily ratio h. If h > 0.6: use standard GCN first.
  2. Choose d=2 initially, map type = diagonal.
  3. Use 2–3 sheaf layers (deeper is rarely better with sheaves).
  4. Add a skip connection: H ← (1−α)(I−Δ_F^{norm})HW + αX₀W’ (residual to initial features).
  5. Regularise: L2 on map norms (λ=1e-4), dropout on stalk features (p=0.5).
  6. If performance saturates: try orthogonal maps, d=3, or PNSD with K=5 polynomial filter.

References