Datasets and splits#

These 14 node-classification benchmarks are the standard testbed for heterophily research: the regime where connected nodes tend to belong to different classes. The homophily ratio

\[ h = \frac{|\{(u,v) \in E : y_u = y_v\}|}{|E|} \]

measures the fraction of edges connecting same-class nodes. Fully homophilic graphs have \(h = 1\); heterophilic graphs have \(h \ll 1\). Methods like vanilla GCN aggregate neighbourhood features and implicitly assume high \(h\); Sheaf Neural Networks replace the graph Laplacian with a sheaf Laplacian \(L_\mathcal{F}\) that is learned per edge and is not restricted to this assumption.

Datasets are downloaded on demand into src/exp/data/ (PyG cache) and exp/splits/ (.npz files for Geom-GCN splits, generated by gen_splits.py on first request). Three split strategies are selected automatically per dataset:

Geom-GCN 48/32/20

Datasets: cora (2 708 nodes, 5 429 edges, 7 classes, \(h \approx 0.81\)), citeseer (3 327 nodes, 4 732 edges, 6 classes, \(h \approx 0.74\)), chameleon (2 277 nodes, 36 101 edges, 5 classes, \(h \approx 0.23\)), squirrel (5 201 nodes, 217 073 edges, 5 classes, \(h \approx 0.22\)), cornell (183 nodes, 295 edges, 5 classes, \(h \approx 0.11\)), texas (183 nodes, 309 edges, 5 classes, \(h \approx 0.11\)), film (alias fil, 7 600 nodes, 33 544 edges, 5 classes, \(h \approx 0.22\)).

Canonical Pei et al. (2020) splits. The filename convention {dataset}_split_0.6_0.2_{fold}.npz is historical and does not reflect actual ratios; the true split is approximately 48/32/20 train/val/test.

Geom-GCN filtered ~48/32/20

Datasets: chameleon_filtered, squirrel_filtered.

Platonov et al. (2023) found that the original chameleon and squirrel graphs contain large numbers of near-duplicate nodes sharing identical feature vectors, causing train/test leakage under any random split. Removing these duplicates yields the _filtered variants. Splits are embedded directly in the raw .npz files from the yandex-research release rather than generated by gen_splits.py.

Platonov 50/25/25

Datasets: amazon_ratings (\(h \approx 0.38\)), minesweeper (\(h \approx 0.68\), binary), questions (\(h \approx 0.84\), binary), roman_empire (\(h \approx 0.05\)), tolokers (\(h \approx 0.59\), binary).

Shipped with PyG HeterophilousGraphDataset. These graphs were introduced alongside the filtered variants specifically to provide large-scale, leak-free heterophily benchmarks. Split ratio is 50/25/25 train/val/test with 10 pre-defined folds.

Metric selection#

Accuracy (\(\frac{\text{correct}}{\text{total}}\)) is used for all datasets except minesweeper, tolokers, and questions, which are binary classification tasks evaluated with ROC-AUC (one-vs-rest, scikit-learn). The distinction matters because class imbalance in these three datasets makes accuracy a misleading metric; ROC-AUC is invariant to class prevalence.

Pre-fetching splits#

Splits for Geom-GCN datasets are not bundled with PyG and must be generated before training:

python -m exp.gen_splits                          # all datasets
python -m exp.gen_splits --datasets cora citeseer # subset

Platonov splits are fetched automatically by PyG on first dataset access. Running gen_splits.py does not affect them.