Datasets and splits#
These 14 node-classification benchmarks are the standard testbed for heterophily research: the regime where connected nodes tend to belong to different classes. The homophily ratio
measures the fraction of edges connecting same-class nodes. Fully homophilic graphs have \(h = 1\); heterophilic graphs have \(h \ll 1\). Methods like vanilla GCN aggregate neighbourhood features and implicitly assume high \(h\); Sheaf Neural Networks replace the graph Laplacian with a sheaf Laplacian \(L_\mathcal{F}\) that is learned per edge and is not restricted to this assumption.
Datasets are downloaded on demand into src/exp/data/ (PyG cache) and
exp/splits/ (.npz files for Geom-GCN splits, generated by
gen_splits.py on first request). Three split strategies are selected
automatically per dataset:
Datasets: cora (2 708 nodes, 5 429 edges, 7 classes, \(h \approx 0.81\)),
citeseer (3 327 nodes, 4 732 edges, 6 classes, \(h \approx 0.74\)),
chameleon (2 277 nodes, 36 101 edges, 5 classes, \(h \approx 0.23\)),
squirrel (5 201 nodes, 217 073 edges, 5 classes, \(h \approx 0.22\)),
cornell (183 nodes, 295 edges, 5 classes, \(h \approx 0.11\)),
texas (183 nodes, 309 edges, 5 classes, \(h \approx 0.11\)),
film (alias fil, 7 600 nodes, 33 544 edges, 5 classes, \(h \approx 0.22\)).
Canonical Pei et al. (2020) splits. The filename convention
{dataset}_split_0.6_0.2_{fold}.npz is historical and does not reflect
actual ratios; the true split is approximately 48/32/20 train/val/test.
Datasets: chameleon_filtered, squirrel_filtered.
Platonov et al. (2023) found that the original chameleon and squirrel
graphs contain large numbers of near-duplicate nodes sharing identical
feature vectors, causing train/test leakage under any random split.
Removing these duplicates yields the _filtered variants. Splits are
embedded directly in the raw .npz files from the yandex-research release
rather than generated by gen_splits.py.
Datasets: amazon_ratings (\(h \approx 0.38\)), minesweeper (\(h \approx 0.68\), binary), questions (\(h \approx 0.84\), binary), roman_empire (\(h \approx 0.05\)), tolokers (\(h \approx 0.59\), binary).
Shipped with PyG HeterophilousGraphDataset. These graphs were introduced
alongside the filtered variants specifically to provide large-scale,
leak-free heterophily benchmarks. Split ratio is 50/25/25
train/val/test with 10 pre-defined folds.
Metric selection#
Accuracy (\(\frac{\text{correct}}{\text{total}}\)) is used for all datasets except minesweeper, tolokers, and questions, which are binary classification tasks evaluated with ROC-AUC (one-vs-rest, scikit-learn). The distinction matters because class imbalance in these three datasets makes accuracy a misleading metric; ROC-AUC is invariant to class prevalence.
Pre-fetching splits#
Splits for Geom-GCN datasets are not bundled with PyG and must be generated before training:
python -m exp.gen_splits # all datasets
python -m exp.gen_splits --datasets cora citeseer # subset
Platonov splits are fetched automatically by PyG on first dataset access.
Running gen_splits.py does not affect them.