DiffPool: Learning Hierarchical Graph Pooling
Published:
The Limitation of Flat Global Pooling
Global mean/sum/max pooling jumps directly from N node embeddings to a single graph embedding. For graphs with hierarchical structure โ like molecules (atoms โ functional groups โ whole molecule) or social networks (people โ communities โ factions) โ this skips all intermediate scales.
CNNs solve this with hierarchical pooling (max-pool after each conv layer). DiffPool brings the same idea to graphs, with the key challenge: graph pooling must be permutation-invariant and must handle variable numbers of nodes.
The DiffPool Architecture
DiffPool processes each pooling level l with two GNNs:
1. Embedding GNN โ computes node embeddings:
2. Pooling GNN โ computes cluster assignments:
S^{(l)} โ โ^{N_l ร k_l} is the soft assignment matrix: S[i,j] is the probability that node i belongs to cluster j. softmax is applied row-wise.
Coarsening: compute the next levelโs adjacency and embeddings:
The new adjacency A^{(l+1)} is the cluster-to-cluster connectivity โ how much two clusters share edges through their constituent nodes.
Why Soft Assignment?
Hard assignment (each node assigned to exactly one cluster) would require argmax โ not differentiable. Soft assignment (each node fractionally assigned to all clusters) allows end-to-end gradient flow.
The assignment S^{(l)} is learned jointly with the rest of the network. The model discovers which nodes should be clustered together โ without any external supervision on the clustering.
Auxiliary Loss Terms
DiffPool adds two regularisation losses to guide clustering quality:
Link prediction loss: encourages nodes connected by an edge to be assigned to the same cluster:
If S^{(l)} S^{(l)T} approximates A^{(l)}, clusters are graph-connected.
Entropy loss: encourages each nodeโs assignment to be concentrated (not uniformly spread):
Where H is entropy. Low entropy โ sharp cluster assignment โ more interpretable clusters.
Computational Cost
DiffPool runs two GNNs at each level. For a graph with N nodes coarsened to k clusters:
- Forward pass: O(Nยฒ d) for the dense assignment matrix
- Memory: O(Nยฒ) โ DiffPool operates on dense adjacency matrices
This makes DiffPool quadratic in N โ practical for graphs with hundreds of nodes (molecules), but not for social networks or knowledge graphs with millions of nodes.
When DiffPool Helps
DiffPool outperforms flat pooling when:
- The task requires multi-scale understanding: molecular property prediction benefits from atom-level and functional-group-level representations simultaneously
- Graphs have natural hierarchical structure: trees, clustered communities, hierarchical molecules
- The graph is small enough: typically โค 1000 nodes
It is a standard strong baseline on the TUDataset graph classification benchmarks (PROTEINS, MUTAG, IMDB, RDT).
Limitations
- Quadratic memory: no scaling to large graphs
- Fixed number of clusters: must specify k^{(l)} per level before training
- No guarantee of meaningful clusters: the auxiliary losses help but do not force semantically meaningful groupings
- Sensitive to the number of pooling levels: too many levels โ over-compression; too few โ flat pooling
Summary
| Component | Role |
|---|---|
| GNN_{embed} | Compute node representations at this scale |
| GNN_{pool} | Learn soft cluster assignments |
| S^T Z | Aggregate node embeddings into cluster embeddings |
| S^T A S | Coarsen adjacency to cluster graph |
| L_{LP} + L_E | Encourage connectivity-aligned, sharp clusters |
DiffPool introduced the idea of learned hierarchical pooling for graphs โ differentiable, end-to-end, and structure-aware. Its quadratic complexity limits scale, but for small-graph tasks (molecules, proteins), it remains a reference architecture.
References
- Ying, R., You, J., Morris, C., Ren, X., Hamilton, W. L., & Leskovec, J. (2018). Hierarchical Graph Representation Learning with Differentiable Pooling. NeurIPS 2018.
- Simonovsky, M., & Komodakis, N. (2018). Dynamic Edge-Conditioned Filters in Convolutional Neural Networks on Graphs. CVPR 2017.
