Global Pooling in GNNs: Mean, Sum, and Max

4 minute read

Published:

TL;DR: After message passing, a readout function aggregates all node embeddings into a single graph embedding. Mean pooling is permutation-invariant and size-normalised. Sum pooling retains count information. Max pooling captures extremes. Each has different expressivity — sum is the most expressive for distinguishing graph sizes and multisets.

The Readout Problem

A K-layer GNN produces a set of node embeddings {h^{(K)}_v : v ∈ V}. For node-level tasks (node classification, link prediction), these are used directly. For graph-level tasks (graph classification, graph regression), they must be compressed into a single vector h_G.

This compression is the readout or global pooling step. It must be:

  1. Permutation-invariant: the same graph regardless of node ordering
  2. Differentiable: end-to-end training
  3. Expressive: different graphs should map to different embeddings

Mean Pooling

h_G = (1/|V|) Σ_{v ∈ V} h^{(K)}_v

Properties:

  • Permutation-invariant: ✓
  • Normalised by graph size: yes (divides byV)
  • Sensitive to graph size: no (a graph with 10 nodes and 100 identical nodes → same embedding)
  • Captures average node behaviour

When to use: tasks where the typical node matters — e.g., average atom property in a molecule, average sentiment in a document graph.

Failure case: cannot distinguish a graph with one active node from a graph with 100 identical active nodes — mean pooling normalises out the count.

Sum Pooling

h_G = Σ_{v ∈ V} h^{(K)}_v

Properties:

  • Permutation-invariant: ✓
  • Sensitive to graph size: yes (more nodes → larger magnitude)
  • Injective over multisets of bounded node embeddings: yes (under the right conditions)
  • Captures total contribution of all nodes

When to use: tasks where the total matters — e.g., total charge of a molecule, total influence in a social network.

Expressive power: Xu et al. (GIN, 2019) proved that sum readout is strictly more expressive than mean or max for distinguishing non-isomorphic graphs. Mean collapses count information; sum preserves it.

Failure case: sensitive to graph size in ways that may not be desired — a graph with 100 zero-embedding nodes has the same sum as a graph with 0 nodes.

Max Pooling

h_G[i] = max_{v ∈ V} h^{(K)}_v[i] (elementwise maximum)

Properties:

  • Permutation-invariant: ✓
  • Captures the most prominent feature value in each dimension
  • Insensitive to count of nodes with non-maximal features

When to use: tasks where the extreme matters — e.g., is there any toxic functional group? Does any node have property X?

Failure case: cannot distinguish {1, 2} from {2} — max pooling drops information about non-maximal elements.

The multiset analogy: Think of pooling as summarising a multiset of vectors. Mean collapses {1,1,1} and {1} to the same value. Max collapses {1,2} and {2}. Sum distinguishes all three — but loses ordering (which is intended for permutation invariance).

Expressivity Ranking

For graph-level tasks requiring discrimination between non-isomorphic graphs:

Sum > Mean ≈ Max (in terms of distinguishing power)

Sum aggregation is the foundation of GIN’s graph-level expressiveness. The GIN paper proved: if node-level embeddings are injective and readout is sum, the resulting graph-level model is as expressive as 1-WL on graphs.

Combinations and Hierarchical Pooling

In practice, combining multiple pooling types often works best:

h_G = concat( mean_pool(H), sum_pool(H), max_pool(H) )

This captures average behaviour (mean), count sensitivity (sum), and extreme values (max) simultaneously.

For graphs where structure at different scales matters (molecules with atoms and functional groups, social networks with individuals and communities), hierarchical pooling — covered in DiffPool and TopK-Pool posts — is more appropriate than flat global pooling.

Summary

PoolingFormulaSensitive to SizeInformation CapturedBest For
MeanΣh / |V|NoAverage node behaviourDistribution of properties
SumΣhYesTotal + countAdditive properties
Maxmax(h)NoExtreme valuesExistence queries
Concat(all)[mean; sum; max]PartialCombinedGeneral tasks

The choice of readout is as important as the choice of message passing architecture. On graph classification benchmarks, switching from mean to sum pooling alone can change accuracy by 5-10 percentage points.

References

  • Xu, K., Hu, W., Leskovec, J., & Jegelka, S. (2019). How Powerful are Graph Neural Networks?. ICLR 2019 (proves sum readout is strictly more expressive than mean or max).
  • Zaheer, M., Kottur, S., Ravanbakhsh, S., Poczos, B., Salakhutdinov, R., & Smola, A. J. (2017). Deep Sets. NeurIPS 2017 (theory of permutation-invariant functions over sets).