Set2Set and Attention Readout: Order-Invariant Graph Summaries
Published:
Beyond Uniform Pooling
Mean and sum pooling treat all nodes identically. But for most tasks, nodes differ greatly in importance:
- In a molecule, the reactive functional group matters more than inert backbone atoms
- In a social network, hubs matter more than peripheral nodes
- In a citation graph, landmark papers matter more than derivative works
Attention readout learns these importance differences during training.
Attention Readout (Global Attention Pooling)
For each node v, compute a scalar importance score:
Normalise scores with softmax:
Compute the graph embedding as a weighted sum:
This is a single-pass soft attention over all nodes. The model learns which nodes to weight highly for the specific prediction task.
Properties:
- Permutation-invariant (softmax and weighted sum are unordered)
- Differentiable: all operations are smooth
- Task-conditioned: α_v depends on h_v which encodes local neighbourhood
Limitation: each node is scored independently. The attention over node v does not account for what other nodes contribute — the weights are computed in isolation.
Set2Set (Vinyals et al., 2015)
Set2Set produces a graph embedding using T steps of LSTM-driven attention. At each step t, the LSTM maintains a query vector q_t, which is used to compute attention over all nodes:
Step t:
Update the LSTM:
After T steps, the final graph embedding is:
(concatenation of LSTM hidden state and final attended message)
Set2Set vs Attention Readout vs Sum
| Property | Sum | Attention Readout | Set2Set |
|---|---|---|---|
| Weights nodes uniformly | Yes | No | No |
| Learns importance | No | Yes (independently) | Yes (iteratively) |
| Multiple passes over nodes | No | No | Yes (T passes) |
| Output dimension | d | d | 2d |
| Complexity | O(N d) | O(N d) | O(T N d) |
| Permutation-invariant | Yes | Yes | Yes |
When Set2Set Helps
Set2Set is particularly effective when:
- Graph-level prediction requires integrating information from multiple disjoint node subsets
- Different “aspects” of the graph matter for the prediction (Set2Set reads each in turn)
- The graph size varies widely across the dataset (attention readout adapts better than fixed pooling)
On molecular benchmarks (QM9 for molecular property prediction), Set2Set significantly outperforms mean/sum pooling and slightly outperforms single-pass attention.
Multi-head Attention Readout
A simpler extension of attention readout: compute K independent attention heads, each with its own gate MLP:
Each head learns to attend to a different subset of important nodes. This gives multi-aspect graph summarisation without the LSTM overhead of Set2Set.
Summary
| Method | Core idea | Strength |
|---|---|---|
| Sum/Mean | Uniform aggregation | Simple, fast |
| Attention readout | Learned per-node weights | Task-adaptive |
| Set2Set | LSTM queries node set T times | Rich multi-pass summary |
| Multi-head attention | Multiple independent attention pools | Balanced expressiveness/cost |
For small graphs (molecules, proteins), Set2Set and multi-head attention provide meaningful improvements over flat pooling. For large graphs, the O(T N d) cost of Set2Set may be prohibitive, making single-pass attention readout the preferred choice.
References
- Vinyals, O., Bengio, S., & Kudlur, M. (2015). Order Matters: Sequence to Sequence for Sets. ICLR 2016 (Set2Set).
- Li, Y., Tarlow, D., Brockschmidt, M., & Zemel, R. (2016). Gated Graph Sequence Neural Networks. ICLR 2016 (global attention readout).
- Gilmer, J., Schütt, K. T., Matera, G., Deisenroth, M. P., & Müller, K.-R. (2017). Neural Message Passing for Quantum Chemistry. ICML 2017 (uses Set2Set for molecular property prediction).
