Oversquashing: When Too Much Information Passes Through Bottlenecks

5 minute read

Published: April 14, 2024

TL;DR: In K-layer message passing, node v's embedding must summarise information from its K-hop neighbourhood — which grows exponentially with K. If the path to a distant important node passes through a single bottleneck edge, the information from that distant node is diluted by exponentially many competing signals. This is oversquashing — distinct from oversmoothing.

Two Different Problems

Oversmoothing (too many layers → embeddings converge) and oversquashing (long-range info is lost at bottlenecks) are often confused. They are distinct:

	Oversmoothing	Oversquashing
Cause	Iterated averaging → feature collapse	Exponential neighbourhood growth → info compression
Affects	Nearby nodes most	Distant nodes most
More layers	Makes it worse	Would help (more hops) but also squashes more
Root mechanism	Low-pass filtering	Information bottleneck
Graph structure involved	Dense, connected graphs	Narrow bottleneck edges

The Exponential Growth Problem

In a K-layer MPNN, node v’s embedding h^{(K)}_v depends on all nodes within K hops. The K-hop neighbourhood of v can be exponentially large — for a tree-like graph with degree d, it has ~d^K nodes.

All information from these d^K nodes must be compressed into a single vector h^{(K)}_v of fixed dimension d_hidden. As K grows:

|N^K(v)| ~ d^K nodes → compressed into h^{(K)}_v ∈ ℝ^{d_hidden}

The information from any single distant node u contributes a vanishingly small fraction: ~1/d^K of the total. Even if u’s feature is critical for predicting v’s label, it is drowned out.

The Jacobian Analysis

Alon & Yahav (2021) formalised oversquashing via the Jacobian:

∂h^{(K)}_v / ∂x_u

This matrix measures how sensitive v’s K-layer embedding is to u’s input feature. If u is K hops from v, this involves K matrix multiplications:

||∂h^{(K)}_v / ∂x_u|| ≤ C · (1/d)^K · [product of weight norms]

For nodes far apart (large K, bottleneck edges with large fan-in degree d), this norm becomes exponentially small. Gradients of the loss with respect to x_u also vanish — the model cannot learn that u matters for v’s prediction.

The bottleneck analogy: Imagine a wide river (large neighbourhood) flowing through a narrow gorge (a single bottleneck edge connecting two parts of the graph). Most water (information) cannot pass through efficiently. The node on the other side of the gorge receives only a tiny, heavily compressed signal from the vast neighbourhood upstream.

Where Oversquashing Is Severe

Oversquashing is worst when:

The path between relevant nodes is long (diameter » number of layers)
Bottleneck edges connect high-degree subtrees — many nodes compete through a single edge
The graph has tree-like structure (few cycles, exponential neighbourhood growth)

Real examples where this matters:

Molecular property prediction: computing HOMO-LUMO gap requires whole-molecule reasoning; bottleneck edges are single bonds connecting large fragments
Social network influence: influence travels through single bridges between communities
Traffic forecasting: a road closure (bottleneck) affects distant nodes but the effect is diluted through many competing paths

Measuring Oversquashing

The sensitivity score

∂h_v^{(K)}/∂x_u

measures how much node u influences node v after K layers. Plotting this for all pairs (u,v) reveals which edges are bottlenecks.

Alternatively: the commute time between nodes (expected random walk length) — high commute time between u and v means information struggles to flow between them.

Solutions: Graph Rewiring

Graph rewiring adds or removes edges to reduce bottlenecks:

SDRF (Stochastic Discrete Ricci Flow): adds edges based on Ollivier-Ricci curvature — edges with negative curvature are bottlenecks
DIGL: adds edges between nodes with high personalized PageRank similarity
CurvDrop: removes edges with high negative curvature (bottlenecks) and adds long-range connections

Other approaches:

Global attention (Graph Transformers): bypasses all bottlenecks — every node attends to every node directly
APPNP: personalized PageRank allows distant information to flow via many paths simultaneously
Virtual node: add a single virtual node connected to all other nodes, providing a global communication channel

Curvature and Oversquashing

Topping et al. (2022) connected oversquashing to Ricci curvature. An edge (u,v) has negative Ollivier-Ricci curvature when its endpoints have few common neighbours — the edge is a bottleneck between two otherwise disconnected regions.

Adding edges at negative-curvature bottlenecks (SDRF) provably reduces oversquashing. This connects graph geometry (curvature) to information flow (oversquashing) — a beautiful theoretical result.

Summary

Property	Value
Root cause	Exponential neighbourhood growth + fixed embedding dimension
Formal measure	Jacobian	∂h_v^{(K)}/∂x_u	→ 0 exponentially
Worst cases	Long paths, tree-like graphs, single bottleneck bridges
Effect	Distant relevant information lost; gradient vanishes
Solution 1	Graph rewiring (add/remove edges)
Solution 2	Global attention (Graph Transformers)
Solution 3	Virtual nodes (global communication channel)
Relation to oversmoothing	Distinct: affects different nodes, different depth regime

Oversquashing explains why GNNs fail on long-range reasoning tasks even when depth is not the bottleneck. Solving it requires either changing the graph (rewiring) or bypassing message passing altogether (Graph Transformers).

References

Alon, U., & Yahav, E. (2021). On the Bottleneck of Graph Neural Networks and Its Practical Implications. ICLR 2021.
Topping, J., Di Giovanni, F., Chamberlain, B. P., Dong, X., & Bronstein, M. M. (2022). Understanding over-squashing and Bottlenecks on Graphs via Curvature. ICLR 2022.
Di Giovanni, F., Giusti, L., Barbero, F., Maschi, G., Lio, P., & Bronstein, M. M. (2023). On Over-Squashing in Message Passing Neural Networks: The Impact of Width, Depth, and Topology. ICML 2023.

Share on

Bluesky Facebook LinkedIn X (formerly Twitter)

Alessio Borgi

Oversquashing: When Too Much Information Passes Through Bottlenecks

Two Different Problems

The Exponential Growth Problem

The Jacobian Analysis

Where Oversquashing Is Severe

Measuring Oversquashing

Solutions: Graph Rewiring

Curvature and Oversquashing

Summary

References

Share on

You May Also Enjoy

Open Problems and Future Directions in Sheaf Neural Networks

Sheaf GNNs for Molecular Property Prediction

Sheaf GNNs for Heterophilic Node Classification: Benchmarks and Results

Sheaves Meet Attention: Transformer-Inspired Sheaf Models