Depth in GNNs: Why Deeper Is Not Always Better
Published:
Depth in GNNs Is Different
In Transformers, adding layers increases representational depth โ each layer refines the global representation with no structural penalty. Transformers scale to 96+ layers with consistent improvement.
In GNNs, each layer has a dual effect:
- Positive: expands the receptive field by one hop (more context)
- Negative: smooths features by averaging neighbours (more oversmoothing)
These effects fight each other. The optimal depth depends on which dominates.
What Depth Buys: Receptive Field
With K GNN layers, node v aggregates information from its K-hop neighbourhood. For node classification, the useful depth K* is the number of hops that contains task-relevant information.
For homophilic datasets (Cora, CiteSeer): most task-relevant context is at 1-2 hops. Beyond that, the neighbourhood is dominated by same-label nodes that add little information while increasing oversmoothing.
For long-range tasks (predicting molecular properties from whole-molecule context): you need K โฅ graph diameter โ which is 10+ for medium-sized molecules. This immediately conflicts with oversmoothing.
The Empirical Depth Cliff
On standard GNN benchmarks, accuracy as a function of depth follows a characteristic pattern:
Layers: 1 2 3 4 8 16 32
GCN: 75% 82% 80% 76% 58% 42% 25%
GAT: 76% 83% 81% 78% 61% 44% 28%
(Illustrative values on Cora-style datasets.) Performance peaks at 2-3 layers, then drops dramatically. At 32 layers, the model fails catastrophically โ oversmoothing has collapsed all node distinctions.
The Depth Dilemma
This creates an uncomfortable trade-off:
- Task requires long-range context โ need many layers
- Many layers โ oversmoothing โ performance collapse
- Few layers โ undershooting the diameter โ missing distant context
For graphs with large diameter (long molecules, social networks, knowledge graphs), standard GNNs are caught in this dilemma with no good resolution.
Architectural Solutions for Deep GNNs
GCNII (Chen et al., 2020)
GCNII adds two modifications to enable 64-layer GCNs:
- Initial residual: skip connection to the initial features X at every layer:
- Identity mapping: identity initialisation for weight matrices
ฮฑ controls how much initial feature signal is retained; ฮฒ controls how close the weight matrix stays to identity. Together, they prevent both oversmoothing and gradient vanishing.
GCNII achieves competitive accuracy with 64 layers on Cora โ a previously impossible depth.
JK-Net (Jumping Knowledge, Xu et al., 2018)
JK-Net uses all intermediate representations, not just the last layer:
Where AGG is concatenation, max-pooling, or LSTM. Each nodeโs final embedding includes information from all receptive field sizes simultaneously. Oversmoothing in deep layers is offset by the sharp early-layer representations.
APPNP
As discussed in the APPNP post: separate transformation from propagation. The teleport probability ฮฑ keeps each node anchored to its own features even after 20 propagation steps โ preventing oversmoothing.
DropEdge
Randomly drop a fraction of edges during each training step. This reduces the averaging effect per layer, slowing oversmoothing. Analogous to Dropout for edges.
The Depth vs Width Trade-off
An alternative to depth: make each layer wider (more hidden dimensions). In practice, going from d=64 to d=256 with 2-3 layers often outperforms using d=64 with 8-16 layers โ wider layers better capture local structure without oversmoothing.
The GNN community is moving toward:
- Shallow local GNNs (2-4 layers) for node-level tasks
- Aggregation-then-transform designs (APPNP, SGC) for medium range
- Graph Transformers or GPS (Transformer + local MPNN) for long-range tasks
Summary
| Depth | Effect | Recommendation |
|---|---|---|
| 1-2 layers | Minimal smoothing, limited context | Good baseline |
| 3-4 layers | Optimal for most homophilic benchmarks | Default choice |
| 8-16 layers | Oversmoothing dominates; needs residuals | Use GCNII, JK-Net |
| 32+ layers | Near-impossible without special design | GCNII or abandon local MPNN |
GNN depth scaling does not follow the same scaling laws as Transformer depth. Understanding this โ and which architectural tricks restore the benefit of depth โ is central to modern GNN design.
References
- Chen, M., Wei, Z., Huang, Z., Ding, B., & Li, Y. (2020). Simple and Deep Graph Convolutional Networks. ICML 2020 (GCNII).
- Xu, K., Li, C., Tian, Y., Sonobe, T., Kawarabayashi, K., & Jegelka, S. (2018). Representation Learning on Graphs with Jumping Knowledge Networks. ICML 2018 (JK-Net).
- Rong, Y., Huang, W., Xu, T., & Huang, J. (2020). DropEdge: Towards Deep Graph Convolutional Networks on Node Classification. ICLR 2020.
