Depth in GNNs: Why Deeper Is Not Always Better

5 minute read

Published: April 16, 2024

TL;DR: Each GNN layer expands the receptive field by one hop — seemingly beneficial. But more layers also mean more averaging and oversmoothing. On most real graphs, the optimal depth is 2-4 layers. Deep GNNs (8+ layers) require special architectural tricks (GCNII, JK-Net, APPNP) to avoid performance collapse.

Depth in GNNs Is Different

In Transformers, adding layers increases representational depth — each layer refines the global representation with no structural penalty. Transformers scale to 96+ layers with consistent improvement.

In GNNs, each layer has a dual effect:

Positive: expands the receptive field by one hop (more context)
Negative: smooths features by averaging neighbours (more oversmoothing)

These effects fight each other. The optimal depth depends on which dominates.

What Depth Buys: Receptive Field

With K GNN layers, node v aggregates information from its K-hop neighbourhood. For node classification, the useful depth K* is the number of hops that contains task-relevant information.

For homophilic datasets (Cora, CiteSeer): most task-relevant context is at 1-2 hops. Beyond that, the neighbourhood is dominated by same-label nodes that add little information while increasing oversmoothing.

For long-range tasks (predicting molecular properties from whole-molecule context): you need K ≥ graph diameter — which is 10+ for medium-sized molecules. This immediately conflicts with oversmoothing.

The Empirical Depth Cliff

On standard GNN benchmarks, accuracy as a function of depth follows a characteristic pattern:

Layers:   1     2     3     4     8     16    32
GCN:      75%   82%   80%   76%   58%   42%   25%
GAT:      76%   83%   81%   78%   61%   44%   28%

(Illustrative values on Cora-style datasets.) Performance peaks at 2-3 layers, then drops dramatically. At 32 layers, the model fails catastrophically — oversmoothing has collapsed all node distinctions.

The Depth Dilemma

This creates an uncomfortable trade-off:

Task requires long-range context → need many layers
Many layers → oversmoothing → performance collapse
Few layers → undershooting the diameter → missing distant context

For graphs with large diameter (long molecules, social networks, knowledge graphs), standard GNNs are caught in this dilemma with no good resolution.

Why CNNs don't have this problem: In image CNNs, each layer also expands the receptive field. But the operation is convolution (pattern detection), not averaging. A 32-layer CNN does not collapse pixel values to a uniform grey — it detects increasingly abstract patterns. GNN averaging is fundamentally different: it destroys information rather than abstracting it.

Architectural Solutions for Deep GNNs

GCNII (Chen et al., 2020)

GCNII adds two modifications to enable 64-layer GCNs:

Initial residual: skip connection to the initial features X at every layer:
Identity mapping: identity initialisation for weight matrices

H^{(k+1)} = σ( ((1−α) S̃ H^{(k)} + α H^{(0)}) ((1−β) I + β W^{(k)}) )

α controls how much initial feature signal is retained; β controls how close the weight matrix stays to identity. Together, they prevent both oversmoothing and gradient vanishing.

GCNII achieves competitive accuracy with 64 layers on Cora — a previously impossible depth.

JK-Net (Jumping Knowledge, Xu et al., 2018)

JK-Net uses all intermediate representations, not just the last layer:

h_v = AGG( h^{(1)}_v, h^{(2)}_v, ..., h^{(K)}_v )

Where AGG is concatenation, max-pooling, or LSTM. Each node’s final embedding includes information from all receptive field sizes simultaneously. Oversmoothing in deep layers is offset by the sharp early-layer representations.

APPNP

As discussed in the APPNP post: separate transformation from propagation. The teleport probability α keeps each node anchored to its own features even after 20 propagation steps — preventing oversmoothing.

DropEdge

Randomly drop a fraction of edges during each training step. This reduces the averaging effect per layer, slowing oversmoothing. Analogous to Dropout for edges.

The Depth vs Width Trade-off

An alternative to depth: make each layer wider (more hidden dimensions). In practice, going from d=64 to d=256 with 2-3 layers often outperforms using d=64 with 8-16 layers — wider layers better capture local structure without oversmoothing.

The GNN community is moving toward:

Shallow local GNNs (2-4 layers) for node-level tasks
Aggregation-then-transform designs (APPNP, SGC) for medium range
Graph Transformers or GPS (Transformer + local MPNN) for long-range tasks

Summary

Depth	Effect	Recommendation
1-2 layers	Minimal smoothing, limited context	Good baseline
3-4 layers	Optimal for most homophilic benchmarks	Default choice
8-16 layers	Oversmoothing dominates; needs residuals	Use GCNII, JK-Net
32+ layers	Near-impossible without special design	GCNII or abandon local MPNN

GNN depth scaling does not follow the same scaling laws as Transformer depth. Understanding this — and which architectural tricks restore the benefit of depth — is central to modern GNN design.

References

Chen, M., Wei, Z., Huang, Z., Ding, B., & Li, Y. (2020). Simple and Deep Graph Convolutional Networks. ICML 2020 (GCNII).
Xu, K., Li, C., Tian, Y., Sonobe, T., Kawarabayashi, K., & Jegelka, S. (2018). Representation Learning on Graphs with Jumping Knowledge Networks. ICML 2018 (JK-Net).
Rong, Y., Huang, W., Xu, T., & Huang, J. (2020). DropEdge: Towards Deep Graph Convolutional Networks on Node Classification. ICLR 2020.

Share on

Bluesky Facebook LinkedIn X (formerly Twitter)

Alessio Borgi