GCN: Graph Convolutional Networks

3 minute read

Published:

TL;DR: GCN's layer formula is H' = σ(÷H·W), where à = D̃^(-1/2)(A+I)D̃^(-1/2) is the degree-normalised adjacency with self-loops. Each layer averages neighbour representations and applies a learned linear transformation. Simple. Powerful. The foundation everything else builds on.

From Spectral Theory to a Practical Layer

As explained in the Graph Laplacian post, spectral graph convolution filters signals using the Laplacian’s eigenvectors. But computing eigenvectors is O(N³) — completely infeasible for large graphs.

Kipf & Welling (2016) made two elegant simplifications:

  1. Localise: approximate the spectral filter as a first-order polynomial in L — only immediate neighbours matter.
  2. Normalise: use symmetric normalisation to prevent large-degree nodes from dominating.

The result is a layer that can be computed with simple sparse matrix multiplication.

The GCN Formula

H(l+1) = σ( D̃ Ã D̃ · H(l) · W(l) )

Where:

  • H^(l) — node features at layer l (N × F matrix, one row per node).
  • Ã = A + I — adjacency with self-loops added.
  • — degree matrix of Ã.
  • W^(l) — learned weight matrix (F × F’ ).
  • σ — activation function (ReLU).

Breaking It Down Step by Step

Let’s trace what happens for a single node v with neighbours A, B, C:

Step 1: Add self-loops. Ã = A + I ensures v attends to itself too, not just its neighbours. Without this, a node’s own features would be excluded from the aggregation.

Step 2: Compute normalisation weights. The weight for the edge (u→v) is: 1 / √(deg_Ã(u) · deg_Ã(v)).

This is the symmetric normalisation. It down-weights edges from high-degree “hub” nodes (which would otherwise dominate).

Step 3: Aggregate. (D̃^(-½) Ã D̃^(-½) · H)[v] = weighted sum of features from v and all its neighbours.

Step 4: Transform. Multiply by W: linear projection to a new feature space.

Step 5: Activate. Apply ReLU (or other non-linearity).

Graph (with deg normalisation) A deg=2 v deg=4 B deg=3 C deg=2 1/√(2×4)≈0.35 1/√(4×3)≈0.29 1/√(2×4)≈0.35 self-loop 1/√(4×4)=0.25 GCN aggregation at v: h_v' = σ( W · [ 0.35 · h_A + 0.25 · h_v (self) + 0.29 · h_B + 0.35 · h_C ] ) High-degree hub A: weight 0.35 Low-degree v×v: weight 0.25 — prevents hub dominance
Figure 1: GCN aggregation at node v. Degree normalisation (1/√(deg_u · deg_v)) prevents high-degree hub nodes from dominating the aggregation. The self-loop ensures v's own features are included.

What GCN Is in MPNN Terms

  • MSG: send neighbour features multiplied by normalisation weight.
  • AGGREGATE: weighted sum (the normalised adjacency does this).
  • UPDATE: apply a single linear layer W followed by ReLU.

Stacking Layers and the Oversmoothing Problem

With k GCN layers, each node’s representation captures its k-hop neighbourhood.

  • k=2: captures 2-hop info (good for most citation/social graph tasks).
  • k≫1: oversmoothing — all node representations converge to the same vector as information diffuses uniformly across the whole graph, making nodes indistinguishable.

This is GCN’s main limitation. GAT, GPRGNN, and JK-Nets propose various solutions.

✅ Key Takeaways

  • GCN layer: H' = σ(Ã_norm · H · W), where Ã_norm is the degree-normalised adjacency with self-loops.
  • All neighbours contribute equally (weighted only by degree normalisation) — no attention, no learned weights on edges.
  • Simple and fast: just sparse matrix multiplication + a linear layer.
  • Main weakness: oversmoothing with deep stacks; all node embeddings converge to the same vector.