GAT: Graph Attention Networks

3 minute read

Published:

TL;DR: GAT (Veličković et al., 2018) replaces GCN's fixed degree-normalised weights with learned attention coefficients α(i,j) on each edge. Each node can learn to attend more strongly to certain neighbours — adaptive, task-specific, interpretable.

The Problem with Fixed Weights

In GCN, the aggregation weight for neighbour u contributing to node v is fixed at 1/√(deg(u)·deg(v)). This depends only on node degrees — the model can’t learn that some neighbours are more important than others for the task at hand.

For a citation network: when predicting paper topic, citing a paper on the exact same topic should count more than citing a survey that covers dozens of topics. GCN can’t express this.

GAT solves it by learning attention weights from features.

The Attention Mechanism

For each directed edge (j → i), GAT computes an attention coefficient:

Step 1: Linear transform. Apply a shared weight matrix W to both node features: z_i = W · h_i, z_j = W · h_j

Step 2: Concatenate and score. Compute raw attention score using a learned vector a: e_{ij} = LeakyReLU( aᵀ · [z_i ‖ z_j] )

Step 3: Softmax over neighbours. Normalise across all neighbours of i:

αij = softmaxj∈N(i)( eij ) = exp(eij) / Σk∈N(i) exp(eik)

Step 4: Weighted aggregate: h'_i = σ( Σ_{j∈N(i)} α_{ij} · W · h_j )

i target A α = 0.65 (high) B α = 0.25 (medium) C α = 0.10 (low) α_iA + α_iB + α_iC = 0.65 + 0.25 + 0.10 = 1.0 (softmax normalised) GAT-v2: h heads each with different α → concat or average
Figure 1: GAT learns attention coefficient α(i,j) for each edge. Node A gets high attention (0.65), B medium (0.25), C low (0.10). These sum to 1 (softmax) and weight the neighbourhood aggregation.

Multi-Head GAT

Just like Multi-Head Attention in Transformers, GAT can run K independent attention heads:

  • For intermediate layers: concatenate the K head outputs: h'_i = ‖_{k=1}^K σ(Σ_j α^k_{ij} W^k h_j) — expands the feature dimension by K.
  • For final layers: average the K head outputs: h'_i = σ( (1/K) Σ_k Σ_j α^k_{ij} W^k h_j ) — keeps original dimension.

Each head can specialise in different types of relationships, exactly as in Transformer multi-head attention.

GAT v2 (2022)

The original GAT has a subtle expressiveness issue: the attention is a static function of the source node, meaning the attention from u to v can be the same regardless of what v looks like (Brody et al., 2022 showed this).

GAT v2 fixes this with a small change — applying the non-linearity before the dot product with a:

e_{ij} = aᵀ · LeakyReLU( W_l · h_j + W_r · h_i )

This makes the attention dynamic — truly a function of both i and j together.

When to Use GAT over GCN?

  • When neighbour importance varies and you want the model to learn which neighbours matter.
  • When interpretability is important — the α values can be visualised as edge importance scores.
  • When edge features are available (can be incorporated into the attention score).

✅ Key Takeaways

  • GAT replaces GCN's fixed degree weights with learned attention coefficients α(i,j) on each edge.
  • Attention is computed from both node features — adaptive to the task, not just graph topology.
  • Multi-head GAT runs K attention heads in parallel, improving representational capacity.
  • GAT v2 fixes a static attention problem in the original by applying non-linearity before the attention score.