GAT: Graph Attention Networks
Published:
The Problem with Fixed Weights
In GCN, the aggregation weight for neighbour u contributing to node v is fixed at 1/√(deg(u)·deg(v)). This depends only on node degrees — the model can’t learn that some neighbours are more important than others for the task at hand.
For a citation network: when predicting paper topic, citing a paper on the exact same topic should count more than citing a survey that covers dozens of topics. GCN can’t express this.
GAT solves it by learning attention weights from features.
The Attention Mechanism
For each directed edge (j → i), GAT computes an attention coefficient:
Step 1: Linear transform. Apply a shared weight matrix W to both node features: z_i = W · h_i, z_j = W · h_j
Step 2: Concatenate and score. Compute raw attention score using a learned vector a: e_{ij} = LeakyReLU( aᵀ · [z_i ‖ z_j] )
Step 3: Softmax over neighbours. Normalise across all neighbours of i:
Step 4: Weighted aggregate: h'_i = σ( Σ_{j∈N(i)} α_{ij} · W · h_j )
Multi-Head GAT
Just like Multi-Head Attention in Transformers, GAT can run K independent attention heads:
- For intermediate layers: concatenate the K head outputs:
h'_i = ‖_{k=1}^K σ(Σ_j α^k_{ij} W^k h_j)— expands the feature dimension by K. - For final layers: average the K head outputs:
h'_i = σ( (1/K) Σ_k Σ_j α^k_{ij} W^k h_j )— keeps original dimension.
Each head can specialise in different types of relationships, exactly as in Transformer multi-head attention.
GAT v2 (2022)
The original GAT has a subtle expressiveness issue: the attention is a static function of the source node, meaning the attention from u to v can be the same regardless of what v looks like (Brody et al., 2022 showed this).
GAT v2 fixes this with a small change — applying the non-linearity before the dot product with a:
e_{ij} = aᵀ · LeakyReLU( W_l · h_j + W_r · h_i )
This makes the attention dynamic — truly a function of both i and j together.
When to Use GAT over GCN?
- When neighbour importance varies and you want the model to learn which neighbours matter.
- When interpretability is important — the α values can be visualised as edge importance scores.
- When edge features are available (can be incorporated into the attention score).
✅ Key Takeaways
- GAT replaces GCN's fixed degree weights with learned attention coefficients α(i,j) on each edge.
- Attention is computed from both node features — adaptive to the task, not just graph topology.
- Multi-head GAT runs K attention heads in parallel, improving representational capacity.
- GAT v2 fixes a static attention problem in the original by applying non-linearity before the attention score.
