GAT: Graph Attention Networks

3 minute read

Published: February 06, 2024

TL;DR: GAT (Veličković et al., 2018) replaces GCN's fixed degree-normalised weights with learned attention coefficients α(i,j) on each edge. Each node can learn to attend more strongly to certain neighbours — adaptive, task-specific, interpretable.

The Problem with Fixed Weights

In GCN, the aggregation weight for neighbour u contributing to node v is fixed at 1/√(deg(u)·deg(v)). This depends only on node degrees — the model can’t learn that some neighbours are more important than others for the task at hand.

For a citation network: when predicting paper topic, citing a paper on the exact same topic should count more than citing a survey that covers dozens of topics. GCN can’t express this.

GAT solves it by learning attention weights from features.

The Attention Mechanism

For each directed edge (j → i), GAT computes an attention coefficient:

Step 1: Linear transform. Apply a shared weight matrix W to both node features: z_i = W · h_i, z_j = W · h_j

Step 2: Concatenate and score. Compute raw attention score using a learned vector a: e_{ij} = LeakyReLU( aᵀ · [z_i ‖ z_j] )

Step 3: Softmax over neighbours. Normalise across all neighbours of i:

α_ij = softmax_j∈N(i)( e_ij ) = exp(e_ij) / Σ_k∈N(i) exp(e_ik)

Step 4: Weighted aggregate: h'_i = σ( Σ_{j∈N(i)} α_{ij} · W · h_j )

Figure 1: GAT learns attention coefficient α(i,j) for each edge. Node A gets high attention (0.65), B medium (0.25), C low (0.10). These sum to 1 (softmax) and weight the neighbourhood aggregation.

Multi-Head GAT

Just like Multi-Head Attention in Transformers, GAT can run K independent attention heads:

For intermediate layers: concatenate the K head outputs: h'_i = ‖_{k=1}^K σ(Σ_j α^k_{ij} W^k h_j) — expands the feature dimension by K.
For final layers: average the K head outputs: h'_i = σ( (1/K) Σ_k Σ_j α^k_{ij} W^k h_j ) — keeps original dimension.

Each head can specialise in different types of relationships, exactly as in Transformer multi-head attention.

GAT v2 (2022)

The original GAT has a subtle expressiveness issue: the attention is a static function of the source node, meaning the attention from u to v can be the same regardless of what v looks like (Brody et al., 2022 showed this).

GAT v2 fixes this with a small change — applying the non-linearity before the dot product with a:

e_{ij} = aᵀ · LeakyReLU( W_l · h_j + W_r · h_i )

This makes the attention dynamic — truly a function of both i and j together.

When to Use GAT over GCN?

When neighbour importance varies and you want the model to learn which neighbours matter.
When interpretability is important — the α values can be visualised as edge importance scores.
When edge features are available (can be incorporated into the attention score).

✅ Key Takeaways

GAT replaces GCN's fixed degree weights with learned attention coefficients α(i,j) on each edge.
Attention is computed from both node features — adaptive to the task, not just graph topology.
Multi-head GAT runs K attention heads in parallel, improving representational capacity.
GAT v2 fixes a static attention problem in the original by applying non-linearity before the attention score.

Share on

Bluesky Facebook LinkedIn X (formerly Twitter)

Alessio Borgi

GAT: Graph Attention Networks

The Problem with Fixed Weights

The Attention Mechanism

Multi-Head GAT

GAT v2 (2022)

When to Use GAT over GCN?

✅ Key Takeaways

Share on

You May Also Enjoy

GIN: Graph Isomorphism Network — The Most Expressive GNN

GraphSAGE: Inductive Learning on Large Graphs

GCN: Graph Convolutional Networks

Message Passing: The Universal GNN Framework