Graph Transformers: Bringing Attention to Graphs

5 minute read

Published:

TL;DR: A Graph Transformer treats each node as a token and runs full self-attention across all N nodes — every node can attend to every other node directly. Graph structure is injected via positional encodings (Laplacian eigenvectors, random walks) or attention biases. This overcomes over-squashing and long-range dependency limits of local message passing.

The Limits of Local Message Passing

Standard GNNs (GCN, GAT, GIN, GraphSAGE) aggregate information only from direct neighbours. To reach a node k hops away, you need k layers. Problems:

  1. Long-range dependencies: for diameter-10 graphs, you need 10 layers — causing over-smoothing
  2. Over-squashing: information from exponentially many nodes must be compressed through bottleneck edges (see the Over-squashing post)
  3. Structural rigidity: the message-passing graph IS the computation graph

The solution: full attention — every node attends to every other.

Graph Transformer Architecture

A Graph Transformer layer is essentially a standard Transformer self-attention layer, but applied to graph nodes:

Q = H W_Q,   K = H W_K,   V = H W_V
A_{ij} = softmax_j( QᵢKⱼᵀ / √d_k + b_{ij} )
H'ᵢ = Σⱼ A_{ij} Vⱼ

Where b_{ij} is an optional attention bias encoding the graph structure between nodes i and j.

Without b_{ij}: the model ignores the graph and is a standard Transformer on node features.
With b_{ij}: graph structure guides attention (edge presence, distance, structural similarity).

The Key Challenge: Positional Encodings for Graphs

Sequences have a natural positional order (position 1, 2, 3, …). Graphs do not — there is no canonical node ordering. Without positional information, nodes with identical features but different structural roles are indistinguishable.

Graph Transformers use graph-based positional encodings (covered in depth in the Graph PE section):

  • Laplacian eigenvectors: the k smallest eigenvectors of the graph Laplacian
  • Random walk PEs: landing probabilities from node i to node j in k steps
  • Shortest path distances: encoded as integer biases on attention scores

Graphormer (2021)

Graphormer (Ying et al., Microsoft) biases attention scores by three structural features:

  1. Spatial encoding: b_{ij} = φ(dist(i,j)) — a learned function of the shortest path distance
  2. Edge encoding: for each edge on the path i→j, add edge features to the attention score
  3. Centrality encoding: add degree-based biases to node embeddings at input
A_{ij} = softmax( (Qᵢ + z_{deg(i)}) (Kⱼ + z_{deg(j)})ᵀ / √d + b_dist(i,j) + b_edge(i,j) ) / √d

Graphormer achieved state-of-the-art on OGB-LSC molecular property prediction (PCQM4Mv2).

SAN (Spectral Attention Network)

SAN (Kreuzer et al., 2021) uses Laplacian eigenvectors as positional encodings and computes full attention over all node pairs, distinguishing connected from non-connected pairs:

A_{ij} = softmax( edge(i,j) · QᵢKⱼᵀ + (1-edge(i,j)) · Q̃ᵢK̃ⱼᵀ )

Separate query-key matrices for existing edges and non-edges — the model can attend differently to connected and non-connected nodes.

GPS (General, Powerful, Scalable Graph Transformer)

GPS (Rampášek et al., 2022) combines:

  • Local message passing (GCN/GAT/GIN): captures local structural patterns efficiently
  • Global self-attention (Transformer): captures long-range dependencies
GPS layer:
  h_local  ← local MPNN(h, A)    # O(|E| · d)
  h_global ← self-attention(h)   # O(N² · d)
  h_out    ← LN(h + h_local + h_global) + FFN

GPS achieves state-of-the-art on the Long-Range Graph Benchmark (LRGB) — a benchmark specifically designed to test long-range dependency learning.

Why combine MPNN and attention? Local MPNN is good at short-range structural reasoning (counting triangles, identifying motifs). Global attention is good at long-range reasoning (connecting distant relevant nodes). They are complementary, not competing.

Complexity: The O(N²) Challenge

Full attention on graphs of N nodes costs O(N²) — the same as self-attention in Transformers.

For small graphs (molecules with N < 100): no problem. For large graphs (social networks with N > 100,000): prohibitive.

Solutions:

  • K-nearest-neighbor attention: each node attends to only its K nearest neighbours in feature space
  • Linformer-style: approximate full attention with a low-rank decomposition
  • Cluster-based: run full attention within clusters, then inter-cluster attention

Summary

PropertyLocal MPNN (GCN, GAT)Graph Transformer  
Receptive fieldK-hop (K = number of layers)All nodes (direct)  
Long-range dependenciesRequires many layers (oversmoothing)Handled directly  
ComplexityO(K ·E· d)O(N² · d)
Graph structure encodingAdjacency (message passing)Positional encodings + attention biases  
Suitable graph sizeAny (with neighbour sampling)Small-medium (N < 10,000)  
Over-squashingYesNo  

Graph Transformers trade O(N²) computation for the ability to directly connect any pair of nodes. For small graphs (molecules, proteins, small networks), this is the current state of the art. For large graphs, GPS-style hybrid approaches (local MPNN + global attention) are the practical frontier.

References