Graphormer: Transformers with Structural Biases for Graphs

5 minute read

Published:

TL;DR: Graphormer (Ying et al., Microsoft, 2021) takes a standard Transformer and injects three graph-structural signals: (1) degree centrality encoded in node embeddings, (2) shortest-path distance encoded as attention biases, (3) edge features along paths encoded in attention. The result: a Transformer that provably subsumes message-passing GNNs and achieves state-of-the-art on molecular benchmarks.

The Bridge Between Transformers and GNNs

Graphormer’s key insight: a standard Transformer on graphs, without any structural information, ignores the graph entirely. Inject the right structural signals, and it becomes more expressive than any message-passing GNN.

The paper proves that Graphormer can simulate any MPNN — message-passing GNNs are a special case. This makes Graphormer a strict generalisation of the GNN family.

The Three Structural Encodings

1. Centrality Encoding

The input representation of each node v is augmented with its in-degree and out-degree:

x'_v = x_v + z⁻_{deg⁻(v)} + z⁺_{deg⁺(v)}

Where z⁻ and z⁺ are learnable embedding vectors for each degree value, indexed by in-degree deg⁻(v) and out-degree deg⁺(v).

Why: degree is a fundamental structural property. High-degree “hub” nodes play a different role than low-degree “leaf” nodes. The centrality encoding informs the model about a node’s structural importance before attention even begins.

2. Spatial Encoding (Shortest Path Distance)

For each pair of nodes (i, j), the attention score is biased by a learned function of their shortest path distance:

A_{ij} = softmax_j( QᵢKⱼᵀ / √d + φ(dist(i,j)) )

Where dist(i,j) is the shortest path distance and φ is a learnable scalar embedding indexed by distance. If no path exists (disconnected), a special value (e.g., dist = ∞ → a specific embedding) is used.

Why: nearby nodes should interact more strongly; the attention mechanism alone (based only on features) cannot discover topology. By encoding distance, the model can attend to a node’s immediate neighbours more strongly than distant ones — recovering the inductive bias of local message passing while still being able to attend globally.

3. Edge Encoding

For each directed path i → v₁ → v₂ → … → j, the edge features along the path are incorporated into the attention score:

b_{ij} = (1/|path|) Σₙ x^E_{eₙ} · w^E_n

Where x^E_{eₙ} is the feature of the n-th edge on the shortest path, and w^E_n is a learned weight vector for edge position n.

Why: in molecules, the type of bonds along the path between two atoms carries chemically relevant information. Two atoms separated by a chain of single bonds behave differently from two atoms separated by double bonds, even at the same path length.

The Full Graphormer Layer

A_{ij} = softmax_j( (Qᵢ + c_i)(Kⱼ + c_j)ᵀ / √d + φ(dist(i,j)) + b_{ij} ) H'ᵢ = Σⱼ A_{ij} Vⱼ

Where cᵢ = z⁻{deg⁻(i)} + z⁺{deg⁺(i)} is the centrality encoding for node i.

The complete layer is: Centrality-augmented queries/keys + Distance-biased scores + Path-edge-biased scores + Attention-weighted value aggregation.

Expressive Power

Theorem (Ying et al., 2021): Graphormer is strictly more expressive than 1-WL (the Weisfeiler-Lehman test). Any function computed by a message-passing GNN can be computed by Graphormer.

This follows because:

  1. Shortest-path distances capture structural information that 1-WL cannot (e.g., cycle lengths)
  2. Full attention (every node sees every other) avoids the locality constraints of MPNN
  3. Edge encoding captures richer path-level information

Results

Graphormer was submitted to the OGB-LSC 2021 competition (Large-Scale Challenge) on PCQM4Mv2 — predicting the HOMO-LUMO gap of 3.8M molecules.

ModelMAE ↓
GCN0.1684
GIN0.1537
GINE0.1195
Graphormer0.0864

A 28% improvement over the previous best — a landmark result that established Graph Transformers as the new state of the art for molecular property prediction.

Why molecules? Molecules are small graphs (typically 10-60 atoms), making O(N²) attention feasible. Structural information (bond paths, distances) is chemically meaningful. And molecular property prediction is a high-stakes application (drug discovery, materials science) where accuracy matters.

Limitations

  • O(N²) attention: limits applicability to small graphs
  • Precomputed shortest paths: all-pairs shortest paths cost O(N³) — feasible for small molecules but not large graphs
  • No explicit subgraph detection: the model sees pairwise distances but not higher-order structures like triangles or cycles (beyond what distances capture)

Graphormer-3D

A follow-up extends Graphormer to 3D molecular structures — using Euclidean distances and angles between atoms as structural encodings, rather than graph-theoretic path lengths. This is crucial for tasks where 3D geometry determines properties (conformer generation, energy prediction).

Summary

Structural elementEncoding in GraphormerPurpose
Node importanceCentrality encoding in QKHub vs. leaf distinction
Pairwise proximityShortest-path bias in attentionLocal vs. global weighting
Edge informationPath-edge feature aggregationBond type, relation quality

Graphormer is the canonical example of how to inject graph structure into a Transformer cleanly. Its three structural biases are now standard components in the Graph Transformer design space.

References