Shortest-Path Encodings for Graph Transformers

4 minute read

Published: April 20, 2024

TL;DR: For Graph Transformers, the shortest path distance dist(i,j) between every node pair can be used as an attention bias — biasing attention scores to decrease with distance, encoding the graph's metric structure without message passing.

Shortest path distance bias — Spatial encoding via shortest path distances in Graphormer (Ying et al., 2021)

Intuition First

In a standard Transformer, every token can attend to every other token — but the attention score is purely based on content similarity. For graphs, two distant nodes might have very similar features yet share no direct structural relationship. SPD encoding adds a “distance penalty” to attention: nodes far apart in the graph should attend less strongly, regardless of feature similarity.

Think of it like gravity — the force of attraction decays with distance, but the model learns the exact decay rate from data (not just 1/d²).

Graphormer's spatial encoding: attention from node v to its 1-hop, 2-hop, 3-hop, and 4-hop neighbours. The bias φ_SPD(d) is a learned scalar per distance — the model learns to attend less as distance grows, but the exact shape is data-driven.

Why Shortest Paths?

In Graphormer (Ying et al., 2021), every pair of nodes can attend to each other. But without structural information, the model has no way to know that nodes 1 hop apart should interact more strongly than nodes 10 hops apart.

Shortest-path distance (SPD) encoding injects this directly as an attention bias:

A_{ij} = softmax_j( QᵢKⱼᵀ / √d + φ_SPD(dist(i,j)) )

Where φ_SPD is a learned embedding: one scalar per distance value (0, 1, 2, 3, …, max_dist, ∞).

What It Captures

Immediate neighbours (dist=1): strongest attention bias
2-hop neighbours (dist=2): moderate bias
Distant nodes (dist=10): small or negative bias
Disconnected (dist=∞): special “no path” embedding

The model learns how distance should affect attention strength — this is not fixed (e.g., always decaying); the learned φ_SPD can assign different weights to each distance bucket.

All-Pairs Shortest Paths: Computation Cost

Computing all-pairs shortest paths (APSP) costs O(N · (N +

)) with BFS from every node. For small graphs (N < 1000): cheap, precomputable. For large graphs (N > 100,000): prohibitive.

This is why SPD encoding is primarily used for small-graph tasks (molecules with N < 100, protein structure graphs with N < 500).

SPD vs LapPE vs RWPE

Encoding	Type	Captures	Cost
LapPE	Node PE	Global position	O(k·	E	·T)
RWPE	Node PE	Local structural role	O(K·	E	)
SPD	Edge/pair PE	Graph metric distances	O(N·(N+	E	))

SPD is a pairwise encoding — it’s not a node property but a pair property. It cannot be used as a standard node embedding; it must be injected into the attention mechanism as a bias.

Beyond SPD: Distance Encoding (DE)

Distance Encoding (Li et al., 2020) uses SPD from each node to a set of landmark nodes as a node-level encoding. With landmark set S:

pe_v = [dist(v, s₁), dist(v, s₂), ..., dist(v, s_k)]

This converts pairwise distances into node features (by fixing anchor nodes), making it usable as standard node PE. With random landmark selection, it approximates SPD information efficiently.

Key Insight: SPD encoding is the only PE that directly injects pairwise graph metric information. LapPE and RWPE are node-level — they encode each node independently. SPD is edge-level — it encodes the relationship between every pair. This makes SPD strictly more informative for attention, but strictly more expensive to compute. The O(N²) cost is why SPD is limited to molecule-scale graphs (N < 500).

Summary

SPD encoding directly injects graph metric structure into Graph Transformer attention. It is simple, interpretable, and effective for small graphs. For large graphs, it is too expensive — use RWPE or LapPE instead, which capture distance information implicitly through local structure.

References

Ying, C., Cai, T., Luo, S., Zheng, S., Ke, G., He, D., Shen, Y., & Liu, T.-Y. (2021). Do Transformers Really Perform Bad for Graph Representation?. NeurIPS 2021 (Graphormer — introduces SPD and edge-distance encodings).
Li, P., Wang, Y., Wang, H., & Leskovec, J. (2020). Distance Encoding: Design Provably More Powerful Graph Neural Networks for Structural Representation Learning. NeurIPS 2020.

Share on

Bluesky Facebook LinkedIn X (formerly Twitter)

Alessio Borgi

Shortest-Path Encodings for Graph Transformers

Intuition First

Why Shortest Paths?

What It Captures

All-Pairs Shortest Paths: Computation Cost

SPD vs LapPE vs RWPE

Beyond SPD: Distance Encoding (DE)

Summary

References

Share on

You May Also Enjoy

Output, Gated, and Special Activations: Softmax, GLU, SIREN, and More

Modern Activation Functions: GELU, SiLU, Mish, and Smooth Gating

Activation Functions in Neural Networks: Why Non-Linearity Matters

FoPE: Fourier Position Embedding for Length Generalization

Alessio Borgi

Intuition First

Why Shortest Paths?

What It Captures

All-Pairs Shortest Paths: Computation Cost

SPD vs LapPE vs RWPE

Beyond SPD: Distance Encoding (DE)

Summary

References

Share on

You May Also Enjoy

📄 Output, Gated, and Special Activations: Softmax, GLU, SIREN, and More

📄 Modern Activation Functions: GELU, SiLU, Mish, and Smooth Gating

📄 Activation Functions in Neural Networks: Why Non-Linearity Matters

📄 FoPE: Fourier Position Embedding for Length Generalization

Output, Gated, and Special Activations: Softmax, GLU, SIREN, and More

Modern Activation Functions: GELU, SiLU, Mish, and Smooth Gating

Activation Functions in Neural Networks: Why Non-Linearity Matters

FoPE: Fourier Position Embedding for Length Generalization