Shortest-Path Encodings for Graph Transformers
Published:
Why Shortest Paths?
In Graphormer (Ying et al., 2021), every pair of nodes can attend to each other. But without structural information, the model has no way to know that nodes 1 hop apart should interact more strongly than nodes 10 hops apart.
Shortest-path distance (SPD) encoding injects this directly as an attention bias:
Where φ_SPD is a learned embedding: one scalar per distance value (0, 1, 2, 3, …, max_dist, ∞).
What It Captures
- Immediate neighbours (dist=1): strongest attention bias
- 2-hop neighbours (dist=2): moderate bias
- Distant nodes (dist=10): small or negative bias
- Disconnected (dist=∞): special “no path” embedding
The model learns how distance should affect attention strength — this is not fixed (e.g., always decaying); the learned φ_SPD can assign different weights to each distance bucket.
All-Pairs Shortest Paths: Computation Cost
| Computing all-pairs shortest paths (APSP) costs O(N · (N + | E | )) with BFS from every node. For small graphs (N < 1000): cheap, precomputable. For large graphs (N > 100,000): prohibitive. |
This is why SPD encoding is primarily used for small-graph tasks (molecules with N < 100, protein structure graphs with N < 500).
SPD vs LapPE vs RWPE
| Encoding | Type | Captures | Cost | ||
|---|---|---|---|---|---|
| LapPE | Node PE | Global position | O(k· | E | ·T) |
| RWPE | Node PE | Local structural role | O(K· | E | ) |
| SPD | Edge/pair PE | Graph metric distances | O(N·(N+ | E | )) |
SPD is a pairwise encoding — it’s not a node property but a pair property. It cannot be used as a standard node embedding; it must be injected into the attention mechanism as a bias.
Beyond SPD: Distance Encoding (DE)
Distance Encoding (Li et al., 2020) uses SPD from each node to a set of landmark nodes as a node-level encoding. With landmark set S:
This converts pairwise distances into node features (by fixing anchor nodes), making it usable as standard node PE. With random landmark selection, it approximates SPD information efficiently.
Summary
SPD encoding directly injects graph metric structure into Graph Transformer attention. It is simple, interpretable, and effective for small graphs. For large graphs, it is too expensive — use RWPE or LapPE instead, which capture distance information implicitly through local structure.
References
- Ying, C., Cai, T., Luo, S., Zheng, S., Ke, G., He, D., Shen, Y., & Liu, T.-Y. (2021). Do Transformers Really Perform Bad for Graph Representation?. NeurIPS 2021 (Graphormer — introduces SPD and edge-distance encodings).
- Li, P., Wang, Y., Wang, H., & Leskovec, J. (2020). Distance Encoding: Design Provably More Powerful Graph Neural Networks for Structural Representation Learning. NeurIPS 2020.
