Sheaves Meet Attention: Transformer-Inspired Sheaf Models
Published:
The Local vs Global Trade-off
Sheaf GNNs are local models — they aggregate from the K-hop neighbourhood (K = number of layers). For tasks requiring long-range dependencies (K » graph diameter), standard sheaf GNNs either need many layers (oversmoothing risk) or can’t reach the relevant nodes.
Graph Transformers solve this with global attention — every pair (u, v) computes an attention score, and every node receives information from all nodes regardless of graph distance.
The two approaches are complementary:
- Sheaf diffusion: principled local geometry, handles heterophily, avoids oversmoothing, but local
- Global attention: arbitrary long-range, but no structural inductive bias, quadratic cost
A Sheaf Transformer combines both.
Design Option 1: Sheaf-Augmented Attention
The simplest combination: use sheaf restriction maps as structural biases in the attention matrix.
Standard graph Transformer attention score (e.g., Graphormer):
where b_{uv} is a structural bias (distance, centrality, etc.).
Sheaf-augmented attention: replace or augment b_{uv} with sheaf-derived quantities:
where [Δ_F^{-1}]_{uv} is the (u,v)-block of the pseudoinverse of the Sheaf Laplacian — measuring the sheaf effective resistance between u and v.
High sheaf effective resistance → u and v are weakly connected in the sheaf → low attention bias. Low sheaf effective resistance → strongly connected in sheaf → high attention bias.
This is the sheaf generalisation of the effective resistance bias used in GRIT and related graph Transformers.
Design Option 2: Transported Attention
Inspired by SheafAN (which transports messages via restriction maps before attention), a Sheaf Transformer can transport the key vector of node u into node v’s local frame before computing the attention score:
where O_{uv} ∈ O(d) is the orthogonal restriction map “from u to v” along the shortest path (or the direct map if u and v are adjacent).
For non-adjacent nodes (no direct restriction map), O_{uv} is estimated as the composition of maps along the shortest path:
(parallel transport along the path). This requires path discovery and composition — expensive but topologically meaningful.
Design Option 3: Alternating Layers
The simplest practical approach: alternate sheaf diffusion layers and global attention layers:
For each block:
1. Sheaf diffusion layer (local, handles heterophily):
H ← (I − Δ_F^{norm}) H W
2. Global attention layer (long-range, any pair):
H ← MultiHead-Attention(Q H, K H, V H)
The sheaf layer handles local relational structure; the attention layer handles long-range dependencies. Together they can handle both.
Implementation: Use the sheaf predictor only for the sheaf layers (predicting maps from current H). The attention layers use standard Q, K, V projections. No modification of the attention mechanism is needed.
This is analogous to GPS (General, Powerful, Scalable) graph networks (Rampášek et al., 2022), which alternate MPNN layers and Transformer layers — but replace MPNN with sheaf diffusion.
Sheaf Positional Encodings
An orthogonal contribution of sheaves to Transformers: using eigenvectors of Δ_F as positional encodings.
Standard positional encodings for graph Transformers use eigenvectors of the graph Laplacian L (LSPE, Kreuzer et al., 2021). The sheaf Laplacian eigenvectors provide richer PEs:
- They depend on both topology (through G) and relational geometry (through F)
- They are task-adaptive when F is learned (the PEs change during training)
- They have dimension Nd (vs N for standard LapPE) — more discriminative
Sheaf PE computation:
where q_k^{(v)} ∈ ℝ^d is the v-th block of the k-th eigenvector of Δ_F.
These PEs can be used as input features to any graph Transformer, adding sheaf structure without modifying the attention mechanism.
Comparison: Sheaf Transformer vs Standard Approaches
| Model | Local structure | Long-range | Heterophily | Cost |
|---|---|---|---|---|
| NSD | Sheaf diffusion | None (K-hop) | Yes | O(E·d²·K) |
| GAT | Edge attention | None | Partial | O(E·d·K) |
| Graph Transformer | None | Global attention | No | O(N²·d) |
| GPS (MPNN + Attn) | MPNN | Global attention | Partial | O((E+N²)·d) |
| Sheaf Transformer | Sheaf diffusion | Global attention | Yes | O(E·d²+N²·d) |
The Sheaf Transformer achieves the best of all worlds — at the cost of combining both local (O(E·d²)) and global (O(N²·d)) computation.
Scalability
The main challenge: global attention scales as O(N²) — prohibitive for large graphs (N > 10⁴).
Solutions:
- Linear attention: approximate softmax attention with O(N) cost (Performer, Linformer). Compatible with sheaf local layers.
- Sparse attention: attend only to K-nearest neighbours in feature space + sheaf-selected structural neighbours
- Hierarchical sheaf + attention: apply sheaf diffusion locally, pool to coarser graph, apply attention at coarser scale
Open Problems
- Unified sheaf attention: Can we design a single mechanism that is simultaneously gauge-equivariant, has transported attention (SheafAN-style), and has global reach?
- Sheaf PE + Transformer: Do sheaf-based positional encodings provide measurable benefit over standard LapPE for graph Transformers?
- Sheaf Transformer theory: Can the oversmoothing and heterophily theory of sheaf GNNs be extended to sheaf Transformers?
References
- Rampášek, L., Galkin, M., Dwivedi, V. P., Lim, A. T., Wolf, G., & Beaini, D. (2022). Recipe for a General, Powerful, Scalable Graph Transformer. NeurIPS 2022 (GPS: alternating MPNN and Transformer layers — the architecture template Sheaf Transformers build on).
- Kreuzer, D., Beaini, D., Hamilton, W., Létourneau, V., & Tossou, P. (2021). Rethinking Graph Transformers with Spectral Attention. NeurIPS 2021 (SAT: spectral attention using Laplacian eigenvectors as PEs — extended to sheaves via Δ_F eigenvectors).
- Ying, C., Cai, T., Luo, S., Zheng, S., Ke, G., He, D., Shen, Y., & Liu, T.-Y. (2021). Do Transformers Really Perform Bad for Graph Representation?. NeurIPS 2021 (Graphormer: structural biases in graph Transformer attention — the approach extended with sheaf effective resistance biases).
