Sheaf Attention Networks (Barbero et al., 2022)
Published:
Contribution: Introduces attention into the sheaf GNN framework. Orthogonal restriction maps combined with attention weights yield a model that is both gauge-equivariant and selectively aggregating.
Motivation: What NSD Cannot Do
NSD aggregates from all neighbours equally (weighted only by the Sheaf Laplacian normalisation). GAT assigns attention weights to neighbours — learning which neighbours matter for a given node. These two capabilities are orthogonal:
- NSD: rich relational geometry (sheaf maps), uniform aggregation
- GAT: simple aggregation (no relational maps), adaptive weighting
SheafAN combines both: orthogonal restriction maps (for gauge equivariance and relational structure) + attention weights (for adaptive aggregation).
The SheafAN Aggregation
For each node v and edge e = (u, v), SheafAN computes:
Step 1 — Transported message (using the orthogonal restriction map O_{u▷e}):
where O_{uv} = O_{u▷e}ᵀ O_{v▷e} ∈ O(d) is the “relative rotation” from u to v.
Step 2 — Gauge-invariant attention score:
The score is computed between h_v and the transported message O_{uv}h_u (not raw h_u). This is crucial for gauge invariance: under a gauge transformation {g_w ∈ O(d)}, both h_v and O_{uv}h_u transform by g_v, so e_{uv} is gauge-invariant.
Step 3 — Softmax normalisation:
Step 4 — Weighted aggregation:
The final aggregation is a weighted sum of transported messages — each neighbour’s features are first rotated into v’s local frame (by O_{uv}), then weighted by the attention score, then summed.
Why Gauge Invariance of Attention Matters
In standard GAT, the attention score e_{uv} = a([h_u ‖ h_v]) is not gauge-invariant — it changes if we apply a local rotation g_v at node v. This means the attention weights change depending on which “frame” we use to represent node features.
In SheafAN, the attention score uses O_{uv}h_u (the message transported into v’s frame) rather than raw h_u. Under gauge transformation {g_w}:
- h_v → g_v h_v
- O_{uv}h_u → g_v O_{uv} g_u⁻¹ g_u h_u = g_v O_{uv} h_u
So [h_v ‖ O_{uv}h_u] → [g_v h_v ‖ g_v O_{uv}h_u] = g_v [h_v ‖ O_{uv}h_u].
If a is taken as a linear map that commutes with g_v (e.g., a scalar dot-product), the score e_{uv} = aᵀ[h_v ‖ O_{uv}h_u] transforms to aᵀ g_v [h_v ‖ O_{uv}h_u] — still gauge-equivariant (not invariant unless a is gauge-invariant itself, e.g., uses inner product only).
Relation to Standard GAT
Standard GAT is a special case of SheafAN with identity restriction maps O_{uv} = I:
SheafAN is also a special case of a combination of NSD + attention — it replaces the diffusion-based aggregation with attention-based aggregation over transported messages.
Orthogonal Map Learning
The restriction maps O_{u▷e} ∈ O(d) are learned using the Cayley parameterisation:
The MLP predicts the lower triangular entries of A (since skew-symmetric matrices have d(d−1)/2 free entries). The Cayley map maps ℝ^{d(d−1)/2} → O(d) differentiably — enabling end-to-end training.
For d=2: A = [[0, a], [−a, 0]] → O = [[cos θ, sin θ], [−sin θ, cos θ]] where tan(θ/2) = a. The map reduces to learning a single angle per edge.
Multi-Head Attention
SheafAN supports multi-head attention: K independent heads, each with its own orthogonal maps {O_{uv}^{(k)}} and attention vectors {a^{(k)}}:
With K heads, the output dimension is Kd. Multi-head SheafAN provides K different relational perspectives on each edge — each head can learn a different rotation to represent the edge relationship.
Empirical Results
Node classification on heterophilic benchmarks:
| Model | Cornell | Texas | Wisconsin | Chameleon |
|---|---|---|---|---|
| GAT | 54.3 | 58.4 | 49.4 | 60.5 |
| NSD-orth | 85.0 | 88.4 | 86.0 | 70.2 |
| SheafAN (d=2) | 86.2 | 89.1 | 86.8 | 71.3 |
| SheafAN (d=4) | 87.1 | 89.7 | 87.5 | 72.0 |
SheafAN consistently outperforms NSD on heterophilic datasets, showing that the attention mechanism adds value beyond the sheaf structure alone.
On homophilic datasets (Cora, Citeseer): SheafAN matches GAT and NSD, confirming that attention does not hurt on homophilic tasks.
Comparison: SheafAN vs NSD vs GAT
| Property | GAT | NSD | SheafAN |
|---|---|---|---|
| Restriction maps | None (identity) | General/diagonal/orth | Orthogonal |
| Aggregation | Attention-weighted | Sheaf-Laplacian (uniform) | Attention over transported messages |
| Gauge equivariance | No | Partial (diagonal/orth) | Yes (orthogonal) |
| Heterophily handling | Partial (signed attention) | Yes (via maps) | Yes (via maps + attention) |
| Parameters per edge | d (attention vector) | d² or d | d(d−1)/2 + d (maps + attention) |
Limitations
- Gauge invariance gap: The concatenation-based attention score is gauge-equivariant but not invariant. A truly gauge-invariant score would require inner-product attention: e_{uv} = h_vᵀ O_{uv} h_u.
- Orthogonal maps only: SheafAN restricts to orthogonal maps for gauge equivariance; general maps (as in NSD) are excluded. This limits expressiveness for non-gauge-symmetric tasks.
- Scale-invariance lost: Orthogonal maps preserve norms but cannot scale features — for tasks where feature magnitude matters, diagonal or general maps may outperform orthogonal ones.
References
- Barbero, F., Bodnar, C., de Ocáriz Borde, H. S., Bronstein, M., Veličković, P., & Liò, P. (2022). Sheaf Attention Networks. NeurIPS 2022 Workshop.
- Veličković, P., Cucurull, G., Casanova, A., Romero, A., Liò, P., & Bengio, Y. (2018). Graph Attention Networks. ICLR 2018 (GAT: the attention mechanism SheafAN extends with transported messages and orthogonal restriction maps).
- Bodnar, C., Giovanni, F. D., Chamberlain, B. P., Liò, P., & Bronstein, M. M. (2022). Neural Sheaf Diffusion. NeurIPS 2022 (NSD: the predecessor architecture whose orthogonal map parameterisation SheafAN inherits).
