Sheaf Attention Networks (Barbero et al., 2022)

6 minute read

Published:

Paper: Barbero, F., Bodnar, C., de Ocáriz Borde, H. S., Bronstein, M., Veličković, P., & Liò, P. (2022). Sheaf Attention Networks. NeurIPS 2022 Workshop on Symmetry and Geometry in Neural Representations.
Contribution: Introduces attention into the sheaf GNN framework. Orthogonal restriction maps combined with attention weights yield a model that is both gauge-equivariant and selectively aggregating.

Motivation: What NSD Cannot Do

NSD aggregates from all neighbours equally (weighted only by the Sheaf Laplacian normalisation). GAT assigns attention weights to neighbours — learning which neighbours matter for a given node. These two capabilities are orthogonal:

  • NSD: rich relational geometry (sheaf maps), uniform aggregation
  • GAT: simple aggregation (no relational maps), adaptive weighting

SheafAN combines both: orthogonal restriction maps (for gauge equivariance and relational structure) + attention weights (for adaptive aggregation).

The SheafAN Aggregation

For each node v and edge e = (u, v), SheafAN computes:

Step 1 — Transported message (using the orthogonal restriction map O_{u▷e}):

m_{u→v} = O_{u▷e}ᵀ O_{v▷e} h_u = O_{uv} h_u

where O_{uv} = O_{u▷e}ᵀ O_{v▷e} ∈ O(d) is the “relative rotation” from u to v.

Step 2 — Gauge-invariant attention score:

e_{uv} = LeakyReLU( aᵀ [ h_v ‖ O_{uv} h_u ] )

The score is computed between h_v and the transported message O_{uv}h_u (not raw h_u). This is crucial for gauge invariance: under a gauge transformation {g_w ∈ O(d)}, both h_v and O_{uv}h_u transform by g_v, so e_{uv} is gauge-invariant.

Step 3 — Softmax normalisation:

α_{uv} = exp(e_{uv}) / Σ_{u' ∈ N(v)} exp(e_{u'v})

Step 4 — Weighted aggregation:

h_v^{new} = σ( Σ_{u ∈ N(v)} α_{uv} · O_{uv} h_u )

The final aggregation is a weighted sum of transported messages — each neighbour’s features are first rotated into v’s local frame (by O_{uv}), then weighted by the attention score, then summed.

Why Gauge Invariance of Attention Matters

In standard GAT, the attention score e_{uv} = a([h_u ‖ h_v]) is not gauge-invariant — it changes if we apply a local rotation g_v at node v. This means the attention weights change depending on which “frame” we use to represent node features.

In SheafAN, the attention score uses O_{uv}h_u (the message transported into v’s frame) rather than raw h_u. Under gauge transformation {g_w}:

  • h_v → g_v h_v
  • O_{uv}h_u → g_v O_{uv} g_u⁻¹ g_u h_u = g_v O_{uv} h_u

So [h_v ‖ O_{uv}h_u] → [g_v h_v ‖ g_v O_{uv}h_u] = g_v [h_v ‖ O_{uv}h_u].

If a is taken as a linear map that commutes with g_v (e.g., a scalar dot-product), the score e_{uv} = aᵀ[h_v ‖ O_{uv}h_u] transforms to aᵀ g_v [h_v ‖ O_{uv}h_u] — still gauge-equivariant (not invariant unless a is gauge-invariant itself, e.g., uses inner product only).

Design insight: True gauge invariance of attention requires the score function to be invariant under O(d) rotations. The simplest such function is the inner product h_vᵀ O_{uv} h_u (no concatenation). SheafAN uses concatenation-based attention (like GAT) which is gauge-equivariant but not invariant; the paper notes this as a limitation and a direction for improvement.

Relation to Standard GAT

Standard GAT is a special case of SheafAN with identity restriction maps O_{uv} = I:

SheafAN with O_{uv} = I → h_v^{new} = σ( Σ_{u ∈ N(v)} α_{uv} h_u ) = GAT

SheafAN is also a special case of a combination of NSD + attention — it replaces the diffusion-based aggregation with attention-based aggregation over transported messages.

Orthogonal Map Learning

The restriction maps O_{u▷e} ∈ O(d) are learned using the Cayley parameterisation:

O = (I − A)(I + A)⁻¹ , A = −Aᵀ (skew-symmetric)

The MLP predicts the lower triangular entries of A (since skew-symmetric matrices have d(d−1)/2 free entries). The Cayley map maps ℝ^{d(d−1)/2} → O(d) differentiably — enabling end-to-end training.

For d=2: A = [[0, a], [−a, 0]] → O = [[cos θ, sin θ], [−sin θ, cos θ]] where tan(θ/2) = a. The map reduces to learning a single angle per edge.

Multi-Head Attention

SheafAN supports multi-head attention: K independent heads, each with its own orthogonal maps {O_{uv}^{(k)}} and attention vectors {a^{(k)}}:

h_v^{new} = Concat_{k=1}^{K} ( σ( Σ_{u ∈ N(v)} α_{uv}^{(k)} O_{uv}^{(k)} h_u ) )

With K heads, the output dimension is Kd. Multi-head SheafAN provides K different relational perspectives on each edge — each head can learn a different rotation to represent the edge relationship.

Empirical Results

Node classification on heterophilic benchmarks:

ModelCornellTexasWisconsinChameleon
GAT54.358.449.460.5
NSD-orth85.088.486.070.2
SheafAN (d=2)86.289.186.871.3
SheafAN (d=4)87.189.787.572.0

SheafAN consistently outperforms NSD on heterophilic datasets, showing that the attention mechanism adds value beyond the sheaf structure alone.

On homophilic datasets (Cora, Citeseer): SheafAN matches GAT and NSD, confirming that attention does not hurt on homophilic tasks.

Comparison: SheafAN vs NSD vs GAT

PropertyGATNSDSheafAN
Restriction mapsNone (identity)General/diagonal/orthOrthogonal
AggregationAttention-weightedSheaf-Laplacian (uniform)Attention over transported messages
Gauge equivarianceNoPartial (diagonal/orth)Yes (orthogonal)
Heterophily handlingPartial (signed attention)Yes (via maps)Yes (via maps + attention)
Parameters per edged (attention vector)d² or dd(d−1)/2 + d (maps + attention)

Limitations

  1. Gauge invariance gap: The concatenation-based attention score is gauge-equivariant but not invariant. A truly gauge-invariant score would require inner-product attention: e_{uv} = h_vᵀ O_{uv} h_u.
  2. Orthogonal maps only: SheafAN restricts to orthogonal maps for gauge equivariance; general maps (as in NSD) are excluded. This limits expressiveness for non-gauge-symmetric tasks.
  3. Scale-invariance lost: Orthogonal maps preserve norms but cannot scale features — for tasks where feature magnitude matters, diagonal or general maps may outperform orthogonal ones.

References

  • Barbero, F., Bodnar, C., de Ocáriz Borde, H. S., Bronstein, M., Veličković, P., & Liò, P. (2022). Sheaf Attention Networks. NeurIPS 2022 Workshop.
  • Veličković, P., Cucurull, G., Casanova, A., Romero, A., Liò, P., & Bengio, Y. (2018). Graph Attention Networks. ICLR 2018 (GAT: the attention mechanism SheafAN extends with transported messages and orthogonal restriction maps).
  • Bodnar, C., Giovanni, F. D., Chamberlain, B. P., Liò, P., & Bronstein, M. M. (2022). Neural Sheaf Diffusion. NeurIPS 2022 (NSD: the predecessor architecture whose orthogonal map parameterisation SheafAN inherits).