GraphSAGE: Inductive Learning on Large Graphs

3 minute read

Published:

TL;DR: GraphSAGE (SAmple and aggreGatE) learns to aggregate features from a sampled subset of neighbours. Because it learns the aggregation function (not per-node embeddings), it generalises to new nodes never seen during training — making it inductive.

The Inductive vs. Transductive Distinction

Transductive GNNs (GCN, GAT): learn embeddings for the specific nodes in the training graph. If you add a new node tomorrow, you have to re-train — or at least run another forward pass with the full adjacency matrix.

Inductive GNNs (GraphSAGE): learn a function that maps a node’s local neighbourhood to an embedding. Apply this function to any neighbourhood — seen or unseen — to get an embedding.

This matters enormously in practice:

  • Pinterest uses GraphSAGE to embed new pins (items) in real-time as users upload them.
  • Social networks onboard new users continuously — their profiles must be embedded immediately.

The Algorithm

For each node v at each layer k:

1. SAMPLE: S_v = random sample of min(K, |N(v)|) neighbours
2. AGG:    agg_v = AGGREGATE({ h_u^(k-1) : u ∈ S_v })
3. UPDATE: h_v^k = σ( W^k · concat(h_v^(k-1), agg_v) )
4. NORM:   h_v^k = h_v^k / ||h_v^k||₂

The key novelty: concatenate the node’s own previous representation with the aggregated neighbourhood representation, then apply a shared learned W. This ensures the node retains its own identity while incorporating neighbour information.

Full neighbourhood of v v n1 n2 n3 n4 n5 n6 Too expensive to use all 6! sample K=2 Sampled neighbourhood (K=2) v n2 sampled ✓ n5 sampled ✓ n1 n6 Only 2 neighbours needed per node! AGGREGATE({h_n2, h_n5}) → concat with h_v → W → new h_v
Figure 1: GraphSAGE samples K=2 neighbours instead of using all 6. The sampled neighbours' features are aggregated, concatenated with v's own features, then transformed via W. Same W works for any node.

Aggregator Choices

GraphSAGE offers three built-in aggregators:

AggregatorFormulaProperties
Meanmean({h_u : u ∈ S})Fast, size-invariant, similar to GCN
Max-poolingmax(σ(W·h_u)) per dimCaptures extreme features
LSTMLSTM on random order of SHighest capacity, non-symmetric

The LSTM aggregator technically violates permutation invariance (LSTMs care about input order) — GraphSAGE handles this by randomly permuting neighbour order each training step, which empirically works well.

Mini-Batch Training

Because GraphSAGE uses neighbourhood sampling, it supports mini-batch training on arbitrarily large graphs:

  1. Sample a batch of target nodes.
  2. Sample their K-hop neighbourhoods (expanding the computation graph).
  3. Compute embeddings bottom-up: 0-hop → 1-hop → … → target nodes.
  4. Update W via backprop.

This is how Pinterest’s PinSage scales to graphs with billions of nodes and edges.

✅ Key Takeaways

  • GraphSAGE is inductive: learns an aggregation function, not per-node embeddings — generalises to new nodes.
  • Neighbourhood sampling (K neighbours per node) enables mini-batch training on billion-scale graphs.
  • Concatenates own representation + aggregated neighbourhood before the linear transform — preserving node identity.
  • Used in production at Pinterest, LinkedIn, and other platforms for real-time item/user embedding.