GraphSAGE: Inductive Learning on Large Graphs

4 minute read

Published: February 07, 2024

TL;DR: GraphSAGE (SAmple and aggreGatE) learns to aggregate features from a sampled subset of neighbours. Because it learns the aggregation function (not per-node embeddings), it generalises to new nodes never seen during training — making it inductive.

GraphSAGE inductive learning — GraphSAGE: inductive representation learning via neighbourhood sampling (Hamilton et al., 2017)

The Inductive vs. Transductive Distinction

Transductive GNNs (GCN, GAT): learn embeddings for the specific nodes in the training graph. If you add a new node tomorrow, you have to re-train — or at least run another forward pass with the full adjacency matrix.

Inductive GNNs (GraphSAGE): learn a function that maps a node’s local neighbourhood to an embedding. Apply this function to any neighbourhood — seen or unseen — to get an embedding.

This matters enormously in practice:

Pinterest uses GraphSAGE to embed new pins (items) in real-time as users upload them.
Social networks onboard new users continuously — their profiles must be embedded immediately.

The Algorithm

For each node v at each layer k:

SAMPLE: S_v = random sample of min(K, |N(v)|) neighbours
AGG:    agg_v = AGGREGATE({ h_u^(k-1) : u ∈ S_v })
UPDATE: h_v^k = σ( W^k · concat(h_v^(k-1), agg_v) )
NORM:   h_v^k = h_v^k / ||h_v^k||₂

The key novelty: concatenate the node’s own previous representation with the aggregated neighbourhood representation, then apply a shared learned W. This ensures the node retains its own identity while incorporating neighbour information.

Figure 1: GraphSAGE samples K=2 neighbours instead of using all 6. The sampled neighbours' features are aggregated, concatenated with v's own features, then transformed via W. Same W works for any node.

Why Inductive Learning Matters: GCN and GAT compute embeddings tied to a specific adjacency matrix. Their weight matrices learn "which position in this fixed graph matters." GraphSAGE instead learns "what kind of neighbourhood looks like this?" — a transferable pattern. This is the difference between memorising a map vs. learning to navigate any city.

Concrete Example: Embedding a New Node at Inference Time

Suppose we trained GraphSAGE on a product graph. A new product P is uploaded tonight with features h_P = [0.8, 0.3, 0.1] and two existing similar products as neighbours: n₁ = [0.7, 0.4, 0.2], n₂ = [0.6, 0.5, 0.1].

Without retraining:

Sample: S_P = {n₁, n₂} (both neighbours, K=2)
Aggregate (mean): agg_P = ([0.7,0.4,0.2] + [0.6,0.5,0.1]) / 2 = [0.65, 0.45, 0.15]
Concatenate + transform: h_P_new = σ(W · [0.8, 0.3, 0.1, 0.65, 0.45, 0.15])
Normalise to unit sphere.

The resulting embedding places P in the correct region of the embedding space relative to existing products — ready for recommendation — all without touching the training set.

Aggregator Choices

GraphSAGE offers three built-in aggregators:

Aggregator	Formula	Properties
Mean	mean({h_u : u ∈ S})	Fast, size-invariant, similar to GCN
Max-pooling	max(σ(W·h_u)) per dim	Captures extreme features
LSTM	LSTM on random order of S	Highest capacity, non-symmetric

The LSTM aggregator technically violates permutation invariance (LSTMs care about input order) — GraphSAGE handles this by randomly permuting neighbour order each training step, which empirically works well.

Mini-Batch Training

Because GraphSAGE uses neighbourhood sampling, it supports mini-batch training on arbitrarily large graphs:

Sample a batch of target nodes.
Sample their K-hop neighbourhoods (expanding the computation graph).
Compute embeddings bottom-up: 0-hop → 1-hop → … → target nodes.
Update W via backprop.

This is how Pinterest’s PinSage scales to graphs with billions of nodes and edges.

✅ Key Takeaways

GraphSAGE is inductive: learns an aggregation function, not per-node embeddings — generalises to new nodes.
Neighbourhood sampling (K neighbours per node) enables mini-batch training on billion-scale graphs.
Concatenates own representation + aggregated neighbourhood before the linear transform — preserving node identity.
Used in production at Pinterest, LinkedIn, and other platforms for real-time item/user embedding.

Share on

Bluesky Facebook LinkedIn X (formerly Twitter)

Alessio Borgi

GraphSAGE: Inductive Learning on Large Graphs

The Inductive vs. Transductive Distinction

The Algorithm

Concrete Example: Embedding a New Node at Inference Time

Aggregator Choices

Mini-Batch Training

✅ Key Takeaways

Share on

You May Also Enjoy

Output, Gated, and Special Activations: Softmax, GLU, SIREN, and More

Modern Activation Functions: GELU, SiLU, Mish, and Smooth Gating

Activation Functions in Neural Networks: Why Non-Linearity Matters

FoPE: Fourier Position Embedding for Length Generalization

Alessio Borgi

The Inductive vs. Transductive Distinction

The Algorithm

Concrete Example: Embedding a New Node at Inference Time

Aggregator Choices

Mini-Batch Training

✅ Key Takeaways

Share on

You May Also Enjoy

📄 Output, Gated, and Special Activations: Softmax, GLU, SIREN, and More

📄 Modern Activation Functions: GELU, SiLU, Mish, and Smooth Gating

📄 Activation Functions in Neural Networks: Why Non-Linearity Matters

📄 FoPE: Fourier Position Embedding for Length Generalization

Output, Gated, and Special Activations: Softmax, GLU, SIREN, and More

Modern Activation Functions: GELU, SiLU, Mish, and Smooth Gating

Activation Functions in Neural Networks: Why Non-Linearity Matters

FoPE: Fourier Position Embedding for Length Generalization