Temporal Graph Networks: Learning from Events

4 minute read

Published:

TL;DR: TGN (Rossi et al., 2020) processes a continuous stream of interaction events. Each node maintains a memory vector s_v that is updated by a memory updater (GRU-based) when v participates in an event. When node embeddings are needed, a temporal graph attention module aggregates from recent neighbours using time-aware features. This combines persistent memory with structural context.

TGN’s Design Philosophy

TGN separates two concerns:

  1. Memory: long-term history of a node’s interactions, stored as a fixed-size vector s_v
  2. Embedding: current structural context, computed by aggregating recent neighbours

This separation allows efficient online updates (only memory changes on each event) while preserving rich structural context when embeddings are needed.

The TGN Components

1. Memory Module

Each node v has a memory state s_v ∈ ℝ^{d_s}. Initially s_v = 0.

When an interaction (u, v, t, e_{uv}) occurs:

  • Node u interacts with node v at time t with edge features e_{uv}

Raw messages are computed for both interacting nodes:

m_u(t) = MSG_s( s_u(t^-), s_v(t^-), Δt, e_{uv} ) m_v(t) = MSG_d( s_v(t^-), s_u(t^-), Δt, e_{uv} )

Where s_u(t^-) is u’s memory just before the event, and Δt = t - t_last_u is the time since u’s last interaction.

Memory update (GRU):

s_u(t) = MEM( m_u(t), s_u(t^-) ) s_v(t) = MEM( m_v(t), s_v(t^-) )

The GRU’s gating mechanism naturally handles the trade-off between old memory and new information.

2. Temporal Graph Attention (Embedding Module)

When we need node u’s embedding at time t (for inference), we aggregate from temporal neighbours:

Time encoding: encode elapsed time as a feature using random Fourier features or learnable frequencies:

φ(Δt) = [cos(ω₁ Δt + b₁), ..., cos(ω_d Δt + b_d)]

This gives the model a sense of recency — events further in the past have lower encoded similarity to the current time.

Temporal attention over neighbours:

h_u(t) = AGG( s_u(t), { (s_v(t), e_{uv}, φ(t - t_{uv})) : (v, t_{uv}) ∈ N_k(u, t) } )

Where N_k(u, t) is the k most recent temporal neighbours of u before time t.

Why time encoding matters: An interaction 1 second ago should have more influence than one from 1 year ago. Time encoding φ(Δt) provides this recency signal explicitly. The model learns to weight recent interactions more heavily for tasks where recency is informative (e.g., click prediction), or weight them equally for tasks where history matters uniformly.

Given node embeddings h_u(t) and h_v(t), compute interaction probability:

p(u, v, t) = σ( MLP( [h_u(t) || h_v(t)] ) )

Trained with binary cross-entropy + negative sampling (sample random non-interacting pairs as negatives).

The Full TGN Loop

For each event (u, v, t, e_{uv}) in the stream:
  1. Retrieve memories s_u(t^-), s_v(t^-)
  2. Compute messages m_u, m_v
  3. Update memories: s_u(t), s_v(t)
  4. When evaluation needed:
     a. Compute temporal embeddings h_u(t), h_v(t)
     b. Predict p(u, v, t)

Variants and Ablations

TGN subsumes several prior architectures:

ArchitectureMemoryEmbedding
DeepCoevolveGRUNone (memory = embedding)
JODIERNNLinear projection
DyRepNoneTemporal attention
TGATNoneTemporal graph attention
TGN-attnGRU memoryTemporal graph attention

TGN-attn (full TGN) outperforms all ablations on link prediction benchmarks (Wikipedia, Reddit, MOOC, LastFM).

Inductive Capability

TGN naturally handles new nodes: when a previously unseen node v appears, its memory is initialised to 0. After its first few interactions, the memory accumulates history. This is the key advantage over transductive methods that require all nodes at training time.

Batch Processing and Training

Online event processing is not directly compatible with batched GPU training. TGN uses a trick: process events in chronological order within batches, carrying memory states forward. Memory updates are detached from the computation graph to avoid BPTT over the full event history — only the embedding computation and the final prediction are differentiated.

Summary

ComponentPurpose
Memory s_vLong-term history (fixed-size, GRU-updated)
Raw messagesPer-event information for memory update
Time encodingRecency signal for temporal attention
Temporal attentionStructural context from recent neighbours
Link decoderInteraction probability from embeddings

TGN is the standard baseline for continuous-time dynamic graph link prediction. Its modular design (memory + embedding + decoder) allows ablation studies and component swapping — making it a useful research framework as well as a practical model.

References