ML Blog

⏱ 5 min activation-functionsrelu

Activation Functions in Neural Networks: Why Non-Linearity Matters

Activation functions are the reason neural networks can model curved decision boundaries instead of co...

Start Here · Overview

Transformers: The Architecture That Changed AI

A self-contained guide to the Transformer — the engine behind GPT, BERT, and modern AI. Learn how attention replaces recurrence and why every major AI system uses it.

📖 5 min read The complete picture in one post

🧩 Core Components

🧩

The Transformer Block: Putting It All Together

A single Transformer block combines attention, residuals, layer norm, and an FFN into one reusable uni...

⏱ 5 min transformer-blockarchitecture

🧠

Feed-Forward Networks: The Forgotten Half of Transformers

The FFN block holds two-thirds of a Transformer's parameters and does most of its factual recall. Yet ...

⏱ 5 min FFNMLP

➕

Residual Connections: Why Transformers Can Be Deep

Without residual connections, training a 96-layer Transformer would be practically impossible. The ski...

⏱ 4 min residualskip-connections

📊

Layer Normalization in Transformers

Layer norm is not optional plumbing. It determines training stability, gradient flow, and whether deep...

⏱ 4 min layer-normbatch-norm

🏛️

Encoder vs Decoder vs Encoder-Decoder Transformers

BERT, GPT, and T5 are all Transformers — but their architectures are fundamentally different. One comp...

⏱ 5 min BERTGPT

🔗

Cross-Attention: How Models Attend to Another Sequence

Cross-attention lets one sequence query information from a completely different sequence. It is the br...

⏱ 4 min attentioncross-attention

🎭

Attention Masks: Causal, Padding, and Bidirectional

The difference between GPT, BERT, and T5 is largely a masking decision. Learn how causal, padding, and...

⏱ 5 min attentionmasking

🔍

Query, Key, Value: The Intuition Behind QKV

Q, K, and V are not arbitrary labels. They map precisely onto search queries, database labels, and ret...

⏱ 4 min attentionQKV

⚖️

Scaled Dot-Product Attention: Why the √d Matters

Dividing by √d_k is not just a trick — it prevents softmax from saturating and dying in high-dimension...

⏱ 4 min attentionscaling

👁️

Multi-Head Attention: Many Eyes on the Data

One attention head sees one relationship. Multiple heads running in parallel let the model capture syn...

⏱ 4 min attentionmulti-head

🔍

Self-Attention: Teaching Machines to Focus

Self-attention is the core of every Transformer. Learn how Query, Key, and Value vectors let every tok...

⏱ 4 min attentionmechanism

📐 Positional Encodings

🔑

GAPE: Remember to Forget — Gated Adaptive Positional Encoding

GAPE is a drop-in RoPE augmentation that adds content-aware attention logit biases: a query-gate suppr...

⏱ 8 min positional-encodingrope

🔭

LongRoPE: Extending Context to 2 Million Tokens

LongRoPE (Microsoft, 2024) pushes RoPE-based context to 2M tokens by searching for optimal per-dimensi...

⏱ 5 min RoPELongRoPE

🧶

YaRN: Yet Another RoPE Extensionn Method

YaRN combines NTK scaling for high-frequency dimensions with linear interpolation for low-frequency on...

⏱ 5 min RoPEYaRN

📡

NTK-Aware Scaling: Extending Context Without Fine-Tuning

NTK-Aware Scaling extends the context window of RoPE-based models by rescaling frequencies using Neura...

⏱ 5 min RoPENTK

📏

ALiBi: Attention with Linear Biases

ALiBi skips traditional positional embeddings entirely and just subtracts a distance penalty from atte...

⏱ 3 min positional-encodingalibi

🔄

RoPE: Rotary Position Embeddings

RoPE encodes position by rotating query and key vectors by an angle proportional to position. The clev...

⏱ 5 min positional-encodingrope

↔️

Relative Positional Encodings: It's All About Distance

Instead of asking 'where am I?', relative PEs ask 'how far are these two tokens apart?' Shaw et al. an...

⏱ 4 min positional-encodingrelative

🎓

Learned Positional Encodings: Data-Driven Position

Instead of a fixed formula, why not just train position embeddings from scratch — like word embeddings...

⏱ 3 min positional-encodinglearned

〰️

Sinusoidal Positional Encodings: The Original Solution

The PE method from the 2017 'Attention Is All You Need' paper uses sine and cosine waves at different ...

⏱ 4 min positional-encodingsinusoidal

📐

Positional Encodings: Why Position Matters

Transformers see all tokens at once — which means without help they'd treat 'cat ate mouse' and 'mouse...

⏱ 4 min positional-encodingoverview

🌊

FoPE: Fourier Position Embedding for Length Generalization

FoPE rethinks long-context positional encoding from a frequency-domain perspective. Instead of only st...

⏱ 5 min FoPEpositional-encoding

🪜

Position Interpolation: Extending RoPE with Minimal Fine-Tuning

Position Interpolation rescales positions before applying RoPE so a model trained on short contexts ca...

⏱ 5 min RoPEposition-interpolation

🧭

XPos: Length-Extrapolatable Rotary Embeddings

XPos modifies RoPE with a multiplicative decay that keeps relative rotations while stabilising magnitu...

⏱ 4 min XPosRoPE

🌀

p-RoPE: What Makes Rotary Positional Encodings Useful?

This paper does two things at once: it explains what RoPE is really doing inside a trained LLM, and it...

⏱ 5 min p-RoPERoPE

Start Here · Overview

Graph Neural Networks: Learning on Graphs

Graphs are everywhere — molecules, social networks, road maps, knowledge bases. Graph Neural Networks learn from this relational structure by propagating information between connected nodes. Here's the compl...

📖 5 min read The complete picture in one post

📊 Graph Fundamentals

📋

The Graph Adjacency Matrix: A Graph in Matrix Form

Before understanding GNNs, you need to understand how graphs are represented mathematically. The adjac...

⏱ 3 min graphadjacency-matrix

The Graph Laplacian: Spectral Graph Theory Explained Simply

The Graph Laplacian is L = D - A. Its eigenvectors reveal the graph's community structure; its eigenva...

⏱ 5 min graphlaplacian

🔵

What Is a Graph? Nodes, Edges, Features, and Labels

A graph is a set of nodes connected by edges — but the power of GNNs comes from the features attached ...

⏱ 4 min graph-theorynodes

↔️

Directed, Undirected, Weighted, and Heterogeneous Graphs

Not all graphs are equal. Directed edges, edge weights, multiple node/edge types — each variant requir...

⏱ 4 min graph-typesdirected

🔄

Homophily vs Heterophily: When Neighbours Are Similar or Different

Most GNNs assume nearby nodes are similar — the homophily assumption. When this breaks (heterophilic g...

⏱ 5 min homophilyheterophily

⏱ 4 min node-classificationlink-prediction

Graph Tasks: Node, Edge, and Graph-Level Prediction

GNNs can predict at three levels: properties of individual nodes, existence or type of edges, or prope...

〰️

Graph Fourier Transform: The Spectral View of Graphs

The Graph Fourier Transform decomposes a signal on a graph into frequency components using the Laplaci...

⏱ 5 min spectralFourier

🏗️ Architectures

📨

Message Passing: The Universal GNN Framework

Every GNN — GCN, GAT, GraphSAGE, GIN — is a special case of message passing. Learn the three-step loop...

⏱ 4 min message-passingmpnn

🔵

GCN: Graph Convolutional Networks

GCN (Kipf & Welling, 2016) is the 'hello world' of GNNs. It simplifies spectral graph convolution into...

⏱ 4 min gcnspectral

GAT: Graph Attention Networks

GCN assigns the same (degree-based) weight to every neighbour. GAT learns which neighbours actually ma...

⏱ 4 min gatattention

⏱ 4 min graphsageinductive

GraphSAGE: Inductive Learning on Large Graphs

GCN and GAT learn embeddings for fixed graphs — add a new node and you're stuck. GraphSAGE (Hamilton e...

⏱ 5 min ginexpressiveness

GIN: Graph Isomorphism Network — The Most Expressive GNN

How powerful can a GNN be? Xu et al. (2019) answered with a theoretical bound — and GIN is the archite...

📐

ChebNet: Spectral Graph Convolutions via Chebyshev Polynomials

ChebNet avoids the expensive full eigendecomposition by approximating spectral filters with Chebyshev ...

⏱ 5 min ChebNetspectral

SGC: Simple Graph Convolution

SGC removes all nonlinearities between GCN layers and collapses the entire propagation into a single p...

⏱ 4 min SGCsimple

📊

APPNP: Personalized PageRank Meets Graph Neural Networks

APPNP decouples feature transformation from propagation. A neural network transforms features first; t...

⏱ 4 min APPNPPageRank

⏱ 5 min graph-transformerattention

Graph Transformers: Bringing Attention to Graphs

Graph Transformers replace or augment local message passing with full pairwise attention — every node ...

🏆

Graphormer: Transformers with Structural Biases for Graphs

Graphormer encodes graph structure directly into Transformer attention via three biases: node centrali...

⏱ 5 min Graphormergraph-transformer

📬

MPNN: The General Message Passing Neural Network Framework

The MPNN framework (Gilmer et al., 2017) unifies GCN, GAT, GIN, GraphSAGE, and almost all spatial GNNs...

⏱ 5 min MPNNmessage-passing

🔬 Expressivity & Limitations

🔬

The Weisfeiler-Lehman Test: How Powerful Are GNNs?

The 1-WL graph isomorphism test provides the exact upper bound on message-passing GNN expressivity. GI...

⏱ 5 min WL-testexpressivity

🌫️

Oversmoothing: When All Node Embeddings Become the Same

Stack enough GNN layers and all node embeddings converge to the same vector — making the model useless...

⏱ 5 min oversmoothingdepth

🚱

Oversquashing: When Too Much Information Passes Through Bottlenecks

Oversquashing occurs when exponentially many node features must be compressed into a fixed-size embedd...

⏱ 5 min oversquashingbottleneck

⚖️

Over-smoothing vs Over-squashing: The Difference

Oversmoothing and oversquashing are both problems with deep GNNs, but they affect different nodes, hav...

⏱ 4 min oversmoothingoversquashing

📏

Depth in GNNs: Why Deeper Is Not Always Better

In Transformers, depth = expressiveness. In GNNs, depth = both expressiveness AND over-smoothing. The ...

⏱ 4 min depthGNN

📍 Graph Positional & Structural Encodings

📍

Why GNNs Need Positional Encodings

Message-passing GNNs are permutation-equivariant by design — they cannot assign unique positions to no...

⏱ 4 min positional-encodingstructural-encoding

🧮

Laplacian Eigenvectors as Graph Positional Encodings

The k smallest eigenvectors of the graph Laplacian form a natural positional embedding space — the gra...

⏱ 5 min Laplacianeigenvectors

🚶

Random Walk Positional Encodings

Random walk positional encodings encode each node's structural context by computing the probability of...

⏱ 4 min random-walkRWPE

🗺️

Shortest-Path Encodings for Graph Transformers

Shortest-path distances between nodes can be encoded as attention biases or node features — directly i...

⏱ 4 min shortest-pathdistance-encoding

🗂️

Structural vs Positional Encodings in Graphs

Positional encodings say where a node is in the graph. Structural encodings say what role it plays. Th...

⏱ 4 min structural-encodingpositional-encoding

Sign Ambiguity in Laplacian Eigenvectors

Laplacian eigenvectors are only defined up to sign: if u is an eigenvector, so is -u. This seemingly m...

⏱ 4 min sign-ambiguityLapPE

🧺 Pooling & Graph-Level Learning

🧺

Global Pooling in GNNs: Mean, Sum, and Max

To predict a property of an entire graph, node embeddings must be aggregated into a single vector. The...

⏱ 4 min poolingreadout

🔽

DiffPool: Learning Hierarchical Graph Pooling

DiffPool learns to hierarchically cluster nodes into super-nodes across layers — like a convolutional ...

⏱ 5 min diffpoolhierarchical-pooling

🏆

TopKPool and SAGPool: Sparse Graph Pooling

Instead of soft cluster assignment (DiffPool), TopKPool and SAGPool select a subset of the most import...

⏱ 4 min topkpoolsagpool

⏱ 4 min set2setattention-readout

Set2Set and Attention Readout: Order-Invariant Graph Summaries

Mean and sum readout treat all nodes equally. Attention readout learns which nodes matter most for a g...

🗂️

Graph Classification: From Node Embeddings to Graph Embeddings

Graph classification is the task of predicting a label for an entire graph. It requires composing mess...

⏱ 4 min graph-classificationreadout

🎨 Heterogeneous & Relational Graphs

🎨

Heterogeneous Graphs: When Nodes and Edges Have Types

Most real-world graphs are heterogeneous — they contain multiple node types (users, items, tags) and e...

⏱ 4 min heterogeneous-graphrelational

🔗

R-GCN: Relational Graph Convolutional Networks

R-GCN extends GCN to multi-relational graphs by learning a separate weight matrix for each relation ty...

⏱ 4 min R-GCNrelational

🎗️

HAN: Heterogeneous Graph Attention Networks

HAN combines meta-path decomposition with two levels of attention: node-level attention weights neighb...

⏱ 5 min HANheterogeneous

🧠

Knowledge Graph Embeddings vs GNNs

Knowledge graph completion can be solved with shallow KG embeddings (TransE, DistMult, ComplEx) or wit...

⏱ 5 min knowledge-graphTransE

⏰

Temporal Knowledge Graphs: Facts That Change Over Time

Most knowledge graphs treat facts as timeless — but facts change. Barack Obama was president from 2009...

⏱ 4 min temporal-KGTKG

🌊 Dynamic & Temporal Graphs

🌊

Static vs Dynamic Graphs: When Structure Changes Over Time

Most GNN research assumes a fixed graph. Real graphs evolve: edges appear and disappear, node features...

⏱ 4 min dynamic-graphtemporal

⏱ 4 min neural-ODEcontinuous-time

Temporal Graph Networks: Learning from Events

TGN (Temporal Graph Network) is the leading framework for continuous-time dynamic graphs. It maintains...

⏱ 5 min TGNtemporal

∫

Graph Neural ODEs: Continuous-Time Graph Dynamics

Neural ODEs replace discrete layer-by-layer computation with continuous dynamics governed by a differe...

🗺️

Spatio-Temporal GNNs: Learning on Graphs Through Time

Spatio-temporal GNNs combine spatial message passing with temporal sequence modelling. They are the do...

⏱ 4 min spatio-temporalSTGCN

🔮 Geometric & Equivariant GNNs

🔮

Why Geometry Matters in Graph Neural Networks

Many real-world graphs are embedded in 3D space — molecules, proteins, point clouds, crystal structure...

⏱ 4 min geometry3D

🔄

Equivariance: What It Means and Why It Matters

Equivariance formalises the idea that a function should 'commute with symmetry transformations.' A rot...

⏱ 4 min equivarianceinvariance

⚛️

EGNN: E(n)-Equivariant Graph Neural Networks

EGNN achieves E(n)-equivariance with a simple update rule: positions updated via weighted sums of rela...

⏱ 4 min EGNNequivariant

⏱ 5 min SE3-transformerequivariant

SE(3)-Transformers: Attention with 3D Symmetry

SE(3)-Transformers extend self-attention to 3D point clouds and molecular graphs while maintaining SE(...

🌀

Tensor Field Networks and Geometric Deep Learning

Tensor Field Networks (TFN) were the first architecture to achieve SE(3) equivariance using spherical ...

⏱ 4 min TFNtensor-field-networks

💊

Molecular GNNs: Learning on Atoms and Bonds

Molecules are graphs. Molecular GNNs predict chemical properties from structure. The best models use 3...

⏱ 5 min moleculardrug-discovery

🚀 Applications

🧪

GNNs for Molecules: Drug Discovery and Material Design

Graph neural networks are transforming computational drug discovery. Molecules are natural graphs, and...

⏱ 4 min moleculesdrug-discovery

⏱ 4 min recommender-systemscollaborative-filtering

GNNs for Recommender Systems

Recommendation is naturally a graph problem: users and items are nodes, interactions are edges. GNNs o...

👥

GNNs for Social Networks: Influence, Communities, and Misinformation

Social networks are large sparse graphs with rich node features (user profiles) and heterogeneous edge...

⏱ 4 min social-networkcommunity-detection

🚦

GNNs for Traffic Forecasting

Traffic prediction is a canonical spatio-temporal graph task: sensors on roads form a fixed graph, and...

⏱ 4 min trafficforecasting