ML Blog
Welcome to my research blog โ structured like a library of books. Each book covers a major AI topic; every chapter is a short, self-contained post you can read in 3โ5 minutes. Start with the Start Here overview of any book, then dive into whichever chapters interest you most.
Transformers: The Architecture That Changed AI
A self-contained guide to the Transformer โ the engine behind GPT, BERT, and modern AI. Learn how attention replaces recurrence and why every major AI system uses it.
Self-Attention: Teaching Machines to Focus
Self-attention is the core of every Transformer. Learn how Query, Key, and Value vectors let every tok...
Multi-Head Attention: Many Eyes on the Data
One attention head sees one relationship. Multiple heads running in parallel let the model capture syn...
Positional Encodings: Why Position Matters
Transformers see all tokens at once โ which means without help they'd treat 'cat ate mouse' and 'mouse...
Sinusoidal Positional Encodings: The Original Solution
The PE method from the 2017 'Attention Is All You Need' paper uses sine and cosine waves at different ...
Learned Positional Encodings: Data-Driven Position
Instead of a fixed formula, why not just train position embeddings from scratch โ like word embeddings...
Relative Positional Encodings: It's All About Distance
Instead of asking 'where am I?', relative PEs ask 'how far are these two tokens apart?' Shaw et al. an...
RoPE: Rotary Position Embeddings
RoPE encodes position by rotating query and key vectors by an angle proportional to position. The clev...
ALiBi: Attention with Linear Biases
ALiBi skips traditional positional embeddings entirely and just subtracts a distance penalty from atte...
BERT: Bidirectional Transformers for Language Understanding
BERT flipped the script on language models: instead of predicting the next word left-to-right, it mask...
GPT: Autoregressive Language Modelling at Scale
GPT chose the opposite bet to BERT: decoder-only, left-to-right generation. From GPT-1 at 117M paramet...
T5: Every NLP Task as Text-to-Text
T5 (Text-to-Text Transfer Transformer) from Google Research unifies all NLP tasks under one interface:...
ViT: Vision Transformer โ Images as Sequences of Patches
Dosovitskiy et al. (2020) asked: what if we just cut an image into patches and treat them like words? ...
Swin Transformer: Hierarchical Vision with Shifted Windows
ViT's global attention is expensive. Swin Transformer computes attention within local windows, then sh...
Looped Transformers: Thinking More with the Same Weights
What if instead of making the model wider, you ran the same block multiple times? Looped Transformers ...
Graph Neural Networks: Learning on Graphs
Graphs are everywhere โ molecules, social networks, road maps, knowledge bases. Graph Neural Networks learn from this relational structure by propagating information between connected nodes. Here's the compl...
The Graph Adjacency Matrix: A Graph in Matrix Form
Before understanding GNNs, you need to understand how graphs are represented mathematically. The adjac...
The Graph Laplacian: Spectral Graph Theory Explained Simply
The Graph Laplacian is L = D - A. Its eigenvectors reveal the graph's community structure; its eigenva...
Message Passing: The Universal GNN Framework
Every GNN โ GCN, GAT, GraphSAGE, GIN โ is a special case of message passing. Learn the three-step loop...
GCN: Graph Convolutional Networks
GCN (Kipf & Welling, 2016) is the 'hello world' of GNNs. It simplifies spectral graph convolution into...
GAT: Graph Attention Networks
GCN assigns the same (degree-based) weight to every neighbour. GAT learns which neighbours actually ma...
GraphSAGE: Inductive Learning on Large Graphs
GCN and GAT learn embeddings for fixed graphs โ add a new node and you're stuck. GraphSAGE (Hamilton e...
GIN: Graph Isomorphism Network โ The Most Expressive GNN
How powerful can a GNN be? Xu et al. (2019) answered with a theoretical bound โ and GIN is the archite...
