Posts by Category

🚀 I’m always open to collaborate, exchange ideas or just talk about anything!

👨🏻‍💻 I’m eager to work with anyone who has great ideas, wants to learn more and more and also share their experience to others. Don’t hesitate to write me if you’d like to propose your help or ask for mine on a project, research, paper-idea, or a moonshot you’re cooking up.

👉 Email Me ✉️

basics

Output, Gated, and Special Activations: Softmax, GLU, SIREN, and More

14 minute read

Published: June 01, 2026

Not every activation is a hidden-layer curve. Some produce probabilities, some implement learned gates, some shrink values toward zero, and some are designed for very specialized settings such as implicit neural representations.

Modern Activation Functions: GELU, SiLU, Mish, and Smooth Gating

10 minute read

Published: June 01, 2026

Once ReLU became the default, researchers started asking a better question: can we keep the easy optimization while making the activation smoother, softer, and more expressive? This chapter covers the modern answers.

Activation Functions in Neural Networks: Why Non-Linearity Matters

12 minute read

Published: June 01, 2026

Activation functions are the reason neural networks can model curved decision boundaries instead of collapsing into one giant linear map. This chapter builds the intuition first, then walks through the classical functions that shaped deep learning.

gnn

GNNs for Computer Vision: Scene Graphs and Beyond

6 minute read

Published: May 27, 2024

Computer vision tasks increasingly require relational reasoning — understanding how objects relate to each other, not just what they are. Scene graph generation, visual question answering, action recognition from skeletons, and 3D point cloud processing all benefit from GNN-based relational modelling.

GNNs for Robotics: Planning, Manipulation, and Multi-Agent Systems

6 minute read

Published: May 26, 2024

Robots interact with structured environments: objects have relationships, joints form kinematic chains, agents communicate through interaction graphs. GNNs encode these relational structures — enabling generalisation across object configurations, robot morphologies, and multi-agent scenarios.

GNNs for Knowledge Graphs: Reasoning and Completion

6 minute read

Published: May 25, 2024

Knowledge graphs encode human knowledge as typed entity-relation triples. GNNs enable structure-aware entity representation, multi-hop reasoning, knowledge base completion, and entity alignment — tasks that shallow embedding methods cannot fully solve.

GNNs for Traffic Forecasting

6 minute read

Published: May 24, 2024

Traffic prediction is a canonical spatio-temporal graph task: sensors on roads form a fixed graph, and speed/volume measurements evolve over time. GNNs capture spatial correlations between sensors; RNNs or convolutions capture temporal patterns. Together they achieve state-of-the-art traffic forecasting.

GNNs for Social Networks: Influence, Communities, and Misinformation

6 minute read

Published: May 23, 2024

Social networks are large sparse graphs with rich node features (user profiles) and heterogeneous edges (friendship, follow, retweet). GNNs predict user behaviour, detect communities, identify influential spreaders, and flag misinformation — tasks with significant real-world impact.

GNNs for Recommender Systems

5 minute read

Published: May 22, 2024

Recommendation is naturally a graph problem: users and items are nodes, interactions are edges. GNNs on bipartite user-item graphs capture higher-order collaborative filtering signals — friends of friends liked this — that matrix factorisation cannot represent.

GNNs for Molecules: Drug Discovery and Material Design

5 minute read

Published: May 21, 2024

Graph neural networks are transforming computational drug discovery. Molecules are natural graphs, and GNNs learn molecular representations that predict toxicity, solubility, binding affinity, and synthesis feasibility — tasks that previously required expensive laboratory experiments.

Polynomial Neural Sheaf Diffusion

6 minute read

Published: May 20, 2024

Polynomial Neural Sheaf Diffusion (PNSD) replaces the fixed diffusion operator (I - Δ_F) with a learnable polynomial of the Sheaf Laplacian. This gives the model spectral flexibility — it can learn to amplify or suppress different frequency components of the sheaf signal.

Equivariant Sheaf Neural Networks

6 minute read

Published: May 19, 2024

Sheaves with orthogonal restriction maps define a connection on the graph — a parallel transport structure over edges. This connects sheaf GNNs to differential geometry and enables equivariant processing of data with local coordinate frames at each node.

Sheaf Neural Networks and Heterophily

6 minute read

Published: May 18, 2024

Sheaf GNNs are the principled solution to heterophily: by learning per-edge maps that transform features before comparison, they can perform diffusion that converges within classes and diverges across classes — the exact opposite of standard GCN’s collapse.

Diagonal, Orthogonal, and General Sheaf Maps

6 minute read

Published: May 17, 2024

The restriction maps in a cellular sheaf can be constrained to different matrix classes: scalars, diagonal matrices, orthogonal matrices, or general matrices. Each class offers a different trade-off between expressivity and computational cost.

Neural Sheaf Diffusion: Learning Sheaves End-to-End

7 minute read

Published: May 16, 2024

Neural Sheaf Diffusion (Bodnar et al., 2022) learns the sheaf restriction maps from data using a neural network, then performs diffusion with the learned Sheaf Laplacian. This gives a principled, topology-grounded GNN that handles heterophily without heuristic fixes.

The Sheaf Laplacian: Spectral Theory for Sheaves

7 minute read

Published: May 15, 2024

The Sheaf Laplacian generalises the graph Laplacian by incorporating per-edge restriction maps. Its spectrum reveals how consistent data is under the sheaf. Sheaf diffusion with this Laplacian generalises GCN to handle heterophilic graphs.

What Is a Sheaf? From Topology to Graph Learning

7 minute read

Published: May 14, 2024

A sheaf is a mathematical object from algebraic topology that assigns vector spaces to cells and linear maps between them. On graphs, sheaves assign feature spaces to nodes and edges, with restriction maps encoding how node features relate across edges.

Why Message Passing Is Not Enough: The Case for Sheaves

6 minute read

Published: May 13, 2024

Standard message passing aggregates neighbour features and averages. On heterophilic graphs (where neighbours often disagree), this is harmful. Cellular sheaves provide a mathematically principled framework to model per-edge relationships between node features — going beyond mere averaging.

Molecular GNNs: Learning on Atoms and Bonds

6 minute read

Published: May 12, 2024

Molecules are graphs. Molecular GNNs predict chemical properties from structure. The best models use 3D coordinates and bond angles — not just connectivity.

Tensor Field Networks and Geometric Deep Learning

6 minute read

Published: May 11, 2024

Tensor Field Networks (TFN) were the first architecture to achieve SE(3) equivariance using spherical harmonics and Clebsch-Gordan tensor products. They laid the theoretical foundation for NequIP and MACE — the current state-of-the-art in equivariant molecular force fields.

SE(3)-Transformers: Attention with 3D Symmetry

6 minute read

Published: May 10, 2024

SE(3)-Transformers extend self-attention to 3D point clouds and molecular graphs while maintaining SE(3) equivariance. Attention weights are learned between node pairs; values are equivariant features built from spherical harmonics.

EGNN: E(n)-Equivariant Graph Neural Networks

6 minute read

Published: May 09, 2024

EGNN achieves E(n)-equivariance with a simple update rule: positions updated via weighted sums of relative position vectors, features updated via invariant distances. No spherical harmonics needed.

Equivariance: What It Means and Why It Matters

6 minute read

Published: May 08, 2024

Equivariance formalises the idea that a function should ‘commute with symmetry transformations.’ A rotation-equivariant model applied to rotated input gives the rotated output — no extra training needed. This is the foundation for geometric deep learning.

Why Geometry Matters in Graph Neural Networks

5 minute read

Published: May 07, 2024

Many real-world graphs are embedded in 3D space — molecules, proteins, point clouds, crystal structures. Standard GNNs ignore coordinates and only use connectivity. Geometric GNNs incorporate spatial positions and must respect physical symmetries.

Spatio-Temporal GNNs: Learning on Graphs Through Time

6 minute read

Published: May 06, 2024

Spatio-temporal GNNs combine spatial message passing with temporal sequence modelling. They are the dominant approach for traffic forecasting, weather prediction, and any task where measurements at sensor nodes evolve over time on a fixed graph.

Graph Neural ODEs: Continuous-Time Graph Dynamics

7 minute read

Published: May 05, 2024

Neural ODEs replace discrete layer-by-layer computation with continuous dynamics governed by a differential equation. Graph Neural ODEs apply this to graph data — treating node embeddings as a dynamical system evolving in continuous time.

Temporal Graph Networks: Learning from Events

6 minute read

Published: May 04, 2024

TGN (Temporal Graph Network) is the leading framework for continuous-time dynamic graphs. It maintains a per-node memory that is updated upon each interaction, enabling efficient inductive link prediction on event streams.

Static vs Dynamic Graphs: When Structure Changes Over Time

5 minute read

Published: May 03, 2024

Most GNN research assumes a fixed graph. Real graphs evolve: edges appear and disappear, node features drift, new nodes arrive. Dynamic graph learning addresses how to model and predict on graphs whose structure changes over time.

Temporal Knowledge Graphs: Facts That Change Over Time

5 minute read

Published: May 02, 2024

Most knowledge graphs treat facts as timeless — but facts change. Barack Obama was president from 2009 to 2017. Temporal Knowledge Graphs add timestamps to triples, requiring models to reason about what was true when.

Knowledge Graph Embeddings vs GNNs

6 minute read

Published: May 01, 2024

Knowledge graph completion can be solved with shallow KG embeddings (TransE, DistMult, ComplEx) or with structural GNNs (R-GCN, CompGCN). Each approach has different inductive biases and failure modes. Understanding when to use each is the central design decision for KG tasks.

HAN: Heterogeneous Graph Attention Networks

5 minute read

Published: April 30, 2024

HAN combines meta-path decomposition with two levels of attention: node-level attention weights neighbours along a meta-path, and semantic-level attention weights different meta-paths. This lets the model learn which relationships matter most for a given task.

R-GCN: Relational Graph Convolutional Networks

5 minute read

Published: April 29, 2024

R-GCN extends GCN to multi-relational graphs by learning a separate weight matrix for each relation type. It handles knowledge graphs with typed edges and powers both entity classification and link prediction tasks.

Heterogeneous Graphs: When Nodes and Edges Have Types

5 minute read

Published: April 28, 2024

Most real-world graphs are heterogeneous — they contain multiple node types (users, items, tags) and edge types (clicks, rates, authors). Standard GNNs treat all nodes and edges identically, making them blind to this type structure.

Graph Classification: From Node Embeddings to Graph Embeddings

6 minute read

Published: April 27, 2024

Graph classification is the task of predicting a label for an entire graph. It requires composing message passing (node embeddings), readout (graph embedding), and a classifier — and all three choices interact to determine model expressiveness.

Set2Set and Attention Readout: Order-Invariant Graph Summaries

6 minute read

Published: April 26, 2024

Mean and sum readout treat all nodes equally. Attention readout learns which nodes matter most for a given task. Set2Set goes further — it uses an LSTM to iteratively query the node set, producing richer graph representations than single-pass pooling.

TopKPool and SAGPool: Sparse Graph Pooling

6 minute read

Published: April 25, 2024

Instead of soft cluster assignment (DiffPool), TopKPool and SAGPool select a subset of the most important nodes — producing a smaller but sparser graph at each level. Hard selection is scalable but requires careful score learning.

DiffPool: Learning Hierarchical Graph Pooling

6 minute read

Published: April 24, 2024

DiffPool learns to hierarchically cluster nodes into super-nodes across layers — like a convolutional pyramid for graphs. Unlike flat global pooling, it captures multi-scale graph structure by differentiably assigning nodes to clusters.

Global Pooling in GNNs: Mean, Sum, and Max

5 minute read

Published: April 23, 2024

To predict a property of an entire graph, node embeddings must be aggregated into a single vector. The choice of global pooling — mean, sum, or max — is not arbitrary: each has distinct expressive power and fits different tasks.

Sign Ambiguity in Laplacian Eigenvectors

5 minute read

Published: April 22, 2024

Laplacian eigenvectors are only defined up to sign: if u is an eigenvector, so is -u. This seemingly minor issue creates a fundamental problem for learning with LapPE. Here is the problem, its consequences, and how SignNet solves it.

Structural vs Positional Encodings in Graphs

5 minute read

Published: April 21, 2024

Positional encodings say where a node is in the graph. Structural encodings say what role it plays. They are complementary — and confusing them leads to poor design choices.

Shortest-Path Encodings for Graph Transformers

4 minute read

Published: April 20, 2024

Shortest-path distances between nodes can be encoded as attention biases or node features — directly informing the model about graph proximity without requiring message passing.

Random Walk Positional Encodings

5 minute read

Published: April 19, 2024

Random walk positional encodings encode each node’s structural context by computing the probability of returning to it from itself in k steps — a computationally efficient alternative to Laplacian eigenvectors with no sign ambiguity.

Laplacian Eigenvectors as Graph Positional Encodings

6 minute read

Published: April 18, 2024

The k smallest eigenvectors of the graph Laplacian form a natural positional embedding space — the graph’s own coordinate system. They capture global structure, symmetry, and community membership.

Why GNNs Need Positional Encodings

5 minute read

Published: April 17, 2024

Message-passing GNNs are permutation-equivariant by design — they cannot assign unique positions to nodes. Without positional encodings, symmetric nodes are indistinguishable. Here is why that matters and how to fix it.

Depth in GNNs: Why Deeper Is Not Always Better

6 minute read

Published: April 16, 2024

In Transformers, depth = expressiveness. In GNNs, depth = both expressiveness AND over-smoothing. The optimal GNN depth is rarely more than 3-4 layers — fundamentally different from the hundreds of layers in modern LLMs.

Over-smoothing vs Over-squashing: The Difference

7 minute read

Published: April 15, 2024

Oversmoothing and oversquashing are both problems with deep GNNs, but they affect different nodes, have different causes, and require different fixes. Confusing them leads to applying the wrong solution.

Oversquashing: When Too Much Information Passes Through Bottlenecks

7 minute read

Published: April 14, 2024

Oversquashing occurs when exponentially many node features must be compressed into a fixed-size embedding through a bottleneck edge. It is the reason GNNs struggle with long-range dependencies — not just oversmoothing.

Oversmoothing: When All Node Embeddings Become the Same

7 minute read

Published: April 13, 2024

Stack enough GNN layers and all node embeddings converge to the same vector — making the model useless. Oversmoothing is not a training problem; it is a mathematical inevitability of iterated averaging.

The Weisfeiler-Lehman Test: How Powerful Are GNNs?

7 minute read

Published: April 12, 2024

The 1-WL graph isomorphism test provides the exact upper bound on message-passing GNN expressivity. GIN achieves this bound. Any pair of graphs that 1-WL cannot distinguish cannot be distinguished by any MPNN.

MPNN: The General Message Passing Neural Network Framework

6 minute read

Published: April 11, 2024

The MPNN framework (Gilmer et al., 2017) unifies GCN, GAT, GIN, GraphSAGE, and almost all spatial GNNs under one abstraction: message functions, aggregation, and update. Understanding MPNN means understanding the whole GNN family.

Graphormer: Transformers with Structural Biases for Graphs

6 minute read

Published: April 10, 2024

Graphormer encodes graph structure directly into Transformer attention via three biases: node centrality, spatial encoding (shortest paths), and edge encoding. It won the OGB-LSC 2021 competition on molecular property prediction.

Graph Transformers: Bringing Attention to Graphs

5 minute read

Published: April 09, 2024

Graph Transformers replace or augment local message passing with full pairwise attention — every node attends to every other node. This solves long-range dependencies and over-squashing at the cost of O(N²) computation.

APPNP: Personalized PageRank Meets Graph Neural Networks

5 minute read

Published: April 08, 2024

APPNP decouples feature transformation from propagation. A neural network transforms features first; then Personalized PageRank propagates the result. This enables deep propagation without over-smoothing.

SGC: Simple Graph Convolution

5 minute read

Published: April 07, 2024

SGC removes all nonlinearities between GCN layers and collapses the entire propagation into a single pre-computed matrix power. Surprisingly, it matches GCN on most benchmarks — revealing that nonlinearities between layers may be unnecessary.

ChebNet: Spectral Graph Convolutions via Chebyshev Polynomials

5 minute read

Published: April 06, 2024

ChebNet avoids the expensive full eigendecomposition by approximating spectral filters with Chebyshev polynomials — achieving O(

) computation and spatial locality without sacrificing expressiveness.

Graph Fourier Transform: The Spectral View of Graphs

6 minute read

Published: April 05, 2024

The Graph Fourier Transform decomposes a signal on a graph into frequency components using the Laplacian’s eigenvectors. This spectral view is the mathematical foundation behind spectral GNNs like ChebNet and GCN.

Graph Tasks: Node, Edge, and Graph-Level Prediction

5 minute read

Published: April 04, 2024

GNNs can predict at three levels: properties of individual nodes, existence or type of edges, or properties of entire graphs. Each level requires a different output head and training setup.

Homophily vs Heterophily: When Neighbours Are Similar or Different

5 minute read

Published: April 03, 2024

Most GNNs assume nearby nodes are similar — the homophily assumption. When this breaks (heterophilic graphs), standard message passing hurts performance. Understanding this distinction is essential for modern GNN design.

Directed, Undirected, Weighted, and Heterogeneous Graphs

5 minute read

Published: April 02, 2024

Not all graphs are equal. Directed edges, edge weights, multiple node/edge types — each variant requires different GNN design choices.

What Is a Graph? Nodes, Edges, Features, and Labels

5 minute read

Published: April 01, 2024

A graph is a set of nodes connected by edges — but the power of GNNs comes from the features attached to nodes and edges, and the labels we want to predict.

GIN: Graph Isomorphism Network — The Most Expressive GNN

4 minute read

Published: February 08, 2024

How powerful can a GNN be? Xu et al. (2019) answered with a theoretical bound — and GIN is the architecture that achieves it. The secret: use sum aggregation and an MLP, not mean or max.

GraphSAGE: Inductive Learning on Large Graphs

4 minute read

Published: February 07, 2024

GCN and GAT learn embeddings for fixed graphs — add a new node and you’re stuck. GraphSAGE (Hamilton et al., 2017) learns an aggregation function instead, so it can generate embeddings for entirely new nodes at inference time.

GAT: Graph Attention Networks

4 minute read

Published: February 06, 2024

GCN assigns the same (degree-based) weight to every neighbour. GAT learns which neighbours actually matter — using attention coefficients on edges. More expressive, more interpretable.

GCN: Graph Convolutional Networks

4 minute read

Published: February 05, 2024

GCN (Kipf & Welling, 2016) is the ‘hello world’ of GNNs. It simplifies spectral graph convolution into a single elegant layer: normalised neighbourhood averaging with a learned linear transformation.

Message Passing: The Universal GNN Framework

4 minute read

Published: February 04, 2024

Every GNN — GCN, GAT, GraphSAGE, GIN — is a special case of message passing. Learn the three-step loop that defines them all: compute messages, aggregate, update.

The Graph Laplacian: Spectral Graph Theory Explained Simply

6 minute read

Published: February 03, 2024

The Graph Laplacian is L = D - A. Its eigenvectors reveal the graph’s community structure; its eigenvalues tell you how well-connected the graph is. It’s also the mathematical bridge from spectral theory to GNNs like GCN.

The Graph Adjacency Matrix: A Graph in Matrix Form

4 minute read

Published: February 02, 2024

Before understanding GNNs, you need to understand how graphs are represented mathematically. The adjacency matrix is the foundation — a simple grid that tells you which nodes are connected.

Graph Neural Networks: Learning on Graphs

5 minute read

Published: February 01, 2024

Graphs are everywhere — molecules, social networks, road maps, knowledge bases. Graph Neural Networks learn from this relational structure by propagating information between connected nodes. Here’s the complete picture.

research

GAPE: Remember to Forget — Gated Adaptive Positional Encoding

10 minute read

Published: May 26, 2026

GAPE is a drop-in RoPE augmentation that adds content-aware attention logit biases: a query-gate suppresses irrelevant distant context while a key-gate preserves salient distant tokens. Provably sharper attention and improved long-context robustness — no architecture changes needed.

PolyNSD: Polynomial Neural Sheaf Diffusion

11 minute read

Published: May 26, 2026

PolyNSD replaces the NSD propagation operator with a degree-K Chebyshev polynomial in the normalised sheaf Laplacian, achieving SOTA on homo- and heterophilic benchmarks with only diagonal restriction maps and dramatically lower memory usage.

Z-SASLM: Zero-Shot Style Blending via Spherical Interpolation

10 minute read

Published: May 26, 2026

Z-SASLM is a zero-shot, fine-tuning-free style blending pipeline that replaces linear latent interpolation with SLERP along the geodesic of the hypersphere, preserving latent manifold structure when blending multiple styles. Published at CVPR 2025 Workshop.

HetSheaf: Heterogeneous Graphs Meet Cellular Sheaves

7 minute read

Published: May 26, 2026

HetSheaf encodes graph heterogeneity directly in the sheaf data structure — type-aware stalks and restriction maps conditioned on node and edge types — instead of specialised architectural components, achieving +2pp on HGB with 10× fewer parameters.

sheaf

SheafPool: Basis-Invariant Graph Readout for Sheaf Neural Networks

9 minute read

Published: May 27, 2026

SheafPool solves a key missing piece in sheaf GNNs: graph-level pooling. Instead of averaging stalk vectors in arbitrary local bases, it aligns them into a shared canonical frame and builds a readout that is invariant to local basis changes.

transformers

FoPE: Fourier Position Embedding for Length Generalization

6 minute read

Published: May 29, 2026

FoPE rethinks long-context positional encoding from a frequency-domain perspective. Instead of only stretching RoPE heuristically, it explicitly improves attention’s periodic extension so Transformers generalize more gracefully to longer sequences.

Position Interpolation: Extending RoPE with Minimal Fine-Tuning

5 minute read

Published: May 29, 2026

Position Interpolation rescales positions before applying RoPE so a model trained on short contexts can be adapted to longer ones with surprisingly little fine-tuning. It became the reference baseline for long-context RoPE extension.

XPos: Length-Extrapolatable Rotary Embeddings

5 minute read

Published: May 29, 2026

XPos modifies RoPE with a multiplicative decay that keeps relative rotations while stabilising magnitude at long distance. It is one of the cleanest attempts to make rotary embeddings extrapolate better.

p-RoPE: What Makes Rotary Positional Encodings Useful?

6 minute read

Published: May 29, 2026

This paper does two things at once: it explains what RoPE is really doing inside a trained LLM, and it proposes p-RoPE, a partial rotary variant that drops the lowest frequencies to preserve stronger semantic channels.

LongRoPE: Extending Context to 2 Million Tokens

7 minute read

Published: May 26, 2026

LongRoPE (Microsoft, 2024) pushes RoPE-based context to 2M tokens by searching for optimal per-dimension rescaling factors — far outperforming NTK or YaRN at extreme lengths.

YaRN: Yet Another RoPE Extensionn Method

7 minute read

Published: May 26, 2026

YaRN combines NTK scaling for high-frequency dimensions with linear interpolation for low-frequency ones, plus a temperature correction — achieving better long-context performance with minimal fine-tuning.

NTK-Aware Scaling: Extending Context Without Fine-Tuning

7 minute read

Published: May 26, 2026

NTK-Aware Scaling extends the context window of RoPE-based models by rescaling frequencies using Neural Tangent Kernel theory — with no fine-tuning required.

The Transformer Block: Putting It All Together

7 minute read

Published: May 26, 2026

A single Transformer block combines attention, residuals, layer norm, and an FFN into one reusable unit. Understanding this block is understanding the Transformer.

Feed-Forward Networks: The Forgotten Half of Transformers

8 minute read

Published: May 26, 2026

The FFN block holds two-thirds of a Transformer’s parameters and does most of its factual recall. Yet it is almost always overlooked in introductions to attention.

Residual Connections: Why Transformers Can Be Deep

7 minute read

Published: May 26, 2026

Without residual connections, training a 96-layer Transformer would be practically impossible. The skip connection is a simple addition that solves the vanishing gradient problem and enables arbitrary depth.

Layer Normalization in Transformers

7 minute read

Published: May 26, 2026

Layer norm is not optional plumbing. It determines training stability, gradient flow, and whether deep Transformers converge at all. Pre-LN vs Post-LN is not a detail — it changes training dynamics fundamentally.

Encoder vs Decoder vs Encoder-Decoder Transformers

7 minute read

Published: May 26, 2026

BERT, GPT, and T5 are all Transformers — but their architectures are fundamentally different. One comparison table clarifies the entire landscape.

Cross-Attention: How Models Attend to Another Sequence

6 minute read

Published: May 26, 2026

Cross-attention lets one sequence query information from a completely different sequence. It is the bridge between encoder and decoder, and the core of multimodal AI.

Attention Masks: Causal, Padding, and Bidirectional

6 minute read

Published: May 26, 2026

The difference between GPT, BERT, and T5 is largely a masking decision. Learn how causal, padding, and bidirectional masks shape what each token is allowed to see.

Query, Key, Value: The Intuition Behind QKV

6 minute read

Published: May 26, 2026

Q, K, and V are not arbitrary labels. They map precisely onto search queries, database labels, and retrieved content — a framework you already understand.

Scaled Dot-Product Attention: Why the √d Matters

5 minute read

Published: May 26, 2026

Dividing by √d_k is not just a trick — it prevents softmax from saturating and dying in high-dimensional spaces. Here’s the math and the intuition.

ALiBi: Attention with Linear Biases

4 minute read

Published: May 26, 2026

ALiBi skips traditional positional embeddings entirely and just subtracts a distance penalty from attention scores. Zero extra parameters, excellent extrapolation. Press et al., 2022.

RoPE: Rotary Position Embeddings

5 minute read

Published: May 26, 2026

RoPE encodes position by rotating query and key vectors by an angle proportional to position. The clever result: absolute encoding produces relative attention for free — and it’s now the dominant PE for large language models.

Relative Positional Encodings: It’s All About Distance

4 minute read

Published: May 26, 2026

Instead of asking ‘where am I?’, relative PEs ask ‘how far are these two tokens apart?’ Shaw et al. and T5 both use this idea to build models that generalise better to variable-length inputs.

Learned Positional Encodings: Data-Driven Position

3 minute read

Published: May 26, 2026

Instead of a fixed formula, why not just train position embeddings from scratch — like word embeddings? That’s exactly what BERT and GPT-1 do. Here’s how and when it works.

Sinusoidal Positional Encodings: The Original Solution

4 minute read

Published: May 26, 2026

The PE method from the 2017 ‘Attention Is All You Need’ paper uses sine and cosine waves at different frequencies. Learn why this elegant choice encodes position without any training.

Positional Encodings: Why Position Matters

4 minute read

Published: May 26, 2026

Transformers see all tokens at once — which means without help they’d treat ‘cat ate mouse’ and ‘mouse ate cat’ the same. Positional encodings fix this. Here’s the full landscape.

Multi-Head Attention: Many Eyes on the Data

4 minute read

Published: May 26, 2026

One attention head sees one relationship. Multiple heads running in parallel let the model capture syntax, semantics, and coreference simultaneously — here’s how.

Self-Attention: Teaching Machines to Focus

6 minute read

Published: May 26, 2026

Self-attention is the core of every Transformer. Learn how Query, Key, and Value vectors let every token directly attend to every other — and why that matters.

Transformers: The Architecture That Changed AI

8 minute read

Published: May 26, 2026

A self-contained guide to the Transformer — the engine behind GPT, BERT, and modern AI. Learn how attention replaces recurrence and why every major AI system uses it.

Alessio Borgi

Posts by Category

🚀 I’m always open to collaborate, exchange ideas or just talk about anything!

basics

gnn

research

sheaf

transformers