Posts by Category


🚀 I’m always open to collaborate, exchange ideas or just talk about anything!

👨🏻‍💻 I’m eager to work with anyone who has great ideas, wants to learn more and more and also share their experience to others. Don’t hesitate to write me if you’d like to propose your help or ask for mine on a project, research, paper-idea, or a moonshot you’re cooking up.

👉 Email Me ✉️


basics

Activation Functions in Neural Networks: Why Non-Linearity Matters

7 minute read

Published:

Activation functions are the reason neural networks can model curved decision boundaries instead of collapsing into one giant linear map. This chapter builds the intuition first, then walks through the classical functions that shaped deep learning.

research

GAPE: Remember to Forget — Gated Adaptive Positional Encoding

8 minute read

Published:

GAPE is a drop-in RoPE augmentation that adds content-aware attention logit biases: a query-gate suppresses irrelevant distant context while a key-gate preserves salient distant tokens. Provably sharper attention and improved long-context robustness — no architecture changes needed.

PolyNSD: Polynomial Neural Sheaf Diffusion

7 minute read

Published:

PolyNSD replaces the NSD propagation operator with a degree-K Chebyshev polynomial in the normalised sheaf Laplacian, achieving SOTA on homo- and heterophilic benchmarks with only diagonal restriction maps and dramatically lower memory usage.

Z-SASLM: Zero-Shot Style Blending via Spherical Interpolation

5 minute read

Published:

Z-SASLM is a zero-shot, fine-tuning-free style blending pipeline that replaces linear latent interpolation with SLERP along the geodesic of the hypersphere, preserving latent manifold structure when blending multiple styles. Published at CVPR 2025 Workshop.

HetSheaf: Heterogeneous Graphs Meet Cellular Sheaves

5 minute read

Published:

HetSheaf encodes graph heterogeneity directly in the sheaf data structure — type-aware stalks and restriction maps conditioned on node and edge types — instead of specialised architectural components, achieving +2pp on HGB with 10× fewer parameters.

sheaf

SheafPool: Basis-Invariant Graph Readout for Sheaf Neural Networks

5 minute read

Published:

SheafPool solves a key missing piece in sheaf GNNs: graph-level pooling. Instead of averaging stalk vectors in arbitrary local bases, it aligns them into a shared canonical frame and builds a readout that is invariant to local basis changes.

transformers

FoPE: Fourier Position Embedding for Length Generalization

4 minute read

Published:

FoPE rethinks long-context positional encoding from a frequency-domain perspective. Instead of only stretching RoPE heuristically, it explicitly improves attention’s periodic extension so Transformers generalize more gracefully to longer sequences.

Position Interpolation: Extending RoPE with Minimal Fine-Tuning

4 minute read

Published:

Position Interpolation rescales positions before applying RoPE so a model trained on short contexts can be adapted to longer ones with surprisingly little fine-tuning. It became the reference baseline for long-context RoPE extension.

XPos: Length-Extrapolatable Rotary Embeddings

4 minute read

Published:

XPos modifies RoPE with a multiplicative decay that keeps relative rotations while stabilising magnitude at long distance. It is one of the cleanest attempts to make rotary embeddings extrapolate better.

p-RoPE: What Makes Rotary Positional Encodings Useful?

4 minute read

Published:

This paper does two things at once: it explains what RoPE is really doing inside a trained LLM, and it proposes p-RoPE, a partial rotary variant that drops the lowest frequencies to preserve stronger semantic channels.

LongRoPE: Extending Context to 2 Million Tokens

6 minute read

Published:

LongRoPE (Microsoft, 2024) pushes RoPE-based context to 2M tokens by searching for optimal per-dimension rescaling factors — far outperforming NTK or YaRN at extreme lengths.

YaRN: Yet Another RoPE Extensionn Method

5 minute read

Published:

YaRN combines NTK scaling for high-frequency dimensions with linear interpolation for low-frequency ones, plus a temperature correction — achieving better long-context performance with minimal fine-tuning.

The Transformer Block: Putting It All Together

6 minute read

Published:

A single Transformer block combines attention, residuals, layer norm, and an FFN into one reusable unit. Understanding this block is understanding the Transformer.

Residual Connections: Why Transformers Can Be Deep

5 minute read

Published:

Without residual connections, training a 96-layer Transformer would be practically impossible. The skip connection is a simple addition that solves the vanishing gradient problem and enables arbitrary depth.

Layer Normalization in Transformers

5 minute read

Published:

Layer norm is not optional plumbing. It determines training stability, gradient flow, and whether deep Transformers converge at all. Pre-LN vs Post-LN is not a detail — it changes training dynamics fundamentally.

Query, Key, Value: The Intuition Behind QKV

5 minute read

Published:

Q, K, and V are not arbitrary labels. They map precisely onto search queries, database labels, and retrieved content — a framework you already understand.

ALiBi: Attention with Linear Biases

3 minute read

Published:

ALiBi skips traditional positional embeddings entirely and just subtracts a distance penalty from attention scores. Zero extra parameters, excellent extrapolation. Press et al., 2022.

RoPE: Rotary Position Embeddings

4 minute read

Published:

RoPE encodes position by rotating query and key vectors by an angle proportional to position. The clever result: absolute encoding produces relative attention for free — and it’s now the dominant PE for large language models.

Relative Positional Encodings: It’s All About Distance

3 minute read

Published:

Instead of asking ‘where am I?’, relative PEs ask ‘how far are these two tokens apart?’ Shaw et al. and T5 both use this idea to build models that generalise better to variable-length inputs.

Learned Positional Encodings: Data-Driven Position

3 minute read

Published:

Instead of a fixed formula, why not just train position embeddings from scratch — like word embeddings? That’s exactly what BERT and GPT-1 do. Here’s how and when it works.

Sinusoidal Positional Encodings: The Original Solution

3 minute read

Published:

The PE method from the 2017 ‘Attention Is All You Need’ paper uses sine and cosine waves at different frequencies. Learn why this elegant choice encodes position without any training.

Positional Encodings: Why Position Matters

3 minute read

Published:

Transformers see all tokens at once — which means without help they’d treat ‘cat ate mouse’ and ‘mouse ate cat’ the same. Positional encodings fix this. Here’s the full landscape.

Multi-Head Attention: Many Eyes on the Data

2 minute read

Published:

One attention head sees one relationship. Multiple heads running in parallel let the model capture syntax, semantics, and coreference simultaneously — here’s how.

Self-Attention: Teaching Machines to Focus

4 minute read

Published:

Self-attention is the core of every Transformer. Learn how Query, Key, and Value vectors let every token directly attend to every other — and why that matters.

Transformers: The Architecture That Changed AI

7 minute read

Published:

A self-contained guide to the Transformer — the engine behind GPT, BERT, and modern AI. Learn how attention replaces recurrence and why every major AI system uses it.