Posts by Tags

BERT

FFN

The Transformer Block: Putting It All Together

6 minute read

Published:

A single Transformer block combines attention, residuals, layer norm, and an FFN into one reusable unit. Understanding this block is understanding the Transformer.

FoPE

FoPE: Fourier Position Embedding for Length Generalization

4 minute read

Published:

FoPE rethinks long-context positional encoding from a frequency-domain perspective. Instead of only stretching RoPE heuristically, it explicitly improves attention’s periodic extension so Transformers generalize more gracefully to longer sequences.

Fourier

FoPE: Fourier Position Embedding for Length Generalization

4 minute read

Published:

FoPE rethinks long-context positional encoding from a frequency-domain perspective. Instead of only stretching RoPE heuristically, it explicitly improves attention’s periodic extension so Transformers generalize more gracefully to longer sequences.

GPT

LLaMA

YaRN: Yet Another RoPE Extensionn Method

5 minute read

Published:

YaRN combines NTK scaling for high-frequency dimensions with linear interpolation for low-frequency ones, plus a temperature correction — achieving better long-context performance with minimal fine-tuning.

LongRoPE

LongRoPE: Extending Context to 2 Million Tokens

6 minute read

Published:

LongRoPE (Microsoft, 2024) pushes RoPE-based context to 2M tokens by searching for optimal per-dimension rescaling factors — far outperforming NTK or YaRN at extreme lengths.

MLP

Microsoft

LongRoPE: Extending Context to 2 Million Tokens

6 minute read

Published:

LongRoPE (Microsoft, 2024) pushes RoPE-based context to 2M tokens by searching for optimal per-dimension rescaling factors — far outperforming NTK or YaRN at extreme lengths.

NTK

Post-LN

Layer Normalization in Transformers

5 minute read

Published:

Layer norm is not optional plumbing. It determines training stability, gradient flow, and whether deep Transformers converge at all. Pre-LN vs Post-LN is not a detail — it changes training dynamics fundamentally.

Pre-LN

Layer Normalization in Transformers

5 minute read

Published:

Layer norm is not optional plumbing. It determines training stability, gradient flow, and whether deep Transformers converge at all. Pre-LN vs Post-LN is not a detail — it changes training dynamics fundamentally.

QKV

Query, Key, Value: The Intuition Behind QKV

5 minute read

Published:

Q, K, and V are not arbitrary labels. They map precisely onto search queries, database labels, and retrieved content — a framework you already understand.

RoPE

FoPE: Fourier Position Embedding for Length Generalization

4 minute read

Published:

FoPE rethinks long-context positional encoding from a frequency-domain perspective. Instead of only stretching RoPE heuristically, it explicitly improves attention’s periodic extension so Transformers generalize more gracefully to longer sequences.

Position Interpolation: Extending RoPE with Minimal Fine-Tuning

4 minute read

Published:

Position Interpolation rescales positions before applying RoPE so a model trained on short contexts can be adapted to longer ones with surprisingly little fine-tuning. It became the reference baseline for long-context RoPE extension.

XPos: Length-Extrapolatable Rotary Embeddings

4 minute read

Published:

XPos modifies RoPE with a multiplicative decay that keeps relative rotations while stabilising magnitude at long distance. It is one of the cleanest attempts to make rotary embeddings extrapolate better.

p-RoPE: What Makes Rotary Positional Encodings Useful?

4 minute read

Published:

This paper does two things at once: it explains what RoPE is really doing inside a trained LLM, and it proposes p-RoPE, a partial rotary variant that drops the lowest frequencies to preserve stronger semantic channels.

LongRoPE: Extending Context to 2 Million Tokens

6 minute read

Published:

LongRoPE (Microsoft, 2024) pushes RoPE-based context to 2M tokens by searching for optimal per-dimension rescaling factors — far outperforming NTK or YaRN at extreme lengths.

YaRN: Yet Another RoPE Extensionn Method

5 minute read

Published:

YaRN combines NTK scaling for high-frequency dimensions with linear interpolation for low-frequency ones, plus a temperature correction — achieving better long-context performance with minimal fine-tuning.

SwiGLU

T5

XPos

XPos: Length-Extrapolatable Rotary Embeddings

4 minute read

Published:

XPos modifies RoPE with a multiplicative decay that keeps relative rotations while stabilising magnitude at long distance. It is one of the cleanest attempts to make rotary embeddings extrapolate better.

YaRN

YaRN: Yet Another RoPE Extensionn Method

5 minute read

Published:

YaRN combines NTK scaling for high-frequency dimensions with linear interpolation for low-frequency ones, plus a temperature correction — achieving better long-context performance with minimal fine-tuning.

activation

activation-functions

Activation Functions in Neural Networks: Why Non-Linearity Matters

7 minute read

Published:

Activation functions are the reason neural networks can model curved decision boundaries instead of collapsing into one giant linear map. This chapter builds the intuition first, then walks through the classical functions that shaped deep learning.

alibi

ALiBi: Attention with Linear Biases

3 minute read

Published:

ALiBi skips traditional positional embeddings entirely and just subtracts a distance penalty from attention scores. Zero extra parameters, excellent extrapolation. Press et al., 2022.

architecture

The Transformer Block: Putting It All Together

6 minute read

Published:

A single Transformer block combines attention, residuals, layer norm, and an FFN into one reusable unit. Understanding this block is understanding the Transformer.

Transformers: The Architecture That Changed AI

7 minute read

Published:

A self-contained guide to the Transformer — the engine behind GPT, BERT, and modern AI. Learn how attention replaces recurrence and why every major AI system uses it.

attention

GAPE: Remember to Forget — Gated Adaptive Positional Encoding

8 minute read

Published:

GAPE is a drop-in RoPE augmentation that adds content-aware attention logit biases: a query-gate suppresses irrelevant distant context while a key-gate preserves salient distant tokens. Provably sharper attention and improved long-context robustness — no architecture changes needed.

The Transformer Block: Putting It All Together

6 minute read

Published:

A single Transformer block combines attention, residuals, layer norm, and an FFN into one reusable unit. Understanding this block is understanding the Transformer.

Query, Key, Value: The Intuition Behind QKV

5 minute read

Published:

Q, K, and V are not arbitrary labels. They map precisely onto search queries, database labels, and retrieved content — a framework you already understand.

Multi-Head Attention: Many Eyes on the Data

2 minute read

Published:

One attention head sees one relationship. Multiple heads running in parallel let the model capture syntax, semantics, and coreference simultaneously — here’s how.

Self-Attention: Teaching Machines to Focus

4 minute read

Published:

Self-attention is the core of every Transformer. Learn how Query, Key, and Value vectors let every token directly attend to every other — and why that matters.

attention-bias

ALiBi: Attention with Linear Biases

3 minute read

Published:

ALiBi skips traditional positional embeddings entirely and just subtracts a distance penalty from attention scores. Zero extra parameters, excellent extrapolation. Press et al., 2022.

batch-norm

Layer Normalization in Transformers

5 minute read

Published:

Layer norm is not optional plumbing. It determines training stability, gradient flow, and whether deep Transformers converge at all. Pre-LN vs Post-LN is not a detail — it changes training dynamics fundamentally.

beginner

Query, Key, Value: The Intuition Behind QKV

5 minute read

Published:

Q, K, and V are not arbitrary labels. They map precisely onto search queries, database labels, and retrieved content — a framework you already understand.

causal

context-length

LongRoPE: Extending Context to 2 Million Tokens

6 minute read

Published:

LongRoPE (Microsoft, 2024) pushes RoPE-based context to 2M tokens by searching for optimal per-dimension rescaling factors — far outperforming NTK or YaRN at extreme lengths.

YaRN: Yet Another RoPE Extensionn Method

5 minute read

Published:

YaRN combines NTK scaling for high-frequency dimensions with linear interpolation for low-frequency ones, plus a temperature correction — achieving better long-context performance with minimal fine-tuning.

cross-attention

cvpr

Z-SASLM: Zero-Shot Style Blending via Spherical Interpolation

5 minute read

Published:

Z-SASLM is a zero-shot, fine-tuning-free style blending pipeline that replaces linear latent interpolation with SLERP along the geodesic of the hypersphere, preserving latent manifold structure when blending multiple styles. Published at CVPR 2025 Workshop.

decoder

deep-learning

Transformers: The Architecture That Changed AI

7 minute read

Published:

A self-contained guide to the Transformer — the engine behind GPT, BERT, and modern AI. Learn how attention replaces recurrence and why every major AI system uses it.

depth

Residual Connections: Why Transformers Can Be Deep

5 minute read

Published:

Without residual connections, training a 96-layer Transformer would be practically impossible. The skip connection is a simple addition that solves the vanishing gradient problem and enables arbitrary depth.

diffusion-models

Z-SASLM: Zero-Shot Style Blending via Spherical Interpolation

5 minute read

Published:

Z-SASLM is a zero-shot, fine-tuning-free style blending pipeline that replaces linear latent interpolation with SLERP along the geodesic of the hypersphere, preserving latent manifold structure when blending multiple styles. Published at CVPR 2025 Workshop.

encoder

encoder-decoder

frequency-analysis

p-RoPE: What Makes Rotary Positional Encodings Useful?

4 minute read

Published:

This paper does two things at once: it explains what RoPE is really doing inside a trained LLM, and it proposes p-RoPE, a partial rotary variant that drops the lowest frequencies to preserve stronger semantic channels.

gelu

generative-ai

Z-SASLM: Zero-Shot Style Blending via Spherical Interpolation

5 minute read

Published:

Z-SASLM is a zero-shot, fine-tuning-free style blending pipeline that replaces linear latent interpolation with SLERP along the geodesic of the hypersphere, preserving latent manifold structure when blending multiple styles. Published at CVPR 2025 Workshop.

glu

gradient-flow

Residual Connections: Why Transformers Can Be Deep

5 minute read

Published:

Without residual connections, training a 96-layer Transformer would be practically impossible. The skip connection is a simple addition that solves the vanishing gradient problem and enables arbitrary depth.

gradients

graph-classification

SheafPool: Basis-Invariant Graph Readout for Sheaf Neural Networks

5 minute read

Published:

SheafPool solves a key missing piece in sheaf GNNs: graph-level pooling. Instead of averaging stalk vectors in arbitrary local bases, it aligns them into a shared canonical frame and builds a readout that is invariant to local basis changes.

graph-neural-networks

PolyNSD: Polynomial Neural Sheaf Diffusion

7 minute read

Published:

PolyNSD replaces the NSD propagation operator with a degree-K Chebyshev polynomial in the normalised sheaf Laplacian, achieving SOTA on homo- and heterophilic benchmarks with only diagonal restriction maps and dramatically lower memory usage.

HetSheaf: Heterogeneous Graphs Meet Cellular Sheaves

5 minute read

Published:

HetSheaf encodes graph heterogeneity directly in the sheaf data structure — type-aware stalks and restriction maps conditioned on node and edge types — instead of specialised architectural components, achieving +2pp on HGB with 10× fewer parameters.

heterogeneous-graphs

SheafPool: Basis-Invariant Graph Readout for Sheaf Neural Networks

5 minute read

Published:

SheafPool solves a key missing piece in sheaf GNNs: graph-level pooling. Instead of averaging stalk vectors in arbitrary local bases, it aligns them into a shared canonical frame and builds a readout that is invariant to local basis changes.

HetSheaf: Heterogeneous Graphs Meet Cellular Sheaves

5 minute read

Published:

HetSheaf encodes graph heterogeneity directly in the sheaf data structure — type-aware stalks and restriction maps conditioned on node and edge types — instead of specialised architectural components, achieving +2pp on HGB with 10× fewer parameters.

interpolation

intuition

Query, Key, Value: The Intuition Behind QKV

5 minute read

Published:

Q, K, and V are not arbitrary labels. They map precisely onto search queries, database labels, and retrieved content — a framework you already understand.

key-value memory

latent-space

Z-SASLM: Zero-Shot Style Blending via Spherical Interpolation

5 minute read

Published:

Z-SASLM is a zero-shot, fine-tuning-free style blending pipeline that replaces linear latent interpolation with SLERP along the geodesic of the hypersphere, preserving latent manifold structure when blending multiple styles. Published at CVPR 2025 Workshop.

layer-norm

The Transformer Block: Putting It All Together

6 minute read

Published:

A single Transformer block combines attention, residuals, layer norm, and an FFN into one reusable unit. Understanding this block is understanding the Transformer.

Layer Normalization in Transformers

5 minute read

Published:

Layer norm is not optional plumbing. It determines training stability, gradient flow, and whether deep Transformers converge at all. Pre-LN vs Post-LN is not a detail — it changes training dynamics fundamentally.

learned

Learned Positional Encodings: Data-Driven Position

3 minute read

Published:

Instead of a fixed formula, why not just train position embeddings from scratch — like word embeddings? That’s exactly what BERT and GPT-1 do. Here’s how and when it works.

long-context

FoPE: Fourier Position Embedding for Length Generalization

4 minute read

Published:

FoPE rethinks long-context positional encoding from a frequency-domain perspective. Instead of only stretching RoPE heuristically, it explicitly improves attention’s periodic extension so Transformers generalize more gracefully to longer sequences.

Position Interpolation: Extending RoPE with Minimal Fine-Tuning

4 minute read

Published:

Position Interpolation rescales positions before applying RoPE so a model trained on short contexts can be adapted to longer ones with surprisingly little fine-tuning. It became the reference baseline for long-context RoPE extension.

XPos: Length-Extrapolatable Rotary Embeddings

4 minute read

Published:

XPos modifies RoPE with a multiplicative decay that keeps relative rotations while stabilising magnitude at long distance. It is one of the cleanest attempts to make rotary embeddings extrapolate better.

p-RoPE: What Makes Rotary Positional Encodings Useful?

4 minute read

Published:

This paper does two things at once: it explains what RoPE is really doing inside a trained LLM, and it proposes p-RoPE, a partial rotary variant that drops the lowest frequencies to preserve stronger semantic channels.

GAPE: Remember to Forget — Gated Adaptive Positional Encoding

8 minute read

Published:

GAPE is a drop-in RoPE augmentation that adds content-aware attention logit biases: a query-gate suppresses irrelevant distant context while a key-gate preserves salient distant tokens. Provably sharper attention and improved long-context robustness — no architecture changes needed.

masking

mechanism

Self-Attention: Teaching Machines to Focus

4 minute read

Published:

Self-attention is the core of every Transformer. Learn how Query, Key, and Value vectors let every token directly attend to every other — and why that matters.

mish

multi-head

Multi-Head Attention: Many Eyes on the Data

2 minute read

Published:

One attention head sees one relationship. Multiple heads running in parallel let the model capture syntax, semantics, and coreference simultaneously — here’s how.

multimodal

neural-networks

Activation Functions in Neural Networks: Why Non-Linearity Matters

7 minute read

Published:

Activation functions are the reason neural networks can model curved decision boundaries instead of collapsing into one giant linear map. This chapter builds the intuition first, then walks through the classical functions that shaped deep learning.

nlp

Transformers: The Architecture That Changed AI

7 minute read

Published:

A self-contained guide to the Transformer — the engine behind GPT, BERT, and modern AI. Learn how attention replaces recurrence and why every major AI system uses it.

overview

Positional Encodings: Why Position Matters

3 minute read

Published:

Transformers see all tokens at once — which means without help they’d treat ‘cat ate mouse’ and ‘mouse ate cat’ the same. Positional encodings fix this. Here’s the full landscape.

p-RoPE

p-RoPE: What Makes Rotary Positional Encodings Useful?

4 minute read

Published:

This paper does two things at once: it explains what RoPE is really doing inside a trained LLM, and it proposes p-RoPE, a partial rotary variant that drops the lowest frequencies to preserve stronger semantic channels.

padding

polynomial-filters

PolyNSD: Polynomial Neural Sheaf Diffusion

7 minute read

Published:

PolyNSD replaces the NSD propagation operator with a degree-K Chebyshev polynomial in the normalised sheaf Laplacian, achieving SOTA on homo- and heterophilic benchmarks with only diagonal restriction maps and dramatically lower memory usage.

pooling

SheafPool: Basis-Invariant Graph Readout for Sheaf Neural Networks

5 minute read

Published:

SheafPool solves a key missing piece in sheaf GNNs: graph-level pooling. Instead of averaging stalk vectors in arbitrary local bases, it aligns them into a shared canonical frame and builds a readout that is invariant to local basis changes.

position-interpolation

Position Interpolation: Extending RoPE with Minimal Fine-Tuning

4 minute read

Published:

Position Interpolation rescales positions before applying RoPE so a model trained on short contexts can be adapted to longer ones with surprisingly little fine-tuning. It became the reference baseline for long-context RoPE extension.

positional-encoding

FoPE: Fourier Position Embedding for Length Generalization

4 minute read

Published:

FoPE rethinks long-context positional encoding from a frequency-domain perspective. Instead of only stretching RoPE heuristically, it explicitly improves attention’s periodic extension so Transformers generalize more gracefully to longer sequences.

Position Interpolation: Extending RoPE with Minimal Fine-Tuning

4 minute read

Published:

Position Interpolation rescales positions before applying RoPE so a model trained on short contexts can be adapted to longer ones with surprisingly little fine-tuning. It became the reference baseline for long-context RoPE extension.

XPos: Length-Extrapolatable Rotary Embeddings

4 minute read

Published:

XPos modifies RoPE with a multiplicative decay that keeps relative rotations while stabilising magnitude at long distance. It is one of the cleanest attempts to make rotary embeddings extrapolate better.

p-RoPE: What Makes Rotary Positional Encodings Useful?

4 minute read

Published:

This paper does two things at once: it explains what RoPE is really doing inside a trained LLM, and it proposes p-RoPE, a partial rotary variant that drops the lowest frequencies to preserve stronger semantic channels.

GAPE: Remember to Forget — Gated Adaptive Positional Encoding

8 minute read

Published:

GAPE is a drop-in RoPE augmentation that adds content-aware attention logit biases: a query-gate suppresses irrelevant distant context while a key-gate preserves salient distant tokens. Provably sharper attention and improved long-context robustness — no architecture changes needed.

LongRoPE: Extending Context to 2 Million Tokens

6 minute read

Published:

LongRoPE (Microsoft, 2024) pushes RoPE-based context to 2M tokens by searching for optimal per-dimension rescaling factors — far outperforming NTK or YaRN at extreme lengths.

YaRN: Yet Another RoPE Extensionn Method

5 minute read

Published:

YaRN combines NTK scaling for high-frequency dimensions with linear interpolation for low-frequency ones, plus a temperature correction — achieving better long-context performance with minimal fine-tuning.

ALiBi: Attention with Linear Biases

3 minute read

Published:

ALiBi skips traditional positional embeddings entirely and just subtracts a distance penalty from attention scores. Zero extra parameters, excellent extrapolation. Press et al., 2022.

RoPE: Rotary Position Embeddings

4 minute read

Published:

RoPE encodes position by rotating query and key vectors by an angle proportional to position. The clever result: absolute encoding produces relative attention for free — and it’s now the dominant PE for large language models.

Relative Positional Encodings: It’s All About Distance

3 minute read

Published:

Instead of asking ‘where am I?’, relative PEs ask ‘how far are these two tokens apart?’ Shaw et al. and T5 both use this idea to build models that generalise better to variable-length inputs.

Learned Positional Encodings: Data-Driven Position

3 minute read

Published:

Instead of a fixed formula, why not just train position embeddings from scratch — like word embeddings? That’s exactly what BERT and GPT-1 do. Here’s how and when it works.

Sinusoidal Positional Encodings: The Original Solution

3 minute read

Published:

The PE method from the 2017 ‘Attention Is All You Need’ paper uses sine and cosine waves at different frequencies. Learn why this elegant choice encodes position without any training.

Positional Encodings: Why Position Matters

3 minute read

Published:

Transformers see all tokens at once — which means without help they’d treat ‘cat ate mouse’ and ‘mouse ate cat’ the same. Positional encodings fix this. Here’s the full landscape.

relative

Relative Positional Encodings: It’s All About Distance

3 minute read

Published:

Instead of asking ‘where am I?’, relative PEs ask ‘how far are these two tokens apart?’ Shaw et al. and T5 both use this idea to build models that generalise better to variable-length inputs.

relu

Activation Functions in Neural Networks: Why Non-Linearity Matters

7 minute read

Published:

Activation functions are the reason neural networks can model curved decision boundaries instead of collapsing into one giant linear map. This chapter builds the intuition first, then walks through the classical functions that shaped deep learning.

residual

The Transformer Block: Putting It All Together

6 minute read

Published:

A single Transformer block combines attention, residuals, layer norm, and an FFN into one reusable unit. Understanding this block is understanding the Transformer.

Residual Connections: Why Transformers Can Be Deep

5 minute read

Published:

Without residual connections, training a 96-layer Transformer would be practically impossible. The skip connection is a simple addition that solves the vanishing gradient problem and enables arbitrary depth.

rope

GAPE: Remember to Forget — Gated Adaptive Positional Encoding

8 minute read

Published:

GAPE is a drop-in RoPE augmentation that adds content-aware attention logit biases: a query-gate suppresses irrelevant distant context while a key-gate preserves salient distant tokens. Provably sharper attention and improved long-context robustness — no architecture changes needed.

RoPE: Rotary Position Embeddings

4 minute read

Published:

RoPE encodes position by rotating query and key vectors by an angle proportional to position. The clever result: absolute encoding produces relative attention for free — and it’s now the dominant PE for large language models.

rotary

RoPE: Rotary Position Embeddings

4 minute read

Published:

RoPE encodes position by rotating query and key vectors by an angle proportional to position. The clever result: absolute encoding produces relative attention for free — and it’s now the dominant PE for large language models.

scaling

sheaf-neural-networks

SheafPool: Basis-Invariant Graph Readout for Sheaf Neural Networks

5 minute read

Published:

SheafPool solves a key missing piece in sheaf GNNs: graph-level pooling. Instead of averaging stalk vectors in arbitrary local bases, it aligns them into a shared canonical frame and builds a readout that is invariant to local basis changes.

PolyNSD: Polynomial Neural Sheaf Diffusion

7 minute read

Published:

PolyNSD replaces the NSD propagation operator with a degree-K Chebyshev polynomial in the normalised sheaf Laplacian, achieving SOTA on homo- and heterophilic benchmarks with only diagonal restriction maps and dramatically lower memory usage.

HetSheaf: Heterogeneous Graphs Meet Cellular Sheaves

5 minute read

Published:

HetSheaf encodes graph heterogeneity directly in the sheaf data structure — type-aware stalks and restriction maps conditioned on node and edge types — instead of specialised architectural components, achieving +2pp on HGB with 10× fewer parameters.

sheafpool

SheafPool: Basis-Invariant Graph Readout for Sheaf Neural Networks

5 minute read

Published:

SheafPool solves a key missing piece in sheaf GNNs: graph-level pooling. Instead of averaging stalk vectors in arbitrary local bases, it aligns them into a shared canonical frame and builds a readout that is invariant to local basis changes.

sigmoid

Activation Functions in Neural Networks: Why Non-Linearity Matters

7 minute read

Published:

Activation functions are the reason neural networks can model curved decision boundaries instead of collapsing into one giant linear map. This chapter builds the intuition first, then walks through the classical functions that shaped deep learning.

silu

sinusoidal

Sinusoidal Positional Encodings: The Original Solution

3 minute read

Published:

The PE method from the 2017 ‘Attention Is All You Need’ paper uses sine and cosine waves at different frequencies. Learn why this elegant choice encodes position without any training.

siren

skip-connections

Residual Connections: Why Transformers Can Be Deep

5 minute read

Published:

Without residual connections, training a 96-layer Transformer would be practically impossible. The skip connection is a simple addition that solves the vanishing gradient problem and enables arbitrary depth.

softmax

sparsemax

spectral-gnn

PolyNSD: Polynomial Neural Sheaf Diffusion

7 minute read

Published:

PolyNSD replaces the NSD propagation operator with a degree-K Chebyshev polynomial in the normalised sheaf Laplacian, achieving SOTA on homo- and heterophilic benchmarks with only diagonal restriction maps and dramatically lower memory usage.

style-transfer

Z-SASLM: Zero-Shot Style Blending via Spherical Interpolation

5 minute read

Published:

Z-SASLM is a zero-shot, fine-tuning-free style blending pipeline that replaces linear latent interpolation with SLERP along the geodesic of the hypersphere, preserving latent manifold structure when blending multiple styles. Published at CVPR 2025 Workshop.

swiglu

swish

tanh

Activation Functions in Neural Networks: Why Non-Linearity Matters

7 minute read

Published:

Activation functions are the reason neural networks can model curved decision boundaries instead of collapsing into one giant linear map. This chapter builds the intuition first, then walks through the classical functions that shaped deep learning.

training stability

Layer Normalization in Transformers

5 minute read

Published:

Layer norm is not optional plumbing. It determines training stability, gradient flow, and whether deep Transformers converge at all. Pre-LN vs Post-LN is not a detail — it changes training dynamics fundamentally.

transformer-block

The Transformer Block: Putting It All Together

6 minute read

Published:

A single Transformer block combines attention, residuals, layer norm, and an FFN into one reusable unit. Understanding this block is understanding the Transformer.

transformers

GAPE: Remember to Forget — Gated Adaptive Positional Encoding

8 minute read

Published:

GAPE is a drop-in RoPE augmentation that adds content-aware attention logit biases: a query-gate suppresses irrelevant distant context while a key-gate preserves salient distant tokens. Provably sharper attention and improved long-context robustness — no architecture changes needed.