Encoder vs Decoder vs Encoder-Decoder Transformers
Published:
BERT, GPT, and T5 are all Transformers — but their architectures are fundamentally different. One comparison table clarifies the entire landscape.
Published:
BERT, GPT, and T5 are all Transformers — but their architectures are fundamentally different. One comparison table clarifies the entire landscape.
Published:
The difference between GPT, BERT, and T5 is largely a masking decision. Learn how causal, padding, and bidirectional masks shape what each token is allowed to see.
Published:
A single Transformer block combines attention, residuals, layer norm, and an FFN into one reusable unit. Understanding this block is understanding the Transformer.
Published:
The FFN block holds two-thirds of a Transformer’s parameters and does most of its factual recall. Yet it is almost always overlooked in introductions to attention.
Published:
FoPE rethinks long-context positional encoding from a frequency-domain perspective. Instead of only stretching RoPE heuristically, it explicitly improves attention’s periodic extension so Transformers generalize more gracefully to longer sequences.
Published:
FoPE rethinks long-context positional encoding from a frequency-domain perspective. Instead of only stretching RoPE heuristically, it explicitly improves attention’s periodic extension so Transformers generalize more gracefully to longer sequences.
Published:
BERT, GPT, and T5 are all Transformers — but their architectures are fundamentally different. One comparison table clarifies the entire landscape.
Published:
The difference between GPT, BERT, and T5 is largely a masking decision. Learn how causal, padding, and bidirectional masks shape what each token is allowed to see.
Published:
YaRN combines NTK scaling for high-frequency dimensions with linear interpolation for low-frequency ones, plus a temperature correction — achieving better long-context performance with minimal fine-tuning.
Published:
LongRoPE (Microsoft, 2024) pushes RoPE-based context to 2M tokens by searching for optimal per-dimension rescaling factors — far outperforming NTK or YaRN at extreme lengths.
Published:
The FFN block holds two-thirds of a Transformer’s parameters and does most of its factual recall. Yet it is almost always overlooked in introductions to attention.
Published:
LongRoPE (Microsoft, 2024) pushes RoPE-based context to 2M tokens by searching for optimal per-dimension rescaling factors — far outperforming NTK or YaRN at extreme lengths.
Published:
NTK-Aware Scaling extends the context window of RoPE-based models by rescaling frequencies using Neural Tangent Kernel theory — with no fine-tuning required.
Published:
Layer norm is not optional plumbing. It determines training stability, gradient flow, and whether deep Transformers converge at all. Pre-LN vs Post-LN is not a detail — it changes training dynamics fundamentally.
Published:
Layer norm is not optional plumbing. It determines training stability, gradient flow, and whether deep Transformers converge at all. Pre-LN vs Post-LN is not a detail — it changes training dynamics fundamentally.
Published:
Q, K, and V are not arbitrary labels. They map precisely onto search queries, database labels, and retrieved content — a framework you already understand.
Published:
FoPE rethinks long-context positional encoding from a frequency-domain perspective. Instead of only stretching RoPE heuristically, it explicitly improves attention’s periodic extension so Transformers generalize more gracefully to longer sequences.
Published:
Position Interpolation rescales positions before applying RoPE so a model trained on short contexts can be adapted to longer ones with surprisingly little fine-tuning. It became the reference baseline for long-context RoPE extension.
Published:
XPos modifies RoPE with a multiplicative decay that keeps relative rotations while stabilising magnitude at long distance. It is one of the cleanest attempts to make rotary embeddings extrapolate better.
Published:
This paper does two things at once: it explains what RoPE is really doing inside a trained LLM, and it proposes p-RoPE, a partial rotary variant that drops the lowest frequencies to preserve stronger semantic channels.
Published:
LongRoPE (Microsoft, 2024) pushes RoPE-based context to 2M tokens by searching for optimal per-dimension rescaling factors — far outperforming NTK or YaRN at extreme lengths.
Published:
YaRN combines NTK scaling for high-frequency dimensions with linear interpolation for low-frequency ones, plus a temperature correction — achieving better long-context performance with minimal fine-tuning.
Published:
NTK-Aware Scaling extends the context window of RoPE-based models by rescaling frequencies using Neural Tangent Kernel theory — with no fine-tuning required.
Published:
The FFN block holds two-thirds of a Transformer’s parameters and does most of its factual recall. Yet it is almost always overlooked in introductions to attention.
Published:
BERT, GPT, and T5 are all Transformers — but their architectures are fundamentally different. One comparison table clarifies the entire landscape.
Published:
XPos modifies RoPE with a multiplicative decay that keeps relative rotations while stabilising magnitude at long distance. It is one of the cleanest attempts to make rotary embeddings extrapolate better.
Published:
YaRN combines NTK scaling for high-frequency dimensions with linear interpolation for low-frequency ones, plus a temperature correction — achieving better long-context performance with minimal fine-tuning.
Published:
The FFN block holds two-thirds of a Transformer’s parameters and does most of its factual recall. Yet it is almost always overlooked in introductions to attention.
Published:
Once ReLU became the default, researchers started asking a better question: can we keep the easy optimization while making the activation smoother, softer, and more expressive? This chapter covers the modern answers.
Published:
Activation functions are the reason neural networks can model curved decision boundaries instead of collapsing into one giant linear map. This chapter builds the intuition first, then walks through the classical functions that shaped deep learning.
Published:
ALiBi skips traditional positional embeddings entirely and just subtracts a distance penalty from attention scores. Zero extra parameters, excellent extrapolation. Press et al., 2022.
Published:
A single Transformer block combines attention, residuals, layer norm, and an FFN into one reusable unit. Understanding this block is understanding the Transformer.
Published:
BERT, GPT, and T5 are all Transformers — but their architectures are fundamentally different. One comparison table clarifies the entire landscape.
Published:
A self-contained guide to the Transformer — the engine behind GPT, BERT, and modern AI. Learn how attention replaces recurrence and why every major AI system uses it.
Published:
GAPE is a drop-in RoPE augmentation that adds content-aware attention logit biases: a query-gate suppresses irrelevant distant context while a key-gate preserves salient distant tokens. Provably sharper attention and improved long-context robustness — no architecture changes needed.
Published:
A single Transformer block combines attention, residuals, layer norm, and an FFN into one reusable unit. Understanding this block is understanding the Transformer.
Published:
Cross-attention lets one sequence query information from a completely different sequence. It is the bridge between encoder and decoder, and the core of multimodal AI.
Published:
The difference between GPT, BERT, and T5 is largely a masking decision. Learn how causal, padding, and bidirectional masks shape what each token is allowed to see.
Published:
Q, K, and V are not arbitrary labels. They map precisely onto search queries, database labels, and retrieved content — a framework you already understand.
Published:
Dividing by √d_k is not just a trick — it prevents softmax from saturating and dying in high-dimensional spaces. Here’s the math and the intuition.
Published:
One attention head sees one relationship. Multiple heads running in parallel let the model capture syntax, semantics, and coreference simultaneously — here’s how.
Published:
Self-attention is the core of every Transformer. Learn how Query, Key, and Value vectors let every token directly attend to every other — and why that matters.
Published:
ALiBi skips traditional positional embeddings entirely and just subtracts a distance penalty from attention scores. Zero extra parameters, excellent extrapolation. Press et al., 2022.
Published:
Layer norm is not optional plumbing. It determines training stability, gradient flow, and whether deep Transformers converge at all. Pre-LN vs Post-LN is not a detail — it changes training dynamics fundamentally.
Published:
Q, K, and V are not arbitrary labels. They map precisely onto search queries, database labels, and retrieved content — a framework you already understand.
Published:
The difference between GPT, BERT, and T5 is largely a masking decision. Learn how causal, padding, and bidirectional masks shape what each token is allowed to see.
Published:
LongRoPE (Microsoft, 2024) pushes RoPE-based context to 2M tokens by searching for optimal per-dimension rescaling factors — far outperforming NTK or YaRN at extreme lengths.
Published:
YaRN combines NTK scaling for high-frequency dimensions with linear interpolation for low-frequency ones, plus a temperature correction — achieving better long-context performance with minimal fine-tuning.
Published:
NTK-Aware Scaling extends the context window of RoPE-based models by rescaling frequencies using Neural Tangent Kernel theory — with no fine-tuning required.
Published:
Cross-attention lets one sequence query information from a completely different sequence. It is the bridge between encoder and decoder, and the core of multimodal AI.
Published:
Z-SASLM is a zero-shot, fine-tuning-free style blending pipeline that replaces linear latent interpolation with SLERP along the geodesic of the hypersphere, preserving latent manifold structure when blending multiple styles. Published at CVPR 2025 Workshop.
Published:
BERT, GPT, and T5 are all Transformers — but their architectures are fundamentally different. One comparison table clarifies the entire landscape.
Published:
A self-contained guide to the Transformer — the engine behind GPT, BERT, and modern AI. Learn how attention replaces recurrence and why every major AI system uses it.
Published:
Without residual connections, training a 96-layer Transformer would be practically impossible. The skip connection is a simple addition that solves the vanishing gradient problem and enables arbitrary depth.
Published:
Z-SASLM is a zero-shot, fine-tuning-free style blending pipeline that replaces linear latent interpolation with SLERP along the geodesic of the hypersphere, preserving latent manifold structure when blending multiple styles. Published at CVPR 2025 Workshop.
Published:
BERT, GPT, and T5 are all Transformers — but their architectures are fundamentally different. One comparison table clarifies the entire landscape.
Published:
Cross-attention lets one sequence query information from a completely different sequence. It is the bridge between encoder and decoder, and the core of multimodal AI.
Published:
This paper does two things at once: it explains what RoPE is really doing inside a trained LLM, and it proposes p-RoPE, a partial rotary variant that drops the lowest frequencies to preserve stronger semantic channels.
Published:
Once ReLU became the default, researchers started asking a better question: can we keep the easy optimization while making the activation smoother, softer, and more expressive? This chapter covers the modern answers.
Published:
Z-SASLM is a zero-shot, fine-tuning-free style blending pipeline that replaces linear latent interpolation with SLERP along the geodesic of the hypersphere, preserving latent manifold structure when blending multiple styles. Published at CVPR 2025 Workshop.
Published:
Not every activation is a hidden-layer curve. Some produce probabilities, some implement learned gates, some shrink values toward zero, and some are designed for very specialized settings such as implicit neural representations.
Published:
Without residual connections, training a 96-layer Transformer would be practically impossible. The skip connection is a simple addition that solves the vanishing gradient problem and enables arbitrary depth.
Published:
Dividing by √d_k is not just a trick — it prevents softmax from saturating and dying in high-dimensional spaces. Here’s the math and the intuition.
Published:
SheafPool solves a key missing piece in sheaf GNNs: graph-level pooling. Instead of averaging stalk vectors in arbitrary local bases, it aligns them into a shared canonical frame and builds a readout that is invariant to local basis changes.
Published:
PolyNSD replaces the NSD propagation operator with a degree-K Chebyshev polynomial in the normalised sheaf Laplacian, achieving SOTA on homo- and heterophilic benchmarks with only diagonal restriction maps and dramatically lower memory usage.
Published:
HetSheaf encodes graph heterogeneity directly in the sheaf data structure — type-aware stalks and restriction maps conditioned on node and edge types — instead of specialised architectural components, achieving +2pp on HGB with 10× fewer parameters.
Published:
SheafPool solves a key missing piece in sheaf GNNs: graph-level pooling. Instead of averaging stalk vectors in arbitrary local bases, it aligns them into a shared canonical frame and builds a readout that is invariant to local basis changes.
Published:
HetSheaf encodes graph heterogeneity directly in the sheaf data structure — type-aware stalks and restriction maps conditioned on node and edge types — instead of specialised architectural components, achieving +2pp on HGB with 10× fewer parameters.
Published:
NTK-Aware Scaling extends the context window of RoPE-based models by rescaling frequencies using Neural Tangent Kernel theory — with no fine-tuning required.
Published:
Q, K, and V are not arbitrary labels. They map precisely onto search queries, database labels, and retrieved content — a framework you already understand.
Published:
The FFN block holds two-thirds of a Transformer’s parameters and does most of its factual recall. Yet it is almost always overlooked in introductions to attention.
Published:
Z-SASLM is a zero-shot, fine-tuning-free style blending pipeline that replaces linear latent interpolation with SLERP along the geodesic of the hypersphere, preserving latent manifold structure when blending multiple styles. Published at CVPR 2025 Workshop.
Published:
A single Transformer block combines attention, residuals, layer norm, and an FFN into one reusable unit. Understanding this block is understanding the Transformer.
Published:
Layer norm is not optional plumbing. It determines training stability, gradient flow, and whether deep Transformers converge at all. Pre-LN vs Post-LN is not a detail — it changes training dynamics fundamentally.
Published:
Instead of a fixed formula, why not just train position embeddings from scratch — like word embeddings? That’s exactly what BERT and GPT-1 do. Here’s how and when it works.
Published:
FoPE rethinks long-context positional encoding from a frequency-domain perspective. Instead of only stretching RoPE heuristically, it explicitly improves attention’s periodic extension so Transformers generalize more gracefully to longer sequences.
Published:
Position Interpolation rescales positions before applying RoPE so a model trained on short contexts can be adapted to longer ones with surprisingly little fine-tuning. It became the reference baseline for long-context RoPE extension.
Published:
XPos modifies RoPE with a multiplicative decay that keeps relative rotations while stabilising magnitude at long distance. It is one of the cleanest attempts to make rotary embeddings extrapolate better.
Published:
This paper does two things at once: it explains what RoPE is really doing inside a trained LLM, and it proposes p-RoPE, a partial rotary variant that drops the lowest frequencies to preserve stronger semantic channels.
Published:
GAPE is a drop-in RoPE augmentation that adds content-aware attention logit biases: a query-gate suppresses irrelevant distant context while a key-gate preserves salient distant tokens. Provably sharper attention and improved long-context robustness — no architecture changes needed.
Published:
The difference between GPT, BERT, and T5 is largely a masking decision. Learn how causal, padding, and bidirectional masks shape what each token is allowed to see.
Published:
Self-attention is the core of every Transformer. Learn how Query, Key, and Value vectors let every token directly attend to every other — and why that matters.
Published:
Once ReLU became the default, researchers started asking a better question: can we keep the easy optimization while making the activation smoother, softer, and more expressive? This chapter covers the modern answers.
Published:
One attention head sees one relationship. Multiple heads running in parallel let the model capture syntax, semantics, and coreference simultaneously — here’s how.
Published:
Cross-attention lets one sequence query information from a completely different sequence. It is the bridge between encoder and decoder, and the core of multimodal AI.
Published:
Activation functions are the reason neural networks can model curved decision boundaries instead of collapsing into one giant linear map. This chapter builds the intuition first, then walks through the classical functions that shaped deep learning.
Published:
A self-contained guide to the Transformer — the engine behind GPT, BERT, and modern AI. Learn how attention replaces recurrence and why every major AI system uses it.
Published:
Transformers see all tokens at once — which means without help they’d treat ‘cat ate mouse’ and ‘mouse ate cat’ the same. Positional encodings fix this. Here’s the full landscape.
Published:
This paper does two things at once: it explains what RoPE is really doing inside a trained LLM, and it proposes p-RoPE, a partial rotary variant that drops the lowest frequencies to preserve stronger semantic channels.
Published:
The difference between GPT, BERT, and T5 is largely a masking decision. Learn how causal, padding, and bidirectional masks shape what each token is allowed to see.
Published:
PolyNSD replaces the NSD propagation operator with a degree-K Chebyshev polynomial in the normalised sheaf Laplacian, achieving SOTA on homo- and heterophilic benchmarks with only diagonal restriction maps and dramatically lower memory usage.
Published:
SheafPool solves a key missing piece in sheaf GNNs: graph-level pooling. Instead of averaging stalk vectors in arbitrary local bases, it aligns them into a shared canonical frame and builds a readout that is invariant to local basis changes.
Published:
Position Interpolation rescales positions before applying RoPE so a model trained on short contexts can be adapted to longer ones with surprisingly little fine-tuning. It became the reference baseline for long-context RoPE extension.
Published:
FoPE rethinks long-context positional encoding from a frequency-domain perspective. Instead of only stretching RoPE heuristically, it explicitly improves attention’s periodic extension so Transformers generalize more gracefully to longer sequences.
Published:
Position Interpolation rescales positions before applying RoPE so a model trained on short contexts can be adapted to longer ones with surprisingly little fine-tuning. It became the reference baseline for long-context RoPE extension.
Published:
XPos modifies RoPE with a multiplicative decay that keeps relative rotations while stabilising magnitude at long distance. It is one of the cleanest attempts to make rotary embeddings extrapolate better.
Published:
This paper does two things at once: it explains what RoPE is really doing inside a trained LLM, and it proposes p-RoPE, a partial rotary variant that drops the lowest frequencies to preserve stronger semantic channels.
Published:
GAPE is a drop-in RoPE augmentation that adds content-aware attention logit biases: a query-gate suppresses irrelevant distant context while a key-gate preserves salient distant tokens. Provably sharper attention and improved long-context robustness — no architecture changes needed.
Published:
LongRoPE (Microsoft, 2024) pushes RoPE-based context to 2M tokens by searching for optimal per-dimension rescaling factors — far outperforming NTK or YaRN at extreme lengths.
Published:
YaRN combines NTK scaling for high-frequency dimensions with linear interpolation for low-frequency ones, plus a temperature correction — achieving better long-context performance with minimal fine-tuning.
Published:
NTK-Aware Scaling extends the context window of RoPE-based models by rescaling frequencies using Neural Tangent Kernel theory — with no fine-tuning required.
Published:
ALiBi skips traditional positional embeddings entirely and just subtracts a distance penalty from attention scores. Zero extra parameters, excellent extrapolation. Press et al., 2022.
Published:
RoPE encodes position by rotating query and key vectors by an angle proportional to position. The clever result: absolute encoding produces relative attention for free — and it’s now the dominant PE for large language models.
Published:
Instead of asking ‘where am I?’, relative PEs ask ‘how far are these two tokens apart?’ Shaw et al. and T5 both use this idea to build models that generalise better to variable-length inputs.
Published:
Instead of a fixed formula, why not just train position embeddings from scratch — like word embeddings? That’s exactly what BERT and GPT-1 do. Here’s how and when it works.
Published:
The PE method from the 2017 ‘Attention Is All You Need’ paper uses sine and cosine waves at different frequencies. Learn why this elegant choice encodes position without any training.
Published:
Transformers see all tokens at once — which means without help they’d treat ‘cat ate mouse’ and ‘mouse ate cat’ the same. Positional encodings fix this. Here’s the full landscape.
Published:
Instead of asking ‘where am I?’, relative PEs ask ‘how far are these two tokens apart?’ Shaw et al. and T5 both use this idea to build models that generalise better to variable-length inputs.
Published:
Activation functions are the reason neural networks can model curved decision boundaries instead of collapsing into one giant linear map. This chapter builds the intuition first, then walks through the classical functions that shaped deep learning.
Published:
A single Transformer block combines attention, residuals, layer norm, and an FFN into one reusable unit. Understanding this block is understanding the Transformer.
Published:
Without residual connections, training a 96-layer Transformer would be practically impossible. The skip connection is a simple addition that solves the vanishing gradient problem and enables arbitrary depth.
Published:
GAPE is a drop-in RoPE augmentation that adds content-aware attention logit biases: a query-gate suppresses irrelevant distant context while a key-gate preserves salient distant tokens. Provably sharper attention and improved long-context robustness — no architecture changes needed.
Published:
RoPE encodes position by rotating query and key vectors by an angle proportional to position. The clever result: absolute encoding produces relative attention for free — and it’s now the dominant PE for large language models.
Published:
RoPE encodes position by rotating query and key vectors by an angle proportional to position. The clever result: absolute encoding produces relative attention for free — and it’s now the dominant PE for large language models.
Published:
Dividing by √d_k is not just a trick — it prevents softmax from saturating and dying in high-dimensional spaces. Here’s the math and the intuition.
Published:
SheafPool solves a key missing piece in sheaf GNNs: graph-level pooling. Instead of averaging stalk vectors in arbitrary local bases, it aligns them into a shared canonical frame and builds a readout that is invariant to local basis changes.
Published:
PolyNSD replaces the NSD propagation operator with a degree-K Chebyshev polynomial in the normalised sheaf Laplacian, achieving SOTA on homo- and heterophilic benchmarks with only diagonal restriction maps and dramatically lower memory usage.
Published:
HetSheaf encodes graph heterogeneity directly in the sheaf data structure — type-aware stalks and restriction maps conditioned on node and edge types — instead of specialised architectural components, achieving +2pp on HGB with 10× fewer parameters.
Published:
SheafPool solves a key missing piece in sheaf GNNs: graph-level pooling. Instead of averaging stalk vectors in arbitrary local bases, it aligns them into a shared canonical frame and builds a readout that is invariant to local basis changes.
Published:
Activation functions are the reason neural networks can model curved decision boundaries instead of collapsing into one giant linear map. This chapter builds the intuition first, then walks through the classical functions that shaped deep learning.
Published:
Once ReLU became the default, researchers started asking a better question: can we keep the easy optimization while making the activation smoother, softer, and more expressive? This chapter covers the modern answers.
Published:
The PE method from the 2017 ‘Attention Is All You Need’ paper uses sine and cosine waves at different frequencies. Learn why this elegant choice encodes position without any training.
Published:
Not every activation is a hidden-layer curve. Some produce probabilities, some implement learned gates, some shrink values toward zero, and some are designed for very specialized settings such as implicit neural representations.
Published:
Without residual connections, training a 96-layer Transformer would be practically impossible. The skip connection is a simple addition that solves the vanishing gradient problem and enables arbitrary depth.
Published:
Not every activation is a hidden-layer curve. Some produce probabilities, some implement learned gates, some shrink values toward zero, and some are designed for very specialized settings such as implicit neural representations.
Published:
Dividing by √d_k is not just a trick — it prevents softmax from saturating and dying in high-dimensional spaces. Here’s the math and the intuition.
Published:
Not every activation is a hidden-layer curve. Some produce probabilities, some implement learned gates, some shrink values toward zero, and some are designed for very specialized settings such as implicit neural representations.
Published:
PolyNSD replaces the NSD propagation operator with a degree-K Chebyshev polynomial in the normalised sheaf Laplacian, achieving SOTA on homo- and heterophilic benchmarks with only diagonal restriction maps and dramatically lower memory usage.
Published:
Z-SASLM is a zero-shot, fine-tuning-free style blending pipeline that replaces linear latent interpolation with SLERP along the geodesic of the hypersphere, preserving latent manifold structure when blending multiple styles. Published at CVPR 2025 Workshop.
Published:
Not every activation is a hidden-layer curve. Some produce probabilities, some implement learned gates, some shrink values toward zero, and some are designed for very specialized settings such as implicit neural representations.
Published:
Once ReLU became the default, researchers started asking a better question: can we keep the easy optimization while making the activation smoother, softer, and more expressive? This chapter covers the modern answers.
Published:
Activation functions are the reason neural networks can model curved decision boundaries instead of collapsing into one giant linear map. This chapter builds the intuition first, then walks through the classical functions that shaped deep learning.
Published:
Layer norm is not optional plumbing. It determines training stability, gradient flow, and whether deep Transformers converge at all. Pre-LN vs Post-LN is not a detail — it changes training dynamics fundamentally.
Published:
A single Transformer block combines attention, residuals, layer norm, and an FFN into one reusable unit. Understanding this block is understanding the Transformer.
Published:
GAPE is a drop-in RoPE augmentation that adds content-aware attention logit biases: a query-gate suppresses irrelevant distant context while a key-gate preserves salient distant tokens. Provably sharper attention and improved long-context robustness — no architecture changes needed.