Output, Gated, and Special Activations: Softmax, GLU, SIREN, and More

14 minute read

Published: June 01, 2026

TL;DR: Many important activations are not “hidden-layer curves” at all. Softmax and sigmoid control outputs, GLU-style activations learn gates, shrinkage activations push values toward zero, and specialized activations such as SIREN or Gaussian RBFs are built for niche but powerful settings.

Output Activations Have a Different Job

Intuition First: In hidden layers, activations shape what the network thinks. At the output layer, activations shape what the network says. They must convert raw scores (logits) into the exact format the loss function expects. Using the wrong output activation is not just suboptimal — it breaks the loss's mathematical assumptions and can make training completely undefined.

In hidden layers, activation functions mainly shape representation learning and gradient flow. At the output layer, they must match the task.

The three most important output cases

Task	Typical activation	Why
Binary classification	Sigmoid	Turns one logit into a probability in `[0, 1]`
Multi-class classification	Softmax	Converts logits into a probability distribution that sums to `1`
Regression	Linear / Identity	Leaves the output unconstrained

The softmax formula is:

\[ \operatorname{softmax}(z_i) = \frac{e^{z_i}}{\sum_j e^{z_j}} \]

Concrete step-by-step: softmax on 3 logits

Say a 3-class classifier produces logits z = [2.0, 1.0, 0.1].

Step	Computation	Result
Exponentiate	e², e¹, e^0.1	7.389, 2.718, 1.105
Sum	7.389 + 2.718 + 1.105	11.212
Normalize	7.389/11.212, 2.718/11.212, 1.105/11.212	0.659, 0.242, 0.099
Check	0.659 + 0.242 + 0.099	= 1.000 ✓

The original logit differences (2.0 vs 1.0 vs 0.1) are now calibrated probabilities summing to 1. Note that a 1-unit logit advantage roughly triples the probability — the exponential makes the winner-take-most effect strong.

Key Insight — temperature: Dividing logits by a temperature T before softmax controls sharpness. T→0 makes softmax behave like argmax (one-hot). T→∞ makes it uniform. This is why temperature scaling is the standard post-hoc calibration technique: the model's weights stay frozen, only the output distribution is reshaped.

The same logits [2.0, 1.0, 0.1] passed through softmax at three temperatures. Low T (left) collapses probability onto the top class — useful for greedy decoding. High T (right) spreads probability more evenly — useful for knowledge distillation. T=1 is the standard training setting.

Why Gated Activations Became So Important

Intuition First: A classic activation like ReLU asks one question of each neuron: "should this value pass?" A gated activation asks two neurons to collaborate: one produces content, the other produces a gate score. The gate modulates how much of the content flows forward. This is conceptually identical to the gating mechanism in LSTMs and GRUs — the same idea, applied at every feed-forward layer. It is why gated variants consistently outperform plain ReLU in large Transformer models.

Modern architectures often do not use a single scalar curve after an affine transform. Instead, they split the channel dimension and let one part gate another.

That gives you:

GLU: one linear branch gates another
SwiGLU: same idea, but with a SiLU/Swish-style gate
GeGLU: GELU gate
ReGLU: ReLU gate

GLU \[ \operatorname{GLU}(x) = a \otimes \sigma(b) \]

SwiGLU \[ \operatorname{SwiGLU}(x) = a \otimes \operatorname{SiLU}(b) \]

GeGLU / ReGLU \[ \operatorname{GeGLU}(x) = a \otimes \operatorname{GELU}(b), \qquad \operatorname{ReGLU}(x) = a \otimes \operatorname{ReLU}(b) \]

This family matters because large Transformers often rely more on gated feed-forward blocks than on plain ReLU-style MLPs.

<div class=”insight-box”> Useful mental model: ReLU asks “should this neuron pass?” GLU-like activations ask “how strongly should this feature gate another feature?” </div>

Worked example — GLU vs. plain linear, step by step:

Suppose an input vector is split into two halves: a = [1.2, −0.4, 0.8] (content) and b = [2.1, −1.5, 0.3] (gate input).

Step	GLU	Plain linear (no gate)
Compute gate	σ(b) = [0.89, 0.18, 0.57]	—
Element-wise product	a ⊗ σ(b) = [1.07, −0.07, 0.46]	a = [1.2, −0.4, 0.8]
Effect	The −0.4 signal is suppressed to −0.07 by a low gate value	−0.4 passes through unchanged

The gate learned that the second feature is unreliable (low σ(b)=0.18) and almost entirely suppressed it. Plain linear cannot make this context-dependent decision.

Animated GLU data-flow. The input is projected by two independent linear layers. The content branch (a) passes through unchanged. The gate branch (b) is squashed by sigmoid to produce per-channel gate values in (0,1). The element-wise product lets the gate selectively suppress or pass each content feature — all learned end-to-end.

</div>

<div style=”background:#fff7ed;border-left:4px solid #f97316;border-radius:8px;padding:.95rem 1.1rem;margin:1.25rem 0;”>Key Insight — SwiGLU in LLaMA/GPT-4 style FFN blocks: SwiGLU(x) = (W₁x) ⊗ SiLU(W₂x). Compared to a standard two-layer MLP with a single activation, SwiGLU uses three weight matrices (W₁, W₂, W₃ for the final projection) but achieves better perplexity at the same parameter budget. The reason is expressivity: the gate is a full learned linear transformation, not just a fixed nonlinearity applied to the same pre-activation. This is why nearly every modern open-source LLM (LLaMA, Mistral, Gemma) uses SwiGLU in its feed-forward blocks instead of plain GELU-MLP.</div>

Diagram contrasting hidden-layer activations, output activations, and gated activations — Figure 1 — Not all activations play the same role. Hidden-layer activations shape features, output activations shape the prediction object, and gated activations decide how one feature stream modulates another.

Shrinkage and Sparse Activations

Intuition First: Standard activations pass strong signals and block weak ones. Shrinkage activations go further: they push small values all the way to zero, creating genuine sparsity in the representation. Think of it as denoising — treat small activations as noise and eliminate them, keep only the confidently large values. Sparsemax takes this idea to the output layer: unlike softmax which distributes probability mass everywhere, Sparsemax assigns exact zero probability to unlikely classes, producing a sparse probability vector. This is especially valuable for attention mechanisms and structured prediction.

Another family is built around sparsity or denoising:

TanhShrink: returns x - tanh(x)
SoftShrink: softly pushes small values toward zero
HardShrink: zeroes small values completely
Sparsemax: like softmax, but can produce exact zeros
Entmax: interpolates between dense softmax and sparse alternatives

TanhShrink \[ \operatorname{TanhShrink}(x) = x - \tanh(x) \]

SoftShrink \[ \operatorname{SoftShrink}(x) = \begin{cases} x - \lambda, & x > \lambda \\ 0, & |x| \le \lambda \\ x + \lambda, & x < -\lambda \end{cases} \]

HardShrink \[ \operatorname{HardShrink}(x) = \begin{cases} x, & |x| > \lambda \\ 0, & |x| \le \lambda \end{cases} \]

These are useful when you want more structured or selective outputs rather than dense probability mass everywhere.

Concrete example — SoftShrink vs HardShrink vs Sparsemax (λ=0.5):

Input value	SoftShrink (λ=0.5)	HardShrink (λ=0.5)	Notes
2.0	1.5	2.0	Large value: both pass through
0.8	0.3	0.8	SoftShrink reduces, HardShrink passes
0.4	0.0	0.0	Both zero — below threshold
0.1	0.0	0.0	Both zero — below threshold
−0.6	−0.1	−0.6	SoftShrink clips toward zero
−1.5	−1.0	−1.5	Large negative: both pass

SoftShrink always shrinks by λ before zeroing; HardShrink either passes completely or zeros. SoftShrink is the classical wavelet/signal denoising shrinkage — it corresponds to solving a LASSO-style proximal operator.

Same logits [3, 1, 0, −1, −2] through Softmax (left) vs. Sparsemax (right). Softmax distributes probability everywhere — even irrelevant classes C4, C5 receive 5–6%. Sparsemax projects onto the probability simplex using a thresholding operation, producing exact zeros for low-scoring classes. This is critical for sparse attention mechanisms where you want some tokens to receive literally zero weight.

Special-Purpose Activations

Some activations are not mainstream in basic classifiers, but they are extremely important in the right niche.

Maxout: takes the maximum over several learned affine responses
Sin / SIREN: uses sinusoidal activations for implicit neural representations
Gaussian / RBF: activates by distance from a center
Soft Exponential: learns whether to behave more like a log, linear, or exponential function
KAN / spline activations: learns the activation shape itself rather than choosing a fixed closed-form function

SIREN \[ f(x) = \sin(\omega x) \]

Gaussian / RBF \[ \phi(x) = \exp\!\left(-\frac{\|x-c\|^2}{2\sigma^2}\right) \]

Soft Exponential \[ f_\alpha(x) = \begin{cases} -\frac{\log(1-\alpha(x+\alpha))}{\alpha}, & \alpha < 0 \\ x, & \alpha = 0 \\ \frac{e^{\alpha x}-1}{\alpha} + \alpha, & \alpha > 0 \end{cases} \]

These remind us that “activation function” is a much broader design space than just ReLU vs GELU.

<div style=”background:#fff7ed;border-left:4px solid #f97316;border-radius:8px;padding:.95rem 1.1rem;margin:1.25rem 0;”>Key Insight — why SIREN works for implicit representations: Modeling a continuous signal (image, shape, audio) as a neural function f(x,y)→RGB requires the network to represent fine-grained detail. ReLU-based networks produce piecewise-linear outputs — they cannot represent smooth higher-order derivatives. SIREN uses sin(ωx), whose derivatives are also sinusoids, so the network naturally represents smooth periodic structure at every layer. The frequency ω controls the scale of detail captured. SIREN networks have been shown to exactly fit high-resolution images with far fewer parameters than ReLU networks because every layer contributes smoothly to all derivative orders — not just the zeroth.</div>

SIREN (blue, smooth) vs. a ReLU piecewise approximation of the same sinusoidal target. The SIREN represents the true smooth signal exactly because its activations have infinite-order smooth derivatives. ReLU can approximate it, but only with many more layers and with derivative discontinuities that limit precision in applications like physics-based neural fields.

</div>

Grid of output, gated, sparse, and special activations including Softmax, LogSoftmax, Maxout, GLU, SwiGLU, GeGLU, ReGLU, TanhShrink, SoftShrink, HardShrink, Sparsemax, Entmax, SIREN, Gaussian RBF, Soft Exponential, and spline-style activations — Figure 2 — This last family is much more diverse. Some activations map logits to probabilities, some implement feature gating, some encourage sparsity, and some are designed for special function classes such as implicit fields or spline-based networks.

Common Mistakes

Four mistakes that show up constantly:

Applying softmax before `CrossEntropyLoss` in PyTorch. That loss expects raw logits.
Using sigmoid for mutually exclusive multi-class classification. Usually you want softmax instead.
Ignoring the output activation entirely. The last-layer activation should match both the task and the loss.
Assuming all gating activations are interchangeable. SwiGLU, GeGLU, and ReGLU can change optimization noticeably in large models.

Practical Recommendation Map

Use case	Recommended activation
Binary classification output	Sigmoid
Multi-class classification output	Softmax
Regression output	Linear
Transformer feed-forward blocks	GELU or SwiGLU
Sparse probability-like outputs	Sparsemax or Entmax
Implicit neural representations	SIREN
Radial similarity models	Gaussian / RBF

What Can Go Wrong with Output and Special Activations?

Activation family	Potential problem
Softmax	Easy to misuse with the wrong loss pipeline, especially if you apply it before losses that expect raw logits.
Sigmoid outputs	Wrong choice for mutually exclusive multi-class prediction, where softmax is usually the right tool.
GLU-style gating	More expressive, but also more parameter-heavy and architecture-dependent.
Sparsemax / Entmax	Useful for sparsity, but can change optimization behavior enough that they are not just drop-in replacements for softmax.
SIREN / RBF / spline-style activations	Very powerful in the right niche, but usually a poor default if the model and task were not designed for them.

Final Takeaway

Activation functions are not a side detail. They define:

how information flows forward,
how gradients flow backward,
what geometry the model can represent,
and what kind of output object the network produces.

That is why the full story needs more than one chapter. ReLU, GELU, Softmax, SwiGLU, Sparsemax, and SIREN are not solving the same problem. They all live under the same name, but they serve very different roles.

References

Dauphin, Y. N. et al. “Language Modeling with Gated Convolutional Networks.” 2017.
Shazeer, N. “GLU Variants Improve Transformer.” 2020.
Martins, A. and Astudillo, R. “From Softmax to Sparsemax.” ICML 2016.
Peters, B. et al. “Sparse Sequence-to-Sequence Models.” ACL 2019.
Sitzmann, V. et al. “Implicit Neural Representations with Periodic Activation Functions.” NeurIPS 2020.

Share on

Bluesky Facebook LinkedIn X (formerly Twitter)

Alessio Borgi

Output, Gated, and Special Activations: Softmax, GLU, SIREN, and More

Output Activations Have a Different Job

The three most important output cases

Why Gated Activations Became So Important

Shrinkage and Sparse Activations

Special-Purpose Activations

Common Mistakes

Practical Recommendation Map

What Can Go Wrong with Output and Special Activations?

Final Takeaway

References

Share on

You May Also Enjoy

Modern Activation Functions: GELU, SiLU, Mish, and Smooth Gating

Activation Functions in Neural Networks: Why Non-Linearity Matters

FoPE: Fourier Position Embedding for Length Generalization

Position Interpolation: Extending RoPE with Minimal Fine-Tuning

Alessio Borgi

Output Activations Have a Different Job

The three most important output cases

Why Gated Activations Became So Important

Shrinkage and Sparse Activations

Special-Purpose Activations

Common Mistakes

Practical Recommendation Map

What Can Go Wrong with Output and Special Activations?

Final Takeaway

References

Share on

You May Also Enjoy

📄 Modern Activation Functions: GELU, SiLU, Mish, and Smooth Gating

📄 Activation Functions in Neural Networks: Why Non-Linearity Matters

📄 FoPE: Fourier Position Embedding for Length Generalization

📄 Position Interpolation: Extending RoPE with Minimal Fine-Tuning

Modern Activation Functions: GELU, SiLU, Mish, and Smooth Gating

Activation Functions in Neural Networks: Why Non-Linearity Matters

FoPE: Fourier Position Embedding for Length Generalization

Position Interpolation: Extending RoPE with Minimal Fine-Tuning