Output, Gated, and Special Activations: Softmax, GLU, SIREN, and More
Published:
Output Activations Have a Different Job
In hidden layers, activation functions mainly shape representation learning and gradient flow. At the output layer, they must match the task.
The three most important output cases
| Task | Typical activation | Why |
|---|---|---|
| Binary classification | Sigmoid | Turns one logit into a probability in [0, 1] |
| Multi-class classification | Softmax | Converts logits into a probability distribution that sums to 1 |
| Regression | Linear / Identity | Leaves the output unconstrained |
The softmax formula is:
Why Gated Activations Became So Important
Modern architectures often do not use a single scalar curve after an affine transform. Instead, they split the channel dimension and let one part gate another.
That gives you:
- GLU: one linear branch gates another
- SwiGLU: same idea, but with a SiLU/Swish-style gate
- GeGLU: GELU gate
- ReGLU: ReLU gate
This family matters because large Transformers often rely more on gated feed-forward blocks than on plain ReLU-style MLPs.
Shrinkage and Sparse Activations
Another family is built around sparsity or denoising:
- TanhShrink: returns
x - tanh(x) - SoftShrink: softly pushes small values toward zero
- HardShrink: zeroes small values completely
- Sparsemax: like softmax, but can produce exact zeros
- Entmax: interpolates between dense softmax and sparse alternatives
These are useful when you want more structured or selective outputs rather than dense probability mass everywhere.
Special-Purpose Activations
Some activations are not mainstream in basic classifiers, but they are extremely important in the right niche.
- Maxout: takes the maximum over several learned affine responses
- Sin / SIREN: uses sinusoidal activations for implicit neural representations
- Gaussian / RBF: activates by distance from a center
- Soft Exponential: learns whether to behave more like a log, linear, or exponential function
- KAN / spline activations: learns the activation shape itself rather than choosing a fixed closed-form function
These remind us that “activation function” is a much broader design space than just ReLU vs GELU.
Common Mistakes
- Applying softmax before `CrossEntropyLoss` in PyTorch. That loss expects raw logits.
- Using sigmoid for mutually exclusive multi-class classification. Usually you want softmax instead.
- Ignoring the output activation entirely. The last-layer activation should match both the task and the loss.
- Assuming all gating activations are interchangeable. SwiGLU, GeGLU, and ReGLU can change optimization noticeably in large models.
Practical Recommendation Map
| Use case | Recommended activation |
|---|---|
| Binary classification output | Sigmoid |
| Multi-class classification output | Softmax |
| Regression output | Linear |
| Transformer feed-forward blocks | GELU or SwiGLU |
| Sparse probability-like outputs | Sparsemax or Entmax |
| Implicit neural representations | SIREN |
| Radial similarity models | Gaussian / RBF |
What Can Go Wrong with Output and Special Activations?
| Activation family | Potential problem |
|---|---|
| Softmax | Easy to misuse with the wrong loss pipeline, especially if you apply it before losses that expect raw logits. |
| Sigmoid outputs | Wrong choice for mutually exclusive multi-class prediction, where softmax is usually the right tool. |
| GLU-style gating | More expressive, but also more parameter-heavy and architecture-dependent. |
| Sparsemax / Entmax | Useful for sparsity, but can change optimization behavior enough that they are not just drop-in replacements for softmax. |
| SIREN / RBF / spline-style activations | Very powerful in the right niche, but usually a poor default if the model and task were not designed for them. |
Final Takeaway
Activation functions are not a side detail. They define:
- how information flows forward,
- how gradients flow backward,
- what geometry the model can represent,
- and what kind of output object the network produces.
That is why the full story needs more than one chapter. ReLU, GELU, Softmax, SwiGLU, Sparsemax, and SIREN are not solving the same problem. They all live under the same name, but they serve very different roles.
References
- Dauphin, Y. N. et al. “Language Modeling with Gated Convolutional Networks.” 2017.
- Shazeer, N. “GLU Variants Improve Transformer.” 2020.
- Martins, A. and Astudillo, R. “From Softmax to Sparsemax.” ICML 2016.
- Peters, B. et al. “Sparse Sequence-to-Sequence Models.” ACL 2019.
- Sitzmann, V. et al. “Implicit Neural Representations with Periodic Activation Functions.” NeurIPS 2020.
