Output, Gated, and Special Activations: Softmax, GLU, SIREN, and More

5 minute read

Published:

TL;DR: Many important activations are not “hidden-layer curves” at all. Softmax and sigmoid control outputs, GLU-style activations learn gates, shrinkage activations push values toward zero, and specialized activations such as SIREN or Gaussian RBFs are built for niche but powerful settings.

Output Activations Have a Different Job

In hidden layers, activation functions mainly shape representation learning and gradient flow. At the output layer, they must match the task.

The three most important output cases

TaskTypical activationWhy
Binary classificationSigmoidTurns one logit into a probability in [0, 1]
Multi-class classificationSoftmaxConverts logits into a probability distribution that sums to 1
RegressionLinear / IdentityLeaves the output unconstrained

The softmax formula is:

\[ \operatorname{softmax}(z_i) = \frac{e^{z_i}}{\sum_j e^{z_j}} \]

Why Gated Activations Became So Important

Modern architectures often do not use a single scalar curve after an affine transform. Instead, they split the channel dimension and let one part gate another.

That gives you:

  • GLU: one linear branch gates another
  • SwiGLU: same idea, but with a SiLU/Swish-style gate
  • GeGLU: GELU gate
  • ReGLU: ReLU gate
GLU \[ \operatorname{GLU}(x) = a \otimes \sigma(b) \]
SwiGLU \[ \operatorname{SwiGLU}(x) = a \otimes \operatorname{SiLU}(b) \]
GeGLU / ReGLU \[ \operatorname{GeGLU}(x) = a \otimes \operatorname{GELU}(b), \qquad \operatorname{ReGLU}(x) = a \otimes \operatorname{ReLU}(b) \]

This family matters because large Transformers often rely more on gated feed-forward blocks than on plain ReLU-style MLPs.

Useful mental model: ReLU asks “should this neuron pass?” GLU-like activations ask “how strongly should this feature gate another feature?”
Diagram contrasting hidden-layer activations, output activations, and gated activations
Figure 1 — Not all activations play the same role. Hidden-layer activations shape features, output activations shape the prediction object, and gated activations decide how one feature stream modulates another.

Shrinkage and Sparse Activations

Another family is built around sparsity or denoising:

  • TanhShrink: returns x - tanh(x)
  • SoftShrink: softly pushes small values toward zero
  • HardShrink: zeroes small values completely
  • Sparsemax: like softmax, but can produce exact zeros
  • Entmax: interpolates between dense softmax and sparse alternatives
TanhShrink \[ \operatorname{TanhShrink}(x) = x - \tanh(x) \]
SoftShrink \[ \operatorname{SoftShrink}(x) = \begin{cases} x - \lambda, & x > \lambda \\ 0, & |x| \le \lambda \\ x + \lambda, & x < -\lambda \end{cases} \]
HardShrink \[ \operatorname{HardShrink}(x) = \begin{cases} x, & |x| > \lambda \\ 0, & |x| \le \lambda \end{cases} \]

These are useful when you want more structured or selective outputs rather than dense probability mass everywhere.

Special-Purpose Activations

Some activations are not mainstream in basic classifiers, but they are extremely important in the right niche.

  • Maxout: takes the maximum over several learned affine responses
  • Sin / SIREN: uses sinusoidal activations for implicit neural representations
  • Gaussian / RBF: activates by distance from a center
  • Soft Exponential: learns whether to behave more like a log, linear, or exponential function
  • KAN / spline activations: learns the activation shape itself rather than choosing a fixed closed-form function
SIREN \[ f(x) = \sin(\omega x) \]
Gaussian / RBF \[ \phi(x) = \exp\!\left(-\frac{\|x-c\|^2}{2\sigma^2}\right) \]
Soft Exponential \[ f_\alpha(x) = \begin{cases} -\frac{\log(1-\alpha(x+\alpha))}{\alpha}, & \alpha < 0 \\ x, & \alpha = 0 \\ \frac{e^{\alpha x}-1}{\alpha} + \alpha, & \alpha > 0 \end{cases} \]

These remind us that “activation function” is a much broader design space than just ReLU vs GELU.

Grid of output, gated, sparse, and special activations including Softmax, LogSoftmax, Maxout, GLU, SwiGLU, GeGLU, ReGLU, TanhShrink, SoftShrink, HardShrink, Sparsemax, Entmax, SIREN, Gaussian RBF, Soft Exponential, and spline-style activations
Figure 2 — This last family is much more diverse. Some activations map logits to probabilities, some implement feature gating, some encourage sparsity, and some are designed for special function classes such as implicit fields or spline-based networks.

Common Mistakes

Four mistakes that show up constantly:
  1. Applying softmax before `CrossEntropyLoss` in PyTorch. That loss expects raw logits.
  2. Using sigmoid for mutually exclusive multi-class classification. Usually you want softmax instead.
  3. Ignoring the output activation entirely. The last-layer activation should match both the task and the loss.
  4. Assuming all gating activations are interchangeable. SwiGLU, GeGLU, and ReGLU can change optimization noticeably in large models.

Practical Recommendation Map

Use caseRecommended activation
Binary classification outputSigmoid
Multi-class classification outputSoftmax
Regression outputLinear
Transformer feed-forward blocksGELU or SwiGLU
Sparse probability-like outputsSparsemax or Entmax
Implicit neural representationsSIREN
Radial similarity modelsGaussian / RBF

What Can Go Wrong with Output and Special Activations?

Activation familyPotential problem
SoftmaxEasy to misuse with the wrong loss pipeline, especially if you apply it before losses that expect raw logits.
Sigmoid outputsWrong choice for mutually exclusive multi-class prediction, where softmax is usually the right tool.
GLU-style gatingMore expressive, but also more parameter-heavy and architecture-dependent.
Sparsemax / EntmaxUseful for sparsity, but can change optimization behavior enough that they are not just drop-in replacements for softmax.
SIREN / RBF / spline-style activationsVery powerful in the right niche, but usually a poor default if the model and task were not designed for them.

Final Takeaway

Activation functions are not a side detail. They define:

  1. how information flows forward,
  2. how gradients flow backward,
  3. what geometry the model can represent,
  4. and what kind of output object the network produces.

That is why the full story needs more than one chapter. ReLU, GELU, Softmax, SwiGLU, Sparsemax, and SIREN are not solving the same problem. They all live under the same name, but they serve very different roles.

References

  1. Dauphin, Y. N. et al. “Language Modeling with Gated Convolutional Networks.” 2017.
  2. Shazeer, N. “GLU Variants Improve Transformer.” 2020.
  3. Martins, A. and Astudillo, R. “From Softmax to Sparsemax.” ICML 2016.
  4. Peters, B. et al. “Sparse Sequence-to-Sequence Models.” ACL 2019.
  5. Sitzmann, V. et al. “Implicit Neural Representations with Periodic Activation Functions.” NeurIPS 2020.