Modern Activation Functions: GELU, SiLU, Mish, and Smooth Gating
Published:
Why ReLU Was Not the End of the Story
ReLU solved a huge optimization problem, but it also introduced a blunt shape:
- exactly zero on the negative side
- exactly linear on the positive side
- non-differentiable at zero
That simplicity is often a strength, but it is not always the best match for large modern architectures. Once deep learning scaled up, researchers started testing smoother alternatives that preserve gradient flow while making the network’s response less abrupt.
The Main Modern Idea
Instead of saying:
modern activations often say:
That makes them feel more like soft gates than hard thresholds.
Smooth ReLU-Like Families
ELU, SELU, and CELU
These functions keep the positive linear branch, but replace the dead negative side with a smooth saturating tail.
- ELU: negative values bend toward a negative plateau
- SELU: a self-normalizing variant designed to stabilize mean and variance
- CELU: a continuously differentiable ELU-like variant
They are especially interesting because they acknowledge that “all negatives become zero” is sometimes too crude.
GELU
GELU is the activation you now see everywhere in Transformers.
where Φ(x) is the Gaussian cumulative distribution function.
The intuition is elegant: instead of passing all positive signals and rejecting all negative ones, GELU keeps a value in proportion to how likely it is to be useful under a Gaussian view of the input.
Swish and SiLU
Swish is the same family idea; SiLU is the most common fixed version. These activations are smooth, slightly non-monotonic, and behave like a gated linear response.
Mish
Mish pushes the same logic further:
It is smooth, non-monotonic, and often visually looks like “a softer Swish with a richer negative-side bend.”
Fast Approximations and Mobile-Friendly Variants
Smooth functions can be strong, but they are more expensive than piecewise-linear ones. That is why approximation-based activations became popular in efficient models:
- Hard Sigmoid: piecewise-linear approximation of sigmoid
- Hard Tanh: clipped tanh-like shape
- Hard Swish: approximation of Swish used in mobile models
The guiding tradeoff is simple: give up a bit of smoothness to gain speed.
A Few More Interesting Curves
The visual grid also includes a few less standard but conceptually useful shapes:
- Bent Identity: almost linear, but gently nonlinear near zero
- Arctan: another smooth bounded squash
- SELU / CELU: reminders that negative values do not have to be thrown away completely
These are not the default choices in modern LLMs, but they help build the reader’s intuition: activation design is really about deciding what should happen around zero, in the tails, and in the derivative.
Which Ones Actually Matter Most Today?
| Activation | Why people use it | Main tradeoff |
|---|---|---|
| GELU | Very strong default in Transformers | More expensive than ReLU |
| SiLU / Swish | Smooth, gated, stable, often great in efficient deep nets | Still more expensive than ReLU |
| Mish | Flexible smooth non-monotonic response | Less standard in large production stacks |
| Hard Swish | Good hardware-friendly approximation | Less smooth than the original |
Practical Advice
- Transformers: GELU remains the standard baseline.
- Efficient CNNs / mobile models: Hard Swish or SiLU are common.
- General-purpose modern MLPs: ReLU is still a valid baseline, but SiLU is worth testing.
What Can Go Wrong with Modern Activations?
| Activation | Potential problem |
|---|---|
| ELU / SELU / CELU | More expensive than ReLU and more sensitive to architectural assumptions than many beginners expect. |
| GELU | Excellent in Transformers, but often unnecessary overhead in smaller or simpler models. |
| SiLU / Swish | Smoother optimization, but still costlier than piecewise-linear activations. |
| Mish | Can work well, but is less standardized and not always worth the extra complexity. |
| Hard approximations | Faster on-device, but they give up part of the smooth behavior that motivated the original function. |
Common Misunderstanding
The best activation is not the one with the fanciest formula. It is the one whose shape matches:
- the optimization constraints,
- the architecture,
- the hardware budget,
- the role of that layer inside the model.
That is why ReLU still survives, GELU dominates Transformers, and Hard Swish shows up in mobile networks. The “best” activation is context-dependent.
References
- Hendrycks, D. and Gimpel, K. “Gaussian Error Linear Units (GELUs).” 2016.
- Ramachandran, P., Zoph, B., and Le, Q. V. “Searching for Activation Functions.” 2017.
- Misra, D. “Mish: A Self Regularized Non-Monotonic Activation Function.” 2019.
- Klambauer, G. et al. “Self-Normalizing Neural Networks.” NeurIPS 2017.
