Activation Functions in Neural Networks: Why Non-Linearity Matters
Published:
Why Activation Functions Exist
The core equation of a hidden layer is simple:
The matrix multiplication Wx + b is only an affine transformation. If every layer did only that, then stacking ten layers would still collapse into one big affine transformation. Depth would give you more parameters, but not more expressive shape.
Activation functions are the thing that breaks that collapse. They inject non-linearity, which means the network can carve curved decision boundaries, represent thresholds, and model interactions that a linear model cannot.
The Core Intuition
Think of a neuron as a tiny processor that first computes a score and then asks: should I pass this signal, suppress it, clip it, smooth it, or gate it?
- A step activation behaves like a binary rule.
- A sigmoid behaves like a soft probability gate.
- A tanh behaves like a centered soft gate.
- A ReLU behaves like a one-way valve: block negatives, pass positives.
That tiny local choice changes the global behavior of the whole network.
Historical Progression
How the field evolved
- Step / threshold activations: good for early perceptrons, but not differentiable enough for modern gradient-based learning.
- Sigmoid and tanh: smooth and differentiable, which made backpropagation practical, but they saturate.
- ReLU: dramatically simplified optimization and became the default for CNNs and MLPs.
- Modern smooth activations: GELU, SiLU, Swish, Mish, and gated variants improved optimization in large modern models.
Classical Families
A. Linear and Threshold Activations
These are conceptually important because they show the two extremes.
- Linear / Identity: does nothing; useful mainly in regression outputs.
- Step / Heaviside: flips from
0to1once a threshold is crossed.
The linear activation is not wrong, but if you use it in every hidden layer, you lose the entire point of deep learning.
B. Squashing Functions
The first major family maps inputs into a bounded range:
- Sigmoid: maps to
[0, 1]. - Tanh: maps to
[-1, 1]and is zero-centered. - Softsign: also saturates, but more gently than tanh.
These functions were historically attractive because they are smooth and easy to differentiate. Their main weakness is saturation: for large positive or negative inputs, the derivative becomes tiny.
C. Piecewise-Linear Functions
Then came the ReLU era:
- ReLU: keeps the positive branch and zeros out the negative one.
- Leaky ReLU: small negative slope instead of a hard zero
- PReLU: learns that negative slope
- RReLU: uses a random negative slope during training
- ReLU6: same idea as ReLU, but clipped at
6 - Thresholded ReLU: stays at zero until a chosen threshold
These functions made optimization much easier because their positive branch keeps a strong gradient.
What the Shapes Are Telling You
You can often predict training behavior by looking at the curve.
| Shape pattern | What it usually implies |
|---|---|
| Flat tails | Risk of vanishing gradients |
| Hard zero region | Risk of dead neurons |
| Smooth transition | More stable optimization |
| Unbounded positive branch | Strong gradient flow for active units |
| Clipping | Better control, but less expressivity |
So activation functions are not just output transformations. They are also gradient transformations.
Gradient Perspective
Backpropagation trains a network by multiplying many derivatives together. That is where activation choice becomes decisive.
The four recurring problems
| Problem | Meaning |
|---|---|
| Vanishing gradients | Derivatives become so small that early layers barely learn. |
| Exploding gradients | Derivatives become too large and make optimization unstable. |
| Dead neurons | Some ReLU units stay permanently inactive because they only see negative inputs. |
| Saturation | Sigmoid/tanh flatten for large magnitudes, so gradient flow collapses. |
This is why ReLU became such a turning point: it did not solve everything, but it avoided the worst saturation behavior that slowed down older deep networks.
Practical First Recommendations
If you are just starting, a strong first mental map is:
| Use case | Good default |
|---|---|
| Hidden layers in MLPs / CNNs | ReLU or Leaky ReLU |
| Very deep modern architectures | GELU or SiLU |
| Binary output | Sigmoid |
| Multi-class output | Softmax |
| Regression output | Linear |
The later chapters in this mini-series cover the smoother modern functions and the output-layer functions in more detail.
What Can Go Wrong with Classical Activations?
Typical failure modes
| Activation | Potential problem |
|---|---|
| Step | Not useful for standard backpropagation because the derivative is zero almost everywhere. |
| Sigmoid | Saturates in the tails and causes vanishing gradients in deep hidden stacks. |
| Tanh | Zero-centered, but still saturates for large magnitudes. |
| ReLU | Can create dead neurons that never reactivate. |
| ReLU6 / clipped variants | Gain control, but can reduce expressivity if clipping is too aggressive. |
Common Mistakes
- Thinking depth alone creates expressivity. Without non-linearity, depth collapses into one linear map.
- Using sigmoid everywhere. It is useful at the output for binary probabilities, but usually a weak default for deep hidden stacks.
- Thinking ReLU is โjust a formula.โ It changed deep learning because of its gradient behavior, not only because it is simple.
Main Takeaway
Activation functions determine what kind of signal a neuron emits and how gradients travel backward through the network. That is why they sit at the intersection of expressivity, optimization, and practical performance.
The clean historical story is: step โ sigmoid/tanh โ ReLU โ modern smooth and gated activations.
References
- Nair, V. and Hinton, G. E. โRectified Linear Units Improve Restricted Boltzmann Machines.โ ICML 2010.
- Goodfellow, I., Bengio, Y., and Courville, A. Deep Learning. MIT Press, 2016.
- Glorot, X., Bordes, A., and Bengio, Y. โDeep Sparse Rectifier Neural Networks.โ AISTATS 2011.
