Activation Functions in Neural Networks: Why Non-Linearity Matters

7 minute read

Published:

TL;DR: A neural network without activation functions is just a stack of linear layers pretending to be deep. Activation functions are what bend the geometry, control gradient flow, and decide whether a network behaves like a hard switch, a soft gate, or a smooth feature extractor.

Why Activation Functions Exist

The core equation of a hidden layer is simple:

\[ h = \sigma(Wx + b) \]

The matrix multiplication Wx + b is only an affine transformation. If every layer did only that, then stacking ten layers would still collapse into one big affine transformation. Depth would give you more parameters, but not more expressive shape.

Activation functions are the thing that breaks that collapse. They inject non-linearity, which means the network can carve curved decision boundaries, represent thresholds, and model interactions that a linear model cannot.

Good intuition: an activation function decides how much of a neuron's signal should move forward. Some behave like hard on/off switches. Others behave like soft gates. Others are chosen mainly because they make gradients easier to optimize.

The Core Intuition

Think of a neuron as a tiny processor that first computes a score and then asks: should I pass this signal, suppress it, clip it, smooth it, or gate it?

  • A step activation behaves like a binary rule.
  • A sigmoid behaves like a soft probability gate.
  • A tanh behaves like a centered soft gate.
  • A ReLU behaves like a one-way valve: block negatives, pass positives.

That tiny local choice changes the global behavior of the whole network.

Diagram showing a neuron computing a linear score and then passing it through different kinds of activation gates
Figure 1 โ€” The same linear score can be turned into very different behaviors depending on the activation: a hard threshold, a soft probability gate, or a one-way valve like ReLU. The activation is what decides how the raw score becomes a useful signal.

Historical Progression

How the field evolved

  1. Step / threshold activations: good for early perceptrons, but not differentiable enough for modern gradient-based learning.
  2. Sigmoid and tanh: smooth and differentiable, which made backpropagation practical, but they saturate.
  3. ReLU: dramatically simplified optimization and became the default for CNNs and MLPs.
  4. Modern smooth activations: GELU, SiLU, Swish, Mish, and gated variants improved optimization in large modern models.

Classical Families

A. Linear and Threshold Activations

These are conceptually important because they show the two extremes.

  • Linear / Identity: does nothing; useful mainly in regression outputs.
  • Step / Heaviside: flips from 0 to 1 once a threshold is crossed.

The linear activation is not wrong, but if you use it in every hidden layer, you lose the entire point of deep learning.

B. Squashing Functions

The first major family maps inputs into a bounded range:

  • Sigmoid: maps to [0, 1].
  • Tanh: maps to [-1, 1] and is zero-centered.
  • Softsign: also saturates, but more gently than tanh.
Sigmoid \[ \sigma(x) = \frac{1}{1 + e^{-x}} \]
Tanh \[ \tanh(x) = \frac{e^x - e^{-x}}{e^x + e^{-x}} \]
Softsign \[ \operatorname{softsign}(x) = \frac{x}{1 + |x|} \]

These functions were historically attractive because they are smooth and easy to differentiate. Their main weakness is saturation: for large positive or negative inputs, the derivative becomes tiny.

C. Piecewise-Linear Functions

Then came the ReLU era:

  • ReLU: keeps the positive branch and zeros out the negative one.
  • Leaky ReLU: small negative slope instead of a hard zero
  • PReLU: learns that negative slope
  • RReLU: uses a random negative slope during training
  • ReLU6: same idea as ReLU, but clipped at 6
  • Thresholded ReLU: stays at zero until a chosen threshold
ReLU \[ \operatorname{ReLU}(x) = \max(0, x) \]
Leaky ReLU \[ \operatorname{LeakyReLU}(x) = \begin{cases} x, & x > 0 \\ \alpha x, & x \le 0 \end{cases} \]
ReLU6 \[ \operatorname{ReLU6}(x) = \min(\max(0, x), 6) \]

These functions made optimization much easier because their positive branch keeps a strong gradient.

Grid of classical activation functions including linear, step, sigmoid, tanh, ReLU, Leaky ReLU, PReLU, RReLU, Softplus, Softsign, ReLU6, and Thresholded ReLU
Figure 2 โ€” A visual cheat sheet for the classical activation family. The main story is already visible in the shapes: squashing activations saturate, ReLU-like activations keep a strong positive branch, and clipped variants trade expressivity for stability or efficiency.

What the Shapes Are Telling You

You can often predict training behavior by looking at the curve.

Shape patternWhat it usually implies
Flat tailsRisk of vanishing gradients
Hard zero regionRisk of dead neurons
Smooth transitionMore stable optimization
Unbounded positive branchStrong gradient flow for active units
ClippingBetter control, but less expressivity

So activation functions are not just output transformations. They are also gradient transformations.

Gradient Perspective

Backpropagation trains a network by multiplying many derivatives together. That is where activation choice becomes decisive.

The four recurring problems

ProblemMeaning
Vanishing gradientsDerivatives become so small that early layers barely learn.
Exploding gradientsDerivatives become too large and make optimization unstable.
Dead neuronsSome ReLU units stay permanently inactive because they only see negative inputs.
SaturationSigmoid/tanh flatten for large magnitudes, so gradient flow collapses.

This is why ReLU became such a turning point: it did not solve everything, but it avoided the worst saturation behavior that slowed down older deep networks.

Diagram contrasting vanishing gradients, dead neurons, and healthy gradient flow across common activations
Figure 3 โ€” Activation choice is really a gradient-management decision. Sigmoid and tanh can flatten into tiny derivatives, ReLU can kill units on the negative side, and smoother modern activations try to preserve useful gradient flow near zero.

Practical First Recommendations

If you are just starting, a strong first mental map is:

Use caseGood default
Hidden layers in MLPs / CNNsReLU or Leaky ReLU
Very deep modern architecturesGELU or SiLU
Binary outputSigmoid
Multi-class outputSoftmax
Regression outputLinear

The later chapters in this mini-series cover the smoother modern functions and the output-layer functions in more detail.

What Can Go Wrong with Classical Activations?

Typical failure modes

ActivationPotential problem
StepNot useful for standard backpropagation because the derivative is zero almost everywhere.
SigmoidSaturates in the tails and causes vanishing gradients in deep hidden stacks.
TanhZero-centered, but still saturates for large magnitudes.
ReLUCan create dead neurons that never reactivate.
ReLU6 / clipped variantsGain control, but can reduce expressivity if clipping is too aggressive.

Common Mistakes

  1. Thinking depth alone creates expressivity. Without non-linearity, depth collapses into one linear map.
  2. Using sigmoid everywhere. It is useful at the output for binary probabilities, but usually a weak default for deep hidden stacks.
  3. Thinking ReLU is โ€œjust a formula.โ€ It changed deep learning because of its gradient behavior, not only because it is simple.

Main Takeaway

Activation functions determine what kind of signal a neuron emits and how gradients travel backward through the network. That is why they sit at the intersection of expressivity, optimization, and practical performance.

The clean historical story is: step โ†’ sigmoid/tanh โ†’ ReLU โ†’ modern smooth and gated activations.

References

  1. Nair, V. and Hinton, G. E. โ€œRectified Linear Units Improve Restricted Boltzmann Machines.โ€ ICML 2010.
  2. Goodfellow, I., Bengio, Y., and Courville, A. Deep Learning. MIT Press, 2016.
  3. Glorot, X., Bordes, A., and Bengio, Y. โ€œDeep Sparse Rectifier Neural Networks.โ€ AISTATS 2011.