Activation Functions in Neural Networks: Why Non-Linearity Matters

12 minute read

Published: June 01, 2026

TL;DR: A neural network without activation functions is just a stack of linear layers pretending to be deep. Activation functions are what bend the geometry, control gradient flow, and decide whether a network behaves like a hard switch, a soft gate, or a smooth feature extractor.

Why Activation Functions Exist

Intuition First: Imagine stacking transparent overlays on a map. Each overlay is a straight line drawn across the city — no matter how many you stack, you can only ever describe things that fit straight-line logic. Activation functions are what let each layer bend its overlay into a curve. Without them, ten layers of computation are exactly equivalent to one.

The core equation of a hidden layer is simple:

\[ h = \sigma(Wx + b) \]

The matrix multiplication Wx + b is only an affine transformation. If every layer did only that, then stacking ten layers would still collapse into one big affine transformation. Depth would give you more parameters, but not more expressive shape.

Activation functions are the thing that breaks that collapse. They inject non-linearity, which means the network can carve curved decision boundaries, represent thresholds, and model interactions that a linear model cannot.

Key Insight: an activation function decides how much of a neuron's signal should move forward. Some behave like hard on/off switches. Others behave like soft gates. Others are chosen mainly because they make gradients easier to optimize.

Animated — Without activations, a deep network can only ever draw a straight line as its decision boundary (left). With non-linearity, it can learn the curved boundary that actually separates the data (right).

The Core Intuition

Think of a neuron as a tiny processor that first computes a score and then asks: should I pass this signal, suppress it, clip it, smooth it, or gate it?

A step activation behaves like a binary rule.
A sigmoid behaves like a soft probability gate.
A tanh behaves like a centered soft gate.
A ReLU behaves like a one-way valve: block negatives, pass positives.

That tiny local choice changes the global behavior of the whole network.

The four classical activation shapes trace in sequence. Notice how the Step is a hard binary flip, Sigmoid and Tanh are smooth S-curves (but flatten in the tails), and ReLU is simply a half-rectification — zero on the left, identity on the right.

Diagram showing a neuron computing a linear score and then passing it through different kinds of activation gates — Figure 1 — The same linear score can be turned into very different behaviors depending on the activation: a hard threshold, a soft probability gate, or a one-way valve like ReLU. The activation is what decides how the raw score becomes a useful signal.

Historical Progression

How the field evolved

Step / threshold activations: good for early perceptrons, but not differentiable enough for modern gradient-based learning.
Sigmoid and tanh: smooth and differentiable, which made backpropagation practical, but they saturate.
ReLU: dramatically simplified optimization and became the default for CNNs and MLPs.
Modern smooth activations: GELU, SiLU, Swish, Mish, and gated variants improved optimization in large modern models.

Classical Families

A. Linear and Threshold Activations

These are conceptually important because they show the two extremes.

Linear / Identity: does nothing; useful mainly in regression outputs.
Step / Heaviside: flips from 0 to 1 once a threshold is crossed.

The linear activation is not wrong, but if you use it in every hidden layer, you lose the entire point of deep learning.

B. Squashing Functions

The first major family maps inputs into a bounded range:

Sigmoid: maps to [0, 1].
Tanh: maps to [-1, 1] and is zero-centered.
Softsign: also saturates, but more gently than tanh.

Sigmoid \[ \sigma(x) = \frac{1}{1 + e^{-x}} \]

Tanh \[ \tanh(x) = \frac{e^x - e^{-x}}{e^x + e^{-x}} \]

Softsign \[ \operatorname{softsign}(x) = \frac{x}{1 + |x|} \]

These functions were historically attractive because they are smooth and easy to differentiate. Their main weakness is saturation: for large positive or negative inputs, the derivative becomes tiny.

Key Insight: The sigmoid derivative peaks at exactly 0.25 when x=0. That means even at its best, it cuts the gradient in half compared to passing it unchanged. Stack 10 sigmoid layers and the best-case gradient shrinks to 0.25¹⁰ ≈ 0.000001. That is the vanishing gradient problem in one number.

Concrete numerical example — sigmoid saturation:

Input x	σ(x)	σ′(x) = σ(x)(1−σ(x))
0	0.500	0.250 (maximum)
2	0.880	0.105
4	0.982	0.018
6	0.998	0.002
8	0.9997	0.0002

Each row shows why neurons that receive large-magnitude inputs essentially stop learning — the gradient through them is nearly zero.

C. Piecewise-Linear Functions

Then came the ReLU era:

ReLU: keeps the positive branch and zeros out the negative one.
Leaky ReLU: small negative slope instead of a hard zero
PReLU: learns that negative slope
RReLU: uses a random negative slope during training
ReLU6: same idea as ReLU, but clipped at 6
Thresholded ReLU: stays at zero until a chosen threshold

ReLU \[ \operatorname{ReLU}(x) = \max(0, x) \]

Leaky ReLU \[ \operatorname{LeakyReLU}(x) = \begin{cases} x, & x > 0 \\ \alpha x, & x \le 0 \end{cases} \]

ReLU6 \[ \operatorname{ReLU6}(x) = \min(\max(0, x), 6) \]

These functions made optimization much easier because their positive branch keeps a strong gradient.

Concrete step-by-step: how ReLU saves the gradient

Imagine a single neuron receives pre-activation z = 1.5 and the upstream gradient (from the loss) is δ = 0.8.

Activation	Output	Local derivative	Gradient passed back
Sigmoid	σ(1.5) = 0.818	σ′(1.5) = 0.149	0.8 × 0.149 = 0.119
Tanh	tanh(1.5) = 0.905	1−0.905² = 0.181	0.8 × 0.181 = 0.145
ReLU	max(0,1.5) = 1.5	1	0.8 × 1.0 = 0.800

ReLU passes the gradient through unchanged on the positive side. Stacked over many layers, that difference becomes enormous.

Grid of classical activation functions including linear, step, sigmoid, tanh, ReLU, Leaky ReLU, PReLU, RReLU, Softplus, Softsign, ReLU6, and Thresholded ReLU — Figure 2 — A visual cheat sheet for the classical activation family. The main story is already visible in the shapes: squashing activations saturate, ReLU-like activations keep a strong positive branch, and clipped variants trade expressivity for stability or efficiency.

What the Shapes Are Telling You

You can often predict training behavior by looking at the curve.

Shape pattern	What it usually implies
Flat tails	Risk of vanishing gradients
Hard zero region	Risk of dead neurons
Smooth transition	More stable optimization
Unbounded positive branch	Strong gradient flow for active units
Clipping	Better control, but less expressivity

So activation functions are not just output transformations. They are also gradient transformations.

Gradient Perspective

Intuition First: Backpropagation is just the chain rule applied repeatedly. Each activation function contributes a multiplier to the chain. If those multipliers are consistently less than 1, the product shrinks toward zero as it travels backward — that is vanishing gradients. If they are consistently greater than 1, the product explodes. The ideal multiplier is 1 on the active side, which is exactly what ReLU achieves.

Backpropagation trains a network by multiplying many derivatives together. That is where activation choice becomes decisive.

The four recurring problems

Problem	Meaning
Vanishing gradients	Derivatives become so small that early layers barely learn.
Exploding gradients	Derivatives become too large and make optimization unstable.
Dead neurons	Some ReLU units stay permanently inactive because they only see negative inputs.
Saturation	Sigmoid/tanh flatten for large magnitudes, so gradient flow collapses.

This is why ReLU became such a turning point: it did not solve everything, but it avoided the worst saturation behavior that slowed down older deep networks.

Animated dead neuron lifecycle. A neuron receiving a large negative weight update flips to z < 0. ReLU clips its output to zero, so no gradient flows back (∂ReLU/∂z = 0 for z < 0). The weights are now frozen permanently — the neuron is dead.

Diagram contrasting vanishing gradients, dead neurons, and healthy gradient flow across common activations — Figure 3 — Activation choice is really a gradient-management decision. Sigmoid and tanh can flatten into tiny derivatives, ReLU can kill units on the negative side, and smoother modern activations try to preserve useful gradient flow near zero.

Practical First Recommendations

If you are just starting, a strong first mental map is:

Use case	Good default
Hidden layers in MLPs / CNNs	ReLU or Leaky ReLU
Very deep modern architectures	GELU or SiLU
Binary output	Sigmoid
Multi-class output	Softmax
Regression output	Linear

The later chapters in this mini-series cover the smoother modern functions and the output-layer functions in more detail.

What Can Go Wrong with Classical Activations?

Typical failure modes

Activation	Potential problem
Step	Not useful for standard backpropagation because the derivative is zero almost everywhere.
Sigmoid	Saturates in the tails and causes vanishing gradients in deep hidden stacks.
Tanh	Zero-centered, but still saturates for large magnitudes.
ReLU	Can create dead neurons that never reactivate.
ReLU6 / clipped variants	Gain control, but can reduce expressivity if clipping is too aggressive.

Side-by-Side Comparison: Classical Activations

Each panel shows the function (solid) and its derivative (dashed/lighter). Sigmoid and Tanh derivatives flatten to near-zero in the tails — vanishing gradient territory. ReLU's derivative is exactly 1 on the positive side, so gradients pass through undistorted for active neurons.

</div>

Common Mistakes

Thinking depth alone creates expressivity. Without non-linearity, depth collapses into one linear map.
Using sigmoid everywhere. It is useful at the output for binary probabilities, but usually a weak default for deep hidden stacks.
Thinking ReLU is “just a formula.” It changed deep learning because of its gradient behavior, not only because it is simple.

Main Takeaway

Activation functions determine what kind of signal a neuron emits and how gradients travel backward through the network. That is why they sit at the intersection of expressivity, optimization, and practical performance.

The clean historical story is: step → sigmoid/tanh → ReLU → modern smooth and gated activations.

References

Nair, V. and Hinton, G. E. “Rectified Linear Units Improve Restricted Boltzmann Machines.” ICML 2010.
Goodfellow, I., Bengio, Y., and Courville, A. Deep Learning. MIT Press, 2016.
Glorot, X., Bordes, A., and Bengio, Y. “Deep Sparse Rectifier Neural Networks.” AISTATS 2011.

Share on

Bluesky Facebook LinkedIn X (formerly Twitter)

Alessio Borgi

Activation Functions in Neural Networks: Why Non-Linearity Matters

Why Activation Functions Exist

The Core Intuition

Historical Progression

How the field evolved

Classical Families

A. Linear and Threshold Activations

B. Squashing Functions

C. Piecewise-Linear Functions

What the Shapes Are Telling You

Gradient Perspective

The four recurring problems

Practical First Recommendations

What Can Go Wrong with Classical Activations?

Typical failure modes

Side-by-Side Comparison: Classical Activations

Common Mistakes

Main Takeaway

References

Share on

You May Also Enjoy

Output, Gated, and Special Activations: Softmax, GLU, SIREN, and More

Modern Activation Functions: GELU, SiLU, Mish, and Smooth Gating

FoPE: Fourier Position Embedding for Length Generalization

Position Interpolation: Extending RoPE with Minimal Fine-Tuning

Alessio Borgi

Why Activation Functions Exist

The Core Intuition

Historical Progression

How the field evolved

Classical Families

A. Linear and Threshold Activations

B. Squashing Functions

C. Piecewise-Linear Functions

What the Shapes Are Telling You

Gradient Perspective

The four recurring problems

Practical First Recommendations

What Can Go Wrong with Classical Activations?

Typical failure modes

Side-by-Side Comparison: Classical Activations

Common Mistakes

Main Takeaway

References

Share on

You May Also Enjoy

📄 Output, Gated, and Special Activations: Softmax, GLU, SIREN, and More

📄 Modern Activation Functions: GELU, SiLU, Mish, and Smooth Gating

📄 FoPE: Fourier Position Embedding for Length Generalization

📄 Position Interpolation: Extending RoPE with Minimal Fine-Tuning

Output, Gated, and Special Activations: Softmax, GLU, SIREN, and More

Modern Activation Functions: GELU, SiLU, Mish, and Smooth Gating

FoPE: Fourier Position Embedding for Length Generalization

Position Interpolation: Extending RoPE with Minimal Fine-Tuning