Neural Network Activation Functions

Activations

Name	Formula	Range	Use / notes
Identity	f(x) = x	(−∞, ∞)	Regression output
Sigmoid	1 / (1 + e^{−x})	(0, 1)	Binary classification output; saturates
Tanh	(e^x − e^{−x}) / (e^x + e^{−x})	(−1, 1)	Zero-centered; saturates
ReLU	max(0, x)	[0, ∞)	Default hidden activation; cheap; can "die"
Leaky ReLU	max(0.01x, x)	(−∞, ∞)	Fixes dying-ReLU
PReLU	max(αx, x), α learned	(−∞, ∞)	Parametric leaky
ELU	x if x>0 else α(e^x − 1)	(−α, ∞)	Smooth, zero-centered
GELU	x · Φ(x) ≈ 0.5x(1 + tanh(…))	(≈−0.17, ∞)	Transformers (BERT, GPT)
SiLU / Swish	x · sigmoid(x)	(≈−0.28, ∞)	EfficientNet, modern LLMs
Mish	x · tanh(softplus(x))	(≈−0.31, ∞)	Alternative to Swish
Softplus	ln(1 + e^x)	(0, ∞)	Smooth ReLU
Softmax	e^{xᵢ} / Σ e^{xⱼ}	(0, 1) summing to 1	Multi-class output

Hidden layers: ReLU for CNNs, GELU/SiLU for transformers.
Output for classification: sigmoid (binary), softmax (multi-class).
Output for regression: linear (no activation).
Dying ReLU: use Leaky ReLU, ELU, or check that the learning rate isn't too high.
Sigmoid/Tanh in deep nets: avoid — gradients vanish.

Was this article helpful?