Neural Network Activation Functions

Common activations — formulas, output range, and typical use.

Reference Reference Updated Apr 19, 2026
Reference

Activations

Name Formula Range Use / notes
Identity f(x) = x (−∞, ∞) Regression output
Sigmoid 1 / (1 + e^{−x}) (0, 1) Binary classification output; saturates
Tanh (e^x − e^{−x}) / (e^x + e^{−x}) (−1, 1) Zero-centered; saturates
ReLU max(0, x) [0, ∞) Default hidden activation; cheap; can "die"
Leaky ReLU max(0.01x, x) (−∞, ∞) Fixes dying-ReLU
PReLU max(αx, x), α learned (−∞, ∞) Parametric leaky
ELU x if x>0 else α(e^x − 1) (−α, ∞) Smooth, zero-centered
GELU x · Φ(x) ≈ 0.5x(1 + tanh(…)) (≈−0.17, ∞) Transformers (BERT, GPT)
SiLU / Swish x · sigmoid(x) (≈−0.28, ∞) EfficientNet, modern LLMs
Mish x · tanh(softplus(x)) (≈−0.31, ∞) Alternative to Swish
Softplus ln(1 + e^x) (0, ∞) Smooth ReLU
Softmax e^{xᵢ} / Σ e^{xⱼ} (0, 1) summing to 1 Multi-class output

Picking one

  • Hidden layers: ReLU for CNNs, GELU/SiLU for transformers.
  • Output for classification: sigmoid (binary), softmax (multi-class).
  • Output for regression: linear (no activation).
  • Dying ReLU: use Leaky ReLU, ELU, or check that the learning rate isn't too high.
  • Sigmoid/Tanh in deep nets: avoid — gradients vanish.

Last updated: