Machine Learning Loss Functions
Regression, classification, and task-specific losses — what each measures and watch-outs.
Reference
Regression
| Loss | Formula | Notes |
|---|---|---|
| MSE / L2 | (1/N) Σ (y − ŷ)² | Smooth; penalizes outliers heavily |
| MAE / L1 | (1/N) Σ |y − ŷ| | Robust to outliers; gradient constant |
| Huber | L2 if |e| < δ else L1 | Smooth + robust |
| Log-cosh | Σ log(cosh(e)) | Smooth everywhere, outlier-resistant |
| Quantile | Σ max(q·e, (q−1)·e) | Regression for a specific quantile |
Classification
| Loss | Formula | Notes |
|---|---|---|
| Binary cross-entropy | −[y·log(p) + (1−y)·log(1−p)] | Use with sigmoid output |
| Categorical cross-entropy | −Σ yᵢ · log(pᵢ) | Use with softmax output |
| Sparse categorical CE | Index-label version | Same as above with integer labels |
| Hinge | max(0, 1 − y·ŷ) | SVMs; y ∈ {−1, +1} |
| Focal loss | −(1 − p_t)^γ · log(p_t) | Imbalanced classification |
| Label smoothing | Replace hot 1 with 1−ε | Prevents overconfidence |
Task-specific
| Loss | Use |
|---|---|
| Triplet loss | Metric learning — embeddings |
| Contrastive (InfoNCE) | Self-supervised, CLIP-style training |
| Dice coefficient | Image segmentation (handles imbalance) |
| IoU / Jaccard | Segmentation / detection |
| CTC loss | Sequence prediction without alignment (speech, OCR) |
| DPO / reward modeling | RLHF — preference-based fine-tuning |
| KL divergence | Distillation, variational methods |
Notes
- Add a small regularization term (L1 / L2 on weights) to reduce overfitting.
- Log-likelihoods should be computed in log-space to avoid numerical underflow — use log_softmax + NLL instead of softmax + log.
Last updated: