Variants
| Name | Update rule | Notes |
|---|---|---|
| SGD | θ ← θ − lr · ∇L | Baseline; requires lr tuning |
| SGD + momentum | v ← γv + ∇L; θ ← θ − lr · v | Accelerates in consistent directions |
| Nesterov | Lookahead momentum | Slightly better theoretical convergence |
| Adagrad | Scales lr by 1/√(Σg²) | Adapts per-parameter; lr decays too fast |
| RMSprop | EMA of g², divide gradient by √EMA | Fixes Adagrad's lr decay |
| Adam | Momentum + RMSprop | Default go-to optimizer |
| AdamW | Adam with decoupled weight decay | Use for transformers; better generalization |
| Lion | Sign-of-EMA update | Memory-light, competitive with AdamW |
| L-BFGS | Limited-memory quasi-Newton | Full-batch; small datasets |
Hyperparameters
| Learning rate | Most important knob — start with defaults, tune on log scale |
|---|---|
| Batch size | Larger = more stable gradient; smaller = more steps, sometimes better generalization |
| Momentum β₁ | Adam default 0.9 — rarely changed |
| RMS β₂ | Adam default 0.999 |
| Weight decay | 0.01–0.1 typical for transformers (AdamW) |
| Warmup | Linear LR ramp over first few % of steps prevents early instability |
| Schedule | Cosine or linear decay common |
Common learning rates
| Model type | Typical LR |
|---|---|
| CNN (SGD + momentum) | 0.1 |
| CNN (Adam) | 1e-3 |
| Transformer pre-training | 1e-4 – 5e-4 |
| Transformer fine-tuning | 1e-5 – 5e-5 |
| LoRA fine-tuning | 1e-4 – 3e-4 |
Diagnostics
- Loss exploding → LR too high or need warmup / gradient clipping.
- Loss stuck → LR too low, or stuck in saddle / plateau.
- Loss oscillating → LR too high at current point in training.
- Train/val gap growing → overfitting; add regularization or data augmentation.
- Val loss plateaus while train keeps dropping → decay LR or stop early.
Was this article helpful?