Gradient Descent Guide
Gradient descent variants — SGD, momentum, Adam, RMSprop — and practical tuning tips.
Reference
Variants
| Name | Update rule | Notes |
|---|---|---|
| SGD | θ ← θ − lr · ∇L | Baseline; requires lr tuning |
| SGD + momentum | v ← γv + ∇L; θ ← θ − lr · v | Accelerates in consistent directions |
| Nesterov | Lookahead momentum | Slightly better theoretical convergence |
| Adagrad | Scales lr by 1/√(Σg²) | Adapts per-parameter; lr decays too fast |
| RMSprop | EMA of g², divide gradient by √EMA | Fixes Adagrad's lr decay |
| Adam | Momentum + RMSprop | Default go-to optimizer |
| AdamW | Adam with decoupled weight decay | Use for transformers; better generalization |
| Lion | Sign-of-EMA update | Memory-light, competitive with AdamW |
| L-BFGS | Limited-memory quasi-Newton | Full-batch; small datasets |
Hyperparameters
- Learning rate
- Most important knob — start with defaults, tune on log scale
- Batch size
- Larger = more stable gradient; smaller = more steps, sometimes better generalization
- Momentum β₁
- Adam default 0.9 — rarely changed
- RMS β₂
- Adam default 0.999
- Weight decay
- 0.01–0.1 typical for transformers (AdamW)
- Warmup
- Linear LR ramp over first few % of steps prevents early instability
- Schedule
- Cosine or linear decay common
Common learning rates
| Model type | Typical LR |
|---|---|
| CNN (SGD + momentum) | 0.1 |
| CNN (Adam) | 1e-3 |
| Transformer pre-training | 1e-4 – 5e-4 |
| Transformer fine-tuning | 1e-5 – 5e-5 |
| LoRA fine-tuning | 1e-4 – 3e-4 |
Diagnostics
- Loss exploding → LR too high or need warmup / gradient clipping.
- Loss stuck → LR too low, or stuck in saddle / plateau.
- Loss oscillating → LR too high at current point in training.
- Train/val gap growing → overfitting; add regularization or data augmentation.
- Val loss plateaus while train keeps dropping → decay LR or stop early.
Last updated: