Gradient Descent Guide

Gradient descent variants — SGD, momentum, Adam, RMSprop — and practical tuning tips.

Reference Reference Updated Apr 19, 2026
Reference

Variants

Name Update rule Notes
SGD θ ← θ − lr · ∇L Baseline; requires lr tuning
SGD + momentum v ← γv + ∇L; θ ← θ − lr · v Accelerates in consistent directions
Nesterov Lookahead momentum Slightly better theoretical convergence
Adagrad Scales lr by 1/√(Σg²) Adapts per-parameter; lr decays too fast
RMSprop EMA of g², divide gradient by √EMA Fixes Adagrad's lr decay
Adam Momentum + RMSprop Default go-to optimizer
AdamW Adam with decoupled weight decay Use for transformers; better generalization
Lion Sign-of-EMA update Memory-light, competitive with AdamW
L-BFGS Limited-memory quasi-Newton Full-batch; small datasets

Hyperparameters

Learning rate
Most important knob — start with defaults, tune on log scale
Batch size
Larger = more stable gradient; smaller = more steps, sometimes better generalization
Momentum β₁
Adam default 0.9 — rarely changed
RMS β₂
Adam default 0.999
Weight decay
0.01–0.1 typical for transformers (AdamW)
Warmup
Linear LR ramp over first few % of steps prevents early instability
Schedule
Cosine or linear decay common

Common learning rates

Model type Typical LR
CNN (SGD + momentum) 0.1
CNN (Adam) 1e-3
Transformer pre-training 1e-4 – 5e-4
Transformer fine-tuning 1e-5 – 5e-5
LoRA fine-tuning 1e-4 – 3e-4

Diagnostics

  • Loss exploding → LR too high or need warmup / gradient clipping.
  • Loss stuck → LR too low, or stuck in saddle / plateau.
  • Loss oscillating → LR too high at current point in training.
  • Train/val gap growing → overfitting; add regularization or data augmentation.
  • Val loss plateaus while train keeps dropping → decay LR or stop early.

Last updated: