Gradient Descent Guide

Variants

Name	Update rule	Notes
SGD	θ ← θ − lr · ∇L	Baseline; requires lr tuning
SGD + momentum	v ← γv + ∇L; θ ← θ − lr · v	Accelerates in consistent directions
Nesterov	Lookahead momentum	Slightly better theoretical convergence
Adagrad	Scales lr by 1/√(Σg²)	Adapts per-parameter; lr decays too fast
RMSprop	EMA of g², divide gradient by √EMA	Fixes Adagrad's lr decay
Adam	Momentum + RMSprop	Default go-to optimizer
AdamW	Adam with decoupled weight decay	Use for transformers; better generalization
Lion	Sign-of-EMA update	Memory-light, competitive with AdamW
L-BFGS	Limited-memory quasi-Newton	Full-batch; small datasets

Learning rate	Most important knob — start with defaults, tune on log scale
Batch size	Larger = more stable gradient; smaller = more steps, sometimes better generalization
Momentum β₁	Adam default 0.9 — rarely changed
RMS β₂	Adam default 0.999
Weight decay	0.01–0.1 typical for transformers (AdamW)
Warmup	Linear LR ramp over first few % of steps prevents early instability
Schedule	Cosine or linear decay common

Was this article helpful?