Gradient Descent Guide

Gradient descent variants — SGD, momentum, Adam, RMSprop — and practical tuning tips.

Reference Reference Updated Apr 19, 2026

Name	Update rule	Notes
SGD	θ ← θ − lr · ∇L	Baseline; requires lr tuning
SGD + momentum	v ← γv + ∇L; θ ← θ − lr · v	Accelerates in consistent directions
Nesterov	Lookahead momentum	Slightly better theoretical convergence
Adagrad	Scales lr by 1/√(Σg²)	Adapts per-parameter; lr decays too fast
RMSprop	EMA of g², divide gradient by √EMA	Fixes Adagrad's lr decay
Adam	Momentum + RMSprop	Default go-to optimizer
AdamW	Adam with decoupled weight decay	Use for transformers; better generalization
Lion	Sign-of-EMA update	Memory-light, competitive with AdamW
L-BFGS	Limited-memory quasi-Newton	Full-batch; small datasets

Learning rate: Most important knob — start with defaults, tune on log scale
Batch size: Larger = more stable gradient; smaller = more steps, sometimes better generalization
Momentum β₁: Adam default 0.9 — rarely changed
RMS β₂: Adam default 0.999
Weight decay: 0.01–0.1 typical for transformers (AdamW)
Warmup: Linear LR ramp over first few % of steps prevents early instability
Schedule: Cosine or linear decay common

Last updated: April 19, 2026