Web & Dev

Gradient Descent Guide

Gradient descent variants — SGD, momentum, Adam, RMSprop — and practical tuning tips.

Variants

NameUpdate ruleNotes
SGDθ ← θ − lr · ∇LBaseline; requires lr tuning
SGD + momentumv ← γv + ∇L; θ ← θ − lr · vAccelerates in consistent directions
NesterovLookahead momentumSlightly better theoretical convergence
AdagradScales lr by 1/√(Σg²)Adapts per-parameter; lr decays too fast
RMSpropEMA of g², divide gradient by √EMAFixes Adagrad's lr decay
AdamMomentum + RMSpropDefault go-to optimizer
AdamWAdam with decoupled weight decayUse for transformers; better generalization
LionSign-of-EMA updateMemory-light, competitive with AdamW
L-BFGSLimited-memory quasi-NewtonFull-batch; small datasets

Hyperparameters

Learning rateMost important knob — start with defaults, tune on log scale
Batch sizeLarger = more stable gradient; smaller = more steps, sometimes better generalization
Momentum β₁Adam default 0.9 — rarely changed
RMS β₂Adam default 0.999
Weight decay0.01–0.1 typical for transformers (AdamW)
WarmupLinear LR ramp over first few % of steps prevents early instability
ScheduleCosine or linear decay common

Common learning rates

Model typeTypical LR
CNN (SGD + momentum)0.1
CNN (Adam)1e-3
Transformer pre-training1e-4 – 5e-4
Transformer fine-tuning1e-5 – 5e-5
LoRA fine-tuning1e-4 – 3e-4

Diagnostics

  • Loss exploding → LR too high or need warmup / gradient clipping.
  • Loss stuck → LR too low, or stuck in saddle / plateau.
  • Loss oscillating → LR too high at current point in training.
  • Train/val gap growing → overfitting; add regularization or data augmentation.
  • Val loss plateaus while train keeps dropping → decay LR or stop early.
Was this article helpful?