ML//neural network//gradient descent

2026-03-07

The optimization algorithm that makes neural networks learn: compute how wrong the model is (loss function), compute which direction to adjust each weight (backpropagation), then nudge weights in that direction.

The "gradient" is a vector of partial derivatives, one per parameter. It points toward steeper loss. You move in the opposite direction (downhill) by a small step called the learning rate

Pure gradient descent uses the entire dataset per update (impractical). SGD (stochastic gradient descent) uses one sample. Mini-batch SGD uses a batch (typically 32-4096 samples), balancing noise and efficiency.

Why it works despite non-convex landscapes: modern networks have so many parameters that most local minima are roughly equivalent in quality. The real danger is saddle points (flat regions), not bad minima.

Adam is the dominant optimizer because it adapts the learning rate per-parameter using momentum (running average of gradients) and second moments (running average of squared gradients). It handles sparse gradients and varying scales automatically.

Vanishing gradient is what happens when gradients shrink exponentially through deep layers. Residual connections fix this by providing a gradient highway that bypasses layers.

At massive scale, gradient descent requires distributed training: split the batch across GPUs, compute gradients in parallel, synchronize. The math is the same, the engineering is the hard part.