ML//neural network//optimizer//learning rate

2026-03-05

The step size for gradient descent: how much to adjust weights per update. Too high and training diverges (loss explodes), too low and it takes forever or gets stuck in shallow minima.

The step size for gradient descent: how much to adjust weights per update. Too high and training diverges (loss explodes), too low and it takes forever or gets stuck in shallow minima.

The single most important hyperparameter in all of deep learning. Everything else (architecture, data, optimizer) matters, but a wrong learning rate can make the best setup fail completely.

Schedules

Warmup: start with a tiny learning rate and increase linearly for the first 1-5% of training. Without warmup, early gradients are noisy (random weights) and large steps destabilize training. Adam's adaptive moments partially mitigate this, but warmup still helps.

Cosine decay: after warmup, decrease the learning rate following a cosine curve down to near zero. This lets the model make large exploratory steps early, then fine-grained adjustments late. The standard schedule for pre-training large models.

Scaling laws interact directly: as you scale batch size, you typically scale the learning rate proportionally (linear scaling rule). Distributed training across more GPUs means larger effective batch size, which means adjusting the schedule.

For fine-tuning, the learning rate is typically 10-100x smaller than pre-training. The model's weights are already near a good solution; large steps would destroy what it learned. This is why LoRA works: it adds small trainable deltas instead of updating all weights with a small learning rate.