ML//neural network//vanishing gradient

- Gradients shrink exponentially through deep layers — chain rule multiplies many small numbers.


Gradients shrink exponentially through deep layers — chain rule multiplies many small numbers.

Why deep nets couldn't train before ResNet skip connections.

Why RNNs couldn't handle long sequences before LSTM gates — the hidden state carries "memory" from previous steps, but information from the beginning dilutes exponentially.

The fundamental reason we don't save enriched embeddings as persistent state: that's exactly what RNNs did, and the hidden state corrupted over time.