ML//neural network//vanishing gradient

2018-03-08

- Gradients shrink exponentially through deep layers: chain rule multiplies many small numbers.

Gradients shrink exponentially through deep layers: chain rule multiplies many small numbers.

Why deep nets couldn't train before ResNet skip connections.

Why RNNs couldn't handle long sequences before LSTM gates: the hidden state carries "memory" from previous steps, but information from the beginning dilutes exponentially.

The fundamental reason we don't save enriched embeddings as persistent state: that's exactly what RNNs did, and the hidden state corrupted over time.