ML//neural network//vanishing gradient
- Gradients shrink exponentially through deep layers — chain rule multiplies many small numbers.
Gradients shrink exponentially through deep layers — chain rule multiplies many small numbers.
Why deep nets couldn't train before ResNet skip connections.
Why RNNs couldn't handle long sequences before LSTM gates — the hidden state carries "memory" from previous steps, but information from the beginning dilutes exponentially.
The fundamental reason we don't save enriched embeddings as persistent state: that's exactly what RNNs did, and the hidden state corrupted over time.