ML//neural network//activation function//ReLU
- max(0, x) — dead simple. The threshold gate in MLP layers
max(0, x) — dead simple. The threshold gate in MLP layers
Solved the vanishing gradient problem for deep nets: gradient is either 0 or 1, never shrinks.
In transformers: acts as a feature compatibility threshold — if the input aligns with a feature direction above zero, that feature fires. Below zero, it's suppressed.
"Dead neurons" downside: once a unit goes negative, it never recovers. GELU is the modern fix (smooth approximation)