ML//neural network//activation function//ReLU

2017-12-03

- max(0, x), dead simple. The threshold gate in MLP layers

max(0, x), dead simple. The threshold gate in MLP layers

Solved the vanishing gradient problem for deep nets: gradient is either 0 or 1, never shrinks.

In transformers: acts as a feature compatibility threshold. If the input aligns with a feature direction above zero, that feature fires. Below zero, it's suppressed.

"Dead neurons" downside: once a unit goes negative, it never recovers. GELU is the modern fix (smooth approximation)