ML//neural network//activation function//ReLU

- max(0, x) — dead simple. The threshold gate in MLP layers


max(0, x) — dead simple. The threshold gate in MLP layers

Solved the vanishing gradient problem for deep nets: gradient is either 0 or 1, never shrinks.

In transformers: acts as a feature compatibility threshold — if the input aligns with a feature direction above zero, that feature fires. Below zero, it's suppressed.

"Dead neurons" downside: once a unit goes negative, it never recovers. GELU is the modern fix (smooth approximation)