ML//Transformer//layer normalization
Normalize activations across the feature dimension (not the batch)
Normalize activations across the feature dimension (not the batch)
Stabilizes training — prevents values from exploding or vanishing across layers.
Pre-norm (normalize before attention/FFN) vs post-norm (after) — pre-norm is more stable for deep models.