ML//Transformer//feed-forward network
The "other half" of each transformer layer: linear → activation → linear. W₂ · ReLU(W₁ · x + b₁) + b₂ (or GELU in modern models)
The "other half" of each transformer layer: linear → activation → linear. W₂ · ReLU(W₁ · x + b₁) + b₂ (or GELU in modern models)
It's basically a perceptron that chunks features: if the current embedding aligns with a feature direction above threshold (ReLU), that feature gets added.
Example: if a vector already points toward "Michael" and "Jordan", the MLP detects compatibility and adds the "basketball" direction. Linears have bias.
Most of the model's parameters live here (2/3), not in attention (1/3). MLP stores facts, attention routes context.
Recent theory: the FFN acts as a key-value memory — keys are the first layer, values the second.
No matter how big the model, the tokenizer (embedding dimension) bottlenecks comprehension — MLP width is limited by it.
Multilayer perceptrons live in MLP layers, not attention layers.