ML//Transformer//feed-forward network

The "other half" of each transformer layer: linear → activation → linear. W₂ · ReLU(W₁ · x + b₁) + b₂ (or GELU in modern models)


The "other half" of each transformer layer: linear → activation → linear. W₂ · ReLU(W₁ · x + b₁) + b₂ (or GELU in modern models)

It's basically a perceptron that chunks features: if the current embedding aligns with a feature direction above threshold (ReLU), that feature gets added.

Example: if a vector already points toward "Michael" and "Jordan", the MLP detects compatibility and adds the "basketball" direction. Linears have bias.

Most of the model's parameters live here (2/3), not in attention (1/3). MLP stores facts, attention routes context.

Recent theory: the FFN acts as a key-value memory — keys are the first layer, values the second.

No matter how big the model, the tokenizer (embedding dimension) bottlenecks comprehension — MLP width is limited by it.

Multilayer perceptrons live in MLP layers, not attention layers.