ML//Mixture of Experts
Instead of 1 MLP per layer, have N MLPs ("experts") and a router that activates only K of them per token.
Instead of 1 MLP per layer, have N MLPs ("experts") and a router that activates only K of them per token.
Total param count is huge, active params per token are small — cheaper inference at frontier scale.
Mixtral (Mistral): 8×7B experts, 2 active → 47B total, 13B active. DeepSeek V3: 256 experts. GPT-4 uses MoE.
More parameters total, same computational cost per token — the scaling trick for frontier models.
Now the dominant architecture for frontier-scale models. Dense is becoming the exception.