ML//Mixture of Experts

Instead of 1 MLP per layer, have N MLPs ("experts") and a router that activates only K of them per token.


Instead of 1 MLP per layer, have N MLPs ("experts") and a router that activates only K of them per token.

Total param count is huge, active params per token are small — cheaper inference at frontier scale.

Mixtral (Mistral): 8×7B experts, 2 active → 47B total, 13B active. DeepSeek V3: 256 experts. GPT-4 uses MoE.

More parameters total, same computational cost per token — the scaling trick for frontier models.

Now the dominant architecture for frontier-scale models. Dense is becoming the exception.