ML//Mixture of Experts

2026-03-01

Instead of 1 MLP per layer, have N MLPs ("experts") and a router that activates only K of them per token.

Instead of 1 MLP per layer, have N MLPs ("experts") and a router that activates only K of them per token.

Total param count is huge, active params per token are small: cheaper inference at frontier scale.

Mixtral (Mistral): 8×7B experts, 2 active → 47B total, 13B active. DeepSeek V3: 256 experts. GPT-4 uses MoE.

More parameters total, same computational cost per token: the scaling trick for frontier models.

Now the dominant architecture for frontier-scale models. Dense is becoming the exception.