ML//Transformer//attention//GQA

Grouped Query Attention: instead of separate K, V per head, share K and V across groups of heads.


Grouped Query Attention: instead of separate K, V per head, share K and V across groups of heads.

Standard MHA: 96 heads = 96 Q + 96 K + 96 V. GQA with 8 groups: 96 Q + 8 K + 8 V.

Drastically reduces KV cache size — the cache grows per K/V pair, not per head.

Llama 3, Gemini use GQA. Minimal quality loss, major memory savings at inference.

MQA (Multi-Query Attention) is the extreme: ALL heads share a single K, V pair. GQA is the practical middle ground.