ML//Transformer//attention//head

2026-03-08

One unit of multi-head attention: its own W_Q, W_K, W_V projections operating on a reduced dimension (e.g. 128-dim instead of 12288)

One unit of multi-head attention: its own W_Q, W_K, W_V projections operating on a reduced dimension (e.g. 128-dim instead of 12288)

Why reduce? Each head can specialize in a different type of relationship (syntax, coreference, semantic, positional) by operating in its own subspace.

No explicit mechanism forces specialization. It emerges from training. If two heads learn the same thing, they don't reduce loss, so backprop pushes them to diversify.

Emergent specialization

Empirically observed specializations: syntax tracking, coreference resolution, positional distance, and induction heads (pattern-copying circuits for in-context learning)

GPT-3: 96 heads per layer × 96 layers. Outputs are concatenated and projected back to model dimension via W_O.

The full process (Q·K → softmax → ×V → sum columns → suggested vector change) IS one head of attention.

The question mechanistic interpretability asks: if you can identify what each head does across 96 heads × 96 layers, you have a complete map of how the model thinks.