ML//Transformer//attention//induction head

One of the most important discoveries of mechanistic interpretability — concrete, verifiable, and functionally understood.


One of the most important discoveries of mechanistic interpretability — concrete, verifiable, and functionally understood.

The pattern they explain: if [A][B] appears in the context and later [A] appears again, the model predicts [B] with high probability. "El gato duerme. El perro ladra. El gato ____" → predicts "duerme".

This IS in-context learning at the lowest level — the model learns patterns within the current sequence, no weight updates.

The circuit

Two attention heads in consecutive layers working together:

Head 1 (prefix matching head): the current token looks back and finds tokens similar to itself in the past.

Head 2 (induction head): copies what came AFTER that previous occurrence.

Not a special architecture — they're regular heads with regular W_Q, W_K, W_V matrices, initialized randomly. The gradient pushed them into this specialization because it reduced loss

In large models there are multiple induction circuits — some more specialized, some for different pattern types. They're a subset of the 96 heads per layer, not all of them.

During training there's a phase transition: when induction heads emerge, the loss drops abruptly — the model suddenly gets much better at following context patterns.

The "hello world" of understanding transformers from the inside — proof that attention heads form interpretable circuits, not just opaque weights.