ML//Transformer//attention//self-attention

Q, K, V all come from the same sequence — each token attends to all others including itself.


Q, K, V all come from the same sequence — each token attends to all others including itself.

The novel contribution of the 2017 paper — attention existed before, but self-attention for sequence processing was new.

The n² cost that defines transformer scaling: every pair of tokens interacts.

softmax(QKᵀ/√d) V — the dot product measures relevance, softmax normalizes, values get weighted.

In encoder (BERT): bidirectional, no masking — every token sees everything.

In decoder (GPT): causal, with masking — each token only sees the past.