ML//Transformer//attention//self-attention

2018-07-05

Q, K, V all come from the same sequence, each token attends to all others including itself.

Q, K, V all come from the same sequence, each token attends to all others including itself.

The novel contribution of the 2017 paper. Attention existed before, but self-attention for sequence processing was new.

The n² cost that defines transformer scaling: every pair of tokens interacts.

softmax(QKᵀ/√d) V. The dot product measures relevance, softmax normalizes, values get weighted.

In encoder (BERT): bidirectional, no masking. Every token sees everything.

In decoder (GPT): causal, with masking. Each token only sees the past.