ML//Transformer//attention

2018-06-15

Attention existed before the 2017 paper. The key contribution of "Attention Is All You Need" was that attention was... all you needed, no recurrence. Self-attention specifically was novel.

Attention existed before the 2017 paper. The key contribution of "Attention Is All You Need" was that attention was... all you needed, no recurrence. Self-attention specifically was novel.

Attention does context routing: allows encoded information from one token's vector to flow into another's. Every token can look at every other token in parallel.

Q, K, V matrices: query asks a question, keys answer relevance, values carry the content.

O(n²) cost in sequence length, the fundamental scaling constraint. Doblar el contexto = 4x memoria y cómputo.

~1/3 of all you need: attention handles context, the other 2/3 is MLP storing facts.

Replaced the sequential bottleneck of RNNs with parallel computation.