ML//Transformer//positional encoding

Without position info, a Transformer treats "gato muerde perro" the same as "perro muerde gato" — permutation invariant.


Without position info, a Transformer treats "gato muerde perro" the same as "perro muerde gato" — permutation invariant.

Original paper: sinusoidal waves summed to embeddings. Problem: train on 512 tokens, see 1000 at inference → degradation (never saw those positions)

Learned embeddings (GPT-2), RoPE (LLaMA, rotates Q and K), ALiBi (linear bias on attention scores) — each approach handles extrapolation differently.

The encoding method matters most for long sequences beyond training length.