ML//Transformer//positional encoding
Without position info, a Transformer treats "gato muerde perro" the same as "perro muerde gato" — permutation invariant.
Without position info, a Transformer treats "gato muerde perro" the same as "perro muerde gato" — permutation invariant.
Original paper: sinusoidal waves summed to embeddings. Problem: train on 512 tokens, see 1000 at inference → degradation (never saw those positions)
Learned embeddings (GPT-2), RoPE (LLaMA, rotates Q and K), ALiBi (linear bias on attention scores) — each approach handles extrapolation differently.
The encoding method matters most for long sequences beyond training length.