ML//Transformer//positional encoding//ALiBi

2026-03-02

Attention with Linear Biases: alternative to RoPE and sinusoidal positional encoding

Attention with Linear Biases: alternative to RoPE and sinusoidal positional encoding

Doesn't add position info to embeddings at all. Instead it subtracts a linear bias from attention scores proportional to distance between tokens.

Farther tokens get penalized more → natural recency bias without learned position embeddings.

Extrapolates to longer sequences than seen during training: the linear penalty is simple enough to generalize.

No extra parameters, just a fixed penalty schedule applied during attention computation.