ML//Transformer//positional encoding//ALiBi

Attention with Linear Biases — alternative to RoPE and sinusoidal positional encoding


Attention with Linear Biases — alternative to RoPE and sinusoidal positional encoding

Doesn't add position info to embeddings at all — instead subtracts a linear bias from attention scores proportional to distance between tokens.

Farther tokens get penalized more → natural recency bias without learned position embeddings.

Extrapolates to longer sequences than seen during training — the linear penalty is simple enough to generalize.

No extra parameters — just a fixed penalty schedule applied during attention computation.