ML//Transformer//attention//sliding window attention

Longformer-style: each token attends only to its w nearest neighbors + a few designated global tokens.


Longformer-style: each token attends only to its w nearest neighbors + a few designated global tokens.

Reduces attention cost from O(n²) to O(n×w) — makes very long sequences feasible.

The tradeoff: tokens far apart can only communicate indirectly (through chains of local windows across layers)

Global tokens (e.g. [CLS]) still attend to everything — anchor points for aggregation.

None of these approaches give truly "infinite" context — they're all compromises between cost and attention capacity.

Mistral uses sliding window attention as default.