ML//Transformer//attention//sliding window attention
Longformer-style: each token attends only to its w nearest neighbors + a few designated global tokens.
Longformer-style: each token attends only to its w nearest neighbors + a few designated global tokens.
Reduces attention cost from O(n²) to O(n×w) — makes very long sequences feasible.
The tradeoff: tokens far apart can only communicate indirectly (through chains of local windows across layers)
Global tokens (e.g. [CLS]) still attend to everything — anchor points for aggregation.
None of these approaches give truly "infinite" context — they're all compromises between cost and attention capacity.
Mistral uses sliding window attention as default.