ML//Transformer//context window

The finite number of tokens the model can see at once — attention is O(n²), so doubling context = 4x memory and compute.


The finite number of tokens the model can see at once — attention is O(n²), so doubling context = 4x memory and compute.

GPT-2: 1024. GPT-3: 2048. GPT-4: 8K-128K. Claude: 100K-200K.

Everything outside the window doesn't exist for the model — no memory, no persistence.

Why RAG matters: fetch relevant info on demand instead of fitting everything in the window.

Flash Attention enables longer windows at the same hardware by avoiding N×N memory materialization.

Sliding window attention trades full attention for O(n×w) — a compromise, not infinite context.