ML//Transformer//context window
The finite number of tokens the model can see at once — attention is O(n²), so doubling context = 4x memory and compute.
The finite number of tokens the model can see at once — attention is O(n²), so doubling context = 4x memory and compute.
GPT-2: 1024. GPT-3: 2048. GPT-4: 8K-128K. Claude: 100K-200K.
Everything outside the window doesn't exist for the model — no memory, no persistence.
Why RAG matters: fetch relevant info on demand instead of fitting everything in the window.
Flash Attention enables longer windows at the same hardware by avoiding N×N memory materialization.
Sliding window attention trades full attention for O(n×w) — a compromise, not infinite context.