ML//Inference//KV cache
During autoregressive generation, store the K and V matrices for all previous tokens instead of recomputing them.
During autoregressive generation, store the K and V matrices for all previous tokens instead of recomputing them.
Only works because of causal masking: previous tokens' representations are "closed" — they don't change when a new token arrives. Without masking (BERT-style), every new token would change all previous values → must recompute everything.
Example: context ["El", "gato"], new token "duerme" → only compute K_duerme, V_duerme. K_El, V_El, K_gato, V_gato already cached.
For 4K+ token sequences, KV cache reduces generation time 10-50x. The cost is memory — cache grows linearly with context length.
The tradeoff
Trades memory for compute. Why inference is memory-bandwidth-bound, not compute-bound. For long contexts the KV cache can exceed the model weights in memory.
Another reason not to save enriched embeddings across passes: you'd need to recompute everything each time — RNN-style dilution problem.