ML//Transformer//attention//causal masking

2026-03-06

Set the upper triangle of the attention matrix to -∞ (→ 0 after softmax) so tokens can't attend to future positions.

Set the upper triangle of the attention matrix to -∞ (→ 0 after softmax) so tokens can't attend to future positions.

Effect: attention never carries information from later tokens back to earlier ones, preserving the autoregressive property.

The new token CAN attend to all previous tokens. Previous tokens CANNOT attend to the new one.

Enables KV cache: since previous tokens' representations are "closed" (won't change), we can cache their K and V.

Without causal masking (encoder-style): all tokens see all others → better enrichment of early positions, but can't do generation.

Another reason not to save enriched embeddings: without masking you'd have to RECOMPUTE EVERYTHING each time a new token arrives.

BERT has no masking → bidirectional understanding. GPT has masking → autoregressive generation. Encoder-decoder cross-attention has no masking (decoder sees all encoder tokens freely)