ML//Transformer//tokenizer//special tokens
Tokens with structural meaning, not linguistic meaning:
Tokens with structural meaning, not linguistic meaning:
[CLS] — BERT's aggregation token. Sits at position 0, learns to capture global sentence meaning during training. Used for classification tasks via downstream layers
[MASK] — placeholder for masked language modeling. The model predicts what goes here using bidirectional context.
[BOS] — beginning of sequence. In cross-attention translation, the Q from [BOS] is the "cold start" — how do you attend to the source language when you haven't generated anything yet?
[EOS] — end of sequence. When the model generates this token, inference stops.
GPT doesn't use [CLS] — it simply takes the last token's representation (already optimized to carry full sequence meaning via causal masking)