ML//Transformer//tokenizer//BPE
Byte Pair Encoding: start with individual characters, iteratively merge the most frequent adjacent pair until target vocabulary size.
Byte Pair Encoding: start with individual characters, iteratively merge the most frequent adjacent pair until target vocabulary size.
Simple greedy algorithm, surprisingly effective compression.
GPT family uses BPE. The vocabulary is a frozen artifact of the training data.
"running", "runner", "runs" share subword tokens → the model gets morphological hints for free, unlike char-level tokenization.
Current trend: BPE with large vocabularies (~100K tokens). Balances sequence length against embedding sparsity.