ML//Transformer//tokenizer//BPE

2019-01-20

Byte Pair Encoding: start with individual characters, iteratively merge the most frequent adjacent pair until target vocabulary size.

Byte Pair Encoding: start with individual characters, iteratively merge the most frequent adjacent pair until target vocabulary size.

Simple greedy algorithm, surprisingly effective compression.

GPT family uses BPE. The vocabulary is a frozen artifact of the training data.

"running", "runner", "runs" share subword tokens → the model gets morphological hints for free, unlike char-level tokenization.

Current trend: BPE with large vocabularies (~100K tokens). Balances sequence length against embedding sparsity.