ML//Transformer//tokenizer

2018-09-10

Breaks text into subword tokens: "unbelievable" → "un" + "believ" + "able". Input into pieces = tokens (also chunks of image, or sound)

Breaks text into subword tokens: "unbelievable" → "un" + "believ" + "able". Input into pieces = tokens (also chunks of image, or sound)

Two separate steps: tokenizer (algorithm, not a NN) splits text → assigns token IDs. Embedding matrix converts IDs → vectors. "gato" → token_id 4821 → vector[512 dims].

Vocabulary size tradeoff: small (8K) = more tokens per sentence, large (100K) = sparser but shorter sequences.

BPE is the most common algorithm. WordPiece (BERT) and SentencePiece are alternatives.

Almost never touched in fine-tuning: changing it means changing the embedding matrix, basically starting over.

What if chars instead of subwords? "transformer" = 2-3 BPE tokens vs 11 chars. 3x less text per context window. Model must learn morphology from scratch ("running"/"runner"/"runs" lose shared subwords). But: perfect typo handling, new languages, code. Trade-off favors BPE at current scale.