ML//Transformer//tokenizer//vocabulary

2026-03-01

The finite set of tokens the model knows, every possible subword unit it can produce or consume.

The finite set of tokens the model knows, every possible subword unit it can produce or consume.

Size tradeoff: small vocab (8K) = longer sequences, more context spent per sentence. Large vocab (100K) = shorter sequences, sparser embeddings.

BPE builds the vocabulary greedily from character-level up: merge most frequent pairs until you hit target size.

The vocabulary is a frozen artifact of the training data. Fixed at pre-training time, never touched during fine-tuning

Determines the shape of the embedding matrix: rows = vocab size, columns = embedding dimension.