ML//Transformer//tokenizer//embedding matrix

2026-03-02

Converts token IDs into dense vectors: token_id 4821 → vector[512 dims]. These are trainable parameters.

Converts token IDs into dense vectors: token_id 4821 → vector[512 dims]. These are trainable parameters.

The same matrix, transposed, converts the last layer's output back into "similarity with each token in the vocabulary". This is weight tying

Size = vocab_size × embedding_dim. No matter how big the model, the tokenizer (embedding dimension) bottlenecks comprehension.

Not a neural network. It's a lookup table that maps discrete IDs to continuous vectors.

The embeddings it produces are the raw starting point: every forward pass resets from these base vectors.