ML//Transformer//tokenizer//embedding matrix

Converts token IDs into dense vectors: token_id 4821 → vector[512 dims]. These are trainable parameters.


Converts token IDs into dense vectors: token_id 4821 → vector[512 dims]. These are trainable parameters.

The same matrix, transposed, converts the last layer's output back into "similarity with each token in the vocabulary" — this is weight tying

Size = vocab_size × embedding_dim. No matter how big the model, the tokenizer (embedding dimension) bottlenecks comprehension.

Not a neural network — it's a lookup table that maps discrete IDs to continuous vectors.

The embeddings it produces are the raw starting point — every forward pass resets from these base vectors.