ML//neural network//softmax

2026-03-08

Converts a vector of raw scores (logits) into probabilities that sum to 1: exp(xᵢ)/Σexp(xⱼ)

Converts a vector of raw scores (logits) into probabilities that sum to 1: exp(xᵢ)/Σexp(xⱼ)

Not an activation function: it's a normalization that creates a probability distribution.

Two key uses in transformers:

In attention: applied per column of QKᵀ scores → how much each token attends to every other.

In output: applied to logits over vocabulary → probability of each possible next token.

Temperature controls sharpness: T<1 makes the distribution peakier (more confident), T>1 flattens it (more random)

The inputs to softmax at the output layer are called logits: raw pre-normalization scores.