ML//neural network//softmax

Converts a vector of raw scores (logits) into probabilities that sum to 1: exp(xᵢ)/Σexp(xⱼ)


Converts a vector of raw scores (logits) into probabilities that sum to 1: exp(xᵢ)/Σexp(xⱼ)

Not an activation function — it's a normalization that creates a probability distribution.

Two key uses in transformers:

In attention: applied per column of QKᵀ scores → how much each token attends to every other.

In output: applied to logits over vocabulary → probability of each possible next token.

Temperature controls sharpness: T<1 makes the distribution peakier (more confident), T>1 flattens it (more random)

The inputs to softmax at the output layer are called logits — raw pre-normalization scores.