ML//neural network//sparse autoencoder

2026-03-05

Tool for extracting interpretable features from superposed neural activations.

Tool for extracting interpretable features from superposed neural activations.

Architecture: encoder expands activations into a much larger (overcomplete) basis, sparsity constraint forces most dimensions to zero, decoder reconstructs.

The non-zero dimensions in the expanded space correspond to active features, ideally monosemantic

"Dictionary learning": the autoencoder learns a dictionary of features, activations are sparse combinations of dictionary entries.

Used in mechanistic interpretability research to peer inside transformer layers.