ML//neural network//sparse autoencoder

Tool for extracting interpretable features from superposed neural activations.


Tool for extracting interpretable features from superposed neural activations.

Architecture: encoder expands activations into a much larger (overcomplete) basis, sparsity constraint forces most dimensions to zero, decoder reconstructs.

The non-zero dimensions in the expanded space correspond to active features — ideally monosemantic

"Dictionary learning" — the autoencoder learns a dictionary of features, activations are sparse combinations of dictionary entries.

Used in mechanistic interpretability research to peer inside transformer layers.