ML//neural network//sparse autoencoder
Tool for extracting interpretable features from superposed neural activations.
Tool for extracting interpretable features from superposed neural activations.
Architecture: encoder expands activations into a much larger (overcomplete) basis, sparsity constraint forces most dimensions to zero, decoder reconstructs.
The non-zero dimensions in the expanded space correspond to active features — ideally monosemantic
"Dictionary learning" — the autoencoder learns a dictionary of features, activations are sparse combinations of dictionary entries.
Used in mechanistic interpretability research to peer inside transformer layers.