ML//neural network//monosemanticity

2026-03-06

The interpretability holy grail: one neuron, one meaning: no double shifts, no moonlighting, just clean signal. The opposite of superposition.

The interpretability holy grail: one neuron, one meaning: no double shifts, no moonlighting, just clean signal. The opposite of superposition.

"Towards Monosemanticity: Decomposing Language Models with Dictionary Learning" (Anthropic, 2023), the landmark paper.

Uses sparse autoencoders to decompose activations into a larger set of interpretable features

The insight: the model DOES have clean features. They're just encoded as directions in superposed space, not as individual neurons.

Key goal of mechanistic interpretability: if we can extract monosemantic features, we can understand what the model actually knows.