ML//neural network//monosemanticity
The interpretability holy grail: one neuron, one meaning — no double shifts, no moonlighting, just clean signal. The opposite of superposition.
The interpretability holy grail: one neuron, one meaning — no double shifts, no moonlighting, just clean signal. The opposite of superposition.
"Towards Monosemanticity: Decomposing Language Models with Dictionary Learning" (Anthropic, 2023) — the landmark paper.
Uses sparse autoencoders to decompose activations into a larger set of interpretable features
The insight: the model DOES have clean features — they're just encoded as directions in superposed space, not as individual neurons.
Key goal of mechanistic interpretability: if we can extract monosemantic features, we can understand what the model actually knows.