ML//neural network//monosemanticity

The interpretability holy grail: one neuron, one meaning — no double shifts, no moonlighting, just clean signal. The opposite of superposition.


The interpretability holy grail: one neuron, one meaning — no double shifts, no moonlighting, just clean signal. The opposite of superposition.

"Towards Monosemanticity: Decomposing Language Models with Dictionary Learning" (Anthropic, 2023) — the landmark paper.

Uses sparse autoencoders to decompose activations into a larger set of interpretable features

The insight: the model DOES have clean features — they're just encoded as directions in superposed space, not as individual neurons.

Key goal of mechanistic interpretability: if we can extract monosemantic features, we can understand what the model actually knows.