ML//Alignment//mechanistic interpretability
Studying what each latent feature actually represents; trying to open the black box.
Studying what each latent feature actually represents; trying to open the black box.
The alignment approach that says "instead of trusting the model, understand it".
Key challenge: superposition — neurons represent many features simultaneously, making individual neurons uninterpretable.
Sparse autoencoders decompose superposed activations into interpretable features — the path toward monosemanticity
"Towards Monosemanticity" (Anthropic, 2023): dictionary learning to extract clean features from superposed representations.
Techniques: probing (test individual neurons), sparse autoencoders (decompose activations), activation patching (swap activations between runs to find causal features)
Induction heads are the "hello world" — the first concrete, verified circuit: two attention heads in consecutive layers that implement pattern-copying for in-context learning. Proof that interpretable structure exists in the residual stream
The ultimate goal: identify what every head and MLP does across all layers — a complete map of how the model thinks.