ML//Alignment//mechanistic interpretability

Studying what each latent feature actually represents; trying to open the black box.


Studying what each latent feature actually represents; trying to open the black box.

The alignment approach that says "instead of trusting the model, understand it".

Key challenge: superposition — neurons represent many features simultaneously, making individual neurons uninterpretable.

Sparse autoencoders decompose superposed activations into interpretable features — the path toward monosemanticity

"Towards Monosemanticity" (Anthropic, 2023): dictionary learning to extract clean features from superposed representations.

Techniques: probing (test individual neurons), sparse autoencoders (decompose activations), activation patching (swap activations between runs to find causal features)

Induction heads are the "hello world" — the first concrete, verified circuit: two attention heads in consecutive layers that implement pattern-copying for in-context learning. Proof that interpretable structure exists in the residual stream

The ultimate goal: identify what every head and MLP does across all layers — a complete map of how the model thinks.