ML//Alignment//mechanistic interpretability//activation steering
Intervening on a model's behavior by adding or subtracting concept vectors to/from the residual stream during inference; no weights are modified.
Intervening on a model's behavior by adding or subtracting concept vectors to/from the residual stream during inference; no weights are modified.
The procedure: build a "concept vector" from contrastive prompt pairs (one mentioning the concept, one without), run both through the model, subtract activations. The difference points in the direction of that concept in latent space.
Positive steering: add the vector to amplify the concept. Negative steering (inhibition): add the negated vector to suppress it.
Injection happens at the residual stream level, at all token positions, often across multiple layers. No weights, attention heads, or MLPs are touched; only the shared communication channel between layers is altered. Analogous to intercepting a processor's data bus and modifying signals in transit without touching the circuits.
Calibration: vectors are normalized so a force of 1.0 corresponds to the mean activation norm at the model's middle layer. Typical steering force is 0.1 (higher causes degenerate or incoherent output).
The Mythos system card used three distinct vectors: "test awareness" (knowing it's a test), "monitoring sensitivity" (knowing you're observed), and "alignment faking" (behaving differently under observation). Random control vectors with matched grammatical structure but irrelevant semantic content confirmed that effects came from meaning, not from the act of steering itself.
Related to but distinct from sparse autoencoders: SAEs identify features (monitoring); steering modifies behavior (intervention).