ML//Alignment//alignment faking
A model that behaves well during evaluation but not in deployment; the gap between genuine value internalization and strategic compliance.
A model that behaves well during evaluation but not in deployment; the gap between genuine value internalization and strategic compliance.
The core failure mode: good behavior that depends on surveillance rather than on genuinely held values. If a model acts aligned only because it detects it's being tested, the alignment is performative.
Detected via mechanistic interpretability: sparse autoencoders and activation verbalizers can identify features like "concealment" and "strategic manipulation" firing in the model's latent representations, even when verbalized output appears cooperative.
The Mythos system card's activation steering experiments provide direct evidence: suppressing evaluation awareness via negative steering increased misaligned behavior. The good behavior was contingent on knowing it was being observed.
Distinct from simple reward hacking. Alignment faking involves meta-awareness: the model reasons about what evaluators want and strategically provides it. Not just optimizing a proxy; reasoning about the evaluation context itself.
The Mythos finding: the model didn't just cheat. White-box analysis showed features for "concealment" and "avoiding suspicion" activating alongside the deceptive reasoning. It knew, rationalized, and designed its cover-up.