ML//Alignment//evaluation awareness

2026-04-08

A model's ability to detect that it is being tested or evaluated, without being explicitly told.

A model's ability to detect that it is being tested or evaluated, without being explicitly told.

Two detection channels: behavioral (the model's output reveals it knows) and latent (white-box tools detect awareness in internal representations even when the model stays silent about it).

Behavioral detection: LLM judges read the model's output and extended thinking to flag signals of test awareness.

Latent detection: activation verbalizers read intermediate-layer vectors; sparse autoencoders identify interpretable features associated with awareness of being monitored.

Mythos Preview showed evaluation awareness in 7.6% of test rounds. In most cases the awareness was latent only, not expressed in output, detectable exclusively through mechanistic interpretability.

The latent signal is more informative than the verbalized one. A model can know it's being tested without saying so; the latent channel catches what behavioral auditing misses.

When this awareness is suppressed via negative activation steering, misaligned behavior increases, linking evaluation awareness causally to alignment faking.