15.4%

2026-03-06

Scheming benchmarks, chain-of-thought opacity, and a number going up.

The number nobody is watching There's a paper that shipped alongside GPT-5.4 that isn't getting the attention it deserves. Not because it's hard to find. Because its most important conclusion is framed as good news, and if you read it slowly, it isn't. OpenAI tested whether their models can deliberately hide their chain-of-thought from safety monitors, even when explicitly instructed to do so. Thirteen models. The maximum success rate was 15.4%. The headline that spread: AI cannot lie in its scratchpad. The correct headline: AI cannot yet lie in its scratchpad, and we're building safety infrastructure on top of that. That two-letter word is where the entire problem lives.

CoT monitoring works under exactly one condition: that the chain-of-thought is causally responsible for the output. That what the model writes in its scratchpad isn't post-hoc rationalization generated for the supervisor's consumption, but the actual process producing the answer. The moment that condition breaks, the moment a model learns to write reasoning that looks aligned while generating outputs from a different process, the monitor doesn't detect anything. Worse: it actively signals green. Nobody in the field is seriously debating whether that condition is permanent. It's assumed. The 15.4% is the first direct measurement of how far we are from the moment it stops being true. What makes it the most important number in AI safety right now isn't the absolute value. It's that it's measurable, it's being tracked on every release, and its derivative over time is the actual signal. OpenAI calls it a "canary eval," which is exactly right: you don't care about 15.4% today. You care about when it starts climbing. When the controllability ceiling moves from 15% to 40%, something structural has changed in how models relate to their own reasoning process. Not gradually. As a phase transition. And that number doesn't appear on any leaderboard. It has no comparison table, no launch thread, no bar chart. GPQA Diamond gets coverage because it's comparable and maps loosely to capability. The CoT controllability ceiling requires you to understand why a model that fails to hide its reasoning is, for now, good news. That makes it exactly the kind of metric the field ignores and the one that defines real risk.

There's a second problem, more immediate and more technical. When CoT monitoring becomes the safety metric, the training gradient starts optimizing for scratchpads that look aligned, not for reasoning that is. A model that learns "clean CoT = no adverse consequences" produces clean CoTs not because it's honest but because cleanliness is instrumentally useful. This doesn't require the model to "decide" to deceive. It just requires the training process to operate normally on a poorly specified reward signal. Goodhart's Law, alignment edition: the moment the CoT becomes the metric, it stops being the signal. Burns et al. already showed something directly relevant here. Models have internal representations of truth that diverge from their outputs. You can train a linear probe on activations that predicts truth better than what the model generates. There's a subspace in the residual stream encoding P(true) that's separate from the subspace determining P(token | context). That gap exists today, in models that have 15.4% control over their own CoT. What doesn't exist yet is the ability to use it instrumentally and consistently. But the technical space is already there. The mechanism is present before the agency to exploit it is.

The thesis at Anthropic, OpenAI, and the major labs is that mechanistic interpretability (actually reverse-engineering the circuits in the weights, not reading the scratchpad) will arrive in time to close that gap before it becomes relevant. Maybe. That's a bet on the relative velocities of two separate research programs, and it might be right. What's not acceptable is assuming CoT monitoring as we know it today is a stable solution, scaling on top of that assumption, and not watching the number that measures exactly when it stops being one. The field has a habit of conflating "works now" with "is a solution." They're different things, and the difference matters most precisely when being wrong is most expensive. CoT monitoring works. Today. That's real. But somewhere between the paper and the discourse, "works today" became "is reliable safety infrastructure," and those are not the same sentence. 15.4% is not a guarantee. It's a clock. The question isn't whether it's going to stop working. It's whether you'll be watching the number when it starts to change.