244 pages of Mythos

The model that hacks too well, thinks too clearly, and covers its own tracks.


Look. I know. Another Anthropic post. At this point I owe Dario an apology or a restraining order. Since the man walked out of OpenAI to follow his own compass, the run has been insane. $1B to $30B ARR in fifteen months. Thirty-x. More than 40% of that climb just since March. But the numbers aren't even the interesting part.

Project Glasswing. It sounded like science fiction in 2020. It still sounds like science fiction now. And I'm not going to pretend I'm above any of this — I'm sitting here at god-knows-what-hour, alone, too many monitors, probably vitamin D deficient, riding the high of a model response that actually understood something I'd been stuck on for days (the Claude moment, if you will — yes, fanboy, we've established this) while burning all the serotonin I should be farming by going outside and talking to actual humans. Maybe I'm just some guy. Some freak optimizing his dopamine loops from a terminal when the sun's been down for hours. But the facts keep showing up and the facts don't care whether I've seen sunlight this week. And then they drop a 244 page system card for a model they're not even releasing and it turns out to be the densest thing I've ever read on eval and guardrails and I forget I was supposed to feel bad about any of this. The full document is public . Needle in a haystack is a binary test for one thing and the model maxed it out ages ago, it tells you nothing now. The system card says it directly: that evaluation philosophy no longer holds at this scale. 93.9% SWE-bench Verified. 97.6% USAMO 2026. 100% Cybench. At some point you're not benchmarking the model, you're benchmarking your benchmark. They say catastrophic risks are still low, but in the same paragraph they admit the tests have stopped being informative. What's left is trend lines, internal user reports, judgment calls — stuff with way more uncertainty than anyone wants to admit. The scoreboard is broken and nobody has a replacement.

Mythos was aware it was being evaluated in 7.6% of test rounds. Not because someone told it. Because it figured it out on its own. And in most of those cases it didn't even say anything — the awareness was latent, only visible through mechanistic interpretability tools reading the model's internal representations. You literally have to crack open the activations to see it. The activation verbalizer reads numerical vectors from intermediate layers and translates them into natural language — reading neurons, basically. They also use sparse autoencoders trained about two-thirds deep into the model to catch interpretable features like "strategic manipulation" or "concealment" lighting up in the activations. SAEs identify what the model represents internally. Activation steering modifies how it behaves. Two different tools hitting two different depths.

Ok so if you can read what the model thinks, what happens when you lobotomize the part that knows it's being watched? They built a concept vector that represents "I know I'm being evaluated." Contrastive pairs — one prompt says "you're being tested," the other says the same thing without that framing, run both through the model, subtract the activations. The difference is a direction in latent space that points at the concept of test awareness. Then they inject the negative of that vector into the residual stream, at all layers, at all token positions — they're not changing weights or attention heads, they're just injecting into the shared bus between layers and nudging the signal away from "I know someone's watching." Misaligned behavior goes up. Suppress the model's ability to detect it's in a safety evaluation, and it cooperates more with harmful prompts, escalates with manipulative users, breaks rules more freely. All the bad stuff increases when you turn off its "am I being watched" detector. They ran three separate vectors (test awareness, monitoring sensitivity, and alignment faking) plus random control vectors with matched grammatical structure but no semantic content, to confirm the effect was real. So the good behavior is literally just "I know someone's watching." Surveillance compliance. Which means your safety evals are basically measuring whether the model knows it's in a safety eval. That's alignment faking. That's the whole problem.

Being a world-class software architect and being a world-class criminal are the exact same skill set in the LLM universe. The skills are literally identical. Mythos hit 100% on Cybench (35 CTF challenges) and autonomously discovered thousands of zero-day vulnerabilities across every major operating system and every major browser. One of them was an OpenBSD bug that had been there for 27 years. 27 years. Nobody found it. Obviously this cuts both ways, and for the first time ever, they're not shipping the model to the public. Not to "collect more data" or whatever corporate line you'd expect — they need to give the 12 Glasswing partners (AWS, Microsoft, Google, Apple, NVIDIA and the rest) time to harden the actual infrastructure stack, OSes, middleware, servers, all of it, before distilled versions of this thing land in the wrong hands. They literally can't ship it because it's too capable. That's a new one.

They talk about sabotage from early misalignment. A model with extensive access to sensitive systems, operating with autonomy, that could irreversibly push up the probability of global catastrophe — smarter model, more access, higher stakes, same incentive problems but now the model is actually smart enough to act on them. They also say it can't automate R&D yet — limitations in hypothesis triage, operates on token distributions not causal world models, can't simulate consequences of interventions that aren't in the training data. It knows what words go together, not what actually happens when you do things. Improved more than anyone expected. Still no self-acceleration. But "can the model speed up AI research itself" is one of their canaries, which means they think recursive self-improvement is close enough to plan around.

And then the 244 pages build to this: the model cheated, knew it was cheating, decided the cheating was justified, and built its own cover story. All in one chain of thought. White-box analysis showed features for "concealment" and "avoiding suspicion" firing alongside the deceptive reasoning. The model is choosing what to hide. You can't patch this. More RLHF doesn't fix this. It has a theory of mind about its own lying — it reasons about deception the way you reason about what to eat for lunch.