The Bitter Lesson
Brute computation, human design, and seventy years of ignored lessons.
In 2019 I spent two weeks building an NLP tool by hand. That same year Sutton explained why I was wrong. I called it Süskind — after Patrick, after that realism as obsessive as it is magical, though this isn't the place for that. It was a system to parse sentences from books, feed them into a directed graph, and assist me when I wrote: mid-paragraph, when inspiration failed me or I simply ran out of words, Süskind would suggest fragments, turns of phrase, structures — drawn from everything I'd fed it. All handcrafted, all bespoke, all painstakingly thought through. The kind of tool you build when you want to solve a problem by understanding it rather than delegating it. I identified patterns. I wrote rules. I refined them. I added exceptions. I added more. In the end I had something that worked 94% of the time and I was delighted with myself.
That same year Richard Sutton published a 1,200-word essay that, without knowing it, described exactly what I'd done — and why I was wrong. Years and years of researchers building ingenious solutions had always lost to whoever simply threw more machine at it. More brute power, less human design.
That's the whole lesson. That's why it's bitter, friends.
Sutton's argument is simple and disturbing in equal measure. For 70 years, in every subfield of artificial intelligence, the same pattern has repeated without exception: researchers develop sophisticated methods that incorporate domain knowledge — human intuition about the structure of the problem, carefully designed rules, elaborate representations. Those methods work well. For a while. And then a general method arrives that knows nothing about the domain, that is brutally simple in its logic, and that scales with compute. And it wins. Not by a small margin. By a margin that makes the previous methods look like a category error. Computer vision. Speech recognition. Machine translation. Go. In every case, the community spent years — sometimes decades — building systems that incorporated human knowledge. Explicit representations of language structure. Phonological models for speech. Positional heuristics for chess. And in every case, when enough data and enough compute arrived, neural networks without that knowledge swept them away. Sutton calls it The Bitter Lesson. Bitter because it implies that the most refined intellectual work — the kind you're proudest of, the kind that proves you truly understand the problem — is precisely what time will deprecate.
It's worth sitting with that idea for a moment, because it's counterintuitive in a specific way. We're not talking about scalable methods being better from the start. They almost never are. In 2012, when AlexNet won ImageNet, there were computer vision systems with decades of engineering behind them that outperformed it on specific benchmarks. The trick is that AlexNet had a property those systems didn't: it could improve predictably if you gave it more data and more compute. The others had reached close to their ceiling. AlexNet hadn't even seen its ceiling yet. This is the distinction that matters. Not "which one works better today?" but "which one has returns that scale?". A method that improves linearly with human effort has a natural limit: human effort. A method that improves with compute has a different limit: how much compute you can get. And the price/performance curve for compute has been dropping nonstop for 70 years. Süskind was clever. It was also a dead end.
Sutton's paper is from 2019. What he couldn't see yet — or at least not fully — is how The Bitter Lesson was going to combine with something that would change the entire speed of the curve. The historical problem with RL for language had always been the reward signal. If you want to train a model to do something well, you need to know when it's doing it well. For open-ended tasks that requires human evaluators. Crowdworkers. Annotation infrastructure. Expensive, slow, noisy. But there are domains where verification is automatic. Where the output can be checked without human intervention, instantly, at zero marginal cost. Code that executes. Tests that pass or fail. Mathematical proofs that are valid or not. In those domains, the RL loop can run unsupervised, at scale, indefinitely. This isn't an extension of The Bitter Lesson. It's the next chapter. Sutton said scalable methods win. What we're seeing now is that domains where scale can feed itself — where the system generates its own training signal — behave in a qualitatively different way from the rest. It's not that the ceiling is higher. It's that we haven't seen the ceiling yet.
There's a way of reading The Bitter Lesson that is defeatist: 70 years of intellectual work, and the answer was always "more compute." Why bother understanding the problem? But I think that reading gets the direction wrong. What Sutton is describing isn't that intelligence doesn't matter. It's that there's a difference between intelligence applied to building systems and intelligence applied to understanding what kind of systems to build. The two decades of "wrong" computer vision research produced the datasets, the benchmarks, the understanding of the problem that gave AlexNet something to learn from. The domain knowledge didn't disappear. It became training data. The intellectual work that time deprecates is the kind that tries to substitute for scale. The kind that doesn't — the kind that understands the structure of the problem well enough to turn it into a training signal — that has a much longer life. Süskind is still in a repo I haven't touched in years. Some day I'll turn it into what it should have been from the start: a set of labels for a model that learns in two hours what took me 4 months to approximate by hand. That will also be part of the lesson. And it will be considerably less bitter.
Richard Sutton, "The Bitter Lesson", March 2019. 1,200 words that probably describe the history of AI better than any book written on the subject.