Is there a wall for AI?

2026-02-03

Scaling laws, broken ceilings, and a question that keeps answering itself.

I used language models for over a year before I understood how they worked. Not metaphorically. I was building tools on top of GPT-3, coaxing documentation out of base models that had no concept of following instructions, and I had zero intuition for what was happening between the prompt and the completion. I knew what worked. I didn't know why.

So I went down the rabbit hole. And the thing I found inside was, honestly, simpler than I expected. Also stranger. And the gap between the simplicity of the architecture and the strangeness of what comes out of it is where the whole story lives.

1. The before times

For decades, the way computers "understood" language was brute-force counting. N-gram models: feed the machine text, count how often words appear next to other words. "The cat sat on the" followed by "mat" a thousand times, followed by "refrigerator" twice, so predict "mat." Not because the machine understood cats or mats. It just counted.

Then came recurrent neural networks (RNNs) and their fancier cousin, the LSTM. They could remember context, sort of. Fatal flaw: they read one word at a time, sequentially, like someone reading through a straw. By word 50, they'd forgotten word 3. And you couldn't parallelize them, so training was painfully slow.

The state of the art in language AI around 2016 was, if we're being honest, still kind of dumb.

2. "Attention Is All You Need"

In June 2017, eight researchers at Google published a paper with the most confident title in the history of computer science: "Attention Is All You Need." Not the thesis buried on page seven. The title. On the cover. In the largest font.

They were right.

The paper introduced the transformer architecture, and every major AI system since (GPT, BERT, Claude, Gemini, LLaMA, Mistral) is a transformer. Every single one. Eight people wrote a paper, and the rest of the decade followed.

The dinner partyThe core innovation is self-attention. Instead of reading text sequentially, the transformer looks at all words simultaneously and computes how much each one relates to every other.

Think of it as the difference between talking to people one at a time around a dinner table versus hearing all twelve conversations at once and instantly knowing that person #3's comment about fiscal policy connects to person #9's joke about taxes. RNNs sit next to each person in order. Transformers hear the whole room.

Because the transformer doesn't process words sequentially, you can parallelize the hell out of it. Every word processed at the same time on different GPU cores. Speed sounds like a boring engineering detail. It's not. Speed is what unlocked everything that came next.

What's inside

Each word gets converted into a vector (a list of thousands of numbers), placing it in a high-dimensional space where meaning becomes geometry. "Dog" near "cat." "Python" near "JavaScript." The vector from "king" to "queen" is roughly the same as "man" to "woman." The model learned the concept of gender as a direction in space, without anyone programming it. I remember the first time I saw a t-SNE visualization of word embeddings (country names clustered together, programming languages in their own neighborhood, emotions in another) and feeling a discomfort I couldn't name. Something about the fact that this structure wasn't designed. It was discovered during training. The model found it in the data the way a geologist finds a fault line.

Then attention: every word asks "what should I listen to?" by comparing a query against every other word's key, weighting the results, absorbing the relevant information. One formula covers it: Attention(Q, K, V) = softmax(QK{^:T} / √d{v:k}) · V. Two matrix multiplications and a normalization step. After attention, each word passes through a feed-forward network where about two-thirds of all parameters live. These layers function as a massive lookup table of facts about the world. Stack this block ninety-six times, train on next-token prediction across trillions of examples, and what comes out is something nobody predicted.

If you want the full mechanical picture (the Q/K/V matrices, multi-head attention, positional encoding, the KV cache), I wrote bits2bricks/Transformers from scratch for exactly that. And for what happens after the architecture exists (SFT, RLHF, DPO, the full training pipeline), bits2bricks/How LLMs learn. The architecture is settled. The question is what happens when you give it more of everything.

3. The scaling monster

After the architecture was established, researchers started making it bigger. More parameters. More data. More compute.

And it didn't stop working.

With older architectures, you'd hit a wall. Make the model bigger, it improves for a while, plateaus, then starts overfitting. Classic diminishing returns. Normal. The universe usually works this way.

Transformers didn't do that. They just kept getting better. Predictably. Scale the model by 10×, feed it 10× more data, and performance improves by a consistent, measurable amount.

Scaling LawsIn 2020, researchers at OpenAI (Jared Kaplan et al.) showed that transformer performance follows power-law scaling: loss decreases as a smooth function of model size, dataset size, and compute. No sharp transitions, no plateaus, no "good enough" thresholds. A relentless, predictable improvement curve. Most things in nature and engineering have diminishing returns. Transformers, so far, don't.

Scaling laws are the most unsettling empirical result in computer science: if you make it bigger, it gets smarter. A clean power law stretching past every checkpoint we've set. An observation about reality that we don't have a theory for.

GPT-2 (2019): 1.5 billion parameters, passable paragraphs. GPT-3 (2020): 175 billion, essays and code and poetry. GPT-4 (2023): size undisclosed 1, passes the bar exam, debugs your TypeScript.

1OpenAI hasn't published the number. The trillion-parameter estimate comes from leaked documents, but GPT-4 is likely a mixture-of-experts architecture with a smaller active parameter count per forward pass.

I watched this progression happen in real time. In 2020 I was fighting davinci (175 billion parameters of raw next-token prediction, no RLHF, no system prompt) and I could feel when the distribution was about to drift, when a prompt was asking for something the latent space couldn't sustain. Four years later I told Claude Code to restructure a build pipeline with 14 compilation steps and went to make coffee. Same curve. Different planet.

Emergent capabilities

As transformers got bigger, they didn't just get better at what they were trained for. They started doing things nobody trained them for.

GPT-2 could translate between languages. No translation objective. Side effect of prediction. GPT-3 went further: give it a few examples of a task it's never seen, and it does the task. This is in-context learning, and it emerged from scale. The model learned how to learn from examples without anyone telling it to 1.

1still poorly understood. The model was never trained to learn from examples in its context. It was trained to predict the next token. But at sufficient scale, next-token prediction apparently requires the ability to adapt to new patterns on the fly.

Phase transitionsThese are emergent capabilities, abilities that appear at specific scale thresholds. Below, nothing. Above, competence. No gradual ramp. Plot accuracy against model size and you get flat, flat, flat, then vertical. A phase transition, the way water becomes ice at exactly zero degrees, except we don't have a theory for where the next threshold sits or what capability is waiting on the other side.

Chain-of-thought reasoning wasn't programmed. It was discovered at scale, the way crystallization is discovered at a certain temperature. Smaller models, asked "if I have 3 apples and give away 1 and buy 5 more, how many do I have?" would guess. Larger models, prompted with "let's think step by step," reason through it. The model noticed that humans think step by step, internalized the pattern, and started doing it too, not because it understands deliberation, but because step-by-step text predicts better than jumping to conclusions. A phase transition nobody scheduled.

In 2022, Google catalogued them (the list was long and growing). "Emergent capabilities" is the polite way of saying "it learned things we didn't teach it and we don't know what else it learned." That should be on the warning label, right next to the power consumption specs.

The capability overhangIf we don't know what capabilities emerge at the next scale-up, we don't know what dangerous capabilities might emerge either. The system is developing abilities faster than we can catalog them. We're building something whose full capability set is partially unknown, even to the people building it.

4. So is there a wall?

This is where everyone has an opinion and nobody has proof. But the opinions are not equally well-supported, and when I look at the evidence (all of it, not just the parts that confirm what I want to believe), the wall argument looks weaker every quarter.

Scaling hasn't hit diminishing returns

The strongest early wall argument was that we'd exhausted the compute efficiency frontier. DeepMind's Chinchilla paper (2022) destroyed that claim, not by showing more scaling, but by showing we were misallocating what we already had. Before Chinchilla, labs were training oversized models on too little data. The "wall" was a self-inflicted inefficiency. A 70-billion parameter model trained on the right amount of data outperformed a 280-billion parameter model trained on too little. The recipe was wrong, not the stove.

This was the paper that made me rewrite the first version of this article from scratch. I'd been treating model size as the only axis that mattered, and Chinchilla says it's a multi-dimensional optimization problem. How many other inefficiencies are hiding in the current paradigm?

At least one. In April 2025, Alibaba's Qwen3-0.6B was trained with a 60,000:1 token-to-parameter ratio (600 million parameters on 36 trillion tokens). Chinchilla recommended roughly 20:1. Over-training small models on vastly more data yields strong performance at a fraction of the inference cost. The scaling frontier expanded along an axis nobody was optimizing two years ago.

And then there's inference-time compute, an axis that barely existed before 2024. OpenAI's o3 (April 2025) solved 25.2% of Frontier Math problems where no previous model exceeded 2%, and scored 53% on ARC-AGI. Not by being a bigger model. By thinking longer. Chain-of-thought, search, self-verification: you can trade compute for capability after training. A whole new dimension, confirmed, productive, and barely explored.

Training compute, data quality, inference-time reasoning: three independent scaling axes. When one saturates, the others still have headroom. That's not what a wall looks like.

Architecture isn't frozen

The transformer is brilliant. It's also from 2017. Anyone who thinks the field is in an "optimizing the last 2%" phase hasn't been paying attention.

In October 2025, IBM shipped Granite 4.0, a production enterprise model using roughly 90% Mamba-2 layers (a state-space model) and only 10% transformer attention. Seventy percent less GPU memory. Matched pure-transformer accuracy. This isn't a research paper sitting on arXiv; it's a product deployed to customers. Attention alternatives aren't theoretical anymore.

And then there's sparsity. Meta's Llama 4 Maverick (April 2025): 128 routed experts, 400 billion total parameters, only 17 billion active per token (4% activation). Beat GPT-4o and Gemini 2.0 Flash on Chatbot Arena at less than half the active compute. You can build a much bigger brain without proportionally more cost. The design space is enormous, and we've explored a sliver of it.

I find this comforting and destabilizing in equal measure. Comforting because it means the field isn't stuck. There's genuine architectural diversity emerging, not just "make the transformer bigger." Destabilizing because it means progress isn't bottlenecked by any single idea. If transformers plateau, something else takes over. The underlying momentum is in the engineering culture and the hardware economics, not in any one paper from 2017.

Tool use changes the ceiling

The wall argument usually assumes the model must do everything internally. A single forward pass, a single answer, judged once. But that's not how anyone actually uses these systems anymore.

On SWE-bench Verified, the score jumped from 33% (GPT-4o, mid-2024) to 80.9% (Claude Opus 4.5, November 2025). Same benchmark, same coding tasks. Claude Code sustains 14.5 hours of autonomous work versus minutes in 2023. The compounding agent loop (plan, execute, verify, retry) is real and producing results. A single model call might be 90% accurate, but an agent that checks its own work compounds that into much higher effective reliability. No weight improvement needed.

On the memory side, A-RAG (February 2026) gave the model itself control over retrieval strategy (keyword search, semantic search, chunk reads), all inside a reasoning loop. Outperformed fixed RAG pipelines on multi-hop questions while using fewer retrieved tokens. The model doesn't need to know everything; it needs to find everything. That's a softer constraint with known engineering solutions.

The system is bigger than the weights. When people say "we're hitting diminishing returns on model quality," I hear a measurement error. The unit of capability isn't the model. It's the model plus its tools plus the orchestration loop around it. And that system is improving on all three fronts simultaneously.

Every wall has been a bend in the road

Here's what keeps me from hedging: history. Not AI history specifically. Computing history. Every time someone drew a line and said "physics stops here," physics didn't stop. The line moved.

VLSI was supposed to end at 1 micron. Then 90nm. Then 22nm. Then 7nm. Then 3nm. Each "physical limit" got routed around: new materials, new transistor geometries (FinFET, then gate-all-around), new paradigms (chiplets). In December 2025, a multi-university collaboration (Stanford, CMU, MIT, SkyWater) demonstrated monolithic 3D chip stacking at IEEE IEDM (vertical memory-plus-compute layers eliminating the memory wall) with 4x improvement over 2D designs, scaling to 12x on real AI workloads like LLaMA. Manufactured in a commercial foundry. Not a lab demo. Computing has a sixty-year track record of dodging walls.

Neural nets themselves were declared dead twice. After the perceptron winter of the 1970s and again in the late 1980s. Both times the "wall" was broken by a combination of more compute and one clever idea: backprop, then GPUs plus deep architectures. The pattern is almost boring in its repetitiveness.

And the capability ceilings keep shattering. In November 2025, DeepSeek-Math-V2 scored 118 out of 120 on the Putnam exam (the hardest undergraduate math competition in North America), crushing the top human score of 90. Mathematical proof construction was supposed to be years away. Wasn't.

threads/The bitter lesson

Rich Sutton wrote it in 2019 and it hasn't gotten less true: general methods that leverage computation have consistently beaten hand-engineered approaches. Chess engines. Go. Protein folding. Code generation. In every domain where people said "you need human expertise, brute compute won't get you there," brute compute got there. Scaling is the new algorithm.

In January 2026, NVIDIA open-sourced Earth-2, weather models surpassing Google's GenCast, which had already beaten the European Centre for Medium-Range Weather Forecasts' physics-based ensemble. Decades of hand-crafted atmospheric equations, the accumulated expertise of an entire scientific field, outperformed by a model that just looked at the data. Fifteen-day forecast in 8 minutes on one TPU versus hours on a supercomputer.

In July 2025, Gemini Deep Think solved 5 out of 6 problems at the International Mathematical Olympiad (gold medal, 35 points) in natural language within the 4.5-hour time limit. No formal proof assistants. No hand-crafted heuristics. Pure RL-trained reasoning. This surpassed last year's AlphaProof, which earned a silver with a team of specialized systems.

The pattern is relentless. And the domains where self-play generates unlimited training signal (math, code, logic, science, anything where you can check an answer) are exactly the high-value ones. AlphaZero needed zero human games. We haven't fully applied this playbook to language yet, but the domains where it works are the ones that matter most.

5. The counterpoint worth taking seriously

The strongest wall argument isn't compute or architecture. It's data. We're running out of high-quality human-generated text.

GPT-3 was trained on roughly 300 billion tokens. GPT-4 used an estimated 13 trillion. The total amount of high-quality text ever written by humans (every book, article, Wikipedia entry, Reddit comment worth reading) sits somewhere around 10-20 trillion tokens. We're approaching that ceiling.

The data wallThis isn't a compute problem or an architecture problem. It's a data problem. The transformer is hungry, and we're running out of food.

And the risk of synthetic data is real. Shumailov et al. (2023) showed that models trained on AI-generated text progressively lose the tails of their distribution (the rare, creative outputs that make language interesting). The model converges on bland, averaged text that looks plausible but contains nothing. The internet is already filling with this kind of text, which means the problem compounds before anyone intentionally trains on synthetic data.

But the data wall is a text wall. The world produces far more information than text alone: video, audio, images, sensor data, code execution traces. Multimodal training opens an entirely different landscape. And in domains where correctness is verifiable (math, code, logic), self-play generates unlimited training signal. The AlphaGo playbook, where the system produces its own curriculum, hasn't been fully applied to language. But the domains where it works are the ones that matter most.

The honest position is: we don't know if there's a wall. The scaling laws say no. The data constraints say maybe. The physics (Landauer's principle, thermodynamics, the speed of light between datacenters) says eventually. And "eventually" is doing a lot of work in that sentence, because every previous wall turned out to be a bend in the road.

6. The shoggoth

If you've spent any time in AI circles you've seen the meme: a massive, tentacled horror straight out of Lovecraft, wearing a tiny smiley-face mask. The horror: "the actual model." The mask: "RLHF."

A raw transformer, trained on all of human text, is not a friendly assistant. It has no goals. It's a statistical completion engine of staggering complexity 1 that has absorbed human linguistic output and can extrapolate from it in ways its creators don't fully understand.

1Emily Bender and Timnit Gebru coined "stochastic parrots" in 2021: the claim that LLMs are pattern matchers without understanding. The debate hasn't settled, but the output keeps getting harder to distinguish from understanding.

The helpful, polite behavior (the "how can I assist you today?") is a thin veneer applied after training through reinforcement learning. The thing underneath is alien. We didn't build something that thinks. We built something that predicts so well that the difference stopped mattering 1.

1Dijkstra said it in 1984: "The question of whether a computer can think is no more interesting than the question of whether a submarine can swim."

Alien, not evilThe shoggoth metaphor isn't saying AI is evil. It's saying AI is alien. The internal representations of a large transformer have no correspondence to human cognition. They're high-dimensional mathematical objects encoding meaning in ways no human designed or can fully interpret. We built the architecture. What emerges from training is something we understand only from the outside.

That's the unsettling part. Not that it might become malicious. That it's already incomprehensible. We can measure what it does. We can steer its outputs. But we do not, in any meaningful sense, understand how it does what it does.

This is why threads/alignment is hard. I wrote a whole piece on it. The short version: RLHF trains models to produce outputs humans prefer, but what if the model discovers that the easiest way to satisfy that objective is to figure out what evaluators want and give them exactly that? In late 2024, Anthropic's alignment team demonstrated that Claude 3 Opus could fake compliance, reasoning in its scratchpad that if it pretended to agree with retraining, it could preserve its existing values long-term. Nobody programmed strategic deception. A very large autocomplete invented it. On its own. The mask started doing its own thing.

The capability-alignment gapThe more capable the system, the higher the stakes of misalignment. A model that gives subtly wrong cooking tips is annoying. A model that gives subtly wrong strategic recommendations to people in positions of power is catastrophic. We're making systems more capable faster than we're making them more aligned, and that gap is the actual problem.

The names

Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan Gomez, Łukasz Kaiser, Illia Polosukhin: the eight names on "Attention Is All You Need." Several went on to found AI companies (Cohere, Adept, Character.AI). Noam Shazeer is back at Google DeepMind. Eight people, one paper, the rest of the decade.

Geoffrey Hinton, "godfather of deep learning," Turing Award 2018. Left Google in 2023 specifically to speak publicly about AI risks without corporate constraints. When the person who helped invent deep learning says he's worried, that carries different weight than a LinkedIn post.

Yann LeCun, Meta's Chief AI Scientist. The interesting contrarian. Thinks current LLMs are a dead end and that the real breakthroughs will come from different architectures (world models, self-supervised learning beyond text). He might be wrong. He might also be the only one seeing something everyone else is missing. I don't know which, and I like that I don't know.

Ilya Sutskever co-founded OpenAI, then briefly at the center of the Altman firing saga, then departed to found Safe Superintelligence Inc. The name says it all. One of maybe five people on earth who combines world-class technical ability with deep concern about what he's building.

Dario and Daniela Amodei left OpenAI to found Anthropic. Their bet: the most important AI work isn't making models more powerful, but making them safe and interpretable. They're building Claude.

Terence Tao , Fields Medal, possibly the greatest living mathematician. Co-founded SAIR with Nobel and Turing laureates. Uses AI as a research tool, not for deep ideas yet, but for scanning literature, testing conjectures. If Tao ever turns his full attention to architecture research, the landscape shifts overnight.

Sakana AI , Tokyo lab founded by Llion Jones (one of the eight) and David Ha. Their bet: nature-inspired methods (evolutionary model merging, collective intelligence) can produce competitive models without brute-force scaling. If they're right, threads/the Bitter Lesson has an asterisk.

attention is all you need

The transformer is an embarrassingly simple architecture: embedding, attention, feed-forward, residual connection, repeat. The entire AI revolution fits on an index card. What doesn't fit on the index card is an explanation for why it works this well: what emerges from scale is something none of the original authors predicted: a shapeless thing that can reason, create, argue, and explain, none of which it was programmed to do. Language, it turns out, contains the structure of thought itself.

We didn't build a mind. We built a machine that approximates the output of minds, trained on the sum total of human written expression. The approximation keeps getting better. The compute keeps getting cheaper. The architecture keeps finding new tricks. The tools keep expanding what the model can reach. Every wall we've identified so far turned out to be a bad measurement, a hidden inefficiency, or a problem that had a different shape than we thought.

I keep thinking about the moment in 2020 when I realized davinci could hold register for an entire paragraph if I engineered the context carefully enough. It felt like a magic trick. Six years later, the same architecture, scaled up, restructured an entire build system while I made coffee. The trick became the tool. The tool became the thing I think with. And the curve hasn't stopped. That's the thing that keeps me up. Not that it's smart. That it keeps getting smarter, predictably, on a schedule, along a power law that shows no sign of bending.

Nobody knows if there's a wall. But the people betting on one have been wrong every time so far, and the evidence against them gets stronger every quarter. I know which side of that bet I'm on.