Member-only story
NVIDIA Is Fixing AI’s Foundational Flaw With Reinforcement Learning Pretraining
Instead of just guessing the next word, models are now rewarded for their hidden “chain-of-thought.” This simple change is creating a massive leap in reasoning.
I spend a lot of time thinking about how AI models “learn.” And if you strip away all the hype, it’s surprisingly simple, and a little disappointing.
For years, the dominant way to train a Large Language Model (LLM) has been something called “next-token prediction.”
In simple English? We show the AI a massive chunk of the internet and ask it, over and over again, “What word comes next?”
“The cat sat on the…”
…mat?
“Correct. Good job. Now do that a few trillion more times.”
That’s it. That’s the secret. The models get incredibly good at predicting patterns, which is why they can write emails and poems that sound human. But they aren’t really thinking. They’re just expert guessers.
All the reasoning powers we see in models like GPT-5 or Sonnet 4.5 usually come later, added on top through costly fine-tuning and reinforcement learning.
