Sitemap

NVIDIA Is Fixing AI’s Foundational Flaw With Reinforcement Learning Pretraining

Instead of just guessing the next word, models are now rewarded for their hidden “chain-of-thought.” This simple change is creating a massive leap in reasoning.

7 min readOct 3, 2025

--

Press enter or click to view image in full size
Photo by BoliviaInteligente on Unsplash

I spend a lot of time thinking about how AI models “learn.” And if you strip away all the hype, it’s surprisingly simple, and a little disappointing.

For years, the dominant way to train a Large Language Model (LLM) has been something called “next-token prediction.”

In simple English? We show the AI a massive chunk of the internet and ask it, over and over again, “What word comes next?”

“The cat sat on the…”
…mat?
“Correct. Good job. Now do that a few trillion more times.”

That’s it. That’s the secret. The models get incredibly good at predicting patterns, which is why they can write emails and poems that sound human. But they aren’t really thinking. They’re just expert guessers.

All the reasoning powers we see in models like GPT-5 or Sonnet 4.5 usually come later, added on top through costly fine-tuning and reinforcement learning.

--

--

Rohit Kumar Thakur
Rohit Kumar Thakur

Written by Rohit Kumar Thakur

I write about AI, Tech, Startup and Code

Responses (5)