Learning path

How models learned to think: chain-of-thought, RL on verifiable rewards, and the reasoning frontier

This path traces the arc from a simple prompting trick to a full training paradigm — the story of how AI models went from pattern-matching to deliberate, step-by-step reasoning. It covers the core ideas (chain-of-thought, RL algorithms, verifiable rewards) before landing on the frontier models that embody them today.

Aimed at readers who know what a language model is and want to understand why reasoning models work the way they do — and what it took to get here.

In-depth7 steps~42 min

7 steps

Begin →

Chain-of-Thought Reasoning
Start here: chain-of-thought is the foundational move — letting a model reason through intermediate steps — that every later training and architecture advance is built on top of.
Read →Beginner In-depth
PPO
Before reasoning models could be trained to think well, PPO was the dominant RL algorithm doing the heavy lifting — understanding it sets up why newer alternatives emerged.
Read →Beginner In-depth
GRPO
GRPO is the leaner RL algorithm that replaced PPO in several reasoning pipelines — reading it after PPO makes the design tradeoffs concrete and meaningful.
Read →Beginner In-depth
Reinforcement Learning with Verifiable Rewards
This is the training paradigm that ties it together: using outcomes that can be objectively checked (math answers, code tests) as reward signals, turning chain-of-thought into something you can optimize with RL.
Read →Beginner In-depth
DeepSeek V4
DeepSeek V4 is a landmark open-weight model that put these training ideas into practice at scale — a concrete case study of the pipeline you just read about.
Read →Beginner In-depth
GPT-5.5
GPT-5.5 represents the frontier from a different lineage, letting you compare how the same reasoning ideas manifest across labs and training philosophies.
Read →Beginner In-depth
Claude Opus 4.6
Close with Claude Opus 4.6 to round out the frontier picture — a third design point that shows how the reasoning paradigm has spread across the leading model families.
Read →Beginner In-depth

How models learned to think: chain-of-thought, RL on verifiable rewards, and the reasoning frontier

Aimed at readers who know what a language model is and want to understand why reasoning models work the way they do — and what it took to get here.

In-depth7 steps~42 min

Chain-of-Thought Reasoning

PPO

GRPO

Reinforcement Learning with Verifiable Rewards

DeepSeek V4

GPT-5.5

Claude Opus 4.6