Almanac

Learning path

How models learned to think: chain-of-thought, RL on verifiable rewards, and the reasoning frontier

This path traces the arc from a simple prompting trick to a full training paradigm — the story of how AI models went from pattern-matching to deliberate, step-by-step reasoning. It covers the core ideas (chain-of-thought, RL algorithms, verifiable rewards) before landing on the frontier models that embody them today.

Aimed at readers who know what a language model is and want to understand why reasoning models work the way they do — and what it took to get here.

In-depth7 steps~42 min

7 steps

Begin →
  1. Chain-of-Thought Reasoning

    Start here: chain-of-thought is the foundational move — letting a model reason through intermediate steps — that every later training and architecture advance is built on top of.

  2. PPO

    Before reasoning models could be trained to think well, PPO was the dominant RL algorithm doing the heavy lifting — understanding it sets up why newer alternatives emerged.

  3. GRPO

    GRPO is the leaner RL algorithm that replaced PPO in several reasoning pipelines — reading it after PPO makes the design tradeoffs concrete and meaningful.

  4. Reinforcement Learning with Verifiable Rewards

    This is the training paradigm that ties it together: using outcomes that can be objectively checked (math answers, code tests) as reward signals, turning chain-of-thought into something you can optimize with RL.

  5. DeepSeek V4

    DeepSeek V4 is a landmark open-weight model that put these training ideas into practice at scale — a concrete case study of the pipeline you just read about.

  6. GPT-5.5

    GPT-5.5 represents the frontier from a different lineage, letting you compare how the same reasoning ideas manifest across labs and training philosophies.

  7. Claude Opus 4.6

    Close with Claude Opus 4.6 to round out the frontier picture — a third design point that shows how the reasoning paradigm has spread across the leading model families.