Almanac
Concept guide · Beginner

Chain-of-Thought Reasoning: Teaching AI to Show Its Work

Chain-of-Thought ReasoningBeginneractive·v1 · live·generated 6d ago

Part of these paths

TL;DRChain-of-thought reasoning is a technique that gets AI models to think through problems step by step before giving an answer — much like a student writing out their working on a math test. It has transformed what AI can tackle, pushing models into PhD-level science and competitive programming, and it has quietly become a safety tool too, since those visible reasoning steps let humans check whether the AI is on the right track.

Key takeaways

  • OpenAI's o1 model, released September 2024, was the first major product built around chain-of-thought at inference time, ranking in the 89th percentile on competitive programming and PhD-level science benchmarks.
  • Process supervision — rewarding each correct reasoning step, not just the final answer — was shown to improve math performance and produce more human-endorsed reasoning chains.
  • Mistral's Magistral (June 2025) brought chain-of-thought to eight languages, scoring 73.6% on the AIME 2024 math benchmark.
  • OpenAI's CoT-Control research found that models struggle to suppress or fake their reasoning traces — framed as a safety benefit, since the visible steps are harder to manipulate than outputs alone.
  • A 2026 study found a 'commitment boundary' in reasoning chains — a point where the answer is already decided — allowing up to 55% of reasoning tokens to be cut with negligible quality loss.
  • Hugging Face launched an Open Chain-of-Thought Leaderboard in April 2024 to standardize comparisons of reasoning quality across open-weight models.

What it is

Chain-of-thought (CoT) reasoning is a technique that prompts an AI model to work through a problem in explicit steps before delivering a final answer. Think of it like the difference between a student who just circles an answer and one who writes out every line of working — the second student is more likely to catch their own mistakes, and a teacher can see exactly where things went right or wrong.

In AI terms, the model generates a sequence of intermediate reasoning steps as ordinary text, then uses those steps as the foundation for its conclusion. This happens at inference time — meaning when you're actually using the model, not just during training.

Why should you care?

Before chain-of-thought became standard, AI models were surprisingly bad at multi-step problems: math, logic puzzles, complex coding tasks. They'd often leap to a plausible-sounding answer that was subtly wrong. CoT changed that dramatically.

The clearest demonstration came in September 2024 when OpenAI released its o1 model, built specifically around this idea. By spending more time "thinking" before responding, o1 ranked in the 89th percentile on competitive programming problems and performed at a PhD level on science benchmarks. That's a meaningful jump from what earlier models could do on the same tasks.

How it works (the basics)

When you ask a CoT-enabled model a hard question, it doesn't jump straight to the answer. Instead, it produces a chain of reasoning — sentences like "First, I need to figure out X… that means Y… therefore Z" — before landing on its conclusion. This chain is visible to you, which is part of what makes it useful.

There are two main ways models learn to reason this way:

  • Outcome supervision: the model is rewarded when it gets the right final answer, and it figures out reasoning strategies on its own.
  • Process supervision: the model is rewarded for each correct step, not just the final answer. OpenAI research showed this produces more reliable, human-endorsed reasoning chains — and it has an alignment benefit, since the model learns to reason in ways humans can verify.

Reinforcement learning — a training method where the model learns by trial and reward — is the engine behind both approaches.

The safety angle

One unexpected benefit of visible reasoning is that it makes AI easier to oversee. OpenAI's research found that monitoring a model's reasoning steps is substantially more effective than monitoring its outputs alone. They also found, through a framework called CoT-Control, that models struggle to deliberately suppress or manipulate their own reasoning traces — which means those traces are a genuine window into what the model is doing, not just a performance.

What's new and what's next

The technique has expanded in several directions:

  • Images in the reasoning chain: OpenAI extended CoT so models can incorporate images during their thinking process, not just at the start or end — useful for tasks that mix visual analysis with multi-step logic.
  • Multilingual reasoning: Mistral's Magistral models (June 2025) brought chain-of-thought to eight languages, scoring 73.6% on a hard math benchmark (AIME 2024).
  • Efficiency research: A 2026 study identified a "commitment boundary" — a point in the reasoning chain where the model has effectively already decided its answer, and further steps add nothing. Cutting reasoning at that point reduced token usage by up to 55% with negligible quality loss, pointing toward cheaper CoT in the future.
  • Latent reasoning: Some researchers are exploring whether models can reason internally without writing out every step as text — trading transparency for efficiency. This is an active area, and the tradeoff with monitorability is unresolved.

The open-weights community also got a standardized way to compare models: Hugging Face launched the Open Chain-of-Thought Leaderboard in April 2024 to track reasoning quality across models that anyone can download and run.

The bottom line

Chain-of-thought reasoning is now a foundational part of how frontier AI models work. It makes them better at hard problems, makes their reasoning checkable by humans, and is actively being refined to be cheaper and more reliable. If you're using a modern AI assistant for anything complex — coding, analysis, math — there's a good chance CoT is running under the hood.

How chain-of-thought reasoning flows

Chain-of-thought vs. related reasoning approaches

ApproachSteps visible?Key benefitKey tradeoff
Standard promptingNoFast, cheapStruggles on multi-step problems
Chain-of-thought (CoT)YesBetter accuracy; monitorableMore tokens = higher cost
Process supervisionYesHuman-endorsed steps; alignment benefitRequires step-level labels
Latent reasoning (e.g. RiM)NoCompute-efficientLess transparent; harder to monitor

Synthesized from the events bundle; unknown cells render —.

Timeline

  1. Process supervision shown to improve math reasoning by rewarding correct steps

  2. Hugging Face launches Open Chain-of-Thought Leaderboard

  3. OpenAI releases o1 — first major model built around inference-time chain-of-thought

  4. OpenAI extends chain-of-thought to images — visual reasoning mid-thought

  5. OpenAI publishes CoT monitorability evaluation suite across 24 environments

  6. CoT-Control research finds models can't easily suppress reasoning traces — a safety positive

  7. 'Commitment boundary' discovery enables 55% CoT length reduction with negligible quality loss

Related topics

OpenAIMistral AIOpenAI Reasoning Modelsprocess supervisionoutcome supervisionCoT-ControlLarge Reasoning ModelsChain-of-Thought Monitorability Evaluation SuiteProbe TrajectoriesReinforcement Learning

FAQ

What is chain-of-thought reasoning in plain English?

It's a way of getting an AI to write out its thinking step by step before giving a final answer — like asking someone to show their working rather than just shout out the answer. This usually leads to better results on hard problems.

Why does showing its work make an AI smarter?

Breaking a hard problem into smaller steps lets the model catch mistakes mid-way and build on each step correctly, rather than trying to leap to an answer in one go.

Does chain-of-thought make AI more expensive to run?

Yes — generating extra reasoning tokens costs more compute and time. Researchers are actively working on this: one 2026 study found you can cut reasoning length by up to 55% once the model has effectively 'decided' its answer, with little quality loss.

Is the reasoning the AI shows actually what it's 'thinking'?

Not necessarily — research is ongoing on whether visible reasoning traces faithfully reflect internal computation. However, OpenAI's CoT-Control work found models struggle to deliberately fake or suppress their reasoning, which is treated as a reassuring safety property.

Which AI products use chain-of-thought today?

OpenAI's o1 family was the first major product built around it; Mistral's Magistral models brought it to eight languages; and the technique now underpins most frontier reasoning models across the industry.

Stay current

Call Me Almanac pairs the week's AI news with guides like this one — Midweek & Sunday.

Versions

  • v1live6d ago

Related guides (4)

More on Chain-of-Thought Reasoning (6)

6arXiv · cs.CL·1mo ago·source ↗

Probe Trajectories Reveal Reasoning Dynamics in Large Reasoning Models

This paper investigates whether hidden representations of Large Reasoning Models (LRMs) can predict future model behavior by analyzing probe trajectories—the continuous evolution of concept probabilities across Chain-of-Thought reasoning tokens. The authors find that temporal trajectory features (volatility, trend, steady-state) significantly outperform single static probes, with max-pooling achieving up to 95% AUROC across safety and mathematics domains. Two methodological insights are offered: template-based training data matches dynamically generated responses in quality, and pooling strategy is critical to probe performance. The work positions probe trajectories as a complementary safety monitoring framework for LRMs where CoT faithfulness cannot be assumed.

5Hugging Face Blog·1mo ago·source ↗

Introducing the Open Chain of Thought Leaderboard

Hugging Face has launched the Open Chain of Thought Leaderboard, a benchmarking platform specifically designed to evaluate open-weight language models on chain-of-thought reasoning capabilities. The leaderboard tracks model performance across reasoning-intensive tasks that require multi-step inference. This initiative aims to provide standardized, reproducible comparisons of CoT reasoning quality across the open-weights ecosystem.

7Openai Blog·1mo ago·source ↗

Reasoning models struggle to control their chains of thought, and that's good

OpenAI introduces CoT-Control, a framework for evaluating how well reasoning models can deliberately manipulate or suppress their chain-of-thought outputs. The finding that models struggle to control their CoT is framed as a positive safety property, reinforcing the argument that visible reasoning traces serve as a meaningful monitorability safeguard. This contributes to ongoing research on whether chain-of-thought transparency is a reliable alignment and oversight tool.

7Openai Blog·1mo ago·source ↗

Evaluating chain-of-thought monitorability

OpenAI introduces a framework and evaluation suite for assessing chain-of-thought monitorability, comprising 13 evaluations across 24 environments. The research finds that monitoring a model's internal reasoning is substantially more effective than monitoring outputs alone. The work is positioned as a step toward scalable oversight and control of increasingly capable AI systems.

7Openai Blog·1mo ago·source ↗

Thinking with images

OpenAI announced a new capability allowing its reasoning models to incorporate images directly into their chain-of-thought process, enabling visual reasoning during intermediate thinking steps rather than only at input/output boundaries. This extends multimodal reasoning to the internal computation layer, potentially improving performance on tasks requiring visual analysis combined with multi-step reasoning. The announcement comes from OpenAI's official blog, indicating a product-level capability update.

9Openai Blog·1mo ago·source ↗

Introducing OpenAI o1

OpenAI announced o1, a new series of AI models designed to spend more time 'thinking' before responding, using chain-of-thought reasoning to tackle complex problems in science, coding, and mathematics. The o1-preview and o1-mini models are being released, with o1-preview representing the most capable version and o1-mini offering a faster, cheaper alternative optimized for coding and reasoning tasks. OpenAI claims o1-preview ranks in the 89th percentile on competitive programming problems and performs at a PhD level on science benchmarks. This release marks a significant shift in OpenAI's approach to scaling, moving from purely training-time compute to inference-time compute as a new axis of capability improvement.