Almanac
Concept guide · In-depth

Chain-of-Thought Reasoning: Mechanism, Variants, and the Frontier of Inference-Time Compute

Chain-of-Thought ReasoningIn-depthactive·v1 · live·generated 6d ago

Part of these paths

TL;DRChain-of-thought reasoning transforms a language model from a single-step answer machine into a system that works through problems in explicit intermediate steps — and training that process with reinforcement learning turned it into the dominant paradigm for frontier capability gains. What began as a prompting trick has matured into a full engineering discipline, with active research on how to supervise the steps, how to monitor them for safety, and how to make them cheaper without losing what they buy.

Key takeaways

  • OpenAI's o1 (September 2024) marked the inflection point where CoT moved from prompting technique to a dedicated inference-time compute axis, ranking in the 89th percentile on competitive programming and PhD-level on science benchmarks.
  • Process supervision — rewarding each correct reasoning step rather than only the final answer — was shown to improve math benchmark performance and produce human-endorsed reasoning traces.
  • A 2026 study found a 'commitment boundary' in CoT traces where the answer stabilizes; exploiting it cuts CoT length by up to 55% with negligible quality loss.
  • OpenAI's CoT-Control framework found that reasoning models struggle to deliberately suppress or manipulate their own chain-of-thought, framing this as a positive safety property for monitorability.
  • Latent reasoning alternatives (RiM, STORM) replace token-by-token CoT with internal memory blocks or continuous trajectories, trading interpretability for inference efficiency.
  • Mistral's Magistral (June 2025) extended CoT to native multilingual reasoning across eight languages, and Hugging Face launched a dedicated open-weight CoT leaderboard in April 2024.

What it is

Chain-of-thought (CoT) reasoning is a technique that instructs or trains a language model to produce explicit intermediate steps on the way to a final answer. Instead of mapping a prompt directly to an output token, the model externalizes a scratchpad: it writes out sub-problems, partial conclusions, and self-corrections before committing. The result is a model that can decompose hard problems the same way a human expert might — and whose work can, in principle, be inspected.

How it works

The basic mechanism

At inference time, CoT reasoning produces a sequence of reasoning tokens before the answer token. These tokens are part of the model's output distribution and are generated autoregressively, meaning each step conditions on all prior steps. The chain can be elicited by prompting ("think step by step") or baked in through training.

Training CoT with reinforcement learning

The step change came when OpenAI trained CoT behavior directly via reinforcement learning rather than relying on prompting alone. The o1 model series, released in September 2024, treats inference-time compute as an independent scaling axis: the model is given a compute budget to "think" before answering, and longer thinking correlates with better answers on hard tasks. o1-preview ranked in the 89th percentile on competitive programming problems and performed at a PhD level on science benchmarks — gains attributed specifically to this RL-trained reasoning process.

Process supervision vs. outcome supervision

A key design choice in training CoT models is what to reward. Outcome supervision rewards only the final answer; process supervision rewards each correct intermediate step. OpenAI research demonstrated that process supervision improves mathematical problem-solving performance and carries an alignment benefit: models trained this way produce reasoning traces that humans endorse, creating a direct link between capability improvement and interpretable behavior.

Why it matters

CoT reasoning shifted the frontier capability conversation from "how much can you train?" to "how much can you think at inference time?" It also opened a new surface for safety work: if a model's reasoning is visible, it can be monitored. OpenAI's monitorability evaluation suite — 13 evaluations across 24 environments — found that monitoring internal reasoning is substantially more effective than monitoring outputs alone. The CoT-Control framework further found that reasoning models struggle to deliberately suppress or manipulate their own chain-of-thought, which is framed as a positive safety property: the reasoning trace is hard to fake.

Variants and extensions

Multimodal CoT

In April 2025, OpenAI extended CoT to incorporate images directly into intermediate reasoning steps — not just at input/output boundaries. This allows the model to visually analyze a diagram mid-chain, enabling richer spatial and perceptual reasoning.

Multilingual CoT

Mistral's Magistral (June 2025) demonstrated native multilingual chain-of-thought reasoning across eight major languages, with Magistral Medium scoring 73.6% on AIME2024 (90% with majority voting at 64 samples). This established that RL-trained CoT is not English-only and can generalize across linguistic contexts.

Structured pre-planning

The PPC (Preplan-Plan-CoT) framework adds an explicit problem-understanding stage before the planning and execution stages. The preplan captures problem type, applicable tools, and foreseeable pitfalls — addressing a gap in plan-based methods that address "how" to solve without first clarifying "what" to solve. Evaluated across five math benchmarks, PPC achieves improvements of +2.23 maj@16 and +3.06 pass@16 over the strongest baseline at no additional inference token cost.

Latent reasoning: CoT without text

Two research directions challenge the assumption that reasoning must be expressed in natural-language tokens:

  • RiM (Reasoning in Memory) replaces autoregressive CoT token generation with fixed sequences of special "memory block" tokens processed in a single forward pass. A two-stage curriculum first grounds the memory blocks by predicting explicit reasoning steps, then discards step-level supervision. RiM matches or exceeds existing latent reasoning methods while improving compute efficiency.
  • STORM applies a similar idea to video-language models, using bounded continuous latent trajectories rather than textual CoT for spatial-temporal reasoning. At inference time, no video regeneration or frame reinsertion is required, reducing latency versus tool-based pipelines.

Both approaches trade interpretability for efficiency — a meaningful tradeoff given that CoT token generation is the dominant cost in reasoning-heavy workloads.

Efficiency: the commitment boundary

A 2026 arXiv preprint introduced the concept of a commitment boundary: a sharp transition point in a CoT trace where the model's answer stabilizes and subsequent reasoning steps become causally inert ("epiphenomenal"). The boundary can be linearly decoded from intermediate hidden states using early-exit probing and generalizes across tasks. Exploiting this signal to exit reasoning at the commitment boundary reduces CoT length by up to 55% on average with negligible performance loss — a direct handle on inference cost for deployed reasoning models.

Safety and monitorability

The visibility of CoT traces has made them a focus of alignment research. Probe trajectory analysis — tracking the continuous evolution of concept probabilities across CoT tokens — achieves up to 95% AUROC for predicting model behavior in safety and mathematics domains, outperforming single static probes. The temporal features (volatility, trend, steady-state) carry more signal than any snapshot, positioning probe trajectories as a complementary safety monitoring layer for large reasoning models where CoT faithfulness cannot be assumed unconditionally.

Research on self-refinement via question-asking found a gap between detection and recovery: probes trained on hidden states before question generation are predictive of final answer correctness, but interventions are as likely to harm correct trajectories as to fix incorrect ones — a caution against over-relying on LLM self-diagnosis.

Benchmarking the ecosystem

Hugging Face launched the Open Chain-of-Thought Leaderboard in April 2024, providing standardized, reproducible comparisons of CoT reasoning quality across open-weight models. This infrastructure matters because CoT capability is not uniform across model families, and benchmark-level comparisons are the primary signal practitioners use to select models for reasoning-intensive deployments.

Tradeoffs and when not to use it

CoT reasoning is not free. It multiplies token generation — and therefore latency and cost — proportionally to the length of the reasoning trace. For tasks where the answer is simple or the model is already highly confident, CoT adds overhead without benefit. The commitment boundary research quantifies this: a large fraction of generated reasoning tokens are post-hoc and causally inert. Latent reasoning methods (RiM, STORM) address this by internalizing computation, but at the cost of the interpretability that makes CoT traces useful for monitoring and alignment. The right choice depends on whether the deployment values auditability or throughput more.

Chain-of-Thought: from prompting to RL-trained reasoning and its frontiers

CoT variants and alternatives

ApproachHow reasoning is expressedInference costInterpretabilityKey tradeoff
Standard CoT (prompted)Explicit token-by-token stepsHigher (more tokens)HighCost vs. accuracy gain
RL-trained CoT (o1-style)Explicit steps trained via RLHigher; controllable effortHighCompute budget vs. capability
Process supervisionExplicit steps with per-step reward signalTraining cost; same inferenceHighLabeling cost vs. alignment benefit
Latent reasoning (RiM)Internal memory-block tokens, not textLower (single forward pass)LowEfficiency vs. interpretability
Latent trajectories (STORM)Continuous latent tokens, no text outputLower; no tool callsLowSpeed vs. auditability
Outcome supervisionNo intermediate steps rewardedStandardN/ASimpler training vs. weaker step quality

Synthesized from the events bundle; unknown cells render —.

Timeline

  1. Process supervision paper: rewarding correct steps improves math reasoning and alignment

  2. Hugging Face launches Open Chain-of-Thought Leaderboard for open-weight models

  3. OpenAI o1 ships: RL-trained CoT becomes a dedicated inference-time compute axis

  4. OpenAI extends CoT to visual reasoning: images incorporated into intermediate thinking steps

  5. OpenAI publishes CoT monitorability evaluation suite across 13 evals and 24 environments

  6. CoT-Control framework: models struggle to suppress their own CoT — framed as a safety property

  7. Mistral Magistral: first open-weight multilingual reasoning model with native CoT across 8 languages

  8. Commitment boundary research: 55% CoT length reduction with negligible quality loss

Related topics

OpenAIReinforcement Learningoutcome supervisionCoT-ControlProbe Trajectoriesprocess supervisionLarge Reasoning ModelsMistral AIOpenAI Reasoning ModelsChain-of-Thought Monitorability Evaluation Suite

FAQ

What is chain-of-thought reasoning?

It is a technique where a language model generates explicit intermediate reasoning steps before producing a final answer, rather than mapping input directly to output in one step.

How is RL-trained CoT different from prompted CoT?

Prompted CoT elicits step-by-step reasoning at inference time; RL-trained CoT (as in OpenAI's o1) bakes the reasoning behavior into the model's weights through reinforcement learning, making it a first-class capability rather than a prompt artifact.

What is process supervision and why does it matter?

Process supervision rewards each correct intermediate reasoning step rather than only the final answer; OpenAI research showed this improves math benchmark performance and produces reasoning traces that humans endorse, creating a synergy between capability and alignment.

Are CoT traces reliable for safety monitoring?

OpenAI's monitorability research found that monitoring internal reasoning is substantially more effective than monitoring outputs alone, and that models struggle to deliberately suppress their CoT — both results support using visible reasoning as a meaningful oversight signal, though faithfulness cannot be assumed unconditionally.

What are the main alternatives to token-level CoT?

Latent reasoning methods like RiM replace explicit token generation with internal memory-block tokens processed in a single forward pass, and STORM uses continuous latent trajectories for video reasoning — both trade interpretability for lower inference overhead.

How much can CoT token usage be reduced without hurting quality?

Research on the 'commitment boundary' — the point where a model's answer stabilizes — found that exiting reasoning at that boundary reduces CoT length by up to 55% on average with negligible performance loss.

Stay current

Call Me Almanac pairs the week's AI news with guides like this one — Midweek & Sunday.

Versions

  • v1live6d ago

Related guides (4)

More on Chain-of-Thought Reasoning (6)

6arXiv · cs.CL·1mo ago·source ↗

Probe Trajectories Reveal Reasoning Dynamics in Large Reasoning Models

This paper investigates whether hidden representations of Large Reasoning Models (LRMs) can predict future model behavior by analyzing probe trajectories—the continuous evolution of concept probabilities across Chain-of-Thought reasoning tokens. The authors find that temporal trajectory features (volatility, trend, steady-state) significantly outperform single static probes, with max-pooling achieving up to 95% AUROC across safety and mathematics domains. Two methodological insights are offered: template-based training data matches dynamically generated responses in quality, and pooling strategy is critical to probe performance. The work positions probe trajectories as a complementary safety monitoring framework for LRMs where CoT faithfulness cannot be assumed.

5Hugging Face Blog·1mo ago·source ↗

Introducing the Open Chain of Thought Leaderboard

Hugging Face has launched the Open Chain of Thought Leaderboard, a benchmarking platform specifically designed to evaluate open-weight language models on chain-of-thought reasoning capabilities. The leaderboard tracks model performance across reasoning-intensive tasks that require multi-step inference. This initiative aims to provide standardized, reproducible comparisons of CoT reasoning quality across the open-weights ecosystem.

7Openai Blog·1mo ago·source ↗

Reasoning models struggle to control their chains of thought, and that's good

OpenAI introduces CoT-Control, a framework for evaluating how well reasoning models can deliberately manipulate or suppress their chain-of-thought outputs. The finding that models struggle to control their CoT is framed as a positive safety property, reinforcing the argument that visible reasoning traces serve as a meaningful monitorability safeguard. This contributes to ongoing research on whether chain-of-thought transparency is a reliable alignment and oversight tool.

7Openai Blog·1mo ago·source ↗

Evaluating chain-of-thought monitorability

OpenAI introduces a framework and evaluation suite for assessing chain-of-thought monitorability, comprising 13 evaluations across 24 environments. The research finds that monitoring a model's internal reasoning is substantially more effective than monitoring outputs alone. The work is positioned as a step toward scalable oversight and control of increasingly capable AI systems.

7Openai Blog·1mo ago·source ↗

Thinking with images

OpenAI announced a new capability allowing its reasoning models to incorporate images directly into their chain-of-thought process, enabling visual reasoning during intermediate thinking steps rather than only at input/output boundaries. This extends multimodal reasoning to the internal computation layer, potentially improving performance on tasks requiring visual analysis combined with multi-step reasoning. The announcement comes from OpenAI's official blog, indicating a product-level capability update.

9Openai Blog·1mo ago·source ↗

Introducing OpenAI o1

OpenAI announced o1, a new series of AI models designed to spend more time 'thinking' before responding, using chain-of-thought reasoning to tackle complex problems in science, coding, and mathematics. The o1-preview and o1-mini models are being released, with o1-preview representing the most capable version and o1-mini offering a faster, cheaper alternative optimized for coding and reasoning tasks. OpenAI claims o1-preview ranks in the 89th percentile on competitive programming problems and performs at a PhD level on science benchmarks. This release marks a significant shift in OpenAI's approach to scaling, moving from purely training-time compute to inference-time compute as a new axis of capability improvement.