What it is
Chain-of-thought (CoT) reasoning is a technique that instructs or trains a language model to produce explicit intermediate steps on the way to a final answer. Instead of mapping a prompt directly to an output token, the model externalizes a scratchpad: it writes out sub-problems, partial conclusions, and self-corrections before committing. The result is a model that can decompose hard problems the same way a human expert might — and whose work can, in principle, be inspected.
How it works
The basic mechanism
At inference time, CoT reasoning produces a sequence of reasoning tokens before the answer token. These tokens are part of the model's output distribution and are generated autoregressively, meaning each step conditions on all prior steps. The chain can be elicited by prompting ("think step by step") or baked in through training.
Training CoT with reinforcement learning
The step change came when OpenAI trained CoT behavior directly via reinforcement learning rather than relying on prompting alone. The o1 model series, released in September 2024, treats inference-time compute as an independent scaling axis: the model is given a compute budget to "think" before answering, and longer thinking correlates with better answers on hard tasks. o1-preview ranked in the 89th percentile on competitive programming problems and performed at a PhD level on science benchmarks — gains attributed specifically to this RL-trained reasoning process.
Process supervision vs. outcome supervision
A key design choice in training CoT models is what to reward. Outcome supervision rewards only the final answer; process supervision rewards each correct intermediate step. OpenAI research demonstrated that process supervision improves mathematical problem-solving performance and carries an alignment benefit: models trained this way produce reasoning traces that humans endorse, creating a direct link between capability improvement and interpretable behavior.
Why it matters
CoT reasoning shifted the frontier capability conversation from "how much can you train?" to "how much can you think at inference time?" It also opened a new surface for safety work: if a model's reasoning is visible, it can be monitored. OpenAI's monitorability evaluation suite — 13 evaluations across 24 environments — found that monitoring internal reasoning is substantially more effective than monitoring outputs alone. The CoT-Control framework further found that reasoning models struggle to deliberately suppress or manipulate their own chain-of-thought, which is framed as a positive safety property: the reasoning trace is hard to fake.
Variants and extensions
Multimodal CoT
In April 2025, OpenAI extended CoT to incorporate images directly into intermediate reasoning steps — not just at input/output boundaries. This allows the model to visually analyze a diagram mid-chain, enabling richer spatial and perceptual reasoning.
Multilingual CoT
Mistral's Magistral (June 2025) demonstrated native multilingual chain-of-thought reasoning across eight major languages, with Magistral Medium scoring 73.6% on AIME2024 (90% with majority voting at 64 samples). This established that RL-trained CoT is not English-only and can generalize across linguistic contexts.
Structured pre-planning
The PPC (Preplan-Plan-CoT) framework adds an explicit problem-understanding stage before the planning and execution stages. The preplan captures problem type, applicable tools, and foreseeable pitfalls — addressing a gap in plan-based methods that address "how" to solve without first clarifying "what" to solve. Evaluated across five math benchmarks, PPC achieves improvements of +2.23 maj@16 and +3.06 pass@16 over the strongest baseline at no additional inference token cost.
Latent reasoning: CoT without text
Two research directions challenge the assumption that reasoning must be expressed in natural-language tokens:
- RiM (Reasoning in Memory) replaces autoregressive CoT token generation with fixed sequences of special "memory block" tokens processed in a single forward pass. A two-stage curriculum first grounds the memory blocks by predicting explicit reasoning steps, then discards step-level supervision. RiM matches or exceeds existing latent reasoning methods while improving compute efficiency.
- STORM applies a similar idea to video-language models, using bounded continuous latent trajectories rather than textual CoT for spatial-temporal reasoning. At inference time, no video regeneration or frame reinsertion is required, reducing latency versus tool-based pipelines.
Both approaches trade interpretability for efficiency — a meaningful tradeoff given that CoT token generation is the dominant cost in reasoning-heavy workloads.
Efficiency: the commitment boundary
A 2026 arXiv preprint introduced the concept of a commitment boundary: a sharp transition point in a CoT trace where the model's answer stabilizes and subsequent reasoning steps become causally inert ("epiphenomenal"). The boundary can be linearly decoded from intermediate hidden states using early-exit probing and generalizes across tasks. Exploiting this signal to exit reasoning at the commitment boundary reduces CoT length by up to 55% on average with negligible performance loss — a direct handle on inference cost for deployed reasoning models.
Safety and monitorability
The visibility of CoT traces has made them a focus of alignment research. Probe trajectory analysis — tracking the continuous evolution of concept probabilities across CoT tokens — achieves up to 95% AUROC for predicting model behavior in safety and mathematics domains, outperforming single static probes. The temporal features (volatility, trend, steady-state) carry more signal than any snapshot, positioning probe trajectories as a complementary safety monitoring layer for large reasoning models where CoT faithfulness cannot be assumed unconditionally.
Research on self-refinement via question-asking found a gap between detection and recovery: probes trained on hidden states before question generation are predictive of final answer correctness, but interventions are as likely to harm correct trajectories as to fix incorrect ones — a caution against over-relying on LLM self-diagnosis.
Benchmarking the ecosystem
Hugging Face launched the Open Chain-of-Thought Leaderboard in April 2024, providing standardized, reproducible comparisons of CoT reasoning quality across open-weight models. This infrastructure matters because CoT capability is not uniform across model families, and benchmark-level comparisons are the primary signal practitioners use to select models for reasoning-intensive deployments.
Tradeoffs and when not to use it
CoT reasoning is not free. It multiplies token generation — and therefore latency and cost — proportionally to the length of the reasoning trace. For tasks where the answer is simple or the model is already highly confident, CoT adds overhead without benefit. The commitment boundary research quantifies this: a large fraction of generated reasoning tokens are post-hoc and causally inert. Latent reasoning methods (RiM, STORM) address this by internalizing computation, but at the cost of the interpretability that makes CoT traces useful for monitoring and alignment. The right choice depends on whether the deployment values auditability or throughput more.




