NF-CoT: Latent reasoning with normalizing flows preserves autoregressive LLM advantages
Researchers propose NF-CoT, a latent reasoning framework that replaces discrete chain-of-thought token streams with continuous intermediate states modeled by normalizing flows embedded inside an LLM backbone. The approach uses a TARFlow-style normalizing flow head alongside the standard language model head, enabling exact likelihoods, KV-cache-compatible left-to-right decoding, and policy-gradient optimization in latent space. On code-generation benchmarks, NF-CoT improves pass rates over both explicit CoT and prior latent-reasoning baselines while reducing intermediate reasoning cost. The work addresses a key limitation of existing latent reasoning methods, which typically sacrifice probabilistic tractability or autoregressive compatibility.
Related guides (2)
Related events (8)
IS-CoT framework addresses long-form generation collapse in LLMs via interleaved structural thinking
Researchers introduce IS-CoT (Interleaved Structural Chain-of-Thought), a framework that embeds a dynamic Plan-Write-Reflect cycle into LLM generation to overcome severe length collapse observed in reasoning-enhanced models for open-ended writing tasks beyond 2,000 words. The authors construct a multi-teacher training dataset of interleaved reasoning traces and train IS-Writer-8B, which achieves state-of-the-art results on LongBench-Write, outperforming DeepSeek-V3.2 by 3.08 points. The work identifies static hierarchical planning as a root cause of long-form degradation and proposes an in-model alternative to external agentic workflows.
Reasoning in Memory (RiM): Latent Reasoning via Working Memory Blocks in LLMs
RiM introduces a latent reasoning method that replaces autoregressive chain-of-thought token generation with fixed sequences of special 'memory block' tokens, allowing LLMs to perform internal computation without externalizing intermediate steps. These memory blocks are processed in a single forward pass rather than generated autoregressively, improving compute efficiency at test time. Training uses a two-stage curriculum: first grounding memory blocks by predicting explicit reasoning steps, then discarding step-level supervision and refining answers iteratively. Experiments across multiple model families and sizes show RiM matches or exceeds existing latent reasoning methods.
Reasoning models struggle to control their chains of thought, and that's good
OpenAI introduces CoT-Control, a framework for evaluating how well reasoning models can deliberately manipulate or suppress their chain-of-thought outputs. The finding that models struggle to control their CoT is framed as a positive safety property, reinforcing the argument that visible reasoning traces serve as a meaningful monitorability safeguard. This contributes to ongoing research on whether chain-of-thought transparency is a reliable alignment and oversight tool.
Information-theoretic analysis of supervision in latent chain-of-thought reasoning
This paper analyzes Latent Chain-of-Thought (CoT) reasoning — where reasoning occurs in continuous hidden states rather than discrete text — through an information-theoretic lens, identifying a 'dual collapse' failure mode involving gradient attenuation and representational drift. The authors decompose process supervision into Trajectory Supervision and Space Supervision, and introduce the Unified Latent Probe (ULP) to quantify mutual information between latent trajectories and explicit reasoning steps. Experiments reveal an 'Information-Performance Binding' showing reasoning accuracy depends on information fidelity in the latent chain, suggesting supervision should shift from geometric imitation toward mutual information maximization.
ATLAS: Unified Agentic and Latent Visual Reasoning via Functional Tokens
ATLAS proposes a framework where a single discrete 'functional token' serves dual roles as both an agentic operation trigger and a latent visual reasoning unit in multimodal models. This design avoids the computational cost of generating intermediate images while sidestepping the context-switching latency of external tool calls and the generalization limitations of pure latent methods. The framework is compatible with standard SFT and RL training pipelines without architectural changes, and introduces Latent-Anchored GRPO (LA-GRPO) to stabilize reinforcement learning when functional tokens are sparse. Experiments show strong performance on visual reasoning benchmarks with maintained interpretability.
Research identifies 'commitment boundary' in chain-of-thought reasoning, enabling 55% CoT length reduction
A new arXiv preprint introduces the concept of a 'commitment boundary' in chain-of-thought reasoning — a sharp transition point where a model's answer stabilizes, after which subsequent reasoning steps are 'epiphenomenal' and causally inert. The authors use early-exit probing and attention probes to detect this boundary, finding it can be linearly decoded from intermediate steps and generalizes across tasks. Exploiting this signal to exit reasoning blocks at the commitment boundary reduces CoT length by up to 55% on average with negligible performance loss, with direct implications for inference efficiency in large reasoning models.
ACTS: Agentic Chain-of-Thought Steering for efficient and controllable LLM reasoning
Researchers introduce Agentic Chain-of-Thought Steering (ACTS), a framework that formulates inference-time reasoning control as a Markov decision process, where a controller agent adaptively steers a frozen reasoner by issuing reasoning strategy directives and steering phrases at each step. The controller is initialized from synthetic steering trajectories with multi-budget augmentation and further optimized via reinforcement learning with budget-conditioned reward shaping. ACTS matches full-thinking performance with significant token savings and enables controllable accuracy-efficiency trade-offs across multiple benchmarks and reasoner models.
RREDCoT: Segment-level reward redistribution for chain-of-thought reasoning via self-approximated credit assignment
RREDCoT is a new method for redistributing rewards across segments of Chain-of-Thought traces during RL fine-tuning of reasoning language models, addressing the high-variance delayed-reward problem inherent in GRPO-style training. Rather than using computationally expensive Monte Carlo sampling for intermediate state value estimation, the method uses the model itself to approximate optimal reward redistribution without additional generation passes. The paper evaluates RREDCoT against MC sampling and several attribution baselines, analyzing segmentation strategies and state value estimation. This is relevant to the active research thread on improving RL fine-tuning stability and efficiency for reasoning models.

