Fixed-Point Reasoning Model (FPRM): Stable looped Transformers with adaptive compute via fixed-point halting
Researchers introduce FPRM, a Transformer-based Fixed-Point Reasoning Model that uses fixed-point convergence as a halting mechanism in looped architectures, addressing signal propagation problems through pre-norm layers and residual scaling. Looped architectures provide inductive bias for compositional reasoning, but suffer from depth-induced signal degradation when halting is deferred; FPRM resolves this while enabling compute to scale with task difficulty. The model is evaluated on Sudoku, Maze, state-tracking, and ARC-AGI benchmarks. This contributes to the growing body of work on adaptive-compute and iterative-refinement architectures for reasoning.
Related guides (1)
Related events (8)
Future Probe Controlled Generation enables steering of reasoning models without quality degradation
Researchers introduce Future Probe Controlled Generation (FPCG), a text-level steering method for large reasoning models (LRMs) that trains activation probes to predict future behavior likelihoods from intermediate reasoning steps rather than detecting behavior in already-generated text. The probes achieve 64–91% accuracy in predicting the most likely future behavior, revealing a distinct class of internal prediction features separate from detection features. FPCG steers model outputs by sampling candidate sentences and selecting the best according to these probes, achieving steering with minimal output quality degradation and succeeding in cases where activation steering fails. The work provides a principled distinction between detection and prediction features as intervention targets for controlling LRM behavior.
Equilibrium Reasoners: Learning Attractors Enables Scalable Reasoning
This paper introduces Equilibrium Reasoners (EqR), a framework that formalizes test-time compute scaling through learned task-conditioned attractors in latent space, where stable fixed points correspond to valid solutions. EqR scales along two axes—depth (more iterations) and breadth (aggregating stochastic trajectories)—without requiring external verifiers or task-specific priors. On Sudoku-Extreme, unrolling up to 40,000 equivalent layers boosts accuracy from 2.6% (feedforward baseline) to over 99%. The work provides a mechanistic lens for understanding why iterative latent models generalize beyond memorized patterns.
Training-Free Looped Transformers: Inference-Time Recurrence via ODE-Motivated Layer Reapplication
The paper introduces a method to retrofit recurrence onto frozen pretrained transformer checkpoints at inference time by looping a contiguous mid-stack block of layers without any fine-tuning or architectural changes. Naive block reapplication degrades performance, so the authors motivate their approach by treating pre-norm transformer blocks as forward Euler ODE steps and replacing one large update with smaller damped sub-steps. Evaluated across seven model families including dense, sparse MoE, and MLA+MoE architectures, the method yields consistent benchmark improvements (e.g., +2.64 pp on MMLU-Pro for Qwen3-4B-Instruct) at no training cost.
PPC: Preplan-Plan-CoT Framework for LLM Mathematical Reasoning
This paper introduces PPC (Preplan-Plan-CoT), a reasoning framework that adds an explicit problem-understanding stage (the 'preplan') before the planning and chain-of-thought execution stages in LLM mathematical reasoning. The preplan captures problem type, applicable tools, and foreseeable pitfalls, addressing a gap in existing plan-based methods that only address 'how' to solve without first clarifying 'what' to solve. A three-stage synthesis pipeline with a spoiler-score detector and composite GRPO reward ensures clean preplan supervision and coherent plan generation. Evaluated across four backbones and five math benchmarks, PPC achieves best results on 39 of 40 metrics with +2.23 maj@16 and +3.06 pass@16 improvements over the strongest baseline at no additional inference token cost.
RA-RFT: Retrieval-Augmented Reinforcement Fine-Tuning teaches LLMs to reason by analogy
Researchers propose Retrieval-Augmented Reinforcement Fine-Tuning (RA-RFT), a post-training framework that trains a retriever to rank contexts by expected reasoning benefit rather than semantic similarity, then fine-tunes a policy model via reinforcement learning using retrieved analogous demonstrations. The key insight is that reasoning-relevant retrieval surfaces complementary solution strategies rather than superficially similar problems. On mathematical reasoning benchmarks, RA-RFT improves AIME 2025 average@32 accuracy by 7.1 and 2.8 points over GRPO for Qwen3-1.7B and Qwen3-4B respectively, suggesting reasoning-aware retrieval is orthogonal to reward design and training curriculum improvements.
FFR extends Forward-Forward algorithm to regression tasks with 73% memory reduction
A new arXiv preprint introduces FFR (Forward-Forward for Regression), the first framework to adapt Hinton's Forward-Forward algorithm—a biologically plausible, backpropagation-free training method—to regression problems. FFR introduces an ordinal competitive goodness function, a stratified ladder architecture, and hierarchical prediction with uncertainty estimation to handle continuous target spaces. Across five real-world regression benchmarks, FFR recovers 98.6% of backpropagation accuracy while reducing peak training memory to 27% of BP's at depth 8 and 8% at depth 32, with per-iteration time around 72% of BP's.
Prefix Utility Model (PUM) trains process reward models on outcome-grounded prefix gain rather than step correctness
A new arXiv preprint proposes replacing local step-correctness signals in process reward models with 'prefix gain' — the improvement in solve-rate induced by conditioning a student model on a given reasoning prefix. The authors train a Prefix Utility Model (PUM) using a pairwise ranking objective and evaluate it across Best-of-N selection, beam search, and RL on mathematical reasoning tasks. PUM shows particular strength when candidate pools are large, search budgets are high, or rule-based rewards are sparse. Code, data, and models are released publicly.
Qwen2.5-Math Process Reward Model for Mathematical Reasoning Supervision
Alibaba's Qwen team introduces a process reward model (PRM) aimed at improving the reliability of mathematical reasoning in LLMs by supervising intermediate reasoning steps rather than only final answers. The work addresses the problem of models producing plausible but flawed intermediate derivations even when reaching correct conclusions. The release includes model weights on HuggingFace and ModelScope alongside a GitHub repository.
