Equilibrium Reasoners: Learning Attractors Enables Scalable Reasoning
This paper introduces Equilibrium Reasoners (EqR), a framework that formalizes test-time compute scaling through learned task-conditioned attractors in latent space, where stable fixed points correspond to valid solutions. EqR scales along two axes—depth (more iterations) and breadth (aggregating stochastic trajectories)—without requiring external verifiers or task-specific priors. On Sudoku-Extreme, unrolling up to 40,000 equivalent layers boosts accuracy from 2.6% (feedforward baseline) to over 99%. The work provides a mechanistic lens for understanding why iterative latent models generalize beyond memorized patterns.
Related guides (3)
Related events (8)
RACES framework enables recursive composition of verifiable RL environments for LLM reasoning generalization
RACES (Recursive Automated Composition for Environment Scaling) is a new framework that treats verifiable RL training environments as composable building blocks, automatically fusing them when input/output types match. The system implements 300 base environments and four composition operators (SEQUENTIAL, PARALLEL, SORT, SELECT) to generate diverse reasoning patterns at scale. Experiments show consistent gains on unseen benchmarks: DeepSeek-R1-Distill-Qwen-14B improves from 48.2 to 51.3 and Qwen3-14B from 58.8 to 61.1 averaged across six benchmarks. Notably, RACES achieves parity with 300 individual environments using only 50 base environments, suggesting strong efficiency gains over linear environment scaling.
ExpRL: RL-based mid-training using human QA data as reward scaffolds for LLM reasoning
ExpRL proposes an automated approach to LLM mid-training that replaces manually curated reasoning traces with large corpora of human-written QA data used as reward scaffolds rather than imitation targets. Reference solutions are hidden from the policy and used only to construct problem-specific grading rubrics, enabling dense process-level rewards that reinforce partial progress and intermediate reasoning steps. On challenging math reasoning benchmarks, ExpRL outperforms SFT, sparse-reward GRPO, and self-distillation as an RL initialization strategy, with additional mixed-domain experiments suggesting broader applicability.
Probe Trajectories Reveal Reasoning Dynamics in Large Reasoning Models
This paper investigates whether hidden representations of Large Reasoning Models (LRMs) can predict future model behavior by analyzing probe trajectories—the continuous evolution of concept probabilities across Chain-of-Thought reasoning tokens. The authors find that temporal trajectory features (volatility, trend, steady-state) significantly outperform single static probes, with max-pooling achieving up to 95% AUROC across safety and mathematics domains. Two methodological insights are offered: template-based training data matches dynamically generated responses in quality, and pooling strategy is critical to probe performance. The work positions probe trajectories as a complementary safety monitoring framework for LRMs where CoT faithfulness cannot be assumed.
Adaptive Parallel Reasoning: The Next Paradigm in Efficient Inference Scaling
A BAIR blog post surveys recent progress in parallel reasoning for LLMs, covering methods from simple self-consistency and Best-of-N sampling through structured search (Tree of Thoughts, MCTS) to newer adaptive approaches including ParaThinker, GroupThink, and Hogwild! Inference. The core motivation is that sequential reasoning scales linearly with exploration depth, causing latency, context-rot, and compute inefficiency. Adaptive parallel reasoning aims to let models themselves decide when and how to decompose tasks into concurrent threads, rather than imposing fixed parallel structure externally. The post frames this as an emerging inference-time scaling paradigm with implications for agentic and complex reasoning workloads.
Extrapolative Weight Averaging Reveals Correctness-Efficiency Frontiers in Code RL
This paper investigates whether extrapolative weight averaging of RL-trained checkpoints can extend Pareto frontiers between competing objectives (correctness vs. computational efficiency) without additional training. Starting from a shared initialization, the authors train checkpoints under nested unit-test coverage regimes for competitive programming tasks, revealing a correctness-efficiency frontier where higher-coverage rewards reduce optimization failures but increase correctness failures. Extrapolation beyond trained endpoints produces complementary policies that, when ensembled, improve pass@250 on LCB/hard by 3.3% over the best single checkpoint at matched sample budget. Results hold across 7B and 32B model scales and three inference settings: pure reasoning, tool use, and agentic coding.
Reasoning in Memory (RiM): Latent Reasoning via Working Memory Blocks in LLMs
RiM introduces a latent reasoning method that replaces autoregressive chain-of-thought token generation with fixed sequences of special 'memory block' tokens, allowing LLMs to perform internal computation without externalizing intermediate steps. These memory blocks are processed in a single forward pass rather than generated autoregressively, improving compute efficiency at test time. Training uses a two-stage curriculum: first grounding memory blocks by predicting explicit reasoning steps, then discarding step-level supervision and refining answers iteratively. Experiments across multiple model families and sizes show RiM matches or exceeds existing latent reasoning methods.
RA-RFT: Retrieval-Augmented Reinforcement Fine-Tuning teaches LLMs to reason by analogy
Researchers propose Retrieval-Augmented Reinforcement Fine-Tuning (RA-RFT), a post-training framework that trains a retriever to rank contexts by expected reasoning benefit rather than semantic similarity, then fine-tunes a policy model via reinforcement learning using retrieved analogous demonstrations. The key insight is that reasoning-relevant retrieval surfaces complementary solution strategies rather than superficially similar problems. On mathematical reasoning benchmarks, RA-RFT improves AIME 2025 average@32 accuracy by 7.1 and 2.8 points over GRPO for Qwen3-1.7B and Qwen3-4B respectively, suggesting reasoning-aware retrieval is orthogonal to reward design and training curriculum improvements.
RELEX: Extrapolating LLM RLVR Training via Rank-1 Parameter Trajectories
This paper demonstrates that RLVR weight update trajectories are extremely low-rank and near-linearly predictable, with a rank-1 approximation capturing most downstream performance gains. The authors propose RELEX, a compute-efficient method that observes a short training window, estimates the rank-1 subspace, and extrapolates future checkpoints via linear regression—requiring no additional training. Evaluated on Qwen2.5-Math-1.5B, Qwen3-4B-Base, and Qwen3-8B-Base, RELEX matches or exceeds full RLVR performance using as few as 15% of training steps, and can extrapolate up to 10–20× beyond the observed prefix. The authors attribute the method's effectiveness to a denoising effect from rank-1 projection that discards stochastic optimization noise.


