6arXiv cs.CL (Computation and Language)·22d ago

PPC: Preplan-Plan-CoT Framework for LLM Mathematical Reasoning

This paper introduces PPC (Preplan-Plan-CoT), a reasoning framework that adds an explicit problem-understanding stage (the 'preplan') before the planning and chain-of-thought execution stages in LLM mathematical reasoning. The preplan captures problem type, applicable tools, and foreseeable pitfalls, addressing a gap in existing plan-based methods that only address 'how' to solve without first clarifying 'what' to solve. A three-stage synthesis pipeline with a spoiler-score detector and composite GRPO reward ensures clean preplan supervision and coherent plan generation. Evaluated across four backbones and five math benchmarks, PPC achieves best results on 39 of 40 metrics with +2.23 maj@16 and +3.06 pass@16 improvements over the strongest baseline at no additional inference token cost.

Evaluation and Benchmarking Agent and Tool Ecosystem Alignment and RLHF spoiler-score detector GRPO Chain-of-Thought Reasoning PPC (Preplan-Plan-CoT)

Related guides (4)

GRPOConcept

GRPO: The Lightweight RL Trick Behind Today's Reasoning Models

Read asBeginner In-depth

Agent and Tool EcosystemTopic guide

Agent and Tool Ecosystem: How the Infrastructure Layer Around LLMs Is Consolidating

Read asIn-depth

Alignment and RLHFTopic guide

Alignment and RLHF: Teaching AI Models to Behave

Read asBeginner In-depth

Evaluation and BenchmarkingTopic guide

Evaluation and Benchmarking: The Shifting Yardstick of AI Capability

Read asIn-depth

Related events (8)

5arXiv · cs.CL·11d ago·source ↗

IS-CoT framework addresses long-form generation collapse in LLMs via interleaved structural thinking

Researchers introduce IS-CoT (Interleaved Structural Chain-of-Thought), a framework that embeds a dynamic Plan-Write-Reflect cycle into LLM generation to overcome severe length collapse observed in reasoning-enhanced models for open-ended writing tasks beyond 2,000 words. The authors construct a multi-teacher training dataset of interleaved reasoning traces and train IS-Writer-8B, which achieves state-of-the-art results on LongBench-Write, outperforming DeepSeek-V3.2 by 3.08 points. The work identifies static hierarchical planning as a root cause of long-form degradation and proposes an in-model alternative to external agentic workflows.

Long Context Evolution Evaluation and Benchmarking DeepSeek V4 LongBench-Write IS-Writer-8B +1 more

6Qwen Research·1mo ago·source ↗

Qwen2.5-Math Process Reward Model for Mathematical Reasoning Supervision

Alibaba's Qwen team introduces a process reward model (PRM) aimed at improving the reliability of mathematical reasoning in LLMs by supervising intermediate reasoning steps rather than only final answers. The work addresses the problem of models producing plausible but flawed intermediate derivations even when reaching correct conclusions. The release includes model weights on HuggingFace and ModelScope alongside a GitHub repository.

Evaluation and Benchmarking Open Weights Progress Process Reward Model Alibaba Qwen +4 more

5arXiv · cs.CL·17d ago·source ↗

ACTS: Agentic Chain-of-Thought Steering for efficient and controllable LLM reasoning

Researchers introduce Agentic Chain-of-Thought Steering (ACTS), a framework that formulates inference-time reasoning control as a Markov decision process, where a controller agent adaptively steers a frozen reasoner by issuing reasoning strategy directives and steering phrases at each step. The controller is initialized from synthetic steering trajectories with multi-budget augmentation and further optimized via reinforcement learning with budget-conditioned reward shaping. ACTS matches full-thinking performance with significant token savings and enables controllable accuracy-efficiency trade-offs across multiple benchmarks and reasoner models.

Inference Economics Agent and Tool Ecosystem ACTS Agentic Chain-of-Thought Steering Agentic Chain-of-Thought Steering for Efficient and Controllable LLM Reasoning

5arXiv · cs.CL·4d ago·source ↗

Systematic study of extrinsic and intrinsic properties for effective code interpreter reasoning in LLMs

Researchers investigate what behavioral properties make LLMs effective at reasoning with a Code Interpreter (CI), identifying two axes: extrinsic 'crucial tokens' and intrinsic 'cognitive behaviors' such as verification, backtracking, and backward chaining. Stronger CI reasoning models consistently exhibit higher prevalence of these properties. The paper shows that appending code-specific crucial tokens at inference time improves performance on mathematical, ordering, and optimization tasks, while augmenting training with cognitive behaviors improves SFT and RL performance in two of three evaluated models. The work also finds these behaviors reduce overthinking in incorrect responses and improve token efficiency.

Evaluation and Benchmarking Agent and Tool Ecosystem Exploring Extrinsic and Intrinsic Properties for Effective Reasoning with Code Interpreter

6arXiv · cs.CL·1mo ago·source ↗

Implicit Hierarchical GRPO: Decoupling Tool Invocation from Execution for Tool-Integrated Mathematical Reasoning

This paper introduces IH-GRPO, a reinforcement learning algorithm that decouples tool invocation from immediate execution during LLM reasoning, addressing the coherence disruption caused by tight coupling in existing tool-integrated reasoning (TIR) approaches. The authors propose a hierarchical control framework and derive a surrogate loss enabling an implicitly hierarchical policy to match the behavior of an explicit hierarchical policy. Experiments on Qwen3 models (1.7B, 4B, 8B) show absolute improvements of 1.87–2.53% across six out-of-domain mathematical reasoning benchmarks over the strongest baseline. Code is publicly released.

Evaluation and Benchmarking Agent and Tool Ecosystem GRPO Tool-Integrated Reasoning Qwen3-4B +3 more

6Berkeley Ai Research (Bair) Blog·1mo ago·source ↗

Adaptive Parallel Reasoning: The Next Paradigm in Efficient Inference Scaling

A BAIR blog post surveys recent progress in parallel reasoning for LLMs, covering methods from simple self-consistency and Best-of-N sampling through structured search (Tree of Thoughts, MCTS) to newer adaptive approaches including ParaThinker, GroupThink, and Hogwild! Inference. The core motivation is that sequential reasoning scales linearly with exploration depth, causing latency, context-rot, and compute inefficiency. Adaptive parallel reasoning aims to let models themselves decide when and how to decompose tasks into concurrent threads, rather than imposing fixed parallel structure externally. The post frames this as an emerging inference-time scaling paradigm with implications for agentic and complex reasoning workloads.

Long Context Evolution Frontier Model Releases ParaThinker Berkeley AI Research (BAIR)DeepSeek V4 +11 more

7arXiv · cs.CL·17d ago·source ↗

PROVE framework trains LLMs for multi-step tool use via stateful MCP environments and programmatic rewards

Researchers introduce PROVE (Programmatic Rewards On Verified Environments), a framework for training LLMs to orchestrate multi-step tool calls using reinforcement learning. The system includes a library of 20 stateful MCP servers with 343 tools, an automated data synthesis pipeline that grounds training queries in live server state, and a multi-component programmatic reward function requiring no judge model. Training four models (Qwen3-4B, Qwen3-8B, Qwen2.5-7B, Granite-4.1-8B) with ~13K examples yields gains of up to +10.2 on BFCL Multi-Turn, +6.8 on tau2-bench, and +6.5 on T-Eval, demonstrating consistent improvements in multi-step tool orchestration.

Evaluation and Benchmarking Agent and Tool Ecosystem Qwen2.5-7B GRPO Qwen3-4B +7 more

6arXiv · cs.LG·8d ago·source ↗

Operadic consistency: a label-free signal for detecting compositional reasoning failures in LLMs

Researchers introduce operadic consistency (OC), a label-free inference-time signal that checks whether an LLM's direct answer to a compositional query agrees with the answer produced by composing its own stated decomposition of that query. Evaluated across 12 instruction-tuned LLMs (4B–671B parameters) on four multi-hop QA datasets, OC achieves Pearson r ∈ [0.86, 0.94] with accuracy uniformly across all datasets, outperforming self-consistency, semantic entropy, and P(True) in cross-dataset robustness. At the per-question level, OC provides information beyond existing baselines and yields selective-prediction improvements (AUARC lifts +0.086–0.096, AUROC lifts +0.092–0.164) at equal sampling cost, with results extending to frontier thinking models using chain-of-thought decompositions.

Evaluation and Benchmarking AI Safety Research operadic consistency Chain-of-Thought Self-Consistency MuSiQue +6 more