Almanac
← Events
6arXiv cs.CL (Computation and Language)·12d ago

Prefix Utility Model (PUM) trains process reward models on outcome-grounded prefix gain rather than step correctness

A new arXiv preprint proposes replacing local step-correctness signals in process reward models with 'prefix gain' — the improvement in solve-rate induced by conditioning a student model on a given reasoning prefix. The authors train a Prefix Utility Model (PUM) using a pairwise ranking objective and evaluate it across Best-of-N selection, beam search, and RL on mathematical reasoning tasks. PUM shows particular strength when candidate pools are large, search budgets are high, or rule-based rewards are sparse. Code, data, and models are released publicly.

Related guides (2)

Related events (8)

6Qwen Research·1mo ago·source ↗

Qwen2.5-Math Process Reward Model for Mathematical Reasoning Supervision

Alibaba's Qwen team introduces a process reward model (PRM) aimed at improving the reliability of mathematical reasoning in LLMs by supervising intermediate reasoning steps rather than only final answers. The work addresses the problem of models producing plausible but flawed intermediate derivations even when reaching correct conclusions. The release includes model weights on HuggingFace and ModelScope alongside a GitHub repository.

6The Batch·35h ago·source ↗

POPE Training Method Uses Partial Solution Hints to Improve RL Exploration in LLMs

Researchers from Carnegie Mellon University introduced Privileged On-Policy Exploration (POPE), a training method that pairs GRPO reinforcement learning with hint-augmented datasets to help LLMs solve hard problems they would otherwise fail to explore. During training, the model receives partial solution prefixes alongside full problems, enabling it to discover complete solutions; it is then trained on both hinted and unhinted versions so it learns to solve problems without hints at inference time. On competition math benchmarks AIME 2025 and HMMT 2025, POPE outperforms standard GRPO and supervised fine-tuning, with HMMT pass@1 improving from 31.0% to 37.8%. The method addresses a core bottleneck in RL training—sparse reward exploration—by decomposing hard problem-solving into finding a good starting state and completing the solution.

7arXiv · cs.CL·23d ago·source ↗

AXPO: Agent Explorative Policy Optimization Addresses Thinking-Acting Gap in Multimodal Agentic Reasoning

This paper identifies a structural asymmetry in agentic reasoning called the 'Thinking-Acting Gap,' where tool use is attempted in only ~30% of rollouts under standard RL training (GRPO), and all-wrong tool-using subgroups suppress learning signals. The authors propose AXPO (Agent eXplorative Policy Optimization), which fixes the thinking prefix and resamples tool calls for all-wrong subgroups, combined with uncertainty-based prefix selection. Evaluated across nine multimodal benchmarks on Qwen3-VL-Thinking at multiple scales, SFT+AXPO outperforms SFT+GRPO by +1.8pp on both Pass@1 and Pass@4 at 8B, with the 8B model surpassing the 32B baseline on Pass@4 using 4× fewer parameters.

7Openai Blog·1mo ago·source ↗

Improving Mathematical Reasoning with Process Supervision

OpenAI trained a model achieving state-of-the-art mathematical problem solving by rewarding each correct reasoning step (process supervision) rather than only the final answer (outcome supervision). This approach improves performance on math benchmarks and carries an alignment benefit by training models to produce human-endorsed chain-of-thought reasoning. The work highlights a potential synergy between capability improvements and alignment techniques.

5arXiv · cs.AI·2d ago·source ↗

MAST: Mechanism-guided selective unlearning for RLVR-trained reasoning models

Researchers introduce MAST (Mechanism-Aligned Selective Targeting), a method for selectively unlearning capabilities induced by reinforcement learning from verifiable rewards (RLVR) in language models while minimizing collateral damage to retained knowledge. The approach ranks attention-projection tensors by off-principal energy and gradient coupling to identify a targeted subset for update, rather than applying full-parameter gradient ascent. Evaluated on Qwen2.5-Math-1.5B and Qwen3-1.7B-Base, MAST achieves statistically significant forgetting on target MATH problems while preserving GSM8K performance, whereas full-parameter unlearning collapses retained capabilities. The method generalizes across seeds and unlearning objectives (NPO/SimNPO).

6arXiv · cs.AI·8d ago·source ↗

RA-RFT: Retrieval-Augmented Reinforcement Fine-Tuning teaches LLMs to reason by analogy

Researchers propose Retrieval-Augmented Reinforcement Fine-Tuning (RA-RFT), a post-training framework that trains a retriever to rank contexts by expected reasoning benefit rather than semantic similarity, then fine-tunes a policy model via reinforcement learning using retrieved analogous demonstrations. The key insight is that reasoning-relevant retrieval surfaces complementary solution strategies rather than superficially similar problems. On mathematical reasoning benchmarks, RA-RFT improves AIME 2025 average@32 accuracy by 7.1 and 2.8 points over GRPO for Qwen3-1.7B and Qwen3-4B respectively, suggesting reasoning-aware retrieval is orthogonal to reward design and training curriculum improvements.

6arXiv · cs.CL·3d ago·source ↗

ZPPO: Teacher-in-prompt training method outperforms distillation and GRPO for small vision-language models

Researchers introduce Zone of Proximal Policy Optimization (ZPPO), a training method inspired by Vygotsky's zone of proximal development that embeds teacher guidance in prompts rather than policy gradients or logit imitation. On hard questions where student rollouts fail, ZPPO constructs Binary Candidate-included Questions (BCQ) and Negative Candidate-included Questions (NCQ) to help the student discriminate correct from incorrect responses, with a replay buffer that recirculates hard questions until mastered. Evaluated on the Qwen3 family (0.8B–9B) with a 27B teacher across a 31-benchmark suite covering VLM, LLM, and video tasks, ZPPO outperforms both distillation and GRPO baselines, with the largest gains at the smallest model scale. The method addresses a known failure mode of RL training where zero-reward rollouts produce no gradient signal.

5arXiv · cs.CL·15d ago·source ↗

PropMe framework distinguishes memorization capability from propensity in LLMs

A new arXiv preprint introduces PropMe, a framework that separates whether LLMs can be forced to reproduce training data (capability) from whether they do so under ordinary use (propensity). The authors also release SimpleTrace, a lightweight pipeline using infini-gram to attribute model outputs to training corpora. Evaluating two open models on Common Pile and Dynaword, they find a consistent gap: adversarial prefix attacks elicit strong memorization, but propensity scores remain low in non-adversarial settings. The paper argues memorization audits should report both worst-case extractability and ordinary leakage propensity.