Entity · technique

Process Reward Model

techniqueactiveprocess-reward-model-6c172dc9·2 events·first seen May 18, 2026

Aliases: Process Reward Model, process reward models

Co-occurring entities

Recursive Self-Improvement in AI: From Bounded Self-Refinement to Autonomous Research Loops Alibaba Qwen ModelScope Qwen2.5-Math-PRM HuggingFace

More like this (12)

reward model Rule-Based Rewards What do Reward Models Memorize?Improving LLM-Generated Process Model Quality Through Reinforcement Learning: The Role of Reward Function Design CapReward rubric-based reward shaping Reward Learning from Comparisons VPT Model RoboReward Scaling Laws for Reward Model Overoptimization reward misspecification Energy-Based Models

Recent events (2)

7arXiv · cs.AI·Jul 9, 2026·source ↗

Survey of recursive self-improvement in AI: taxonomy, evaluator hierarchy, and governance gaps

A new arXiv survey covers 1,250 papers (2024–2026) on AI self-improvement, proposing a two-axis taxonomy distinguishing what is improved (behavior, policy, evaluator, or research process) from the degree of loop closure (human-in-the-loop to fully closed). The authors construct a verification hierarchy for self-evaluation signals—from formal verifiers (strongest) to intrinsic self-assessment (weakest)—and find that demonstrated self-improvement strength tracks this hierarchy while failure modes (self-confirming loops, model collapse, diversity collapse) arise from its violations. The paper argues that 'research direction-setting' remains the key bottleneck keeping humans in the loop, and identifies governance-grade measurement of self-improvement as the most underpopulated niche in the field. The work connects technical RSI limits to safety and governance concerns raised by frontier labs experimenting with closed-loop AI research.

Frontier Model Releases Evaluation and Benchmarking Process Reward Model Recursive Self-Improvement in AI: From Bounded Self-Refinement to Autonomous Research Loops +1 more

6Qwen Research·May 18, 2026·source ↗

Qwen2.5-Math Process Reward Model for Mathematical Reasoning Supervision

Alibaba's Qwen team introduces a process reward model (PRM) aimed at improving the reliability of mathematical reasoning in LLMs by supervising intermediate reasoning steps rather than only final answers. The work addresses the problem of models producing plausible but flawed intermediate derivations even when reaching correct conclusions. The release includes model weights on HuggingFace and ModelScope alongside a GitHub repository.

Evaluation and Benchmarking Open Weights Progress Process Reward Model Alibaba Qwen +4 more