Ruslan Salakhutdinov
ruslan-salakhutdinov-fca3aaf8·1 events·first seen 36h agoAliases: Ruslan Salakhutdinov
Co-occurring entities
More like this (12)
Recent events (1)
POPE Training Method Uses Partial Solution Hints to Improve RL Exploration in LLMs
Researchers from Carnegie Mellon University introduced Privileged On-Policy Exploration (POPE), a training method that pairs GRPO reinforcement learning with hint-augmented datasets to help LLMs solve hard problems they would otherwise fail to explore. During training, the model receives partial solution prefixes alongside full problems, enabling it to discover complete solutions; it is then trained on both hinted and unhinted versions so it learns to solve problems without hints at inference time. On competition math benchmarks AIME 2025 and HMMT 2025, POPE outperforms standard GRPO and supervised fine-tuning, with HMMT pass@1 improving from 31.0% to 37.8%. The method addresses a core bottleneck in RL training—sparse reward exploration—by decomposing hard problem-solving into finding a good starting state and completing the solution.