person

Ruslan Salakhutdinov

personactiveprovisionalruslan-salakhutdinov-fca3aaf8·1 events·first seen 36h ago

Aliases: Ruslan Salakhutdinov

Co-occurring entities

Virginia Smith Carnegie Mellon University Aviral Kumar GRPO Yuxiao Qu AIME 2025 HMMT 2025 Qwen3-4B-Instruct Amrith Setlur Privileged On-Policy Exploration

More like this (12)

Ilya Sutskever Ivan Burazin Mishig Davaadorj Mykhailo Fedorov Omar Khattab Nikita Kitaev Eliezer Yudkowsky Sam Altman Thomas Kurian Alexandr Wang Siran Li Rishabh Sabharwal

Recent events (1)

6The Batch·36h ago·source ↗

POPE Training Method Uses Partial Solution Hints to Improve RL Exploration in LLMs

Researchers from Carnegie Mellon University introduced Privileged On-Policy Exploration (POPE), a training method that pairs GRPO reinforcement learning with hint-augmented datasets to help LLMs solve hard problems they would otherwise fail to explore. During training, the model receives partial solution prefixes alongside full problems, enabling it to discover complete solutions; it is then trained on both hinted and unhinted versions so it learns to solve problems without hints at inference time. On competition math benchmarks AIME 2025 and HMMT 2025, POPE outperforms standard GRPO and supervised fine-tuning, with HMMT pass@1 improving from 31.0% to 37.8%. The method addresses a core bottleneck in RL training—sparse reward exploration—by decomposing hard problem-solving into finding a good starting state and completing the solution.

Evaluation and Benchmarking Alignment and RLHF Virginia Smith Carnegie Mellon University Aviral Kumar +8 more