Entity · paper

ExpRL: Exploratory RL for LLM Mid-Training

paperactiveexprl-exploratory-rl-for-llm-mid-training-1cd4e032·1 events·first seen Jun 16, 2026

Aliases: ExpRL: Exploratory RL for LLM Mid-Training

Co-occurring entities

More like this (12)

KL-regularized RL ExpRL Do You Really Need to Pretrain Q-Functions for Online RL Fine-Tuning?Improving LLM-Generated Process Model Quality Through Reinforcement Learning: The Role of Reward Function Design Learning Process Rewards via Success Visitation Matching for Efficient RL Teaching LLMs to Self-Evolve: Cultivating Core Meta-Skills with Reinforcement Learning Physics-EnhAnced Reinforcement Learning Enough is as good as a feast: A Comprehensive Analysis of How Reinforcement Learning Mitigates Task Conflicts in LLMs CheckRLM RLIF (Reinforcement Learning from Internal Feedback)Competitive Programming RL Reinforcement Learning with Metacognitive Feedback Elicits Faithful Uncertainty Expression in LLMs

Recent events (1)

6arXiv · cs.LG·Jun 16, 2026·source ↗

ExpRL: RL-based mid-training using human QA data as reward scaffolds for LLM reasoning

ExpRL proposes an automated approach to LLM mid-training that replaces manually curated reasoning traces with large corpora of human-written QA data used as reward scaffolds rather than imitation targets. Reference solutions are hidden from the policy and used only to construct problem-specific grading rubrics, enabling dense process-level rewards that reinforce partial progress and intermediate reasoning steps. On challenging math reasoning benchmarks, ExpRL outperforms SFT, sparse-reward GRPO, and self-distillation as an RL initialization strategy, with additional mixed-domain experiments suggesting broader applicability.

Evaluation and Benchmarking Alignment and RLHF ExpRL: Exploratory RL for LLM Mid-Training GRPO ExpRL