paper
ExpRL: Exploratory RL for LLM Mid-Training
paperactiveprovisional
exprl-exploratory-rl-for-llm-mid-training-1cd4e032·1 events·first seen 39h agoAliases: ExpRL: Exploratory RL for LLM Mid-Training
Co-occurring entities
More like this (12)
KL-regularized RLExpRLRLIF (Reinforcement Learning from Internal Feedback)Competitive Programming RLLearning from the Self-future: On-policy Self-distillation for dLLMsEntropy-Regularized Reinforcement LearningTRL (Transformer Reinforcement Learning)Recursive Language Models (RLMs)Hierarchical Advantage Weighting for Online RL Fine-Tuning of VLAs from Sparse Episode OutcomesSpinning Up in Deep RLBackdoor Unlearning Generalization: A Path Toward the Removal of Unknown Triggers in LLMsReinforcement Learning for Code
Recent events (1)
ExpRL: RL-based mid-training using human QA data as reward scaffolds for LLM reasoning
ExpRL proposes an automated approach to LLM mid-training that replaces manually curated reasoning traces with large corpora of human-written QA data used as reward scaffolds rather than imitation targets. Reference solutions are hidden from the policy and used only to construct problem-specific grading rubrics, enabling dense process-level rewards that reinforce partial progress and intermediate reasoning steps. On challenging math reasoning benchmarks, ExpRL outperforms SFT, sparse-reward GRPO, and self-distillation as an RL initialization strategy, with additional mixed-domain experiments suggesting broader applicability.