Almanac
paper

ExpRL: Exploratory RL for LLM Mid-Training

paperactiveprovisionalexprl-exploratory-rl-for-llm-mid-training-1cd4e032·1 events·first seen 39h ago

Aliases: ExpRL: Exploratory RL for LLM Mid-Training

Co-occurring entities

More like this (12)

Recent events (1)

6arXiv · cs.LG·39h ago·source ↗

ExpRL: RL-based mid-training using human QA data as reward scaffolds for LLM reasoning

ExpRL proposes an automated approach to LLM mid-training that replaces manually curated reasoning traces with large corpora of human-written QA data used as reward scaffolds rather than imitation targets. Reference solutions are hidden from the policy and used only to construct problem-specific grading rubrics, enabling dense process-level rewards that reinforce partial progress and intermediate reasoning steps. On challenging math reasoning benchmarks, ExpRL outperforms SFT, sparse-reward GRPO, and self-distillation as an RL initialization strategy, with additional mixed-domain experiments suggesting broader applicability.