Entity · technique

ExpRL

techniqueactiveexprl-216435e3·1 events·first seen Jun 16, 2026

Aliases: ExpRL

Co-occurring entities

ExpRL: Exploratory RL for LLM Mid-Training GRPO

More like this (12)

ReuseRL PrefixRL ContextRL CheckRLM MedRLM MemRL RL² ExpRL: Exploratory RL for LLM Mid-Training prime-rl Turing-RL SafeRL-Lab SCOPE-RL

Recent events (1)

6arXiv · cs.LG·Jun 16, 2026·source ↗

ExpRL: RL-based mid-training using human QA data as reward scaffolds for LLM reasoning

ExpRL proposes an automated approach to LLM mid-training that replaces manually curated reasoning traces with large corpora of human-written QA data used as reward scaffolds rather than imitation targets. Reference solutions are hidden from the policy and used only to construct problem-specific grading rubrics, enabling dense process-level rewards that reinforce partial progress and intermediate reasoning steps. On challenging math reasoning benchmarks, ExpRL outperforms SFT, sparse-reward GRPO, and self-distillation as an RL initialization strategy, with additional mixed-domain experiments suggesting broader applicability.

Evaluation and Benchmarking Alignment and RLHF ExpRL: Exploratory RL for LLM Mid-Training GRPO ExpRL