OPERA: Perplexity-based RL alignment for open-ended reasoning tasks
OPERA (Objective Perplexity-based Reflective Alignment) proposes replacing LLM-as-a-judge reward models with intrinsic rewards derived from perplexity dynamics to stabilize RL training on open-ended tasks like creative writing. The method includes a cold-start data synthesis pipeline generating 20,000 reasoning trajectories using perplexity-prioritized rollouts. Applied to Qwen3-8B, OPERA claims state-of-the-art among open-source models on open-ended tasks, reportedly matching or exceeding Gemini 2.5 and MiniMax-M2.5 on some benchmarks.
Related guides (2)
Related events (8)
ExpRL: RL-based mid-training using human QA data as reward scaffolds for LLM reasoning
ExpRL proposes an automated approach to LLM mid-training that replaces manually curated reasoning traces with large corpora of human-written QA data used as reward scaffolds rather than imitation targets. Reference solutions are hidden from the policy and used only to construct problem-specific grading rubrics, enabling dense process-level rewards that reinforce partial progress and intermediate reasoning steps. On challenging math reasoning benchmarks, ExpRL outperforms SFT, sparse-reward GRPO, and self-distillation as an RL initialization strategy, with additional mixed-domain experiments suggesting broader applicability.
CORA: Consistency-Oriented Reasoning Alignment addresses thinking-answer gap in multimodal RLVR
Researchers identify and analyze a systematic inconsistency between reasoning traces and final answers in RLVR-trained large vision-language models, showing the problem persists throughout GRPO training and inference. They propose CORA, which introduces a lightweight plug-and-play consistency reward model and a Hybrid Reward Advantage Splitting (HRAS) mechanism to coordinate task and consistency optimization. Experiments across multimodal reasoning benchmarks show CORA improves both task performance and reasoning faithfulness.
General Preference Reinforcement Learning (GPRL): Bridging Online RL and Preference Optimization for Open-Ended Tasks
GPRL proposes a new alignment framework that replaces scalar reward models with a General Preference Model (GPM) embedding responses into k skew-symmetric subspaces to capture multi-dimensional, intransitivity-aware preferences. The method computes per-dimension group-relative advantages, normalizes across axes, and uses a closed-loop drift monitor to detect and correct single-axis reward hacking during training. Starting from Llama-3-8B-Instruct, GPRL achieves a 56.51% length-controlled win rate on AlpacaEval 2.0 and outperforms SimPO and SPPO on Arena-Hard, MT-Bench, and WildBench. The work directly addresses the gap between verifiable-reward online RL (strong on math/code) and preference optimization (strong on open-ended tasks).
RiVER framework enables RL training of LLMs on tasks without ground-truth solutions
Researchers introduce RiVER (Ranking-induced VERifiable framework), a reinforcement learning approach that trains LLMs on score-based optimization tasks using deterministic execution feedback as continuous rewards, without requiring ground-truth answers. The method addresses two failure modes in group-relative RL with continuous rewards—scale dominance and frequency dominance—via calibrated, instance-wise reward shaping. Applied to Qwen3-8B and GLM-Z1-9B-0414 on competitive programming tasks, RiVER improves ALE rating rank by ~9% and also transfers to exact-solution benchmarks (LiveCodeBench, USACO) with 2-4% absolute gains, unlike raw-score baselines. The result suggests score-based heuristic tasks can serve as general-purpose RL training environments for coding ability.
LamPO: Lambda-Style Policy Optimization with Pairwise Decomposed Advantage for Reasoning LMs
LamPO proposes a new RLVR training objective that replaces GRPO's scalar group-relative advantages with a Pairwise Decomposed Advantage, aggregating pairwise reward gaps within response groups and weighting comparisons by confidence-aware log-probability differences. The method retains a critic-free, clipped-update PPO-style structure and optionally adds a ROUGE-L-based dense auxiliary reward to reduce sparsity. Experiments on AIME24, AIME25, MATH-500, and GPQA-Diamond using Qwen3-1.7B, Qwen3-4B, and Phi-4-mini show consistent improvements over GRPO and other RLVR variants with more stable training dynamics.
RA-RFT: Retrieval-Augmented Reinforcement Fine-Tuning teaches LLMs to reason by analogy
Researchers propose Retrieval-Augmented Reinforcement Fine-Tuning (RA-RFT), a post-training framework that trains a retriever to rank contexts by expected reasoning benefit rather than semantic similarity, then fine-tunes a policy model via reinforcement learning using retrieved analogous demonstrations. The key insight is that reasoning-relevant retrieval surfaces complementary solution strategies rather than superficially similar problems. On mathematical reasoning benchmarks, RA-RFT improves AIME 2025 average@32 accuracy by 7.1 and 2.8 points over GRPO for Qwen3-1.7B and Qwen3-4B respectively, suggesting reasoning-aware retrieval is orthogonal to reward design and training curriculum improvements.
SCOPE: Self-Play via Co-Evolving Policies for Open-Ended Tasks
SCOPE is a data-free self-play framework for training language models on open-ended tasks without external supervision or frontier-model judges. It co-evolves two policies—a Challenger that generates document-grounded tasks and a Solver that answers via multi-turn retrieval—using a frozen copy of the initial model as a self-judge that writes task-specific rubrics. Across three 7-8B models (Qwen2.5, Qwen3, OLMo-3), SCOPE achieves up to +10.4 points on eight open-ended benchmarks and +13.8 points on seven held-out short-form QA benchmarks, matching or exceeding GRPO trained on ~9K curated prompts. Ablations identify rubric generation quality as the primary bottleneck for self-judging.
QUBRIC: Co-designing queries and rubrics for RL beyond verifiable rewards
QUBRIC is a framework that jointly optimizes queries and rubrics for reinforcement learning in settings where rewards are not strictly verifiable. The approach uses teacher-derived key points to rewrite open-ended queries into evaluable scenarios, applies contrastive rubric generation to capture teacher-policy gaps, and filters for learnability before GRPO training. Trained only on instruction-following data, QUBRIC achieves a +5.5 point gain on ArenaHard over an SFT baseline and transfers to legal, moral, and narrative reasoning benchmarks (+6.3 points average), suggesting rubric-based RL can complement RLVR in non-verifiable domains.

