Entity · technique

SPPO

techniqueactivesppo-afe20942·1 events·first seen May 19, 2026

Aliases: SPPO

Co-occurring entities

WildBench MT-Bench General Preference Reinforcement Learning SimPO Arena-Hard Llama3-8B-Instruct General Preference Model AlpacaEval 2

More like this (12)

PPO SDPO GSPO GSPO (Group Sequence Policy Optimization)DPPO DDPO SimPO DPO GRPO ProxySPEX SPEX OPSD

Recent events (1)

7arXiv · cs.CL·May 19, 2026·source ↗

General Preference Reinforcement Learning (GPRL): Bridging Online RL and Preference Optimization for Open-Ended Tasks

GPRL proposes a new alignment framework that replaces scalar reward models with a General Preference Model (GPM) embedding responses into k skew-symmetric subspaces to capture multi-dimensional, intransitivity-aware preferences. The method computes per-dimension group-relative advantages, normalizes across axes, and uses a closed-loop drift monitor to detect and correct single-axis reward hacking during training. Starting from Llama-3-8B-Instruct, GPRL achieves a 56.51% length-controlled win rate on AlpacaEval 2.0 and outperforms SimPO and SPPO on Arena-Hard, MT-Bench, and WildBench. The work directly addresses the gap between verifiable-reward online RL (strong on math/code) and preference optimization (strong on open-ended tasks).

Frontier Model Releases Evaluation and Benchmarking WildBench MT-Bench General Preference Reinforcement Learning +7 more