Entity · technique

General Preference Model

techniqueactivegeneral-preference-model-324c01a4·1 events·first seen May 19, 2026

Aliases: General Preference Model, General Preference Model (GPM)

Co-occurring entities

WildBench MT-Bench General Preference Reinforcement Learning SimPO SPPO Arena-Hard Llama3-8B-Instruct AlpacaEval 2

More like this (12)

General Preference Reinforcement Learning Gravity-Weighted Direct Preference Optimization Fine-tuning GPT-2 from Human Preferences Freeform Preference Learning Direct Preference Optimization (DPO)Generalised Linear Mixed Models VPT Model GRPO (Group Relative Policy Optimization)Identity Preference Optimization Process Reward Model GGML generative models

Recent events (1)

7arXiv · cs.CL·May 19, 2026·source ↗

General Preference Reinforcement Learning (GPRL): Bridging Online RL and Preference Optimization for Open-Ended Tasks

GPRL proposes a new alignment framework that replaces scalar reward models with a General Preference Model (GPM) embedding responses into k skew-symmetric subspaces to capture multi-dimensional, intransitivity-aware preferences. The method computes per-dimension group-relative advantages, normalizes across axes, and uses a closed-loop drift monitor to detect and correct single-axis reward hacking during training. Starting from Llama-3-8B-Instruct, GPRL achieves a 56.51% length-controlled win rate on AlpacaEval 2.0 and outperforms SimPO and SPPO on Arena-Hard, MT-Bench, and WildBench. The work directly addresses the gap between verifiable-reward online RL (strong on math/code) and preference optimization (strong on open-ended tasks).

Frontier Model Releases Evaluation and Benchmarking WildBench MT-Bench General Preference Reinforcement Learning +7 more