6arXiv cs.CL (Computation and Language)·16h ago

REAR: Test-time reward decomposition for preference realignment in LLMs

Researchers introduce REAR (REAlignment Reward), a training-free framework for aligning LLMs with diverse user preferences at test time. The method decomposes the reward function into question-related and preference-related components, then derives a realignment reward expressible as a linear combination of token-level log-probabilities. This formulation integrates cleanly with existing test-time scaling algorithms like best-of-N sampling and tree search, and experiments show it generalizes across preference alignment, math, and visual tasks.

Evaluation and Benchmarking Inference Economics Alignment and RLHF REAlignment Reward REAR: Test-time Preference Realignment through Reward Decomposition

Related guides (3)

Evaluation and BenchmarkingTopic guide

AI Evaluation and Benchmarking: From Leaderboards to the Limits of Measurement

Read asBeginner In-depth

Alignment and RLHFTopic guide

Alignment and RLHF: Teaching AI to Do What We Actually Want

Read asBeginner In-depth

Inference EconomicsTopic guide

Inference Economics: The Hidden Cost Battle Shaping AI

Read asBeginner In-depth

Related events (8)

6arXiv · cs.LG·14d ago·source ↗

ExpRL: RL-based mid-training using human QA data as reward scaffolds for LLM reasoning

ExpRL proposes an automated approach to LLM mid-training that replaces manually curated reasoning traces with large corpora of human-written QA data used as reward scaffolds rather than imitation targets. Reference solutions are hidden from the policy and used only to construct problem-specific grading rubrics, enabling dense process-level rewards that reinforce partial progress and intermediate reasoning steps. On challenging math reasoning benchmarks, ExpRL outperforms SFT, sparse-reward GRPO, and self-distillation as an RL initialization strategy, with additional mixed-domain experiments suggesting broader applicability.

Evaluation and Benchmarking Alignment and RLHF ExpRL: Exploratory RL for LLM Mid-Training GRPO ExpRL

5arXiv · cs.CL·21d ago·source ↗

GGRO: Gradient-Guided Reward Optimization for inference-time LLM alignment

Researchers introduce Gradient-Guided Reward Optimization (GGRO), an inference-time alignment method that uses gradient signals from a reward model to inject 'nudging tokens' at high-uncertainty decoding steps, rather than relying on sampling-intensive re-ranking approaches like Best-of-N. The method monitors token-level entropy to detect distribution drift and steers generation trajectories directly, claiming improved robustness to reward hacking with minimal computational overhead. Experiments show gains across safety, helpfulness, and reasoning benchmarks compared to standard inference-time alignment baselines.

Inference Economics Alignment and RLHF Best-of-N Sampling Gradient-Guided Reward Optimization

5arXiv · cs.CL·14d ago·source ↗

RL-trained LLMs learn retriever-specific query formulation strategies for RAG

A new arXiv paper presents the first systematic study of using reinforcement learning to teach LLMs to adapt query formulation strategies to different retrieval backends. The authors find that different retrievers have surprisingly distinct optimal query styles (e.g., descriptive vs. question-like), making cross-retriever strategy transfer ineffective. They introduce a branching-based rollout technique to stabilize training over multi-step retrieval trajectories and show gains from retriever-specific human guidance and model scaling.

Evaluation and Benchmarking Agent and Tool Ecosystem Understanding the Behaviors of Environment-aware Information Retrieval LCO-Embedding

7arXiv · cs.AI·5d ago·source ↗

Progress Advantage: Annotation-Free Step-Level Scoring for LLM Agents via RL Post-Training

Researchers introduce 'progress advantage,' a method that derives implicit step-level reward signals for LLM agents directly from the log-probability ratio between an RL-trained policy and its reference policy, without requiring dedicated process reward model training. The approach is shown to recover the optimal advantage function under a general stochastic MDP formulation, making it annotation-free and domain-agnostic. Validated across five benchmarks and four model families on tasks including test-time scaling, uncertainty quantification, and failure attribution, it outperforms confidence-based baselines and even dedicated trained reward models. The result is practically significant because building process reward models for agentic settings is currently a major bottleneck.

Evaluation and Benchmarking Agent and Tool Ecosystem progress advantage Progress Advantage for LLM Agents +1 more

6arXiv · cs.CL·1mo ago·source ↗

Two is better than one: A Collapse-free Multi-Reward RLIF Training Framework

This paper proposes a multi-reward reinforcement learning from internal feedback (RLIF) framework that decomposes training signals into an answer-level reward via cluster voting and a completion-level reward via token-wise self-certainty. To address reward hacking and entropy collapse common in single-reward RLIF, the authors introduce GDPO-based normalization and KL-Cov regularization targeting low-entropy token distributions. Evaluated on mathematical reasoning and code-generation benchmarks, the method achieves stability and performance approaching supervised RLVR methods without requiring external ground-truth supervision. The work advances scalable unsupervised RL training for LLM reasoning.

AI Safety Research Alignment and RLHF KL-Cov regularization token-wise self-certainty cluster voting reward +3 more

5arXiv · cs.CL·1mo ago·source ↗

LamPO: Lambda-Style Policy Optimization with Pairwise Decomposed Advantage for Reasoning LMs

LamPO proposes a new RLVR training objective that replaces GRPO's scalar group-relative advantages with a Pairwise Decomposed Advantage, aggregating pairwise reward gaps within response groups and weighting comparisons by confidence-aware log-probability differences. The method retains a critic-free, clipped-update PPO-style structure and optionally adds a ROUGE-L-based dense auxiliary reward to reduce sparsity. Experiments on AIME24, AIME25, MATH-500, and GPQA-Diamond using Qwen3-1.7B, Qwen3-4B, and Phi-4-mini show consistent improvements over GRPO and other RLVR variants with more stable training dynamics.

Frontier Model Releases Evaluation and Benchmarking RLVR ROUGE-L AIME24 +10 more

7arXiv · cs.CL·1mo ago·source ↗

General Preference Reinforcement Learning (GPRL): Bridging Online RL and Preference Optimization for Open-Ended Tasks

GPRL proposes a new alignment framework that replaces scalar reward models with a General Preference Model (GPM) embedding responses into k skew-symmetric subspaces to capture multi-dimensional, intransitivity-aware preferences. The method computes per-dimension group-relative advantages, normalizes across axes, and uses a closed-loop drift monitor to detect and correct single-axis reward hacking during training. Starting from Llama-3-8B-Instruct, GPRL achieves a 56.51% length-controlled win rate on AlpacaEval 2.0 and outperforms SimPO and SPPO on Arena-Hard, MT-Bench, and WildBench. The work directly addresses the gap between verifiable-reward online RL (strong on math/code) and preference optimization (strong on open-ended tasks).

Frontier Model Releases Evaluation and Benchmarking WildBench MT-Bench General Preference Reinforcement Learning +7 more

5arXiv · cs.AI·7d ago·source ↗

AIR: Adaptive Interleaved Reasoning with Code in Multimodal LLMs via Reinforcement Learning

Researchers propose AIR, a system that trains multimodal large language models to adaptively interleave reasoning with code execution for numerical computation tasks, going beyond prior work that focused only on visual operations. The approach combines a two-stage cold-start data pipeline, RL dataset filtering, and a group-constrained reward function for tool-invocation decisions. Experiments show a 6.1 percentage point average improvement on evaluation benchmarks, with interleaved reasoning samples gaining 9.9 pp and tool-use success exceeding 95%.

Agent and Tool Ecosystem Alignment and RLHF AIR: Adaptive Interleaved Reasoning with Code in MLLMs OpenAI +1 more