4arXiv cs.AI (Artificial Intelligence)·24d ago

Preference-Shaped Expected Hypervolume and R2 Improvement: Exact Computation and Monotonicity

This paper analyzes preference-shaped expected improvement criteria for Bayesian multiobjective optimization, focusing on hypervolume (EHVI) and R2 indicator families. The authors establish which preference transformations preserve exact computation, Pareto compatibility, and monotonicity, and which alter the underlying geometry. A key result is that exact integral R2 improvement is not generally an objective-space weighted hypervolume but is exactly a scalarization-space volume (Tchebycheff shadow measure), enabling new finite-sum and quadrature algorithms for ER2I. The work also provides an achievement-space Gaussian surrogate formulation reducing ER2I to an integral of scalar Gaussian expected improvements.

Evaluation and Benchmarking Tchebycheff Scalarization Bayesian Multiobjective Optimization Expected Hypervolume Improvement (EHVI)Deng Representation R2 Indicator

Related guides (1)

Evaluation and BenchmarkingTopic guide

Evaluation and Benchmarking: How We Measure AI — and Why It Keeps Getting Harder

Read asBeginner In-depth

Related events (8)

4arXiv · cs.LG·25d ago·source ↗

Probabilistic Smoothing with Ratio-Monotone Transforms for Global Optimization

This paper proposes a generalized probabilistic smoothing framework for global optimization that replaces Gaussian kernels with flexible symmetric unimodal kernels combined with monotonic ratio-based transformations. The authors prove that the smoothed objective preserves the global maximizer and that stationary points concentrate near the true optimum under large amplification, without requiring a decreasing smoothing schedule. Explicit complexity bounds for stochastic gradient ascent are derived, and a leave-one-out baseline is shown to provably reduce variance. Experiments on high-dimensional benchmarks and black-box adversarial attacks demonstrate improved robustness over existing methods.

AI Safety Research stochastic gradient ascent Probabilistic Smoothing with Ratio-Monotone Transforms Gaussian kernel smoothing +2 more

6arXiv · cs.AI·24d ago·source ↗

Extrapolative Weight Averaging Reveals Correctness-Efficiency Frontiers in Code RL

This paper investigates whether extrapolative weight averaging of RL-trained checkpoints can extend Pareto frontiers between competing objectives (correctness vs. computational efficiency) without additional training. Starting from a shared initialization, the authors train checkpoints under nested unit-test coverage regimes for competitive programming tasks, revealing a correctness-efficiency frontier where higher-coverage rewards reduce optimization failures but increase correctness failures. Extrapolation beyond trained endpoints produces complementary policies that, when ensembled, improve pass@250 on LCB/hard by 3.3% over the best single checkpoint at matched sample budget. Results hold across 7B and 32B model scales and three inference settings: pure reasoning, tool use, and agentic coding.

Evaluation and Benchmarking Inference Economics LCB/hard benchmark Competitive Programming RL LeetCode Hard (LCB/hard)+9 more

5arXiv · cs.AI·1mo ago·source ↗

Beyond Isotropy in JEPAs: Hamiltonian Geometry and Symplectic Prediction

This paper critiques the standard practice of regularizing Joint-Embedding Predictive Architecture (JEPA) encoders toward isotropic Gaussian marginals, showing that this Euclidean symmetry assumption incurs a quantifiable 'price of isotropy' and that no geometry-independent fixed marginal target is universally canonical. The authors prove that oracle one-view marginals do not identify the view-to-view predictive coupling, arguing structural bias should enter the cross-view coupling instead. They introduce HamJEPA, which encodes views as phase-space states and uses a learned Hamiltonian leapfrog map for view-to-view prediction, with symplectic coupling identified as the key driver of gains. HamJEPA outperforms SIGReg on CIFAR-100 by up to +6.45 kNN@20 and +10.64 linear-probe points at 80 epochs, with similar improvements on ImageNet-100.

Evaluation and Benchmarking Alignment and RLHF ImageNet-100 HamJEPA CIFAR-100 +4 more

4arXiv · cs.AI·3d ago·source ↗

UBP2: Model-based preference RL with uncertainty-balanced exploration achieves sublinear regret

UBP2 (Uncertainty-Balanced Preference Planning) is a model-based reinforcement learning method that improves sample efficiency in preference-based RL by jointly reasoning over uncertainties in reward, dynamics, and value functions. The approach uses ensembles to score candidate trajectories and provides a principled exploitation-exploration tradeoff without ad hoc heuristics. The authors prove sublinear regret guarantees for finite- and infinite-horizon settings and demonstrate substantially better sample efficiency than model-free baselines on the Meta-World benchmark.

Evaluation and Benchmarking Alignment and RLHF Meta-World UBP2

5arXiv · cs.LG·26d ago·source ↗

Active Query Synthesis for Preference Learning via Mutual Information Maximization

This paper introduces Info-Synth, an active query synthesis framework for preference learning that generates optimal pairwise queries by maximizing a mutual information objective in continuous space, bypassing the computational cost of pool-based evaluation. A confidence-aware response model is proposed to handle ambiguous comparisons between nearly identical or highly dissimilar items. Two finite-pool extensions (Pair M-dist and Pair Opt-dist) are also introduced. The framework is validated on synthetic preference tasks, text summarization datasets, and robotic controller tuning.

Evaluation and Benchmarking Alignment and RLHF active learning Pair Opt-dist mutual information +2 more

7arXiv · cs.AI·1mo ago·source ↗

The Matching Principle: A Geometric Theory Unifying Robustness, Domain Adaptation, and Alignment via Nuisance Covariance

This paper proposes the 'matching principle': a unified geometric framework arguing that robustness methods (CORAL, IRM, adversarial training, augmentation, metric learning, Jacobian penalties, alignment constraints) are all estimators of the same object—the covariance of label-preserving deployment nuisance—and that regularizing the encoder Jacobian along this covariance's range is the core statistical problem. The authors prove closed-form optimality results in a linear-Gaussian model, introduce the Trajectory Deviation Index (TDI) as a label-free embedding sensitivity probe, and validate predictions across 13 pre-registered experimental blocks including Qwen2.5-7B. At 7B scale, matched style-PMH improves selective honesty while standard DPO degrades Style TDI, connecting the theory to alignment safety.

Evaluation and Benchmarking AI Safety Research Invariant Risk Minimization Matching Principle Qwen2.5-7B +5 more

5arXiv · cs.LG·20d ago·source ↗

Tight Convergence Theory for Error Feedback Algorithms in Distributed Optimization

This paper provides tight convergence analyses for two major error-feedback algorithms—classic Error Feedback (EF) and Error Feedback 21 (EF21)—used to mitigate communication bottlenecks in distributed learning. The authors identify optimal step-size choices and construct tailored Lyapunov functions for each method, yielding guarantees that hold independently of the number of agents and recover the best known single-agent bounds. The work clarifies the relative performance of these gradient compression variants, which has remained poorly understood despite widespread use.

Training Infrastructure Inference Economics Error Feedback 21 (EF21)Error Feedback (EF)Lyapunov function +2 more

7arXiv · cs.AI·1mo ago·source ↗

Vector Policy Optimization: Training for Diversity Improves Test-Time Search

Vector Policy Optimization (VPO) is a new RL post-training algorithm for LLMs that replaces the scalar reward paradigm with vector-valued rewards, explicitly training models to produce diverse solution sets that specialize across different reward trade-offs. VPO is designed as a near-drop-in replacement for the GRPO advantage estimator and targets inference-scaling search procedures like AlphaEvolve. Across four tasks, VPO matches or outperforms scalar RL baselines on pass@k and best@k metrics, with advantages growing as search budget increases, and unlocks evolutionary search problems that GRPO-trained models cannot solve. The paper argues that diversity-optimized post-training may need to become the default as inference-time search becomes standard.

Evaluation and Benchmarking Inference Economics GRPO pass@k AlphaEvolve +4 more