Entity · benchmark

best@k

benchmarkactivebest-k-591b054b·1 events·first seen May 22, 2026

Aliases: best@k

Co-occurring entities

GRPO pass@k AlphaEvolve Vector Policy Optimization

More like this (12)

Pass@1 pass@k page-agent val14 LKvaluesIT K-Search mksglu Agents-K1 Kevin Xu FineWeb-Edu linshenkx WISE

Recent events (1)

7arXiv · cs.AI·May 22, 2026·source ↗

Vector Policy Optimization: Training for Diversity Improves Test-Time Search

Vector Policy Optimization (VPO) is a new RL post-training algorithm for LLMs that replaces the scalar reward paradigm with vector-valued rewards, explicitly training models to produce diverse solution sets that specialize across different reward trade-offs. VPO is designed as a near-drop-in replacement for the GRPO advantage estimator and targets inference-scaling search procedures like AlphaEvolve. Across four tasks, VPO matches or outperforms scalar RL baselines on pass@k and best@k metrics, with advantages growing as search budget increases, and unlocks evolutionary search problems that GRPO-trained models cannot solve. The paper argues that diversity-optimized post-training may need to become the default as inference-time search becomes standard.

Evaluation and Benchmarking Inference Economics GRPO pass@k AlphaEvolve +4 more