Entity · technique

POW3R

techniqueactivepow3r-4a23b02e·1 events·first seen May 20, 2026

Aliases: POW3R

Co-occurring entities

rubric-based rewards GRPO Reinforcement Learning with Verifiable Rewards

More like this (12)

PwC SPPO WPP PRX P4IR Wiz Research ClinPRISM ST-RoPE CWQ Qwen3 RealWorldQA PPO

Recent events (1)

6arXiv · cs.AI·May 20, 2026·source ↗

POW3R: Policy-Aware Rubric Rewards for More Efficient RLVR Training

This paper identifies a failure mode in rubric-based reinforcement learning with verifiable rewards (RLVR): static aggregation of criterion weights conflates human-assigned importance with current optimization utility, causing many criteria to be either already saturated or unreachable. The authors introduce POW3R, a framework that dynamically reweights criterion-level rewards during training using rollout-level contrast to emphasize criteria that currently differentiate policy outputs. Across three base policies and two datasets (multimodal and text-only), POW3R wins 24 of 30 comparisons on rubric reward and strict completion metrics, and reaches equivalent performance in 2.5–4× fewer training steps than vanilla GRPO with rubric rewards.

Evaluation and Benchmarking Alignment and RLHF rubric-based rewards GRPO POW3R +2 more