Almanac
technique

POW3R

techniqueactivepow3r-4a23b02e·1 events·first seen 27d ago

Aliases: POW3R

Co-occurring entities

More like this (12)

Recent events (1)

6arXiv · cs.AI·27d ago·source ↗

POW3R: Policy-Aware Rubric Rewards for More Efficient RLVR Training

This paper identifies a failure mode in rubric-based reinforcement learning with verifiable rewards (RLVR): static aggregation of criterion weights conflates human-assigned importance with current optimization utility, causing many criteria to be either already saturated or unreachable. The authors introduce POW3R, a framework that dynamically reweights criterion-level rewards during training using rollout-level contrast to emphasize criteria that currently differentiate policy outputs. Across three base policies and two datasets (multimodal and text-only), POW3R wins 24 of 30 comparisons on rubric reward and strict completion metrics, and reaches equivalent performance in 2.5–4× fewer training steps than vanilla GRPO with rubric rewards.