technique
Hybrid Reward Advantage Splitting
techniqueactiveprovisional
hybrid-reward-advantage-splitting-c849668c·1 events·first seen 2d agoAliases: Hybrid Reward Advantage Splitting
Co-occurring entities
More like this (12)
RREDCoT: Segment-Level Reward Redistribution for Reasoning Modelsreward modelreward hackingcluster voting rewardCapRewardGradient-Guided Reward OptimizationIn-Context Reward Adaptationhybrid reasoningScaling Laws for Reward Model OveroptimizationProcess Reward ModelRule-Based RewardsUsing Reward Uncertainty to Induce Diverse Behaviour in Reinforcement Learning
Recent events (1)
CORA: Consistency-Oriented Reasoning Alignment addresses thinking-answer gap in multimodal RLVR
Researchers identify and analyze a systematic inconsistency between reasoning traces and final answers in RLVR-trained large vision-language models, showing the problem persists throughout GRPO training and inference. They propose CORA, which introduces a lightweight plug-and-play consistency reward model and a Hybrid Reward Advantage Splitting (HRAS) mechanism to coordinate task and consistency optimization. Experiments across multimodal reasoning benchmarks show CORA improves both task performance and reasoning faithfulness.