Almanac
technique

rule-based reinforcement learning rewards

techniqueactiveprovisionalrule-based-reinforcement-learning-rewards-d25d146c·1 events·first seen 20d ago

Aliases: rule-based reinforcement learning rewards

Co-occurring entities

More like this (12)

Recent events (1)

6arXiv · cs.AI·20d ago·source ↗

OmniVerifier-M1: Multimodal Meta-Verifier with Explicit Structured Recalibration

OmniVerifier-M1 is a generalist visual verifier trained using symbolic meta-verification rationales (e.g., bounding boxes) and decoupled reinforcement learning objectives for binary judgment versus meta-verification. The paper finds that symbolic verifier outputs outperform textual explanations as rationales, enabling rule-based RL rewards without auxiliary judge models, and that decoupling RL objectives substantially improves performance over joint optimization. The system further enables M1-TTS, a verifier-driven agentic generation pipeline supporting dynamic region-level self-correction in multimodal outputs.