Almanac
organization

Wang-ML-Lab

organizationactiveprovisionalwang-ml-lab-94c5ce8f·1 events·first seen 5d ago

Aliases: Wang-ML-Lab

Co-occurring entities

More like this (12)

Recent events (1)

6arXiv · cs.CL·5d ago·source ↗

OrchRM: Self-supervised reward modeling for multi-agent orchestration without human annotations

Researchers propose Orchestration Reward Modeling (OrchRM), a self-supervised framework that trains reward models for LLM-based multi-agent orchestrators using intermediate execution artifacts to construct win-lose pairs for Bradley-Terry training. The approach avoids costly sub-agent rollouts by operating directly at the orchestration level, achieving up to 10x improvement in training token efficiency and up to 8% accuracy gains in test-time scaling. Results generalize across mathematical reasoning, web-based QA, and multi-hop reasoning tasks.