paper
Reward Modeling for Multi-Agent Orchestration
paperactiveprovisional
reward-modeling-for-multi-agent-orchestration-181bc679·1 events·first seen 5d agoAliases: Reward Modeling for Multi-Agent Orchestration
Co-occurring entities
More like this (12)
Preference Coordinated Multi-agent Policy OptimizationMulti-Agent Reinforcement Learning from Delayed Marketplace Feedback for Objective-Weight Adaptation in Three-Sided Dispatchmulti-turn agent benchmarksmulti-agent cooperative frameworkRole-Agent: Bootstrapping LLM Agents via Dual-Role EvolutionSkill-RM: Unifying Heterogeneous Evaluation Criteria via Agent Skillmulti-level agent evaluationreward modelLearning Red Agent Policy from Observations for Neurosymbolic Autonomous Cyber Agentsmultimodal agentsMulti-Agent ScaffoldUniIntervene: Agentic Intervention for Efficient Real-World Reinforcement Learning
Recent events (1)
OrchRM: Self-supervised reward modeling for multi-agent orchestration without human annotations
Researchers propose Orchestration Reward Modeling (OrchRM), a self-supervised framework that trains reward models for LLM-based multi-agent orchestrators using intermediate execution artifacts to construct win-lose pairs for Bradley-Terry training. The approach avoids costly sub-agent rollouts by operating directly at the orchestration level, achieving up to 10x improvement in training token efficiency and up to 8% accuracy gains in test-time scaling. Results generalize across mathematical reasoning, web-based QA, and multi-hop reasoning tasks.