Entity · paper

Skill-RM: Unifying Heterogeneous Evaluation Criteria via Agent Skill

paperactiveskill-rm-unifying-heterogeneous-evaluation-criteria-via-agent-skill-c7d46330·1 events·first seen Jun 3, 2026

Aliases: Skill-RM: Unifying Heterogeneous Evaluation Criteria via Agent Skill

Co-occurring entities

Skill-RM Alibaba Qwen

More like this (12)

Skill-RM Who Grades the Grader? Co-Evolving Evaluation Metrics and Skills for Self-Improving LLM Agents Model-Generated Agent Skills (paper)multi-level agent evaluation OpenSkillRisk: Benchmarking Agent Safety When Using Real-World Risky Third-Party Skills Skill Self-Play: Pushing the Frontier of LLM Capability with Co-Evolving Skills The Blind Curator: How a Biased Judge Silently Disables Skill Retirement in Self-Evolving Agents agent-skills Evaluation Cards: An Interpretive Layer for AI Evaluation Reporting Generative Skill Composition for LLM Agents Reward Modeling for Multi-Agent Orchestration Multi-Turn Evaluation of Deep Research Agents Under Process-Level Feedback

Recent events (1)

6arXiv · cs.LG·Jun 3, 2026·source ↗

Skill-RM: A unified reward model framework treating evaluation as an agentic skill

Researchers from the Qwen team propose Skill-RM, a framework that reformulates reward modeling as the execution of a reusable 'Reward-Evaluation Skill,' enabling a single model to orchestrate heterogeneous evaluation criteria including rule-based verifiers, ground-truth references, and rubrics. By treating reward computation as a structured agentic task, Skill-RM dynamically selects and aggregates evidence per input rather than relying on static evaluation. Experiments on reward benchmarks and downstream tasks (best-of-N selection, RL) show consistent improvements over traditional judge baselines. The code is publicly released under the Qwen-Applications GitHub organization.

Evaluation and Benchmarking Agent and Tool Ecosystem Skill-RM Skill-RM: Unifying Heterogeneous Evaluation Criteria via Agent Skill Alibaba +2 more