SkillWeaver: Compositional Skill Routing for LLM Agents via Decompose-Retrieve-Compose
Researchers introduce SkillWeaver, a framework for compositional skill routing in LLM agents that decomposes complex queries into atomic sub-tasks, retrieves matching skills from a large library, and composes an executable DAG plan. The paper formalizes the Compositional Skill Routing problem and introduces CompSkillBench, a benchmark of 300 compositional queries over 2,209 real MCP server skills across 24 categories. A key finding is that task decomposition quality is the primary bottleneck, with standard LLM decomposition reaching only 34.2% category recall; the proposed Iterative Skill-Aware Decomposition (SAD) method improves decomposition accuracy from 51.0% to 67.7% in a single iteration. The framework also reduces context window consumption by over 99% compared to naive skill-stuffing approaches.
Related guides (2)
Related events (8)
SkillGenBench: Benchmarking Skill Generation Pipelines for LLM Agents
SkillGenBench is a new benchmark designed to evaluate the ability of LLM agents to generate correct, reusable, and executable skills from raw repositories and documents, rather than merely using pre-provided skills. It covers two generation regimes (task-conditioned and task-agnostic) and two procedural sources (repository-grounded and document-grounded), with standardized execution-based evaluation protocols. Experiments across multiple skill-generation methods reveal substantial performance variation and distinct failure modes depending on source type. The benchmark aims to establish skill generation as an independent research problem within agent systems.
MUSE-Autoskill: Self-Evolving LLM Agents via Skill Lifecycle Management
MUSE-Autoskill introduces a skill-centric agent framework where LLM agents continuously create, store, manage, evaluate, and refine reusable skills across tasks. The system adds skill-level memory that accumulates per-skill experience over time, enabling more effective reuse and cross-agent transfer. Experiments on SkillsBench show improvements in task success, efficiency, and reuse compared to static skill approaches.
SKIM: Adaptive soft-token compression for procedural skills in LLM workflows
Researchers introduce SKIM (SKIll coMpression), a multi-resolution soft token compression framework targeting procedural knowledge (skills/workflows) rather than factual documents. SKIM compresses reusable natural language skills to 30–60% of their original token length while preserving task performance, reducing prefill cost and latency when skills are repeatedly invoked. The method adapts compression depth to skill complexity and supports offline compression for frequently updated community skills.
Skill-RM: A unified reward model framework treating evaluation as an agentic skill
Researchers from the Qwen team propose Skill-RM, a framework that reformulates reward modeling as the execution of a reusable 'Reward-Evaluation Skill,' enabling a single model to orchestrate heterogeneous evaluation criteria including rule-based verifiers, ground-truth references, and rubrics. By treating reward computation as a structured agentic task, Skill-RM dynamically selects and aggregates evidence per input rather than relying on static evaluation. Experiments on reward benchmarks and downstream tasks (best-of-N selection, RL) show consistent improvements over traditional judge baselines. The code is publicly released under the Qwen-Applications GitHub organization.
Systematic Study of Model-Generated Agent Skills Across the Full Skill Lifecycle
This paper presents a utility-grounded evaluation framework for model-generated agent skills, covering the full lifecycle of experience generation, skill extraction, and skill consumption across five agentic task domains. The authors find that while such skills are beneficial on average, they exhibit non-trivial negative transfer, and that skill utility is independent of model scale or baseline task strength. A key finding is that strong extractors are not necessarily strong consumers and vice versa. The work culminates in a 'meta-skill' that guides extraction toward utility-correlated features, consistently improving skill quality and reducing negative transfer.
DataCOPE: Unsupervised skill discovery framework for data-analytic agents
Researchers introduce DataCOPE, an unsupervised verifier-guided framework for discovering reusable procedural skills in data-analytic agents without labeled supervision or parameter updates. The system coordinates three components—a data-analytic agent, an unsupervised verifier, and a skill manager for contrastive skill distillation—with task-specific verifier instantiations for report-style and reasoning-style analysis. Evaluated on Deep Data Research and DABStep benchmarks, DataCOPE improves mean scores by 9.71% and 32.30% respectively across four model settings. The approach addresses a key bottleneck in agentic data analysis: acquiring reliable skill supervision at scale.
SkillOpt: Systematic Text-Space Optimizer for Self-Evolving Agent Skills
SkillOpt introduces a principled optimization framework for agent skills, treating the skill document as an external trainable state analogous to model weights. A separate optimizer model converts scored rollouts into bounded edits (add/delete/replace) on a skill document, accepting only edits that improve held-out validation scores. Evaluated across six benchmarks, seven target models, and three execution harnesses (direct chat, Codex, Claude Code), SkillOpt achieves best or tied performance on all 52 evaluated cells, lifting GPT-5.5 no-skill accuracy by up to +24.8 points inside the Codex agentic loop. Optimized skill artifacts also transfer across model scales and execution environments without further optimization.
SkillHarm: Lifecycle-Aware Benchmark for Skill-Based Attacks on AI Agents
SkillHarm is a new benchmark evaluating adversarial attacks on AI agent skills across their full use lifecycle, covering two attack scenarios: Fixed-Payload Poisoning (FPP) and Self-Mutating Poisoning (SMP). The benchmark includes 879 attack samples across 71 skills, organized under a 12-category risk taxonomy targeting data pipelines, system environments, and agent autonomy. Experiments show current agents remain highly vulnerable, with attack success rates up to 86.3% (FPP) and 69.3% (SMP). An automated construction pipeline called AutoSkillHarm, driven by coding agents, was used to generate the benchmark at scale.

