5arXiv cs.CL (Computation and Language)·24d ago

MUSE-Autoskill: Self-Evolving LLM Agents via Skill Lifecycle Management

MUSE-Autoskill introduces a skill-centric agent framework where LLM agents continuously create, store, manage, evaluate, and refine reusable skills across tasks. The system adds skill-level memory that accumulates per-skill experience over time, enabling more effective reuse and cross-agent transfer. Experiments on SkillsBench show improvements in task success, efficiency, and reuse compared to static skill approaches.

Evaluation and Benchmarking Agent and Tool Ecosystem skill-level memory MUSE-Autoskill large language model agents SkillsBench

Related guides (2)

Agent and Tool EcosystemTopic guide

Agent and Tool Ecosystem: How AI Is Learning to Act, Not Just Answer

Read asBeginner In-depth

Evaluation and BenchmarkingTopic guide

Evaluation and Benchmarking: How We Measure AI — and Why It Keeps Getting Harder

Read asBeginner In-depth

Related events (8)

6arXiv · cs.CL·3d ago·source ↗

SkillWeaver: Compositional Skill Routing for LLM Agents via Decompose-Retrieve-Compose

Researchers introduce SkillWeaver, a framework for compositional skill routing in LLM agents that decomposes complex queries into atomic sub-tasks, retrieves matching skills from a large library, and composes an executable DAG plan. The paper formalizes the Compositional Skill Routing problem and introduces CompSkillBench, a benchmark of 300 compositional queries over 2,209 real MCP server skills across 24 categories. A key finding is that task decomposition quality is the primary bottleneck, with standard LLM decomposition reaching only 34.2% category recall; the proposed Iterative Skill-Aware Decomposition (SAD) method improves decomposition accuracy from 51.0% to 67.7% in a single iteration. The framework also reduces context window consumption by over 99% compared to naive skill-stuffing approaches.

Evaluation and Benchmarking Agent and Tool Ecosystem CompSkillBench MCP SkillWeaver +2 more

5arXiv · cs.AI·1mo ago·source ↗

SkillGenBench: Benchmarking Skill Generation Pipelines for LLM Agents

SkillGenBench is a new benchmark designed to evaluate the ability of LLM agents to generate correct, reusable, and executable skills from raw repositories and documents, rather than merely using pre-provided skills. It covers two generation regimes (task-conditioned and task-agnostic) and two procedural sources (repository-grounded and document-grounded), with standardized execution-based evaluation protocols. Experiments across multiple skill-generation methods reveal substantial performance variation and distinct failure modes depending on source type. The benchmark aims to establish skill generation as an independent research problem within agent systems.

Evaluation and Benchmarking Agent and Tool Ecosystem task-conditioned generation task-agnostic generation SkillGenBench +2 more

7arXiv · cs.AI·26d ago·source ↗

SkillOpt: Systematic Text-Space Optimizer for Self-Evolving Agent Skills

SkillOpt introduces a principled optimization framework for agent skills, treating the skill document as an external trainable state analogous to model weights. A separate optimizer model converts scored rollouts into bounded edits (add/delete/replace) on a skill document, accepting only edits that improve held-out validation scores. Evaluated across six benchmarks, seven target models, and three execution harnesses (direct chat, Codex, Claude Code), SkillOpt achieves best or tied performance on all 52 evaluated cells, lifting GPT-5.5 no-skill accuracy by up to +24.8 points inside the Codex agentic loop. Optimized skill artifacts also transfer across model scales and execution environments without further optimization.

Evaluation and Benchmarking Agent and Tool Ecosystem TextGrad SkillOpt Trace2Skill +6 more

6arXiv · cs.AI·26d ago·source ↗

Systematic Study of Model-Generated Agent Skills Across the Full Skill Lifecycle

This paper presents a utility-grounded evaluation framework for model-generated agent skills, covering the full lifecycle of experience generation, skill extraction, and skill consumption across five agentic task domains. The authors find that while such skills are beneficial on average, they exhibit non-trivial negative transfer, and that skill utility is independent of model scale or baseline task strength. A key finding is that strong extractors are not necessarily strong consumers and vice versa. The work culminates in a 'meta-skill' that guides extraction toward utility-correlated features, consistently improving skill quality and reducing negative transfer.

Evaluation and Benchmarking Agent and Tool Ecosystem Model-Generated Agent Skills (paper)skill extraction meta-skill +2 more

9Meta Ai Blog·1mo ago·source ↗

Meta Introduces Muse Spark: First Model from Meta Superintelligence Labs with Multimodal Reasoning and Multi-Agent Orchestration

Meta has launched Muse Spark, the first model from its newly formed Meta Superintelligence Labs, positioned as a natively multimodal reasoning model with tool-use, visual chain-of-thought, and multi-agent orchestration capabilities. The model introduces 'Contemplating mode,' which runs multiple agents in parallel to compete with frontier reasoning modes, achieving 58% on Humanity's Last Exam and 38% on FrontierScience Research. Meta claims a greater than 10x compute efficiency improvement over Llama 4 Maverick through a rebuilt pretraining stack, and describes predictable scaling across pretraining, RL, and test-time reasoning axes. Muse Spark is available at meta.ai with a private API preview, and is framed as the first step on a scaling ladder toward 'personal superintelligence.'

Training Infrastructure Long Context Evolution Hyperion Meta AI Gemini Deep Think +14 more

7arXiv · cs.CL·18d ago·source ↗

SkillHarm: Lifecycle-Aware Benchmark for Skill-Based Attacks on AI Agents

SkillHarm is a new benchmark evaluating adversarial attacks on AI agent skills across their full use lifecycle, covering two attack scenarios: Fixed-Payload Poisoning (FPP) and Self-Mutating Poisoning (SMP). The benchmark includes 879 attack samples across 71 skills, organized under a 12-category risk taxonomy targeting data pipelines, system environments, and agent autonomy. Experiments show current agents remain highly vulnerable, with attack success rates up to 86.3% (FPP) and 69.3% (SMP). An automated construction pipeline called AutoSkillHarm, driven by coding agents, was used to generate the benchmark at scale.

Evaluation and Benchmarking AI Safety Research Self-Mutating Poisoning Fixed-Payload Poisoning skill-based attacks +3 more

5arXiv · cs.AI·11d ago·source ↗

Role-Agent: Bootstrapping LLM Agents via Dual-Role Evolution

Role-Agent is a new framework that uses a single LLM simultaneously as both agent and environment, enabling self-bootstrapped co-evolution without external environment feedback. The system has two components: World-In-Agent (WIA), which uses predicted vs. actual state alignment as a process reward, and Agent-In-World (AIW), which reshapes training data by retrieving tasks with similar failure patterns. Experiments across multiple benchmarks show an average performance gain of over 4% over strong baselines. The approach addresses key limitations in LLM agent training: inefficient feedback and static environments.

Agent and Tool Ecosystem Alignment and RLHF Role-Agent: Bootstrapping LLM Agents via Dual-Role Evolution World-In-Agent

8The Batch·19d ago·source ↗

Meta Introduces Muse Spark: First Closed-Weights Model from Superintelligence Labs

Meta released Muse Spark, its first AI model in roughly a year and the debut product of its Superintelligence Labs, marking a significant departure from its open-weights Llama strategy. The natively multimodal reasoning model supports tool use and multi-agent orchestration, achieves fourth place on the Artificial Analysis Intelligence Index, and claims notable token efficiency—matching Llama 4 Maverick with over 10x less training compute. Meta withheld parameter count, architecture, and training details, positioning Muse Spark as a closed commercial product competing with OpenAI, Google, and Anthropic. The release introduces 'thought compression' via RL and a parallel multi-agent 'contemplating' mode, while showing gaps in coding and agentic benchmarks.

Frontier Model Releases Open Weights Progress Scale AI Artificial Analysis Intelligence Index Claude Opus 4.6 +18 more