5arXiv cs.CL (Computation and Language)·2d ago

SkillComposer: Structured skill composition for LLM agents via constrained autoregressive decoding

A new arXiv preprint introduces SkillComposer, a method that frames skill selection for LLM agents as a structured prediction problem — jointly deciding which skills to activate, how many, and in what order via a constrained autoregressive decoder over skill identifiers. The approach addresses a bottleneck in growing skill libraries where existing retrieval and full-context methods fail to capture the joint nature of skill composition. Evaluated on SkillsBench across two production-grade coding agents (GPT-5.2-Codex and Gemini-3-Pro-Preview), SkillComposer raises pass rates by +23.1 and +18.2 percentage points over no-skill baselines, matching gold-skill retrieval upper bounds at lower prompt-token cost.

Evaluation and Benchmarking Agent and Tool Ecosystem GPT-5.3-Codex Generative Skill Composition for LLM Agents SkillComposer Gemini-3.1-Pro SkillsBench

Related guides (2)

Evaluation and BenchmarkingTopic guide

AI Evaluation and Benchmarking: From Leaderboards to the Limits of Measurement

Read asBeginner In-depth

Agent and Tool EcosystemTopic guide

Agent and Tool Ecosystem: How AI Is Learning to Act, Not Just Answer

Read asBeginner In-depth

Related events (8)

6arXiv · cs.CL·16d ago·source ↗

SkillWeaver: Compositional Skill Routing for LLM Agents via Decompose-Retrieve-Compose

Researchers introduce SkillWeaver, a framework for compositional skill routing in LLM agents that decomposes complex queries into atomic sub-tasks, retrieves matching skills from a large library, and composes an executable DAG plan. The paper formalizes the Compositional Skill Routing problem and introduces CompSkillBench, a benchmark of 300 compositional queries over 2,209 real MCP server skills across 24 categories. A key finding is that task decomposition quality is the primary bottleneck, with standard LLM decomposition reaching only 34.2% category recall; the proposed Iterative Skill-Aware Decomposition (SAD) method improves decomposition accuracy from 51.0% to 67.7% in a single iteration. The framework also reduces context window consumption by over 99% compared to naive skill-stuffing approaches.

Evaluation and Benchmarking Agent and Tool Ecosystem CompSkillBench MCP SkillWeaver +2 more

5arXiv · cs.AI·1mo ago·source ↗

SkillGenBench: Benchmarking Skill Generation Pipelines for LLM Agents

SkillGenBench is a new benchmark designed to evaluate the ability of LLM agents to generate correct, reusable, and executable skills from raw repositories and documents, rather than merely using pre-provided skills. It covers two generation regimes (task-conditioned and task-agnostic) and two procedural sources (repository-grounded and document-grounded), with standardized execution-based evaluation protocols. Experiments across multiple skill-generation methods reveal substantial performance variation and distinct failure modes depending on source type. The benchmark aims to establish skill generation as an independent research problem within agent systems.

Evaluation and Benchmarking Agent and Tool Ecosystem task-conditioned generation task-agnostic generation SkillGenBench +2 more

7arXiv · cs.AI·1mo ago·source ↗

SkillOpt: Systematic Text-Space Optimizer for Self-Evolving Agent Skills

SkillOpt introduces a principled optimization framework for agent skills, treating the skill document as an external trainable state analogous to model weights. A separate optimizer model converts scored rollouts into bounded edits (add/delete/replace) on a skill document, accepting only edits that improve held-out validation scores. Evaluated across six benchmarks, seven target models, and three execution harnesses (direct chat, Codex, Claude Code), SkillOpt achieves best or tied performance on all 52 evaluated cells, lifting GPT-5.5 no-skill accuracy by up to +24.8 points inside the Codex agentic loop. Optimized skill artifacts also transfer across model scales and execution environments without further optimization.

Evaluation and Benchmarking Agent and Tool Ecosystem TextGrad SkillOpt Trace2Skill +6 more

5arXiv · cs.CL·22d ago·source ↗

SKIM: Adaptive soft-token compression for procedural skills in LLM workflows

Researchers introduce SKIM (SKIll coMpression), a multi-resolution soft token compression framework targeting procedural knowledge (skills/workflows) rather than factual documents. SKIM compresses reusable natural language skills to 30–60% of their original token length while preserving task performance, reducing prefill cost and latency when skills are repeatedly invoked. The method adapts compression depth to skill complexity and supports offline compression for frequently updated community skills.

Inference Economics Agent and Tool Ecosystem Adaptive Multi-Resolution Procedural Knowledge Compression for Large Language Models SKIM (SKIll coMpression)

5arXiv · cs.CL·1mo ago·source ↗

MUSE-Autoskill: Self-Evolving LLM Agents via Skill Lifecycle Management

MUSE-Autoskill introduces a skill-centric agent framework where LLM agents continuously create, store, manage, evaluate, and refine reusable skills across tasks. The system adds skill-level memory that accumulates per-skill experience over time, enabling more effective reuse and cross-agent transfer. Experiments on SkillsBench show improvements in task success, efficiency, and reuse compared to static skill approaches.

Evaluation and Benchmarking Agent and Tool Ecosystem skill-level memory MUSE-Autoskill large language model agents +1 more

5arXiv · cs.AI·15h ago·source ↗

AgenticSTS: Bounded-memory testbed for studying long-horizon LLM agent decisions in Slay the Spire 2

Researchers introduce AgenticSTS, a testbed for studying long-horizon LLM agents under a bounded-memory contract where each decision is assembled from typed retrieval rather than appending a raw cross-decision transcript. The system is instantiated in Slay the Spire 2, a stochastic deck-building game requiring hundreds of sequential decisions, chosen because frontier LLMs currently win zero games at the lowest difficulty against a 16% human baseline. Ablation experiments show enabling a strategic skill layer improves win rate from 3/10 to 6/10, though sample sizes are too small for statistical significance. The authors release 298 completed trajectories, memory snapshots, and analysis scripts as a reusable methodology for isolating how explicit memory layers affect agent performance.

Evaluation and Benchmarking Agent and Tool Ecosystem AgenticSTS Slay the Spire 2

5arXiv · cs.CL·2d ago·source ↗

Scalable behaviour cloning for browser agents via skill distillation from human interaction traces

A new arXiv preprint proposes converting human browser interaction trajectories into compact natural-language skills that agents can retrieve and compose, arguing that the bottleneck for browser agents is decision-making under incomplete information rather than low-level operations. The approach organizes distilled skills into a skill graph to enable consolidation rather than unbounded accumulation. The work positions collective human browsing behavior as a scalable, under-exploited source of reusable agent priors, potentially reducing reliance on manually designed task demonstrations.

Evaluation and Benchmarking Agent and Tool Ecosystem einsia.ai Scalable Behaviour Cloning on Browser Using via Skill Distillation

6arXiv · cs.AI·1mo ago·source ↗

Systematic Study of Model-Generated Agent Skills Across the Full Skill Lifecycle

This paper presents a utility-grounded evaluation framework for model-generated agent skills, covering the full lifecycle of experience generation, skill extraction, and skill consumption across five agentic task domains. The authors find that while such skills are beneficial on average, they exhibit non-trivial negative transfer, and that skill utility is independent of model scale or baseline task strength. A key finding is that strong extractors are not necessarily strong consumers and vice versa. The work culminates in a 'meta-skill' that guides extraction toward utility-correlated features, consistently improving skill quality and reducing negative transfer.

Evaluation and Benchmarking Agent and Tool Ecosystem Model-Generated Agent Skills (paper)skill extraction meta-skill +2 more