Entity · benchmark

SkillsBench

benchmarkactiveskillsbench-b1c6306e·2 events·first seen May 27, 2026

Aliases: SkillsBench

Co-occurring entities

GPT-5.3-Codex Generative Skill Composition for LLM Agents SkillComposer Gemini-3.1-Pro skill-level memory MUSE-Autoskill large language model agents

More like this (12)

CompSkillBench SorryBench SkillGenBench CursorBench TriggerBench SelectBench ToolBench-X LiveBench JobBench EdgeBench ProgramBench TokenBench

Recent events (2)

5arXiv · cs.CL·Jul 1, 2026·source ↗

SkillComposer: Structured skill composition for LLM agents via constrained autoregressive decoding

A new arXiv preprint introduces SkillComposer, a method that frames skill selection for LLM agents as a structured prediction problem — jointly deciding which skills to activate, how many, and in what order via a constrained autoregressive decoder over skill identifiers. The approach addresses a bottleneck in growing skill libraries where existing retrieval and full-context methods fail to capture the joint nature of skill composition. Evaluated on SkillsBench across two production-grade coding agents (GPT-5.2-Codex and Gemini-3-Pro-Preview), SkillComposer raises pass rates by +23.1 and +18.2 percentage points over no-skill baselines, matching gold-skill retrieval upper bounds at lower prompt-token cost.

Evaluation and Benchmarking Agent and Tool Ecosystem GPT-5.3-Codex Generative Skill Composition for LLM Agents SkillComposer +2 more

5arXiv · cs.CL·May 27, 2026·source ↗

MUSE-Autoskill: Self-Evolving LLM Agents via Skill Lifecycle Management

MUSE-Autoskill introduces a skill-centric agent framework where LLM agents continuously create, store, manage, evaluate, and refine reusable skills across tasks. The system adds skill-level memory that accumulates per-skill experience over time, enabling more effective reuse and cross-agent transfer. Experiments on SkillsBench show improvements in task success, efficiency, and reuse compared to static skill approaches.

Evaluation and Benchmarking Agent and Tool Ecosystem skill-level memory MUSE-Autoskill large language model agents +1 more