Entity · other

large language model agents

otheractivelarge-language-model-agents-1d7a9105·3 events·first seen May 21, 2026

Aliases: large language model agents

Co-occurring entities

skill-level memory MUSE-Autoskill SkillsBench multi-round event injection Claw-Anything proactive assistance evaluation GPT-5.5 web navigation benchmark Mem-π decision-content decoupled reinforcement learning episodic memory retrieval

More like this (12)

large language models Large Language Models (frontier)Multimodal Large Language Models Understanding Large Language Models Self-Compacting Language Model Agents 1B-scale language models Large Reasoning Models AnyLanguageModel tool-augmented language agents generative language modeling Reinforcement Learning for Language Models Enhancing Decision-Making with Large Language Models through Multi-Agent Fictitious Play

Recent events (3)

5arXiv · cs.CL·May 27, 2026·source ↗

MUSE-Autoskill: Self-Evolving LLM Agents via Skill Lifecycle Management

MUSE-Autoskill introduces a skill-centric agent framework where LLM agents continuously create, store, manage, evaluate, and refine reusable skills across tasks. The system adds skill-level memory that accumulates per-skill experience over time, enabling more effective reuse and cross-agent transfer. Experiments on SkillsBench show improvements in task success, efficiency, and reuse compared to static skill approaches.

Evaluation and Benchmarking Agent and Tool Ecosystem skill-level memory MUSE-Autoskill large language model agents +1 more

6arXiv · cs.AI·May 26, 2026·source ↗

Claw-Anything: Benchmark for Always-On Personal Assistants with Broad Digital World Access

Claw-Anything is a new benchmark designed to evaluate LLM agents acting as always-on personal assistants with access to long-horizon activity histories, interdependent backend services, and multi-device GUI/CLI interaction. The benchmark simulates months of user activity to create complex, noisy world states and evaluates both reactive and proactive assistance. GPT-5.5 achieves only 34.5% pass@1, revealing a substantial capability gap versus prior narrower benchmarks. An accompanying automated data-generation pipeline produces 2,000 training environments and yields a 23.7% improvement over the base model.

Long Context Evolution Evaluation and Benchmarking multi-round event injection Claw-Anything large language model agents +3 more

6arXiv · cs.CL·May 21, 2026·source ↗

Mem-π: Adaptive Memory for LLM Agents via On-Demand Generation and Decoupled RL

Mem-π introduces a framework where a dedicated language or vision-language model generates context-specific guidance for LLM agents on demand, rather than retrieving static entries from episodic memory banks. The system is trained with a decision-content decoupled reinforcement learning objective that jointly learns when to generate guidance and what to generate, enabling abstention when generation would not help. Evaluated across web navigation, terminal-based tool use, and text-based embodied interaction benchmarks, Mem-π achieves over 30% relative improvement on web navigation tasks compared to retrieval-based and prior RL-optimized memory baselines.

Evaluation and Benchmarking Agent and Tool Ecosystem web navigation benchmark Mem-π large language model agents +3 more