large language model agents
large-language-model-agents-1d7a9105·3 events·first seen 27d agoAliases: large language model agents
Co-occurring entities
More like this (12)
Recent events (3)
Claw-Anything: Benchmark for Always-On Personal Assistants with Broad Digital World Access
Claw-Anything is a new benchmark designed to evaluate LLM agents acting as always-on personal assistants with access to long-horizon activity histories, interdependent backend services, and multi-device GUI/CLI interaction. The benchmark simulates months of user activity to create complex, noisy world states and evaluates both reactive and proactive assistance. GPT-5.5 achieves only 34.5% pass@1, revealing a substantial capability gap versus prior narrower benchmarks. An accompanying automated data-generation pipeline produces 2,000 training environments and yields a 23.7% improvement over the base model.
Mem-π: Adaptive Memory for LLM Agents via On-Demand Generation and Decoupled RL
Mem-π introduces a framework where a dedicated language or vision-language model generates context-specific guidance for LLM agents on demand, rather than retrieving static entries from episodic memory banks. The system is trained with a decision-content decoupled reinforcement learning objective that jointly learns when to generate guidance and what to generate, enabling abstention when generation would not help. Evaluated across web navigation, terminal-based tool use, and text-based embodied interaction benchmarks, Mem-π achieves over 30% relative improvement on web navigation tasks compared to retrieval-based and prior RL-optimized memory baselines.
MUSE-Autoskill: Self-Evolving LLM Agents via Skill Lifecycle Management
MUSE-Autoskill introduces a skill-centric agent framework where LLM agents continuously create, store, manage, evaluate, and refine reusable skills across tasks. The system adds skill-level memory that accumulates per-skill experience over time, enabling more effective reuse and cross-agent transfer. Experiments on SkillsBench show improvements in task success, efficiency, and reuse compared to static skill approaches.