4arXiv cs.AI (Artificial Intelligence)·17h ago

LoRA mixture-of-experts variants for continual learning in motion-language agents

A new arXiv preprint investigates continual learning for bidirectional motion-language agents that must both understand and generate human motion without catastrophic forgetting. The authors propose LoRA-based mixture-of-experts architectures with an autoencoder-based router for task-specific expert selection at inference time, requiring no task labels. Evaluated on a five-task benchmark derived from HumanML3D, the approach achieves near-zero forgetting across motion-to-text and text-to-motion directions. A key finding is that hard expert selection outperforms soft blending, and that token-level accuracy can diverge from downstream generation quality.

Evaluation and Benchmarking Agent and Tool Ecosystem Towards Continual Motion-Language Agents: LoRA Variants for Incremental Motion Understanding and Generation LoRA HumanML3D

Related guides (3)

LoRAConcept

LoRA: How to Teach a Giant AI New Tricks Without Rebuilding It

Read asBeginner In-depth

Agent and Tool EcosystemTopic guide

Agent and Tool Ecosystem: How AI Is Learning to Act, Not Just Answer

Read asBeginner In-depth

Evaluation and BenchmarkingTopic guide

Evaluation and Benchmarking: How We Measure AI — and Why It Keeps Getting Harder

Read asBeginner In-depth

Related events (8)

5The Batch·1mo ago·source ↗

Sony and University Researchers Train Robots To Learn Without Catastrophic Forgetting

Researchers from UT Austin, UCLA, Nanyang Technological University, and Sony developed a sequential fine-tuning recipe combining LoRA and on-policy reinforcement learning (GRPO) to reduce catastrophic forgetting in vision-language-action (VLA) models for robotics. Applied to the OpenVLA-OFT model on the LIBERO benchmark, the method achieved 81.2% success on libero-spatial tasks with near-zero forgetting (0.3 percentage point drop), outperforming established continual learning baselines including Dark Experience Replay and Elastic Weight Consolidation. The approach requires no replay of prior task data and also showed modest generalization to unseen tasks. The authors note the method has not yet been tested outside robotics simulation contexts.

Evaluation and Benchmarking Agent and Tool Ecosystem Elastic Weight Consolidation Dark Experience Replay University of California Los Angeles +11 more

4arXiv · cs.LG·22d ago·source ↗

SETA: Sparse Subspace-to-Expert Sharing for Continual Learning in LLMs

Researchers introduce SETA (Mixture of Sparse Experts for Task Agnostic Continual Learning), a framework addressing catastrophic forgetting in LLMs via adaptive sparse subspace decomposition into task-specific and shared expert modules. The approach uses adaptive elastic anchoring and routing-aware regularization to protect shared knowledge at both weight and routing levels. Experiments on LLaMA-2 7B and Qwen3-4B show competitive or superior performance versus continual learning baselines, with strong retention of early-task knowledge.

Evaluation and Benchmarking Open Weights Progress LLaMA-7B Qwen3-4B Sparse Subspace-to-Expert Sharing for Task-Agnostic Continual Learning +1 more

5arXiv · cs.AI·7d ago·source ↗

RECALL: Active continual learning for Vision-Language-Action models via uncertainty-guided recovery data collection

Researchers propose RECALL, an active continual learning paradigm for Vision-Language-Action (VLA) robot models that uses uncertainty-guided data collection to target states where the policy struggles, rather than passively collecting demonstrations after failures. The paper demonstrates improved fine-tuning efficiency over passive imitation learning but identifies catastrophic forgetting as a key challenge when incorporating recovery data. The authors evaluate continual learning mitigations including replay-based data mixing and elastic weight consolidation, characterizing tradeoffs between plasticity and retention in large autoregressive robot policies.

Agent and Tool Ecosystem Elastic Weight Consolidation RECALL: Recovery Experience Collection for Active Lifelong Learning in Vision-Language-Action Models

5arXiv · cs.CL·26d ago·source ↗

Expert-aware causal tracing of factual recall in sparse MoE language models

A new arXiv preprint extends causal tracing methodology to sparse mixture-of-experts (MoE) language models, asking which routed experts mediate factual recall rather than just which layers or feed-forward modules. Using CounterFact facts, the authors apply noise-corruption and clean-patch interventions to Qwen3-30B-A3B-Base and Mixtral-8x7B-v0.1, finding that expert-level localization is possible in the former (a single expert at layer 44) but requires multi-expert coalition recovery in the latter. The results indicate that factual localization in MoE models is model- and protocol-dependent rather than universal.

Evaluation and Benchmarking AI Safety Research Qwen3-30B-A3B-Base Mixtral-8x7B-v0.1 Expert-Aware Causal Tracing of Factual Recall in Sparse MoE Language Models +1 more

6arXiv · cs.CL·1mo ago·source ↗

Mem-π: Adaptive Memory for LLM Agents via On-Demand Generation and Decoupled RL

Mem-π introduces a framework where a dedicated language or vision-language model generates context-specific guidance for LLM agents on demand, rather than retrieving static entries from episodic memory banks. The system is trained with a decision-content decoupled reinforcement learning objective that jointly learns when to generate guidance and what to generate, enabling abstention when generation would not help. Evaluated across web navigation, terminal-based tool use, and text-based embodied interaction benchmarks, Mem-π achieves over 30% relative improvement on web navigation tasks compared to retrieval-based and prior RL-optimized memory baselines.

Evaluation and Benchmarking Agent and Tool Ecosystem web navigation benchmark Mem-π large language model agents +3 more

6arXiv · cs.CL·14d ago·source ↗

Expert Tying reduces MoE LLM memory footprint by ~2x with minimal quality loss

Researchers introduce Expert Tying, an architectural modification for Mixture-of-Experts LLMs that shares expert parameters across consecutive transformer layers while keeping routing and attention layer-independent. Evaluated on OLMoE, Qwen3, and DeepSeek-style MoE architectures, the method achieves nearly 2x memory reduction with negligible perplexity or downstream quality degradation. The approach exploits parameter redundancy in MoE pathways to improve the compute-to-memory trade-off for training and inference.

Training Infrastructure Frontier Model Releases DeepSeek V4 Tying the Loop -- Tied Expert Layers in Mixture-of-Experts Language Models Expert Tying +3 more

5arXiv · cs.LG·27d ago·source ↗

Sleep paradigm for LLMs enables continual learning and memory consolidation via distillation and RL

A new arXiv preprint proposes a 'Sleep' paradigm for language models that enables continual learning by consolidating short-term in-context memories into long-term parameters. The framework has two stages: Knowledge Seeding (distilling a smaller model's memories into a larger network via on-policy distillation combined with RL-based imitation learning) and Dreaming (self-improvement via RL-generated synthetic curricula without human supervision). Experiments cover long-horizon tasks, continual learning, knowledge incorporation, and few-shot generalization, addressing a known weakness of current LLMs in retaining temporal knowledge across contexts.

Alignment and RLHF Language Models Need Sleep: Learning to Self-Modify and Consolidate Memories Knowledge Seeding Generalized Distillation

6arXiv · cs.CL·28d ago·source ↗

AgentCL: A Rigorous Evaluation Framework for Continual Learning in Language Agents

AgentCL is a new benchmark and evaluation framework designed to rigorously assess continual learning in language agents, addressing gaps in existing benchmarks that focus on retrieval over long-context documents or use naive task streams with limited cross-task analysis. The framework constructs compositional task streams where earlier sub-solutions, evidence, or workflows are intentionally reusable in later tasks, contrasting them with naive streams to measure transfer gains. The authors also introduce MemProbe, a probing method that stores interactions, insights, and skills while filtering unreliable experiences during consolidation. Empirical results across coding, deep research, and language understanding tasks show that controlled streams better distinguish memory design quality, and that naive streams can mask memory-induced degradation.

Long Context Evolution Evaluation and Benchmarking AgentCL MemProbe Continual Learning +3 more