5arXiv cs.AI (Artificial Intelligence)·43h ago

RECALL: Active continual learning for Vision-Language-Action models via uncertainty-guided recovery data collection

Researchers propose RECALL, an active continual learning paradigm for Vision-Language-Action (VLA) robot models that uses uncertainty-guided data collection to target states where the policy struggles, rather than passively collecting demonstrations after failures. The paper demonstrates improved fine-tuning efficiency over passive imitation learning but identifies catastrophic forgetting as a key challenge when incorporating recovery data. The authors evaluate continual learning mitigations including replay-based data mixing and elastic weight consolidation, characterizing tradeoffs between plasticity and retention in large autoregressive robot policies.

Agent and Tool Ecosystem Elastic Weight Consolidation RECALL: Recovery Experience Collection for Active Lifelong Learning in Vision-Language-Action Models

Related guides (1)

Agent and Tool EcosystemTopic guide

Agent and Tool Ecosystem: How AI Is Learning to Act, Not Just Answer

Read asBeginner In-depth

Related events (8)

5The Batch·1mo ago·source ↗

Sony and University Researchers Train Robots To Learn Without Catastrophic Forgetting

Researchers from UT Austin, UCLA, Nanyang Technological University, and Sony developed a sequential fine-tuning recipe combining LoRA and on-policy reinforcement learning (GRPO) to reduce catastrophic forgetting in vision-language-action (VLA) models for robotics. Applied to the OpenVLA-OFT model on the LIBERO benchmark, the method achieved 81.2% success on libero-spatial tasks with near-zero forgetting (0.3 percentage point drop), outperforming established continual learning baselines including Dark Experience Replay and Elastic Weight Consolidation. The approach requires no replay of prior task data and also showed modest generalization to unseen tasks. The authors note the method has not yet been tested outside robotics simulation contexts.

Evaluation and Benchmarking Agent and Tool Ecosystem Elastic Weight Consolidation Dark Experience Replay University of California Los Angeles +11 more

6arXiv · cs.LG·8d ago·source ↗

HABC: Hierarchical Advantage Weighting for Online RL Fine-Tuning of Vision-Language-Action Policies

Researchers introduce Hierarchical Advantage-Weighted Behavior Cloning (HABC), a method for fine-tuning pretrained Vision-Language-Action (VLA) policies via online RL using only sparse binary episode outcomes. HABC trains separate critic heads for viability and efficiency objectives, combines them via a state-adaptive gate, and applies intervention-aware credit assignment to avoid incorrect supervision across human-intervention boundaries. On three contact-rich bimanual real-robot tasks, HABC improves success rates from SFT baselines of 36%, 44%, and 12% to 92%, 88%, and 38% respectively. The work addresses a fundamental credit assignment problem in robot learning from sparse outcome signals.

Alignment and RLHF Hierarchical Advantage Weighting for Online RL Fine-Tuning of VLAs from Sparse Episode Outcomes Hierarchical Advantage-Weighted Behavior Cloning

5arXiv · cs.LG·6d ago·source ↗

Act2Answer: Benchmarking commonsense and world knowledge retention in Vision-Language-Action models

Researchers introduce Act2Answer, a protocol for evaluating how much commonsense and factual knowledge VLA models retain after fine-tuning on robotics data. The approach converts knowledge benchmark questions into tabletop object-placement episodes, yielding action-grounded success rates that reduce confounds from low-level control failures. A large-scale study of 7 VLA models and 9 VLM baselines finds that VLAs retain solid performance on simple concepts but show larger gaps on richer semantic categories compared to their source VLMs, and that VQA co-training is associated with better knowledge retention.

Evaluation and Benchmarking Multimodal Progress Does VLA Even Know the Basics? Measuring Commonsense and World Knowledge Retention in Vision-Language-Action Models Act2Answer

5arXiv · cs.LG·14d ago·source ↗

TREAD: VLM-based re-labelling framework improves robot policy generalization via dataset augmentation

TREAD (Task Robustness via Re-Labelling Vision-Action Robot Data) is a scalable framework that uses pretrained Vision-Language Models to augment existing robotics datasets without new data collection. The approach decomposes demonstrations into sub-tasks, segments videos accordingly, and generates linguistically diverse instruction labels, enriching language-action pair diversity. Evaluations on the LIBERO benchmark show improved generalization to novel tasks and goals, addressing a key limitation of current robot learning policies.

Agent and Tool Ecosystem Multimodal Progress TREAD LIBERO

6arXiv · cs.CL·12d ago·source ↗

LabVLA: Vision-Language-Action model and RoboGenesis data engine for scientific laboratory robotics

Researchers introduce LabVLA, a Vision-Language-Action model designed to bridge written scientific protocols and physical robot execution in laboratory settings. To address the data scarcity problem, they build RoboGenesis, a simulation-based data engine that composes lab workflows from atomic skills and generates structured demonstrations across robot embodiments. LabVLA uses a two-stage training recipe combining FAST action token pretraining on a Qwen3-VL-4B-Instruct backbone with flow matching posttraining via a DiT action expert. On the LabUtopia benchmark, LabVLA achieves the highest average success rate among evaluated baselines in both in-distribution and out-of-distribution settings.

Agent and Tool Ecosystem Multimodal Progress LabVLA LabUtopia Qwen3-4B-Instruct +3 more

6arXiv · cs.LG·19h ago·source ↗

InSight: Self-guided autonomous skill acquisition for vision-language-action models via primitive steerability

InSight is a framework enabling VLA models to autonomously acquire new manipulation skills beyond their training data by decomposing demonstrations into labeled primitive actions (e.g., 'move gripper to bowl', 'pour the bottle') and running a VLM-guided data flywheel that identifies missing primitives, attempts demonstrations, and integrates successful ones back into training. The system requires no human demonstrations of target skills and is evaluated on simulation and real-world tasks including block flipping, drawer closing, sweeping, and pouring. Learned primitives can be composed for novel long-horizon tasks, offering a practical path toward continual skill acquisition in robotic VLA policies.

Agent and Tool Ecosystem Multimodal Progress InSight InSight: Self-Guided Skill Acquisition via Steerable VLAs InSight: Self-Guided Skill Acquisition via Steerable VLAs

5arXiv · cs.LG·21d ago·source ↗

VLESA: Vision-Language Embodied Safety Agent for Real-Time Human Activity Monitoring

Researchers introduce VLESA, a framework that monitors human activities from egocentric video and triggers real-time safety interventions when dangerous actions are predicted. The system addresses intent-dependent safety — where identical actions can be safe or dangerous depending on context — using a goal-conditioned safety Q-filter trained via GRPO and an intent-action prediction agent. On the ASIMOV-2.0 benchmark, VLESA achieves higher intervention accuracy than baselines, with the Q-filter improving action safety by over 41 percentage points through goal-conditioned constrained decoding.

AI Safety Research Multimodal Progress GRPO ASIMOV-2.0 VLESA +1 more

7arXiv · cs.CL·26d ago·source ↗

Qwen-VLA: Unified Vision-Language-Action Model Across Robot Tasks, Environments, and Embodiments

Alibaba's Qwen team presents Qwen-VLA, a unified embodied foundation model that extends the Qwen vision-language stack to continuous action and trajectory generation via a DiT-based action decoder. The model is jointly pretrained on diverse data spanning manipulation trajectories, egocentric demonstrations, synthetic simulation, and navigation data, with embodiment-aware prompt conditioning to support multiple robot platforms. A unified action-and-trajectory prediction framework covers manipulation, navigation, and trajectory prediction tasks. Benchmarks show strong results: 97.9% on LIBERO, 73.7% on Simpler-WidowX, 69.0% OSR on R2R navigation, and 76.9% average OOD success in real-world ALOHA experiments.

Frontier Model Releases Evaluation and Benchmarking Qwen-VLA DOMINO R2R +10 more