5arXiv cs.AI (Artificial Intelligence)·18h ago

Two-stage action prior pretraining improves cross-embodiment VLA robot manipulation

Researchers propose a two-stage training framework for Vision-Language-Action (VLA) models that pretrains the action module with motion priors before cross-modal alignment begins. Stage 1 uses a flow-matching-based encoder-decoder to learn temporal motion structure from unconditioned action trajectories alone; Stage 2 transfers this prior to VLA training via decoder reuse and latent distillation. Evaluated across 13 cross-embodiment tasks in simulation and real-world settings, the approach achieves faster convergence, higher success rates, and notably better performance in data-scarce real-world scenarios compared to VLA training without action priors.

Agent and Tool Ecosystem Multimodal Progress Learning Action Priors for Cross-embodiment Robot Manipulation Vision-Language-Action model Flow Matching

Related guides (2)

Multimodal ProgressTopic guide

Multimodal Progress: How AI Learned to See, Hear, and Act

Read asBeginner In-depth

Agent and Tool EcosystemTopic guide

Agent and Tool Ecosystem: How AI Is Learning to Act, Not Just Answer

Read asBeginner In-depth

Related events (8)

5arXiv · cs.AI·20d ago·source ↗

TempoVLA: Speed-Controllable Vision-Language-Action Policy for Robot Manipulation

Researchers introduce TempoVLA, a Vision-Language-Action model that enables explicit speed control during robot manipulation by conditioning on a speed signal rather than inheriting a fixed speed from training data. The system pairs Variable-Speed Trajectory Augmentation (VSTA), which re-times demonstrations by merging or splitting actions, with a model-side conditioning mechanism. Experiments in simulation and real-world tasks show flexible bidirectional speed control, with dynamic adaptation—accelerating in low-risk transit phases and decelerating for high-risk contact stages—achieved by coupling with a large multimodal model.

Agent and Tool Ecosystem Multimodal Progress Variable-Speed Trajectory Augmentation TempoVLA TempoVLA: Learning Speed-Controllable Vision-Language-Action Policies

6arXiv · cs.CL·13d ago·source ↗

LabVLA: Vision-Language-Action model and RoboGenesis data engine for scientific laboratory robotics

Researchers introduce LabVLA, a Vision-Language-Action model designed to bridge written scientific protocols and physical robot execution in laboratory settings. To address the data scarcity problem, they build RoboGenesis, a simulation-based data engine that composes lab workflows from atomic skills and generates structured demonstrations across robot embodiments. LabVLA uses a two-stage training recipe combining FAST action token pretraining on a Qwen3-VL-4B-Instruct backbone with flow matching posttraining via a DiT action expert. On the LabUtopia benchmark, LabVLA achieves the highest average success rate among evaluated baselines in both in-distribution and out-of-distribution settings.

Agent and Tool Ecosystem Multimodal Progress LabVLA LabUtopia Qwen3-4B-Instruct +3 more

7arXiv · cs.CL·27d ago·source ↗

Qwen-VLA: Unified Vision-Language-Action Model Across Robot Tasks, Environments, and Embodiments

Alibaba's Qwen team presents Qwen-VLA, a unified embodied foundation model that extends the Qwen vision-language stack to continuous action and trajectory generation via a DiT-based action decoder. The model is jointly pretrained on diverse data spanning manipulation trajectories, egocentric demonstrations, synthetic simulation, and navigation data, with embodiment-aware prompt conditioning to support multiple robot platforms. A unified action-and-trajectory prediction framework covers manipulation, navigation, and trajectory prediction tasks. Benchmarks show strong results: 97.9% on LIBERO, 73.7% on Simpler-WidowX, 69.0% OSR on R2R navigation, and 76.9% average OOD success in real-world ALOHA experiments.

Frontier Model Releases Evaluation and Benchmarking Qwen-VLA DOMINO R2R +10 more

5arXiv · cs.AI·18h ago·source ↗

FORCE: Efficient RL fine-tuning for Vision-Language-Action models via value-calibrated warm-up and self-distillation

Researchers introduce FORCE, a 3-stage reinforcement learning fine-tuning framework for Vision-Language-Action (VLA) models that addresses sample inefficiency caused by unstable Q-functions and low-quality exploration data. The framework uses a Value-Calibrated Warm-Up phase followed by Q-function-filtered policy updates, eliminating the need for costly human interventions during training. Evaluated on simulation and real-world robotic tasks, FORCE achieves a 79% absolute improvement in task success rates, outperforms prior RL methods by 10%, and accelerates training by 32.5%.

Agent and Tool Ecosystem Alignment and RLHF FORCE FORCE: Efficient VLA Reinforcement Fine-Tuning via Value-Calibrated Warm-up and Self-Distillation

6arXiv · cs.LG·9d ago·source ↗

HABC: Hierarchical Advantage Weighting for Online RL Fine-Tuning of Vision-Language-Action Policies

Researchers introduce Hierarchical Advantage-Weighted Behavior Cloning (HABC), a method for fine-tuning pretrained Vision-Language-Action (VLA) policies via online RL using only sparse binary episode outcomes. HABC trains separate critic heads for viability and efficiency objectives, combines them via a state-adaptive gate, and applies intervention-aware credit assignment to avoid incorrect supervision across human-intervention boundaries. On three contact-rich bimanual real-robot tasks, HABC improves success rates from SFT baselines of 36%, 44%, and 12% to 92%, 88%, and 38% respectively. The work addresses a fundamental credit assignment problem in robot learning from sparse outcome signals.

Alignment and RLHF Hierarchical Advantage Weighting for Online RL Fine-Tuning of VLAs from Sparse Episode Outcomes Hierarchical Advantage-Weighted Behavior Cloning

6arXiv · cs.LG·27d ago·source ↗

DynaFLIP: Dynamics-Aware Multimodal Pre-Training for Robot Manipulation Perception

DynaFLIP is a pre-training framework that injects motion understanding into visual encoders for robot manipulation by constructing image-language-3D flow triplets from human and robot videos. The method encourages tri-modal alignment via simplex-volume minimization in a shared hyperspherical space, combined with cosine regularization and contrastive objectives. The resulting dynamics-aware visual backbone consistently outperforms baselines across diverse downstream policies including VLAs, with gains up to +22.5% in out-of-distribution scenarios. The work argues that robot generalization requires encoding how the world changes under action, not just static scene content.

Frontier Model Releases Agent and Tool Ecosystem Vision-Language-Action models simplex-volume minimization DynaFLIP +3 more

6arXiv · cs.LG·42h ago·source ↗

InSight: Self-guided autonomous skill acquisition for vision-language-action models via primitive steerability

InSight is a framework enabling VLA models to autonomously acquire new manipulation skills beyond their training data by decomposing demonstrations into labeled primitive actions (e.g., 'move gripper to bowl', 'pour the bottle') and running a VLM-guided data flywheel that identifies missing primitives, attempts demonstrations, and integrates successful ones back into training. The system requires no human demonstrations of target skills and is evaluated on simulation and real-world tasks including block flipping, drawer closing, sweeping, and pouring. Learned primitives can be composed for novel long-horizon tasks, offering a practical path toward continual skill acquisition in robotic VLA policies.

Agent and Tool Ecosystem Multimodal Progress InSight InSight: Self-Guided Skill Acquisition via Steerable VLAs InSight: Self-Guided Skill Acquisition via Steerable VLAs

5arXiv · cs.LG·7d ago·source ↗

Act2Answer: Benchmarking commonsense and world knowledge retention in Vision-Language-Action models

Researchers introduce Act2Answer, a protocol for evaluating how much commonsense and factual knowledge VLA models retain after fine-tuning on robotics data. The approach converts knowledge benchmark questions into tabletop object-placement episodes, yielding action-grounded success rates that reduce confounds from low-level control failures. A large-scale study of 7 VLA models and 9 VLM baselines finds that VLAs retain solid performance on simple concepts but show larger gaps on richer semantic categories compared to their source VLMs, and that VQA co-training is associated with better knowledge retention.

Evaluation and Benchmarking Multimodal Progress Does VLA Even Know the Basics? Measuring Commonsense and World Knowledge Retention in Vision-Language-Action Models Act2Answer