6arXiv cs.AI (Artificial Intelligence)·21h ago

Task-Agnostic Pretraining (TAP) decouples motor learning from language grounding in VLA models

Researchers propose Task-Agnostic Pretraining (TAP), a two-stage framework for Vision-Language-Action models that separates physical motor skill acquisition from semantic language alignment. The first stage learns motor priors from cheap unlabeled interaction data via a self-supervised Inverse Dynamics objective; the second stage grounds these priors in language using minimal expert demonstrations. On the SIMPLER benchmark, TAP matches models trained on over 1M expert trajectories while using orders of magnitude less labeled data, and on a real-world WidowX robot retains 25% success under camera perturbations where internet-scale baselines collapse to 0%.

Multimodal Progress SIMPLER Inverse Dynamics WidowX Task-Agnostic Pretraining (TAP)

Related guides (1)

Multimodal ProgressTopic guide

Multimodal Progress: How AI Learned to See, Hear, and Act

Read asBeginner In-depth

Related events (8)

5arXiv · cs.AI·Jun 25, 2026·source ↗

Two-stage action prior pretraining improves cross-embodiment VLA robot manipulation

Researchers propose a two-stage training framework for Vision-Language-Action (VLA) models that pretrains the action module with motion priors before cross-modal alignment begins. Stage 1 uses a flow-matching-based encoder-decoder to learn temporal motion structure from unconditioned action trajectories alone; Stage 2 transfers this prior to VLA training via decoder reuse and latent distillation. Evaluated across 13 cross-embodiment tasks in simulation and real-world settings, the approach achieves faster convergence, higher success rates, and notably better performance in data-scarce real-world scenarios compared to VLA training without action priors.

Agent and Tool Ecosystem Multimodal Progress Learning Action Priors for Cross-embodiment Robot Manipulation Vision-Language-Action model Flow Matching

5arXiv · cs.LG·Jun 10, 2026·source ↗

TREAD: VLM-based re-labelling framework improves robot policy generalization via dataset augmentation

TREAD (Task Robustness via Re-Labelling Vision-Action Robot Data) is a scalable framework that uses pretrained Vision-Language Models to augment existing robotics datasets without new data collection. The approach decomposes demonstrations into sub-tasks, segments videos accordingly, and generates linguistically diverse instruction labels, enriching language-action pair diversity. Evaluations on the LIBERO benchmark show improved generalization to novel tasks and goals, addressing a key limitation of current robot learning policies.

Agent and Tool Ecosystem Multimodal Progress TREAD LIBERO

5arXiv · cs.AI·Jun 5, 2026·source ↗

TempoVLA: Speed-Controllable Vision-Language-Action Policy for Robot Manipulation

Researchers introduce TempoVLA, a Vision-Language-Action model that enables explicit speed control during robot manipulation by conditioning on a speed signal rather than inheriting a fixed speed from training data. The system pairs Variable-Speed Trajectory Augmentation (VSTA), which re-times demonstrations by merging or splitting actions, with a model-side conditioning mechanism. Experiments in simulation and real-world tasks show flexible bidirectional speed control, with dynamic adaptation—accelerating in low-risk transit phases and decelerating for high-risk contact stages—achieved by coupling with a large multimodal model.

Agent and Tool Ecosystem Multimodal Progress Variable-Speed Trajectory Augmentation TempoVLA TempoVLA: Learning Speed-Controllable Vision-Language-Action Policies

6arXiv · cs.LG·Jun 24, 2026·source ↗

InSight: Self-guided autonomous skill acquisition for vision-language-action models via primitive steerability

InSight is a framework enabling VLA models to autonomously acquire new manipulation skills beyond their training data by decomposing demonstrations into labeled primitive actions (e.g., 'move gripper to bowl', 'pour the bottle') and running a VLM-guided data flywheel that identifies missing primitives, attempts demonstrations, and integrates successful ones back into training. The system requires no human demonstrations of target skills and is evaluated on simulation and real-world tasks including block flipping, drawer closing, sweeping, and pouring. Learned primitives can be composed for novel long-horizon tasks, offering a practical path toward continual skill acquisition in robotic VLA policies.

Agent and Tool Ecosystem Multimodal Progress InSight InSight: Self-Guided Skill Acquisition via Steerable VLAs InSight: Self-Guided Skill Acquisition via Steerable VLAs

6arXiv · cs.LG·May 29, 2026·source ↗

DynaFLIP: Dynamics-Aware Multimodal Pre-Training for Robot Manipulation Perception

DynaFLIP is a pre-training framework that injects motion understanding into visual encoders for robot manipulation by constructing image-language-3D flow triplets from human and robot videos. The method encourages tri-modal alignment via simplex-volume minimization in a shared hyperspherical space, combined with cosine regularization and contrastive objectives. The resulting dynamics-aware visual backbone consistently outperforms baselines across diverse downstream policies including VLAs, with gains up to +22.5% in out-of-distribution scenarios. The work argues that robot generalization requires encoding how the world changes under action, not just static scene content.

Frontier Model Releases Agent and Tool Ecosystem Vision-Language-Action models simplex-volume minimization DynaFLIP +3 more

6arXiv · cs.LG·Jun 16, 2026·source ↗

HABC: Hierarchical Advantage Weighting for Online RL Fine-Tuning of Vision-Language-Action Policies

Researchers introduce Hierarchical Advantage-Weighted Behavior Cloning (HABC), a method for fine-tuning pretrained Vision-Language-Action (VLA) policies via online RL using only sparse binary episode outcomes. HABC trains separate critic heads for viability and efficiency objectives, combines them via a state-adaptive gate, and applies intervention-aware credit assignment to avoid incorrect supervision across human-intervention boundaries. On three contact-rich bimanual real-robot tasks, HABC improves success rates from SFT baselines of 36%, 44%, and 12% to 92%, 88%, and 38% respectively. The work addresses a fundamental credit assignment problem in robot learning from sparse outcome signals.

Alignment and RLHF Hierarchical Advantage Weighting for Online RL Fine-Tuning of VLAs from Sparse Episode Outcomes Hierarchical Advantage-Weighted Behavior Cloning

5arXiv · cs.AI·3d ago·source ↗

VLK: Synthetic vision-language-kinematics supervision enables humanoid loco-manipulation from egocentric observations

Researchers introduce a pipeline that generates 48,000 paired vision-language-kinematics trajectories synthetically using 3D Gaussian Splatting to reconstruct indoor scenes, bypassing the need for expensive human-annotated robot data. A VLK policy trained on this data predicts whole-body kinematic trajectories from egocentric images and language instructions, which a whole-body tracker converts to physical actions. The approach is validated on a Unitree G1 humanoid performing navigation and object transport, demonstrating viable sim-to-real transfer for perception-based loco-manipulation.

VLK 3D Gaussian Splatting Unitree G1

5arXiv · cs.AI·Jun 25, 2026·source ↗

FORCE: Efficient RL fine-tuning for Vision-Language-Action models via value-calibrated warm-up and self-distillation

Researchers introduce FORCE, a 3-stage reinforcement learning fine-tuning framework for Vision-Language-Action (VLA) models that addresses sample inefficiency caused by unstable Q-functions and low-quality exploration data. The framework uses a Value-Calibrated Warm-Up phase followed by Q-function-filtered policy updates, eliminating the need for costly human interventions during training. Evaluated on simulation and real-world robotic tasks, FORCE achieves a 79% absolute improvement in task success rates, outperforms prior RL methods by 10%, and accelerates training by 32.5%.

Agent and Tool Ecosystem Alignment and RLHF FORCE FORCE: Efficient VLA Reinforcement Fine-Tuning via Value-Calibrated Warm-up and Self-Distillation

Task-Agnostic Pretraining (TAP) decouples motor learning from language grounding in VLA models

Related events (8)

5arXiv · cs.AI·Jun 25, 2026·source ↗