widowx-35ee195a·1 events·first seen Aliases: WidowX
Researchers propose Task-Agnostic Pretraining (TAP), a two-stage framework for Vision-Language-Action models that separates physical motor skill acquisition from semantic language alignment. The first stage learns motor priors from cheap unlabeled interaction data via a self-supervised Inverse Dynamics objective; the second stage grounds these priors in language using minimal expert demonstrations. On the SIMPLER benchmark, TAP matches models trained on over 1M expert trajectories while using orders of magnitude less labeled data, and on a real-world WidowX robot retains 25% success under camera perturbations where internet-scale baselines collapse to 0%.