TempoVLA: Speed-Controllable Vision-Language-Action Policy for Robot Manipulation
Researchers introduce TempoVLA, a Vision-Language-Action model that enables explicit speed control during robot manipulation by conditioning on a speed signal rather than inheriting a fixed speed from training data. The system pairs Variable-Speed Trajectory Augmentation (VSTA), which re-times demonstrations by merging or splitting actions, with a model-side conditioning mechanism. Experiments in simulation and real-world tasks show flexible bidirectional speed control, with dynamic adaptation—accelerating in low-risk transit phases and decelerating for high-risk contact stages—achieved by coupling with a large multimodal model.
Related guides (2)
Related events (8)
LabVLA: Vision-Language-Action model and RoboGenesis data engine for scientific laboratory robotics
Researchers introduce LabVLA, a Vision-Language-Action model designed to bridge written scientific protocols and physical robot execution in laboratory settings. To address the data scarcity problem, they build RoboGenesis, a simulation-based data engine that composes lab workflows from atomic skills and generates structured demonstrations across robot embodiments. LabVLA uses a two-stage training recipe combining FAST action token pretraining on a Qwen3-VL-4B-Instruct backbone with flow matching posttraining via a DiT action expert. On the LabUtopia benchmark, LabVLA achieves the highest average success rate among evaluated baselines in both in-distribution and out-of-distribution settings.
π0 and π0-FAST: Vision-Language-Action Models for General Robot Control
Hugging Face published a blog post covering π0 and π0-FAST, vision-language-action (VLA) models developed for general-purpose robot control. These models combine vision and language understanding with action generation to enable robots to perform a broad range of manipulation tasks. The post appears to be a technical overview or release commentary on Physical Intelligence's robotics foundation models, situating them within the broader VLA research landscape.
CHORUS: Single VLA policy enables decentralized multi-robot collaboration without inter-robot communication
CHORUS is a framework that adapts a single vision-language-action (VLA) backbone to control diverse multi-robot teams in a fully decentralized manner, with each robot running an independent copy conditioned only on its own observations and a robot-identifying prompt. Real-world experiments across tasks like tape measurement, book handovers, and laundry basket lifting show a 64-percentage-point improvement over decentralized from-scratch models and 40-point improvement in reactivity to teammate behavior, while outperforming centralized baselines. The key insight is that pretrained VLA visuomotor priors are sufficient to enable reactive coordination without explicit inter-robot communication or alignment procedures at inference time.
SmolVLA: Efficient Vision-Language-Action Model trained on Lerobot Community Data
Hugging Face introduces SmolVLA, a compact Vision-Language-Action model designed for robotics control, trained on community-contributed data from the LeRobot ecosystem. The model targets efficient deployment on resource-constrained hardware while maintaining competitive manipulation performance. This release represents a continuation of Hugging Face's strategy to democratize robotics AI through open community data pipelines.
Qwen-VLA: Unified Vision-Language-Action Model Across Robot Tasks, Environments, and Embodiments
Alibaba's Qwen team presents Qwen-VLA, a unified embodied foundation model that extends the Qwen vision-language stack to continuous action and trajectory generation via a DiT-based action decoder. The model is jointly pretrained on diverse data spanning manipulation trajectories, egocentric demonstrations, synthetic simulation, and navigation data, with embodiment-aware prompt conditioning to support multiple robot platforms. A unified action-and-trajectory prediction framework covers manipulation, navigation, and trajectory prediction tasks. Benchmarks show strong results: 97.9% on LIBERO, 73.7% on Simpler-WidowX, 69.0% OSR on R2R navigation, and 76.9% average OOD success in real-world ALOHA experiments.
VLESA: Vision-Language Embodied Safety Agent for Real-Time Human Activity Monitoring
Researchers introduce VLESA, a framework that monitors human activities from egocentric video and triggers real-time safety interventions when dangerous actions are predicted. The system addresses intent-dependent safety — where identical actions can be safe or dangerous depending on context — using a goal-conditioned safety Q-filter trained via GRPO and an intent-action prediction agent. On the ASIMOV-2.0 benchmark, VLESA achieves higher intervention accuracy than baselines, with the Q-filter improving action safety by over 41 percentage points through goal-conditioned constrained decoding.
HABC: Hierarchical Advantage Weighting for Online RL Fine-Tuning of Vision-Language-Action Policies
Researchers introduce Hierarchical Advantage-Weighted Behavior Cloning (HABC), a method for fine-tuning pretrained Vision-Language-Action (VLA) policies via online RL using only sparse binary episode outcomes. HABC trains separate critic heads for viability and efficiency objectives, combines them via a state-adaptive gate, and applies intervention-aware credit assignment to avoid incorrect supervision across human-intervention boundaries. On three contact-rich bimanual real-robot tasks, HABC improves success rates from SFT baselines of 36%, 44%, and 12% to 92%, 88%, and 38% respectively. The work addresses a fundamental credit assignment problem in robot learning from sparse outcome signals.
TREAD: VLM-based re-labelling framework improves robot policy generalization via dataset augmentation
TREAD (Task Robustness via Re-Labelling Vision-Action Robot Data) is a scalable framework that uses pretrained Vision-Language Models to augment existing robotics datasets without new data collection. The approach decomposes demonstrations into sub-tasks, segments videos accordingly, and generates linguistically diverse instruction labels, enriching language-action pair diversity. Evaluations on the LIBERO benchmark show improved generalization to novel tasks and goals, addressing a key limitation of current robot learning policies.

