4arXiv cs.AI (Artificial Intelligence)·13h ago

CoFL-S: Language-conditioned flow fields for low-level robot navigation in VLN

CoFL-S is a new vision-language-action framework that predicts language-conditioned flow fields over a robot's local visible sector to generate continuous navigation trajectories. The authors address an underexplored gap in Vision-Language Navigation (VLN) by providing frame-level local supervision with sub-instruction alignment and dense flow-field targets. A new continuous-time Habitat benchmark is introduced to enable decomposition-independent closed-loop evaluation across planner frequencies. CoFL-S outperforms action-token and action-chunk baselines in simulation and demonstrates zero-shot transfer to real-world deployment.

Agent and Tool Ecosystem Multimodal Progress Habitat CoFL-S VLN-CE

Related guides (2)

Multimodal ProgressTopic guide

Multimodal Progress: How AI Learned to See, Hear, and Act

Read asBeginner In-depth

Agent and Tool EcosystemTopic guide

Agent and Tool Ecosystem: How AI Is Learning to Act, Not Just Answer

Read asBeginner In-depth

Related events (8)

5arXiv · cs.AI·3d ago·source ↗

VLK: Synthetic vision-language-kinematics supervision enables humanoid loco-manipulation from egocentric observations

Researchers introduce a pipeline that generates 48,000 paired vision-language-kinematics trajectories synthetically using 3D Gaussian Splatting to reconstruct indoor scenes, bypassing the need for expensive human-annotated robot data. A VLK policy trained on this data predicts whole-body kinematic trajectories from egocentric images and language instructions, which a whole-body tracker converts to physical actions. The approach is validated on a Unitree G1 humanoid performing navigation and object transport, demonstrating viable sim-to-real transfer for perception-based loco-manipulation.

VLK 3D Gaussian Splatting Unitree G1

6arXiv · cs.CL·21d ago·source ↗

LabVLA: Vision-Language-Action model and RoboGenesis data engine for scientific laboratory robotics

Researchers introduce LabVLA, a Vision-Language-Action model designed to bridge written scientific protocols and physical robot execution in laboratory settings. To address the data scarcity problem, they build RoboGenesis, a simulation-based data engine that composes lab workflows from atomic skills and generates structured demonstrations across robot embodiments. LabVLA uses a two-stage training recipe combining FAST action token pretraining on a Qwen3-VL-4B-Instruct backbone with flow matching posttraining via a DiT action expert. On the LabUtopia benchmark, LabVLA achieves the highest average success rate among evaluated baselines in both in-distribution and out-of-distribution settings.

Agent and Tool Ecosystem Multimodal Progress LabVLA LabUtopia Qwen3-4B-Instruct +3 more

5arXiv · cs.AI·28d ago·source ↗

TempoVLA: Speed-Controllable Vision-Language-Action Policy for Robot Manipulation

Researchers introduce TempoVLA, a Vision-Language-Action model that enables explicit speed control during robot manipulation by conditioning on a speed signal rather than inheriting a fixed speed from training data. The system pairs Variable-Speed Trajectory Augmentation (VSTA), which re-times demonstrations by merging or splitting actions, with a model-side conditioning mechanism. Experiments in simulation and real-world tasks show flexible bidirectional speed control, with dynamic adaptation—accelerating in low-risk transit phases and decelerating for high-risk contact stages—achieved by coupling with a large multimodal model.

Agent and Tool Ecosystem Multimodal Progress Variable-Speed Trajectory Augmentation TempoVLA TempoVLA: Learning Speed-Controllable Vision-Language-Action Policies

6Hugging Face Blog·1mo ago·source ↗

π0 and π0-FAST: Vision-Language-Action Models for General Robot Control

Hugging Face published a blog post covering π0 and π0-FAST, vision-language-action (VLA) models developed for general-purpose robot control. These models combine vision and language understanding with action generation to enable robots to perform a broad range of manipulation tasks. The post appears to be a technical overview or release commentary on Physical Intelligence's robotics foundation models, situating them within the broader VLA research landscape.

Frontier Model Releases Agent and Tool Ecosystem Vision-Language-Action model π0-FAST Physical Intelligence +3 more

5Hugging Face Blog·1mo ago·source ↗

SmolVLA: Efficient Vision-Language-Action Model trained on Lerobot Community Data

Hugging Face introduces SmolVLA, a compact Vision-Language-Action model designed for robotics control, trained on community-contributed data from the LeRobot ecosystem. The model targets efficient deployment on resource-constrained hardware while maintaining competitive manipulation performance. This release represents a continuation of Hugging Face's strategy to democratize robotics AI through open community data pipelines.

Open Weights Progress Agent and Tool Ecosystem LeRobot Vision-Language-Action model Hugging Face +2 more

7arXiv · cs.CL·1mo ago·source ↗

Qwen-VLA: Unified Vision-Language-Action Model Across Robot Tasks, Environments, and Embodiments

Alibaba's Qwen team presents Qwen-VLA, a unified embodied foundation model that extends the Qwen vision-language stack to continuous action and trajectory generation via a DiT-based action decoder. The model is jointly pretrained on diverse data spanning manipulation trajectories, egocentric demonstrations, synthetic simulation, and navigation data, with embodiment-aware prompt conditioning to support multiple robot platforms. A unified action-and-trajectory prediction framework covers manipulation, navigation, and trajectory prediction tasks. Benchmarks show strong results: 97.9% on LIBERO, 73.7% on Simpler-WidowX, 69.0% OSR on R2R navigation, and 76.9% average OOD success in real-world ALOHA experiments.

Frontier Model Releases Evaluation and Benchmarking Qwen-VLA DOMINO R2R +10 more

6arXiv · cs.LG·9d ago·source ↗

InSight: Self-guided autonomous skill acquisition for vision-language-action models via primitive steerability

InSight is a framework enabling VLA models to autonomously acquire new manipulation skills beyond their training data by decomposing demonstrations into labeled primitive actions (e.g., 'move gripper to bowl', 'pour the bottle') and running a VLM-guided data flywheel that identifies missing primitives, attempts demonstrations, and integrates successful ones back into training. The system requires no human demonstrations of target skills and is evaluated on simulation and real-world tasks including block flipping, drawer closing, sweeping, and pouring. Learned primitives can be composed for novel long-horizon tasks, offering a practical path toward continual skill acquisition in robotic VLA policies.

Agent and Tool Ecosystem Multimodal Progress InSight InSight: Self-Guided Skill Acquisition via Steerable VLAs InSight: Self-Guided Skill Acquisition via Steerable VLAs

6arXiv · cs.AI·22d ago·source ↗

CHORUS: Single VLA policy enables decentralized multi-robot collaboration without inter-robot communication

CHORUS is a framework that adapts a single vision-language-action (VLA) backbone to control diverse multi-robot teams in a fully decentralized manner, with each robot running an independent copy conditioned only on its own observations and a robot-identifying prompt. Real-world experiments across tasks like tape measurement, book handovers, and laundry basket lifting show a 64-percentage-point improvement over decentralized from-scratch models and 40-point improvement in reactivity to teammate behavior, while outperforming centralized baselines. The key insight is that pretrained VLA visuomotor priors are sufficient to enable reactive coordination without explicit inter-robot communication or alignment procedures at inference time.

Agent and Tool Ecosystem Multimodal Progress CHORUS CHORUS: Decentralized Multi-Embodiment Collaboration with One VLA Policy