Entity · technique

Vision-Language-Action model

techniqueactivevision-language-action-model-3ce380f4·7 events·first seen May 18, 2026

Aliases: Vision-Language-Action model

Co-occurring entities

More like this (12)

Vision-Language-Action models Vision-Language Models visual language model LabVLA: Grounding Vision-Language-Action Models in Scientific Laboratories TempoVLA: Learning Speed-Controllable Vision-Language-Action Policies Vision-Default, Prior-Override: Causal Mechanisms of Perception-Knowledge Conflict in Vision-Language Models Does VLA Even Know the Basics? Measuring Commonsense and World Knowledge Retention in Vision-Language-Action Models From Fixed to Free Cameras: Calibration-Free View-Robust Vision-Language-Action Model From Fixed to Free Cameras: Calibration-Free View-Robust Vision-Language-Action Model RECALL: Recovery Experience Collection for Active Lifelong Learning in Vision-Language-Action Models Geometric Action Model Modeling Complex Behaviors: Multi-Personality Composition and Dynamic Switching in Vision-Language Models

Recent events (7)

5arXiv · cs.AI·Jun 25, 2026·source ↗

Two-stage action prior pretraining improves cross-embodiment VLA robot manipulation

Researchers propose a two-stage training framework for Vision-Language-Action (VLA) models that pretrains the action module with motion priors before cross-modal alignment begins. Stage 1 uses a flow-matching-based encoder-decoder to learn temporal motion structure from unconditioned action trajectories alone; Stage 2 transfers this prior to VLA training via decoder reuse and latent distillation. Evaluated across 13 cross-embodiment tasks in simulation and real-world settings, the approach achieves faster convergence, higher success rates, and notably better performance in data-scarce real-world scenarios compared to VLA training without action priors.

Agent and Tool Ecosystem Multimodal Progress Learning Action Priors for Cross-embodiment Robot Manipulation Vision-Language-Action model Flow Matching

7arXiv · cs.LG·May 28, 2026·source ↗

Ω-QVLA: Training-Free W4A4 Quantization for Full Vision-Language-Action Models Including Diffusion Action Heads

Omega-QVLA is a post-training quantization framework that compresses both the LLM backbone and the diffusion-based action head of VLA models to uniform W4A4 precision without mixed-precision schemes or fine-tuning. It combines composite SVD-Hadamard rotation for weight energy equalization with per-step DiT activation scaling to handle dynamic-range drift across denoising steps. On the LIBERO benchmark, it achieves 98.0% and 87.8% task success on Pi 0.5 and GR00T N1.5 respectively—matching or exceeding FP16 baselines—while reducing static memory footprint by 71.3%. Real-world manipulation experiments confirm the approach generalizes beyond simulation.

Inference Economics Agent and Tool Ecosystem Pi 0.5 SVD-Hadamard rotation LIBERO +6 more

6arXiv · cs.AI·May 21, 2026·source ↗

Lost in Fog: Sensor Perturbations Expose Reasoning Fragility in Driving VLAs

This paper presents a controlled robustness study of Vision-Language-Action (VLA) models in autonomous driving, evaluating Alpamayo R1 (10B parameters) across ~18,000 inference trials under eight sensor perturbation types including noise, lighting extremes, and fog. The key finding is that Chain-of-Causation (CoC) reasoning consistency is a high-fidelity proxy for trajectory reliability: when CoC explanations change post-perturbation, trajectory deviation spikes 5.3× (r=0.99 across attack types). Enabling CoC generation is associated with 11.8% average improvement in trajectory accuracy, and degradation under noise is approximately linear (R²=0.957), while standard preprocessing defenses offer only marginal benefit.

Evaluation and Benchmarking AI Safety Research Vision-Language-Action model Chain-of-Causation autonomous driving +3 more

6Hugging Face Blog·May 19, 2026·source ↗

π0 and π0-FAST: Vision-Language-Action Models for General Robot Control

Hugging Face published a blog post covering π0 and π0-FAST, vision-language-action (VLA) models developed for general-purpose robot control. These models combine vision and language understanding with action generation to enable robots to perform a broad range of manipulation tasks. The post appears to be a technical overview or release commentary on Physical Intelligence's robotics foundation models, situating them within the broader VLA research landscape.

Frontier Model Releases Agent and Tool Ecosystem Vision-Language-Action model π0-FAST Physical Intelligence +3 more

5Hugging Face Blog·May 19, 2026·source ↗

SmolVLA: Efficient Vision-Language-Action Model trained on Lerobot Community Data

Hugging Face introduces SmolVLA, a compact Vision-Language-Action model designed for robotics control, trained on community-contributed data from the LeRobot ecosystem. The model targets efficient deployment on resource-constrained hardware while maintaining competitive manipulation performance. This release represents a continuation of Hugging Face's strategy to democratize robotics AI through open community data pipelines.

Open Weights Progress Agent and Tool Ecosystem LeRobot Vision-Language-Action model Hugging Face +2 more

5Hugging Face Blog·May 19, 2026·source ↗

Asynchronous Robot Inference: Decoupling Action Prediction and Execution

Hugging Face published a blog post on asynchronous robot inference, a technique that decouples the timing of action prediction from action execution in robotic systems. This approach addresses latency bottlenecks that arise when large neural network inference times exceed the real-time control loop requirements of physical robots. The post likely covers architectural patterns and implementation strategies for deploying vision-language-action models or similar policies on robot hardware without blocking the control pipeline.

Inference Economics Enterprise Deployment Patterns asynchronous inference Vision-Language-Action model Hugging Face +1 more

5Hugging Face Blog·May 18, 2026·source ↗

Bringing Robotics AI to Embedded Platforms: Dataset Recording, VLA Fine-Tuning, and On-Device Optimizations

NXP and Hugging Face describe a pipeline for deploying Vision-Language-Action (VLA) models on embedded/edge hardware, covering dataset recording, fine-tuning, and on-device optimization techniques. The post targets robotics applications where inference must run on resource-constrained microcontrollers or SoCs rather than cloud GPUs. Key topics include quantization, model compression, and integration with the LeRobot ecosystem. This represents a practical engineering bridge between frontier VLA research and real-world embedded robotics deployment.

Inference Economics Agent and Tool Ecosystem LeRobot NXP Semiconductors Vision-Language-Action model +3 more