5arXiv cs.LG (Machine Learning)·17d ago

VLESA: Vision-Language Embodied Safety Agent for Real-Time Human Activity Monitoring

Researchers introduce VLESA, a framework that monitors human activities from egocentric video and triggers real-time safety interventions when dangerous actions are predicted. The system addresses intent-dependent safety — where identical actions can be safe or dangerous depending on context — using a goal-conditioned safety Q-filter trained via GRPO and an intent-action prediction agent. On the ASIMOV-2.0 benchmark, VLESA achieves higher intervention accuracy than baselines, with the Q-filter improving action safety by over 41 percentage points through goal-conditioned constrained decoding.

AI Safety Research Multimodal Progress GRPO ASIMOV-2.0 VLESA Hanjiang Hu

Related guides (3)

GRPOConcept

GRPO: The Lightweight RL Trick Behind Today's Reasoning Models

Read asBeginner In-depth

AI Safety ResearchTopic guide

AI Safety Research: From Lab Policies to Real-World Flashpoints

Read asBeginner In-depth

Multimodal ProgressTopic guide

Multimodal Progress: How AI Learned to See, Hear, and Act

Read asBeginner In-depth

Related events (8)

6arXiv · cs.CL·8d ago·source ↗

LabVLA: Vision-Language-Action model and RoboGenesis data engine for scientific laboratory robotics

Researchers introduce LabVLA, a Vision-Language-Action model designed to bridge written scientific protocols and physical robot execution in laboratory settings. To address the data scarcity problem, they build RoboGenesis, a simulation-based data engine that composes lab workflows from atomic skills and generates structured demonstrations across robot embodiments. LabVLA uses a two-stage training recipe combining FAST action token pretraining on a Qwen3-VL-4B-Instruct backbone with flow matching posttraining via a DiT action expert. On the LabUtopia benchmark, LabVLA achieves the highest average success rate among evaluated baselines in both in-distribution and out-of-distribution settings.

Agent and Tool Ecosystem Multimodal Progress LabVLA LabUtopia Qwen3-4B-Instruct +3 more

5arXiv · cs.AI·15d ago·source ↗

TempoVLA: Speed-Controllable Vision-Language-Action Policy for Robot Manipulation

Researchers introduce TempoVLA, a Vision-Language-Action model that enables explicit speed control during robot manipulation by conditioning on a speed signal rather than inheriting a fixed speed from training data. The system pairs Variable-Speed Trajectory Augmentation (VSTA), which re-times demonstrations by merging or splitting actions, with a model-side conditioning mechanism. Experiments in simulation and real-world tasks show flexible bidirectional speed control, with dynamic adaptation—accelerating in low-risk transit phases and decelerating for high-risk contact stages—achieved by coupling with a large multimodal model.

Agent and Tool Ecosystem Multimodal Progress Variable-Speed Trajectory Augmentation TempoVLA TempoVLA: Learning Speed-Controllable Vision-Language-Action Policies

7arXiv · cs.CL·22d ago·source ↗

Qwen-VLA: Unified Vision-Language-Action Model Across Robot Tasks, Environments, and Embodiments

Alibaba's Qwen team presents Qwen-VLA, a unified embodied foundation model that extends the Qwen vision-language stack to continuous action and trajectory generation via a DiT-based action decoder. The model is jointly pretrained on diverse data spanning manipulation trajectories, egocentric demonstrations, synthetic simulation, and navigation data, with embodiment-aware prompt conditioning to support multiple robot platforms. A unified action-and-trajectory prediction framework covers manipulation, navigation, and trajectory prediction tasks. Benchmarks show strong results: 97.9% on LIBERO, 73.7% on Simpler-WidowX, 69.0% OSR on R2R navigation, and 76.9% average OOD success in real-world ALOHA experiments.

Frontier Model Releases Evaluation and Benchmarking Qwen-VLA DOMINO R2R +10 more

5Hugging Face Blog·1mo ago·source ↗

SmolVLA: Efficient Vision-Language-Action Model trained on Lerobot Community Data

Hugging Face introduces SmolVLA, a compact Vision-Language-Action model designed for robotics control, trained on community-contributed data from the LeRobot ecosystem. The model targets efficient deployment on resource-constrained hardware while maintaining competitive manipulation performance. This release represents a continuation of Hugging Face's strategy to democratize robotics AI through open community data pipelines.

Open Weights Progress Agent and Tool Ecosystem LeRobot Vision-Language-Action model Hugging Face +2 more

6arXiv · cs.AI·1mo ago·source ↗

Lost in Fog: Sensor Perturbations Expose Reasoning Fragility in Driving VLAs

This paper presents a controlled robustness study of Vision-Language-Action (VLA) models in autonomous driving, evaluating Alpamayo R1 (10B parameters) across ~18,000 inference trials under eight sensor perturbation types including noise, lighting extremes, and fog. The key finding is that Chain-of-Causation (CoC) reasoning consistency is a high-fidelity proxy for trajectory reliability: when CoC explanations change post-perturbation, trajectory deviation spikes 5.3× (r=0.99 across attack types). Enabling CoC generation is associated with 11.8% average improvement in trajectory accuracy, and degradation under noise is approximately linear (R²=0.957), while standard preprocessing defenses offer only marginal benefit.

Evaluation and Benchmarking AI Safety Research Vision-Language-Action model Chain-of-Causation autonomous driving +3 more

5Berkeley Ai Research (Bair) Blog·1mo ago·source ↗

PEVA: Whole-Body Conditioned Egocentric Video Prediction for Embodied World Models

Researchers from BAIR introduce PEVA (Predicting Ego-centric Video from human Actions), a model that generates first-person video frames conditioned on 48-dimensional whole-body kinematic pose trajectories. The model uses an autoregressive conditional diffusion transformer trained on the Nymeria dataset, which pairs real-world egocentric video with body pose capture. PEVA can generate atomic action videos, simulate counterfactuals, and support long video generation, representing a step toward world models grounded in physically embodied human agents.

Agent and Tool Ecosystem Multimodal Progress PEVA Conditional Diffusion Transformer Berkeley AI Research (BAIR)+2 more

6Hugging Face Blog·1mo ago·source ↗

π0 and π0-FAST: Vision-Language-Action Models for General Robot Control

Hugging Face published a blog post covering π0 and π0-FAST, vision-language-action (VLA) models developed for general-purpose robot control. These models combine vision and language understanding with action generation to enable robots to perform a broad range of manipulation tasks. The post appears to be a technical overview or release commentary on Physical Intelligence's robotics foundation models, situating them within the broader VLA research landscape.

Frontier Model Releases Agent and Tool Ecosystem Vision-Language-Action model π0-FAST Physical Intelligence +3 more

5Hugging Face Blog·1mo ago·source ↗

Vision Language Models (Better, faster, stronger)

A Hugging Face blog post surveys the state of vision-language models (VLMs) in 2025, covering advances in architecture, training, efficiency, and deployment. The post reviews progress across major open and closed VLMs, highlighting trends in multimodal capability, speed improvements, and practical deployment patterns. As a tier-2 commentary piece, it synthesizes the current landscape rather than announcing new research.

Open Weights Progress Inference Economics Vision-Language Models Hugging Face +1 more