6arXiv cs.AI (Artificial Intelligence)·1mo ago

Lost in Fog: Sensor Perturbations Expose Reasoning Fragility in Driving VLAs

This paper presents a controlled robustness study of Vision-Language-Action (VLA) models in autonomous driving, evaluating Alpamayo R1 (10B parameters) across ~18,000 inference trials under eight sensor perturbation types including noise, lighting extremes, and fog. The key finding is that Chain-of-Causation (CoC) reasoning consistency is a high-fidelity proxy for trajectory reliability: when CoC explanations change post-perturbation, trajectory deviation spikes 5.3× (r=0.99 across attack types). Enabling CoC generation is associated with 11.8% average improvement in trajectory accuracy, and degradation under noise is approximately linear (R²=0.957), while standard preprocessing defenses offer only marginal benefit.

Evaluation and Benchmarking AI Safety Research Agent and Tool Ecosystem Multimodal Progress Vision-Language-Action model Chain-of-Causation autonomous driving Alpamayo R1

Related guides (4)

AI Safety ResearchTopic guide

AI Safety Research: From Lab Policies to Real-World Flashpoints

Read asBeginner In-depth

Multimodal ProgressTopic guide

Multimodal Progress: How AI Learned to See, Hear, and Act

Read asBeginner In-depth

Agent and Tool EcosystemTopic guide

Agent and Tool Ecosystem: How the Infrastructure Layer Around LLMs Is Consolidating

Read asIn-depth

Evaluation and BenchmarkingTopic guide

Evaluation and Benchmarking: The Shifting Yardstick of AI Capability

Read asIn-depth

Related events (8)

6arXiv · cs.CL·23d ago·source ↗

The Abstraction Gap in Vision-Language Causal Reasoning

Researchers introduce a dual-probe methodology and the CAGE benchmark (49,500 questions across 5,500 images) to distinguish linguistic plausibility from faithful causal reasoning in vision-language models. An Abstraction Gap (AG) metric quantifies the normalized performance difference between text-only and chain-of-reasoning probes. Evaluating eight VLMs, seven exhibit AG exceeding 0.50—generating fluent causal text but failing structured causal chain tasks—while one model achieves near-zero AG, suggesting architectural and pretraining choices are decisive. Fine-tuning on 45,000 chain-annotated examples fails to close the gap, pointing to a fundamental capability distinction.

Evaluation and Benchmarking Agent and Tool Ecosystem Pearl's Causal Hierarchy CAGE Text-Only Probe +3 more

5arXiv · cs.LG·2d ago·source ↗

Act2Answer: Benchmarking commonsense and world knowledge retention in Vision-Language-Action models

Researchers introduce Act2Answer, a protocol for evaluating how much commonsense and factual knowledge VLA models retain after fine-tuning on robotics data. The approach converts knowledge benchmark questions into tabletop object-placement episodes, yielding action-grounded success rates that reduce confounds from low-level control failures. A large-scale study of 7 VLA models and 9 VLM baselines finds that VLAs retain solid performance on simple concepts but show larger gaps on richer semantic categories compared to their source VLMs, and that VQA co-training is associated with better knowledge retention.

Evaluation and Benchmarking Multimodal Progress Does VLA Even Know the Basics? Measuring Commonsense and World Knowledge Retention in Vision-Language-Action Models Act2Answer

6arXiv · cs.CL·1mo ago·source ↗

Probe Trajectories Reveal Reasoning Dynamics in Large Reasoning Models

This paper investigates whether hidden representations of Large Reasoning Models (LRMs) can predict future model behavior by analyzing probe trajectories—the continuous evolution of concept probabilities across Chain-of-Thought reasoning tokens. The authors find that temporal trajectory features (volatility, trend, steady-state) significantly outperform single static probes, with max-pooling achieving up to 95% AUROC across safety and mathematics domains. Two methodological insights are offered: template-based training data matches dynamically generated responses in quality, and pooling strategy is critical to probe performance. The work positions probe trajectories as a complementary safety monitoring framework for LRMs where CoT faithfulness cannot be assumed.

Frontier Model Releases Evaluation and Benchmarking Max-Pooling Chain-of-Thought Reasoning Probe Trajectories +4 more

4arXiv · cs.AI·12d ago·source ↗

COMPACT-VA: Planning-aligned token compression for long-context autonomous driving

Researchers introduce COMPACT-VA, a working memory framework using conditional VQ-VAE to compress extended temporal context in vision-action autonomous driving models. Compression is conditioned on historical trajectory and a learned planning intent derived from future trajectories during training, enabling end-to-end optimization without backbone modifications. On high-signal dynamic scenarios, the method achieves 68.3% success rate (>6% improvement) with 3.3x speedup and 2.7x memory reduction over uncompressed processing.

Long Context Evolution Inference Economics conditional VQ-VAE Planning-aligned Token Compression for Long-Context Autonomous Driving COMPACT-VA

5arXiv · cs.LG·17d ago·source ↗

VLESA: Vision-Language Embodied Safety Agent for Real-Time Human Activity Monitoring

Researchers introduce VLESA, a framework that monitors human activities from egocentric video and triggers real-time safety interventions when dangerous actions are predicted. The system addresses intent-dependent safety — where identical actions can be safe or dangerous depending on context — using a goal-conditioned safety Q-filter trained via GRPO and an intent-action prediction agent. On the ASIMOV-2.0 benchmark, VLESA achieves higher intervention accuracy than baselines, with the Q-filter improving action safety by over 41 percentage points through goal-conditioned constrained decoding.

AI Safety Research Multimodal Progress GRPO ASIMOV-2.0 VLESA +1 more

5arXiv · cs.AI·15d ago·source ↗

TempoVLA: Speed-Controllable Vision-Language-Action Policy for Robot Manipulation

Researchers introduce TempoVLA, a Vision-Language-Action model that enables explicit speed control during robot manipulation by conditioning on a speed signal rather than inheriting a fixed speed from training data. The system pairs Variable-Speed Trajectory Augmentation (VSTA), which re-times demonstrations by merging or splitting actions, with a model-side conditioning mechanism. Experiments in simulation and real-world tasks show flexible bidirectional speed control, with dynamic adaptation—accelerating in low-risk transit phases and decelerating for high-risk contact stages—achieved by coupling with a large multimodal model.

Agent and Tool Ecosystem Multimodal Progress Variable-Speed Trajectory Augmentation TempoVLA TempoVLA: Learning Speed-Controllable Vision-Language-Action Policies

6arXiv · cs.AI·9d ago·source ↗

CHORUS: Single VLA policy enables decentralized multi-robot collaboration without inter-robot communication

CHORUS is a framework that adapts a single vision-language-action (VLA) backbone to control diverse multi-robot teams in a fully decentralized manner, with each robot running an independent copy conditioned only on its own observations and a robot-identifying prompt. Real-world experiments across tasks like tape measurement, book handovers, and laundry basket lifting show a 64-percentage-point improvement over decentralized from-scratch models and 40-point improvement in reactivity to teammate behavior, while outperforming centralized baselines. The key insight is that pretrained VLA visuomotor priors are sufficient to enable reactive coordination without explicit inter-robot communication or alignment procedures at inference time.

Agent and Tool Ecosystem Multimodal Progress CHORUS CHORUS: Decentralized Multi-Embodiment Collaboration with One VLA Policy

5arXiv · cs.AI·22d ago·source ↗

VisAnomReasoner: Efficient VLM for Time-Series Anomaly Detection via VisAnomBench

Researchers introduce VisAnomBench, a curated benchmark augmenting public time-series anomaly datasets with natural-language rationales generated and selected from multiple large VLMs using task-specific rewards. Fine-tuning on this benchmark produces VisAnomReasoner, a parameter-efficient vision-language model that outperforms all baselines by at least 21.23 and 23.87 percentage points in precision and F1 on VisAnomBench. Cross-benchmark evaluation on TSB-AD-U shows further generalization gains of 9.57 and 13.39 percentage points in precision and F1, respectively.

Evaluation and Benchmarking Agent and Tool Ecosystem time-series anomaly detection Vision-Language Models TSB-AD-U +3 more