paper

Does VLA Even Know the Basics? Measuring Commonsense and World Knowledge Retention in Vision-Language-Action Models

paperactiveprovisional

does-vla-even-know-the-basics-measuring-commonsense-and-world-knowledge-retention-in-vision-language-action-models-ae984616

·1 events·first seen 3d ago

Aliases: Does VLA Even Know the Basics? Measuring Commonsense and World Knowledge Retention in Vision-Language-Action Models

Co-occurring entities

Act2Answer

More like this (12)

LabVLA: Grounding Vision-Language-Action Models in Scientific Laboratories Vision-Language-Action model Vision-Language-Action models TempoVLA: Learning Speed-Controllable Vision-Language-Action Policies Vision-Language Models visual language model Gaze Heads: How VLMs Look at What They Describe Gaze Heads: How VLMs Look at What They Describe Modeling Complex Behaviors: Multi-Personality Composition and Dynamic Switching in Vision-Language Models The Lipreading Gap: Do VSR Models Perceive Visual Speech Like Human Lipreaders?Watch, Remember, Reason: Human-View Video Understanding with MLLMs Visual Question Answering

Recent events (1)

5arXiv · cs.LG·3d ago·source ↗

Act2Answer: Benchmarking commonsense and world knowledge retention in Vision-Language-Action models

Researchers introduce Act2Answer, a protocol for evaluating how much commonsense and factual knowledge VLA models retain after fine-tuning on robotics data. The approach converts knowledge benchmark questions into tabletop object-placement episodes, yielding action-grounded success rates that reduce confounds from low-level control failures. A large-scale study of 7 VLA models and 9 VLM baselines finds that VLAs retain solid performance on simple concepts but show larger gaps on richer semantic categories compared to their source VLMs, and that VQA co-training is associated with better knowledge retention.

Evaluation and Benchmarking Multimodal Progress Does VLA Even Know the Basics? Measuring Commonsense and World Knowledge Retention in Vision-Language-Action Models Act2Answer