paper
Does VLA Even Know the Basics? Measuring Commonsense and World Knowledge Retention in Vision-Language-Action Models
paperactiveprovisional
does-vla-even-know-the-basics-measuring-commonsense-and-world-knowledge-retention-in-vision-language-action-models-ae984616·1 events·first seen 3d agoAliases: Does VLA Even Know the Basics? Measuring Commonsense and World Knowledge Retention in Vision-Language-Action Models
Co-occurring entities
More like this (12)
LabVLA: Grounding Vision-Language-Action Models in Scientific LaboratoriesVision-Language-Action modelVision-Language-Action modelsTempoVLA: Learning Speed-Controllable Vision-Language-Action PoliciesVision-Language Modelsvisual language modelGaze Heads: How VLMs Look at What They DescribeGaze Heads: How VLMs Look at What They DescribeModeling Complex Behaviors: Multi-Personality Composition and Dynamic Switching in Vision-Language ModelsThe Lipreading Gap: Do VSR Models Perceive Visual Speech Like Human Lipreaders?Watch, Remember, Reason: Human-View Video Understanding with MLLMsVisual Question Answering
Recent events (1)
Act2Answer: Benchmarking commonsense and world knowledge retention in Vision-Language-Action models
Researchers introduce Act2Answer, a protocol for evaluating how much commonsense and factual knowledge VLA models retain after fine-tuning on robotics data. The approach converts knowledge benchmark questions into tabletop object-placement episodes, yielding action-grounded success rates that reduce confounds from low-level control failures. A large-scale study of 7 VLA models and 9 VLM baselines finds that VLAs retain solid performance on simple concepts but show larger gaps on richer semantic categories compared to their source VLMs, and that VQA co-training is associated with better knowledge retention.