5arXiv cs.AI (Artificial Intelligence)·1mo ago

WikiVQABench: A Knowledge-Grounded Visual Question Answering Benchmark from Wikipedia and Wikidata

WikiVQABench is a new human-curated VQA benchmark that requires external knowledge beyond visual perception, constructed by combining Wikipedia images, captions, and Wikidata structured knowledge with LLM-generated question candidates reviewed by human annotators. The benchmark evaluates knowledge-intensive reasoning in vision-language models, covering 15 VLMs ranging from 256M to 90B parameters. Accuracy spans 24.7% to 75.6%, indicating meaningful discrimination across model scales. The dataset and code are publicly released.

Evaluation and Benchmarking Multimodal Progress large language models Wikidata WikiVQABench Vision-Language Models Wikipedia

Related guides (3)

Multimodal ProgressTopic guide

Multimodal Progress: How AI Learned to See, Hear, and Act

Read asBeginner In-depth

Evaluation and BenchmarkingTopic guide

Evaluation and Benchmarking: How We Measure AI — and Why It Keeps Getting Harder

Read asBeginner In-depth

Vision-Language ModelsConcept

Vision-Language Models: Teaching AI to See and Read at Once

Read asBeginner In-depth

Related events (8)

5arXiv · cs.LG·2d ago·source ↗

Act2Answer: Benchmarking commonsense and world knowledge retention in Vision-Language-Action models

Researchers introduce Act2Answer, a protocol for evaluating how much commonsense and factual knowledge VLA models retain after fine-tuning on robotics data. The approach converts knowledge benchmark questions into tabletop object-placement episodes, yielding action-grounded success rates that reduce confounds from low-level control failures. A large-scale study of 7 VLA models and 9 VLM baselines finds that VLAs retain solid performance on simple concepts but show larger gaps on richer semantic categories compared to their source VLMs, and that VQA co-training is associated with better knowledge retention.

Evaluation and Benchmarking Multimodal Progress Does VLA Even Know the Basics? Measuring Commonsense and World Knowledge Retention in Vision-Language-Action Models Act2Answer

7Qwen Research·1mo ago·source ↗

QVQ-72B-Preview: Qwen Visual Reasoning Model Release

Alibaba's Qwen team has released QVQ-72B-Preview, a 72-billion parameter multimodal model designed to integrate visual understanding with advanced reasoning capabilities. The model is positioned as an extension of Qwen's language reasoning work into the visual domain. It is available on GitHub, Hugging Face, ModelScope, and Kaggle with a live demo.

Frontier Model Releases Open Weights Progress Alibaba Qwen QVQ-72B-Preview +3 more

5Hugging Face Blog·1mo ago·source ↗

Docmatix: A Large-Scale Dataset for Document Visual Question Answering

Hugging Face released Docmatix, a large-scale dataset designed for Document Visual Question Answering (DocVQA) tasks. The dataset aims to address the scarcity of high-quality training data for document understanding in multimodal models. It is intended to improve fine-tuning of vision-language models on document comprehension tasks.

Evaluation and Benchmarking Multimodal Progress Hugging Face Document Visual Question Answering Docmatix

7Qwen Research·1mo ago·source ↗

Qwen2-VL: Alibaba Releases Latest Vision-Language Model with Extended Video Understanding

Alibaba's Qwen team has released Qwen2-VL, the latest iteration of their vision-language model series built on the Qwen2 foundation. The model claims state-of-the-art performance on visual understanding benchmarks including MathVista, DocVQA, RealWorldQA, and MTVQA. A notable capability is understanding videos exceeding 20 minutes in length for question answering, dialog, and content creation tasks.

Frontier Model Releases Evaluation and Benchmarking Qwen2.5-VL RealWorldQA DocVQA +6 more

7Qwen Research·1mo ago·source ↗

QVQ-Max: Alibaba Qwen Releases Visual Reasoning Model with Multimodal Chain-of-Thought

Alibaba's Qwen team has officially released QVQ-Max, a visual reasoning model succeeding the December 2024 QVQ-72B-Preview. The model is designed to analyze and reason over images and videos, covering domains including mathematics, programming, and creative tasks. It represents a step beyond the exploratory preview, positioning as a production-grade multimodal reasoning system.

Frontier Model Releases Agent and Tool Ecosystem Alibaba Qwen QVQ-72B-Preview QVQ-Max +1 more

5arXiv · cs.CL·11d ago·source ↗

Benchmark for view-level visual evidence identification in multi-view MLLMs for autonomous driving

A new arXiv preprint introduces a multi-view visual question answering benchmark targeting evidence-source identification in autonomous driving scenarios. Given six synchronized NuScenes camera views and a question, models must identify which camera view supports the answer — not just produce a correct answer. The 122-pair benchmark spans causality, counterfactual reasoning, and intent prediction, and exposes grounding failures that answer-only evaluation misses. The work addresses a meaningful gap between answer accuracy and correct visual grounding in safety-critical multimodal systems.

Evaluation and Benchmarking Multimodal Progress NuScenes Where Does the Answer Come From? Benchmarking View-Level Visual Evidence Identification in Multi-View MLLMs for Autonomous Driving

5arXiv · cs.CL·1mo ago·source ↗

ACL-Verbatim: Hallucination-Free Extractive QA System for Research Papers

The paper introduces ACL-Verbatim, an extractive question answering system built on VerbatimRAG that maps user queries directly to verbatim text spans in ACL Anthology papers, eliminating hallucination by design. The authors contribute a new ground-truth benchmark dataset created via human NLP-researcher annotation over synthetic queries generated using a ScIRGen-based pipeline. A 150M-parameter ModernBERT token classifier trained on silver supervision achieves the best word-level F1 of 53.6, outperforming the strongest LLM-based extractor at 48.7. The work demonstrates that smaller extractive models can outperform large generative LLMs on precision-critical retrieval tasks.

Evaluation and Benchmarking AI Safety Research ModernBERT ScIRGen ACL Anthology +3 more

5arXiv · cs.AI·22d ago·source ↗

RoboWits: Benchmark for Robotic Creative Problem Solving Under Unexpected Conditions

RoboWits is a new bi-manual robotic benchmark designed to evaluate cognitive reasoning, creative tool use, and robustness to unexpected conditions in robotics. The authors introduce an automated multi-agent task generation pipeline that produces 30 seed tasks and 208 mutated tasks spanning geometry, material, and assembly-based reasoning. Benchmarking results show that pre-trained Vision-Language-Action models (VLAs) achieve limited success on seed tasks after fine-tuning but fail on mutated variants, exposing brittleness in reasoning and strategy adaptation. The benchmark highlights a significant gap between skill-level execution and genuine cognitive reasoning in current robotic systems.

Evaluation and Benchmarking Agent and Tool Ecosystem Vision-Language-Action models RoboWits multi-agent cooperative framework +3 more