5arXiv cs.CL (Computation and Language)·24d ago

Chartographer: Counterfactual Chart Generation for Evaluating Vision-Language Models

Chartographer is a framework for generating counterfactual chart variants to rigorously evaluate visual reasoning in vision-language models (VLMs), addressing the problem of shortcut-taking and prior knowledge exploitation in chart QA benchmarks. The system reverse-engineers charts into executable code, generates seed-controlled variants, and derives new ground-truth answers via executable QA logic. Evaluation of proprietary and open-source VLMs reveals that models frequently fail to generalize to counterfactual charts even after correctly answering the original, with failures most common when novel visual reasoning pathways are required.

Evaluation and Benchmarking Multimodal Progress Chartographer counterfactual chart generation Vision-Language Models chart question-answering

Related guides (3)

Multimodal ProgressTopic guide

Multimodal Progress: How AI Learned to See, Hear, and Act

Read asBeginner In-depth

Evaluation and BenchmarkingTopic guide

Evaluation and Benchmarking: How We Measure AI — and Why It Keeps Getting Harder

Read asBeginner In-depth

Vision-Language ModelsConcept

Vision-Language Models: Teaching AI to See and Read at Once

Read asBeginner In-depth

Related events (8)

5arXiv · cs.CL·24d ago·source ↗

Self-Ensembling Vision-Language Models for Chart Data Extraction

This paper proposes a self-ensembling method for chart-to-table extraction using vision-language models (VLMs), where multiple tabular outputs are sampled from the same VLM for a given chart image and aggregated via per-cell median over numerical values. The approach includes convergence detection and uncertainty estimation based on sample dispersion. The authors also introduce WB-ChartExtract, a new benchmark built from World Bank data featuring charts with ~7x more datapoints than ChartQA. The method achieves up to 23% relative improvement on WB-ChartExtract over single-pass VLM baselines.

Evaluation and Benchmarking Multimodal Progress WB-ChartExtract ChartQA World Bank +1 more

6arXiv · cs.CL·23d ago·source ↗

The Abstraction Gap in Vision-Language Causal Reasoning

Researchers introduce a dual-probe methodology and the CAGE benchmark (49,500 questions across 5,500 images) to distinguish linguistic plausibility from faithful causal reasoning in vision-language models. An Abstraction Gap (AG) metric quantifies the normalized performance difference between text-only and chain-of-reasoning probes. Evaluating eight VLMs, seven exhibit AG exceeding 0.50—generating fluent causal text but failing structured causal chain tasks—while one model achieves near-zero AG, suggesting architectural and pretraining choices are decisive. Fine-tuning on 45,000 chain-annotated examples fails to close the gap, pointing to a fundamental capability distinction.

Evaluation and Benchmarking Agent and Tool Ecosystem Pearl's Causal Hierarchy CAGE Text-Only Probe +3 more

5arXiv · cs.LG·2d ago·source ↗

Act2Answer: Benchmarking commonsense and world knowledge retention in Vision-Language-Action models

Researchers introduce Act2Answer, a protocol for evaluating how much commonsense and factual knowledge VLA models retain after fine-tuning on robotics data. The approach converts knowledge benchmark questions into tabletop object-placement episodes, yielding action-grounded success rates that reduce confounds from low-level control failures. A large-scale study of 7 VLA models and 9 VLM baselines finds that VLAs retain solid performance on simple concepts but show larger gaps on richer semantic categories compared to their source VLMs, and that VQA co-training is associated with better knowledge retention.

Evaluation and Benchmarking Multimodal Progress Does VLA Even Know the Basics? Measuring Commonsense and World Knowledge Retention in Vision-Language-Action Models Act2Answer

6arXiv · cs.AI·1mo ago·source ↗

MedFocus: Causal Visual Attribution Framework for Chest X-ray Reasoning in Large Vision-Language Models

This paper addresses the faithfulness of visual attribution methods in Large Vision-Language Models (LVLMs) applied to chest X-ray (CXR) analysis. The authors develop a causal evaluation framework using counterfactual editing to verify whether expert-annotated regions are causally responsible for model predictions, testing 11 attribution methods across six open-source LVLMs. Finding that existing attribution methods frequently fail to identify the actual visual evidence used by models, they propose MedFocus, a concept-based attribution method using unbalanced optimal transport to localize anatomical regions and measure their causal effect on outputs. MedFocus substantially outperforms prior methods and provides spatial, concept-level, and token-level attributions.

Evaluation and Benchmarking AI Safety Research Unbalanced Optimal Transport MedFocus Chest X-ray Reasoning +5 more

5arXiv · cs.CL·24d ago·source ↗

EpiCurveBench: A Benchmark for Evaluating VLMs on Epidemic Curve Digitization

EpiCurveBench introduces a benchmark of 1,000 real-world epidemic curve images and a new evaluation metric (EpiCurveSimilarity, ECS) designed to assess vision-language models on time-series chart extraction, addressing limitations of existing metrics that ignore temporal structure. Evaluating six methods including three frontier closed VLMs, one open VLM, and two specialized chart-extraction systems, the best model achieves only 52.3% ECS, revealing substantial headroom compared to saturating scores on ChartQA. ECS is validated against downstream epidemiological statistics and shown to correlate 1.5–3.6× more strongly than Dynamic Time Warping across four summary metrics. The benchmark targets the public-health use case of digitizing historical outbreak data trapped in published figures, but generalizes to any structured time-series chart-extraction task.

Evaluation and Benchmarking Multimodal Progress Dynamic Time Warping EpiCurveSimilarity ChartQA +1 more

6arXiv · cs.LG·26d ago·source ↗

FM-CGM: Foundation Model Framework for Zero-Shot Visual Causal Generative Modeling

FM-CGM is a modular framework that decomposes visual causal reasoning into three components—concept extractor, concept manipulator, and counterfactual generator—using pretrained foundation models without task-specific causal training. The approach combines a large reasoning model for causal inference with a text-to-image diffusion model for generation, enabling zero-shot causal discovery and counterfactual image synthesis. A novel cross-attention mechanism called Causal Semantic Guidance (CSG) ensures that semantic interventions propagate correctly through causal descendants while preserving unaffected image regions. Empirical results show the framework can identify plausible causal structures and generate faithful counterfactual images.

AI Safety Research Agent and Tool Ecosystem cross-attention causal generative modeling FM-CGM +4 more

6arXiv · cs.AI·1mo ago·source ↗

Lost in Fog: Sensor Perturbations Expose Reasoning Fragility in Driving VLAs

This paper presents a controlled robustness study of Vision-Language-Action (VLA) models in autonomous driving, evaluating Alpamayo R1 (10B parameters) across ~18,000 inference trials under eight sensor perturbation types including noise, lighting extremes, and fog. The key finding is that Chain-of-Causation (CoC) reasoning consistency is a high-fidelity proxy for trajectory reliability: when CoC explanations change post-perturbation, trajectory deviation spikes 5.3× (r=0.99 across attack types). Enabling CoC generation is associated with 11.8% average improvement in trajectory accuracy, and degradation under noise is approximately linear (R²=0.957), while standard preprocessing defenses offer only marginal benefit.

Evaluation and Benchmarking AI Safety Research Vision-Language-Action model Chain-of-Causation autonomous driving +3 more

4arXiv · cs.AI·24d ago·source ↗

EdgeFlow: Edge-Map Augmented VLM-Based Flowchart Processing for Industrial Requirements Engineering

EdgeFlow augments Vision Language Models with deterministically extracted Canny edge maps as structural priors to improve flowchart-to-Mermaid conversion in industrial requirements engineering, requiring no annotated training data or fine-tuning. Evaluated on IndusReqFlow, a real-world industrial dataset, it achieves +17.39 pp node-level F1 and +16.94 pp edge-level F1 over off-the-shelf VLMs. Cross-dataset evaluation on a synthetic benchmark shows no significant gains, highlighting the gap between synthetic and industrial benchmarks for VLM-based RE tools.

Evaluation and Benchmarking Enterprise Deployment Patterns Mermaid Canny edge detection Vision-Language Models +3 more