5arXiv cs.CL (Computation and Language)·24d ago

Self-Ensembling Vision-Language Models for Chart Data Extraction

This paper proposes a self-ensembling method for chart-to-table extraction using vision-language models (VLMs), where multiple tabular outputs are sampled from the same VLM for a given chart image and aggregated via per-cell median over numerical values. The approach includes convergence detection and uncertainty estimation based on sample dispersion. The authors also introduce WB-ChartExtract, a new benchmark built from World Bank data featuring charts with ~7x more datapoints than ChartQA. The method achieves up to 23% relative improvement on WB-ChartExtract over single-pass VLM baselines.

Evaluation and Benchmarking Multimodal Progress WB-ChartExtract ChartQA World Bank Self-Ensembling VLM Chart Extraction

Related guides (2)

Multimodal ProgressTopic guide

Multimodal Progress: How AI Learned to See, Hear, and Act

Read asBeginner In-depth

Evaluation and BenchmarkingTopic guide

Evaluation and Benchmarking: How We Measure AI — and Why It Keeps Getting Harder

Read asBeginner In-depth

Related events (8)

5arXiv · cs.CL·24d ago·source ↗

Chartographer: Counterfactual Chart Generation for Evaluating Vision-Language Models

Chartographer is a framework for generating counterfactual chart variants to rigorously evaluate visual reasoning in vision-language models (VLMs), addressing the problem of shortcut-taking and prior knowledge exploitation in chart QA benchmarks. The system reverse-engineers charts into executable code, generates seed-controlled variants, and derives new ground-truth answers via executable QA logic. Evaluation of proprietary and open-source VLMs reveals that models frequently fail to generalize to counterfactual charts even after correctly answering the original, with failures most common when novel visual reasoning pathways are required.

Evaluation and Benchmarking Multimodal Progress Chartographer counterfactual chart generation Vision-Language Models +1 more

5arXiv · cs.CL·24d ago·source ↗

EpiCurveBench: A Benchmark for Evaluating VLMs on Epidemic Curve Digitization

EpiCurveBench introduces a benchmark of 1,000 real-world epidemic curve images and a new evaluation metric (EpiCurveSimilarity, ECS) designed to assess vision-language models on time-series chart extraction, addressing limitations of existing metrics that ignore temporal structure. Evaluating six methods including three frontier closed VLMs, one open VLM, and two specialized chart-extraction systems, the best model achieves only 52.3% ECS, revealing substantial headroom compared to saturating scores on ChartQA. ECS is validated against downstream epidemiological statistics and shown to correlate 1.5–3.6× more strongly than Dynamic Time Warping across four summary metrics. The benchmark targets the public-health use case of digitizing historical outbreak data trapped in published figures, but generalizes to any structured time-series chart-extraction task.

Evaluation and Benchmarking Multimodal Progress Dynamic Time Warping EpiCurveSimilarity ChartQA +1 more

4Hugging Face Blog·1mo ago·source ↗

A Dive into Vision-Language Models

This Hugging Face blog post provides a technical overview of vision-language model (VLM) pretraining approaches, covering architectures and training strategies used to align visual and textual representations. It surveys key models and techniques in the multimodal learning space as of early 2023. The post serves as an educational reference for practitioners working with or building VLMs.

Multimodal Progress Contrastive Language-Image Pretraining (CLIP)Vision-Language Models Hugging Face

3Hugging Face Blog·1mo ago·source ↗

Vision Language Models Explained

A Hugging Face blog post providing a technical overview of vision language models (VLMs), covering their architecture, training approaches, and capabilities. The post serves as an educational resource explaining how VLMs combine visual and language understanding. As a tier-2 commentary piece, it synthesizes existing knowledge rather than presenting new research findings.

Multimodal Progress Vision-Language Models Hugging Face

5Hugging Face Blog·1mo ago·source ↗

Vision Language Models (Better, faster, stronger)

A Hugging Face blog post surveys the state of vision-language models (VLMs) in 2025, covering advances in architecture, training, efficiency, and deployment. The post reviews progress across major open and closed VLMs, highlighting trends in multimodal capability, speed improvements, and practical deployment patterns. As a tier-2 commentary piece, it synthesizes the current landscape rather than announcing new research.

Open Weights Progress Inference Economics Vision-Language Models Hugging Face +1 more

5arXiv · cs.AI·22d ago·source ↗

VisAnomReasoner: Efficient VLM for Time-Series Anomaly Detection via VisAnomBench

Researchers introduce VisAnomBench, a curated benchmark augmenting public time-series anomaly datasets with natural-language rationales generated and selected from multiple large VLMs using task-specific rewards. Fine-tuning on this benchmark produces VisAnomReasoner, a parameter-efficient vision-language model that outperforms all baselines by at least 21.23 and 23.87 percentage points in precision and F1 on VisAnomBench. Cross-benchmark evaluation on TSB-AD-U shows further generalization gains of 9.57 and 13.39 percentage points in precision and F1, respectively.

Evaluation and Benchmarking Agent and Tool Ecosystem time-series anomaly detection Vision-Language Models TSB-AD-U +3 more

4arXiv · cs.AI·24d ago·source ↗

EdgeFlow: Edge-Map Augmented VLM-Based Flowchart Processing for Industrial Requirements Engineering

EdgeFlow augments Vision Language Models with deterministically extracted Canny edge maps as structural priors to improve flowchart-to-Mermaid conversion in industrial requirements engineering, requiring no annotated training data or fine-tuning. Evaluated on IndusReqFlow, a real-world industrial dataset, it achieves +17.39 pp node-level F1 and +16.94 pp edge-level F1 over off-the-shelf VLMs. Cross-dataset evaluation on a synthetic benchmark shows no significant gains, highlighting the gap between synthetic and industrial benchmarks for VLM-based RE tools.

Evaluation and Benchmarking Enterprise Deployment Patterns Mermaid Canny edge detection Vision-Language Models +3 more

6arXiv · cs.CL·1mo ago·source ↗

Vision-OPD: On-Policy Self-Distillation for Fine-Grained Visual Understanding in MLLMs

Vision-OPD addresses a 'regional-to-global perception gap' in multimodal LLMs, where models answer fine-grained visual questions more accurately when given cropped evidence regions than full images. The method instantiates a crop-conditioned teacher and full-image-conditioned student from the same MLLM, minimizing token-level divergence along on-policy rollouts to transfer regional perception to the full-image policy. This self-distillation requires no external teacher models, ground-truth labels, reward verifiers, or inference-time tools. Benchmarks show competitive or superior performance against larger open-source, closed-source, and agentic 'Thinking-with-Images' models.

Evaluation and Benchmarking Agent and Tool Ecosystem Multimodal Large Language Models Thinking-with-Images on-policy self-distillation +4 more