5arXiv cs.CL (Computation and Language)·25d ago

EpiCurveBench: A Benchmark for Evaluating VLMs on Epidemic Curve Digitization

EpiCurveBench introduces a benchmark of 1,000 real-world epidemic curve images and a new evaluation metric (EpiCurveSimilarity, ECS) designed to assess vision-language models on time-series chart extraction, addressing limitations of existing metrics that ignore temporal structure. Evaluating six methods including three frontier closed VLMs, one open VLM, and two specialized chart-extraction systems, the best model achieves only 52.3% ECS, revealing substantial headroom compared to saturating scores on ChartQA. ECS is validated against downstream epidemiological statistics and shown to correlate 1.5–3.6× more strongly than Dynamic Time Warping across four summary metrics. The benchmark targets the public-health use case of digitizing historical outbreak data trapped in published figures, but generalizes to any structured time-series chart-extraction task.

Evaluation and Benchmarking Multimodal Progress Dynamic Time Warping EpiCurveSimilarity ChartQA EpiCurveBench

Related guides (2)

Multimodal ProgressTopic guide

Multimodal Progress: How AI Learned to See, Hear, and Act

Read asBeginner In-depth

Evaluation and BenchmarkingTopic guide

Evaluation and Benchmarking: How We Measure AI — and Why It Keeps Getting Harder

Read asBeginner In-depth

Related events (8)

5arXiv · cs.CL·25d ago·source ↗

Self-Ensembling Vision-Language Models for Chart Data Extraction

This paper proposes a self-ensembling method for chart-to-table extraction using vision-language models (VLMs), where multiple tabular outputs are sampled from the same VLM for a given chart image and aggregated via per-cell median over numerical values. The approach includes convergence detection and uncertainty estimation based on sample dispersion. The authors also introduce WB-ChartExtract, a new benchmark built from World Bank data featuring charts with ~7x more datapoints than ChartQA. The method achieves up to 23% relative improvement on WB-ChartExtract over single-pass VLM baselines.

Evaluation and Benchmarking Multimodal Progress WB-ChartExtract ChartQA World Bank +1 more

6arXiv · cs.CL·19d ago·source ↗

ClinEnv: Interactive Multi-Stage Long-Horizon EHR Benchmark for Clinical Agent Evaluation

ClinEnv is a new interactive benchmark that evaluates LLMs as attending physicians over real inpatient admissions using a Longitudinal Inpatient Simulation paradigm. Each case is decomposed into sequential decision stages where models must query four specialized agents before committing to medications, procedures, and diagnoses. Across seven evaluated models, the best achieves only 0.31 decision F1, with a sharp gap between diagnosis recovery (0.51 F1) and management actions (0.17 F1). The benchmark uniquely measures information-acquisition process quality alongside outcome quality, exposing a gap invisible to static or outcome-only evaluations.

Long Context Evolution Evaluation and Benchmarking large language models ClinEnv Electronic Health Records (EHR)+3 more

7arXiv · cs.AI·19d ago·source ↗

Moment-Video: Benchmark Diagnosing Temporal Fidelity of Video MLLMs on Momentary Visual Events

Moment-Video is a new benchmark of 1,000 human-verified video-QA pairs designed to evaluate how well video multimodal large language models (MLLMs) handle brief, localized visual events that may span only a few frames. The benchmark covers 7 domains and 25 subcategories across four task types: Temporal Occurrence, Temporal Counting, Action Description, and Temporal Reasoning. Evaluation of 33 proprietary and open-source models reveals severe deficiencies: the best model (Seed-2.0-Pro) achieves only 39.6% accuracy, while most open-source models score below 25%. Diagnostic analyses show that denser frame sampling helps but does not resolve the bottleneck, pointing to fundamental limitations in how current video MLLMs represent and preserve transient visual evidence.

Long Context Evolution Evaluation and Benchmarking Multimodal Large Language Models Moment-Video Seed-2.0-Pro +4 more

5arXiv · cs.CL·25d ago·source ↗

Chartographer: Counterfactual Chart Generation for Evaluating Vision-Language Models

Chartographer is a framework for generating counterfactual chart variants to rigorously evaluate visual reasoning in vision-language models (VLMs), addressing the problem of shortcut-taking and prior knowledge exploitation in chart QA benchmarks. The system reverse-engineers charts into executable code, generates seed-controlled variants, and derives new ground-truth answers via executable QA logic. Evaluation of proprietary and open-source VLMs reveals that models frequently fail to generalize to counterfactual charts even after correctly answering the original, with failures most common when novel visual reasoning pathways are required.

Evaluation and Benchmarking Multimodal Progress Chartographer counterfactual chart generation Vision-Language Models +1 more

4arXiv · cs.AI·1mo ago·source ↗

TempGlitch: Benchmark for Evaluating VLMs on Temporal Glitch Detection in Gameplay Videos

TempGlitch is a new benchmark designed to evaluate vision-language models on temporal glitch detection in gameplay videos, distinguishing temporal anomalies (visible only across ordered frames) from spatial ones (visible in a single frame). The benchmark covers five temporal glitch types with paired glitch-free videos for binary evaluation, and tests 12 proprietary and open-weight VLMs across multiple frame-sampling settings. Results show current VLMs perform near chance on temporal glitches, with neither denser frame sampling nor larger model size reliably improving detection. The work highlights a systematic gap in VLM temporal reasoning capabilities relevant to automated video quality assurance.

Evaluation and Benchmarking Multimodal Progress temporal glitch detection gameplay video quality assurance TempGlitch +1 more

5arXiv · cs.AI·23d ago·source ↗

VisAnomReasoner: Efficient VLM for Time-Series Anomaly Detection via VisAnomBench

Researchers introduce VisAnomBench, a curated benchmark augmenting public time-series anomaly datasets with natural-language rationales generated and selected from multiple large VLMs using task-specific rewards. Fine-tuning on this benchmark produces VisAnomReasoner, a parameter-efficient vision-language model that outperforms all baselines by at least 21.23 and 23.87 percentage points in precision and F1 on VisAnomBench. Cross-benchmark evaluation on TSB-AD-U shows further generalization gains of 9.57 and 13.39 percentage points in precision and F1, respectively.

Evaluation and Benchmarking Agent and Tool Ecosystem time-series anomaly detection Vision-Language Models TSB-AD-U +3 more

4arXiv · cs.AI·25d ago·source ↗

EdgeFlow: Edge-Map Augmented VLM-Based Flowchart Processing for Industrial Requirements Engineering

EdgeFlow augments Vision Language Models with deterministically extracted Canny edge maps as structural priors to improve flowchart-to-Mermaid conversion in industrial requirements engineering, requiring no annotated training data or fine-tuning. Evaluated on IndusReqFlow, a real-world industrial dataset, it achieves +17.39 pp node-level F1 and +16.94 pp edge-level F1 over off-the-shelf VLMs. Cross-dataset evaluation on a synthetic benchmark shows no significant gains, highlighting the gap between synthetic and industrial benchmarks for VLM-based RE tools.

Evaluation and Benchmarking Enterprise Deployment Patterns Mermaid Canny edge detection Vision-Language Models +3 more

6arXiv · cs.AI·2d ago·source ↗

Contagion Networks: formal framework for measuring evaluator bias propagation in multi-agent LLM systems

A new arXiv preprint introduces Contagion Networks, a formal framework for quantifying how systematic evaluation biases spread across interacting LLM agents in multi-agent systems. Using a controlled 3-agent experiment with DeepSeek-chat, the authors measure a Cross-Agent Contagion Matrix and find that homogeneous-model agents produce contagion coefficients 3-5x weaker than cross-model settings. A key practical finding is that increasing evaluator committee size from k=1 to k=3 reduces effective contagion by 72.4%, offering a concrete mitigation strategy. The authors release an open-source experimental framework alongside the paper.

Evaluation and Benchmarking AI Safety Research Contagion Networks: Evaluator Bias Propagation in Multi-Agent LLM Systems MM-EPC deepseek-chat +1 more