5arXiv cs.AI (Artificial Intelligence)·22d ago

VisAnomReasoner: Efficient VLM for Time-Series Anomaly Detection via VisAnomBench

Researchers introduce VisAnomBench, a curated benchmark augmenting public time-series anomaly datasets with natural-language rationales generated and selected from multiple large VLMs using task-specific rewards. Fine-tuning on this benchmark produces VisAnomReasoner, a parameter-efficient vision-language model that outperforms all baselines by at least 21.23 and 23.87 percentage points in precision and F1 on VisAnomBench. Cross-benchmark evaluation on TSB-AD-U shows further generalization gains of 9.57 and 13.39 percentage points in precision and F1, respectively.

Evaluation and Benchmarking Agent and Tool Ecosystem Multimodal Progress time-series anomaly detection Vision-Language Models TSB-AD-U VisAnomBench VisAnomReasoner

Related guides (3)

Multimodal ProgressTopic guide

Multimodal Progress: How AI Learned to See, Hear, and Act

Read asBeginner In-depth

Agent and Tool EcosystemTopic guide

Agent and Tool Ecosystem: How AI Is Learning to Act, Not Just Answer

Read asBeginner In-depth

Evaluation and BenchmarkingTopic guide

Evaluation and Benchmarking: How We Measure AI — and Why It Keeps Getting Harder

Read asBeginner In-depth

Related events (8)

7arXiv · cs.AI·18d ago·source ↗

Moment-Video: Benchmark Diagnosing Temporal Fidelity of Video MLLMs on Momentary Visual Events

Moment-Video is a new benchmark of 1,000 human-verified video-QA pairs designed to evaluate how well video multimodal large language models (MLLMs) handle brief, localized visual events that may span only a few frames. The benchmark covers 7 domains and 25 subcategories across four task types: Temporal Occurrence, Temporal Counting, Action Description, and Temporal Reasoning. Evaluation of 33 proprietary and open-source models reveals severe deficiencies: the best model (Seed-2.0-Pro) achieves only 39.6% accuracy, while most open-source models score below 25%. Diagnostic analyses show that denser frame sampling helps but does not resolve the bottleneck, pointing to fundamental limitations in how current video MLLMs represent and preserve transient visual evidence.

Long Context Evolution Evaluation and Benchmarking Multimodal Large Language Models Moment-Video Seed-2.0-Pro +4 more

6arXiv · cs.AI·1mo ago·source ↗

Lost in Fog: Sensor Perturbations Expose Reasoning Fragility in Driving VLAs

This paper presents a controlled robustness study of Vision-Language-Action (VLA) models in autonomous driving, evaluating Alpamayo R1 (10B parameters) across ~18,000 inference trials under eight sensor perturbation types including noise, lighting extremes, and fog. The key finding is that Chain-of-Causation (CoC) reasoning consistency is a high-fidelity proxy for trajectory reliability: when CoC explanations change post-perturbation, trajectory deviation spikes 5.3× (r=0.99 across attack types). Enabling CoC generation is associated with 11.8% average improvement in trajectory accuracy, and degradation under noise is approximately linear (R²=0.957), while standard preprocessing defenses offer only marginal benefit.

Evaluation and Benchmarking AI Safety Research Vision-Language-Action model Chain-of-Causation autonomous driving +3 more

5arXiv · cs.AI·25d ago·source ↗

WSADBench: A Unified Benchmark for Weakly Supervised Anomaly Detection

WSADBench is the first benchmark to unify evaluation across the three primary weakly supervised anomaly detection (WSAD) paradigms—incomplete, inexact, and inaccurate supervision—testing 36 algorithms across 4 modalities with over 700K experiments. Key findings challenge the isolation of current WSAD research directions, showing strong correlations between supervision scenarios and that specialized WSAD methods are quickly outperformed by tabular foundation models as label availability increases. The benchmark also reveals inconsistent utility of unlabeled data and asymmetric model sensitivity to label noise types. Code and datasets are released open-source.

Evaluation and Benchmarking WSADBench weakly supervised anomaly detection SUFE-AILAB +1 more

5arXiv · cs.LG·2d ago·source ↗

Act2Answer: Benchmarking commonsense and world knowledge retention in Vision-Language-Action models

Researchers introduce Act2Answer, a protocol for evaluating how much commonsense and factual knowledge VLA models retain after fine-tuning on robotics data. The approach converts knowledge benchmark questions into tabletop object-placement episodes, yielding action-grounded success rates that reduce confounds from low-level control failures. A large-scale study of 7 VLA models and 9 VLM baselines finds that VLAs retain solid performance on simple concepts but show larger gaps on richer semantic categories compared to their source VLMs, and that VQA co-training is associated with better knowledge retention.

Evaluation and Benchmarking Multimodal Progress Does VLA Even Know the Basics? Measuring Commonsense and World Knowledge Retention in Vision-Language-Action Models Act2Answer

4arXiv · cs.AI·1mo ago·source ↗

TempGlitch: Benchmark for Evaluating VLMs on Temporal Glitch Detection in Gameplay Videos

TempGlitch is a new benchmark designed to evaluate vision-language models on temporal glitch detection in gameplay videos, distinguishing temporal anomalies (visible only across ordered frames) from spatial ones (visible in a single frame). The benchmark covers five temporal glitch types with paired glitch-free videos for binary evaluation, and tests 12 proprietary and open-weight VLMs across multiple frame-sampling settings. Results show current VLMs perform near chance on temporal glitches, with neither denser frame sampling nor larger model size reliably improving detection. The work highlights a systematic gap in VLM temporal reasoning capabilities relevant to automated video quality assurance.

Evaluation and Benchmarking Multimodal Progress temporal glitch detection gameplay video quality assurance TempGlitch +1 more

6arXiv · cs.CL·9d ago·source ↗

OpenMedReason: Large-scale multimodal medical reasoning corpus with 450K instances for clinical VLM training

Researchers introduce OpenMedReason, a 450K-instance open multimodal medical reasoning corpus with reasoning traces derived from human-authored biomedical literature rather than synthetic chains of thought. The dataset covers diverse medical imaging modalities and is paired with OpenMedReason-Bench, a held-out benchmark evaluating LVLMs on perception, medical knowledge, and rationale axes. Training with OpenMedReason yields a 20% average VQA accuracy improvement over base models and achieves performance within 4.2% of leading comparable-scale medical VLMs. Both the dataset and code are publicly released.

Evaluation and Benchmarking Alignment and RLHF OpenMedReason OpenMedReason-Bench +1 more

5arXiv · cs.CL·24d ago·source ↗

Self-Ensembling Vision-Language Models for Chart Data Extraction

This paper proposes a self-ensembling method for chart-to-table extraction using vision-language models (VLMs), where multiple tabular outputs are sampled from the same VLM for a given chart image and aggregated via per-cell median over numerical values. The approach includes convergence detection and uncertainty estimation based on sample dispersion. The authors also introduce WB-ChartExtract, a new benchmark built from World Bank data featuring charts with ~7x more datapoints than ChartQA. The method achieves up to 23% relative improvement on WB-ChartExtract over single-pass VLM baselines.

Evaluation and Benchmarking Multimodal Progress WB-ChartExtract ChartQA World Bank +1 more

5Hugging Face Blog·1mo ago·source ↗

Vision Language Models (Better, faster, stronger)

A Hugging Face blog post surveys the state of vision-language models (VLMs) in 2025, covering advances in architecture, training, efficiency, and deployment. The post reviews progress across major open and closed VLMs, highlighting trends in multimodal capability, speed improvements, and practical deployment patterns. As a tier-2 commentary piece, it synthesizes the current landscape rather than announcing new research.

Open Weights Progress Inference Economics Vision-Language Models Hugging Face +1 more