4arXiv cs.CL (Computation and Language)·7h ago

OCR-Robust benchmark evaluates VLM robustness to visual perturbations on OCR-reasoning tasks

Researchers introduce OCR-Robust, a benchmark of 812 samples designed to evaluate how vision-language models handle OCR-reasoning tasks under controlled visual degradation. The benchmark covers documents, scene text, charts, geometry, and tables, applying 5 perturbation types at 3 severity levels each, and evaluates 18 models using metrics including Relative Corruption Retention and a composite Corruption Robustness Index. Key findings show that higher clean accuracy does not guarantee robustness, and that chart and table inputs are substantially more fragile under perturbation than document-like inputs.

Evaluation and Benchmarking Multimodal Progress How Robust is OCR-Reasoning? Evaluating OCR-Reasoning Robustness of Vision-Language Models under Visual Perturbations OCR-Robust

Related guides (2)

Multimodal ProgressTopic guide

Multimodal Progress: How AI Learned to See, Hear, and Act

Read asBeginner In-depth

Evaluation and BenchmarkingTopic guide

Evaluation and Benchmarking: How We Measure AI — and Why It Keeps Getting Harder

Read asBeginner In-depth

Related events (8)

5arXiv · cs.AI·27d ago·source ↗

RoboWits: Benchmark for Robotic Creative Problem Solving Under Unexpected Conditions

RoboWits is a new bi-manual robotic benchmark designed to evaluate cognitive reasoning, creative tool use, and robustness to unexpected conditions in robotics. The authors introduce an automated multi-agent task generation pipeline that produces 30 seed tasks and 208 mutated tasks spanning geometry, material, and assembly-based reasoning. Benchmarking results show that pre-trained Vision-Language-Action models (VLAs) achieve limited success on seed tasks after fine-tuning but fail on mutated variants, exposing brittleness in reasoning and strategy adaptation. The benchmark highlights a significant gap between skill-level execution and genuine cognitive reasoning in current robotic systems.

Evaluation and Benchmarking Agent and Tool Ecosystem Vision-Language-Action models RoboWits multi-agent cooperative framework +3 more

6arXiv · cs.AI·1mo ago·source ↗

Lost in Fog: Sensor Perturbations Expose Reasoning Fragility in Driving VLAs

This paper presents a controlled robustness study of Vision-Language-Action (VLA) models in autonomous driving, evaluating Alpamayo R1 (10B parameters) across ~18,000 inference trials under eight sensor perturbation types including noise, lighting extremes, and fog. The key finding is that Chain-of-Causation (CoC) reasoning consistency is a high-fidelity proxy for trajectory reliability: when CoC explanations change post-perturbation, trajectory deviation spikes 5.3× (r=0.99 across attack types). Enabling CoC generation is associated with 11.8% average improvement in trajectory accuracy, and degradation under noise is approximately linear (R²=0.957), while standard preprocessing defenses offer only marginal benefit.

Evaluation and Benchmarking AI Safety Research Vision-Language-Action model Chain-of-Causation autonomous driving +3 more

5Hugging Face Blog·1mo ago·source ↗

Introducing ConTextual: Benchmark for Joint Text-Image Reasoning in Text-Rich Scenes

Hugging Face introduces ConTextual, a new benchmark evaluating multimodal models on their ability to jointly reason over text and images in text-rich scenes. The benchmark targets a specific capability gap where models must integrate visual and textual information simultaneously rather than treating them independently. A leaderboard accompanies the benchmark to track model progress on this task.

Evaluation and Benchmarking Multimodal Progress Hugging Face ConTextual

5arXiv · cs.AI·7d ago·source ↗

Multi-domain benchmark for detecting AI-generated text-rich images from GPT-Image-2

Researchers introduce a new benchmark of 8,602 images across six categories (commercial posters, infographics, academic posters, receipts, tables, UI screenshots) specifically for detecting AI-generated text-rich images produced by OpenAI's GPT-Image-2. Five zero-shot detectors are evaluated, revealing highly domain-dependent performance and severe sensitivity to JPEG compression even in the strongest conventional detector. A multimodal VLM is also explored as a detector, showing promise but limitations on structured formats. The work highlights a gap in existing benchmarks that focus on object-centric rather than text-layout-centric images.

Evaluation and Benchmarking Multimodal Progress GPT-Image-2 OpenAI A Multi-Domain Benchmark for Detecting AI-Generated Text-Rich Images from GPT-Image-2

7arXiv · cs.AI·28d ago·source ↗

CORE: Contrastive Reflection for Sample-Efficient Reasoning Improvement

CORE (Contrastive Reflection) is a non-parametric learning algorithm that improves LLM reasoning by comparing successful and unsuccessful reasoning traces to generate compact natural-language 'insights' about reasoning strategies. Across four reasoning tasks, CORE outperforms both parametric baselines (GRPO/RLVR) and non-parametric baselines (GEPA, episodic RAG, MemRL) under fixed rollout budgets, achieving comparable or better gains with as few as five training samples. The method is also more context-efficient than prompt-optimization approaches, storing learned knowledge as interpretable natural-language descriptions rather than raw traces or weight updates. The results suggest contrastive distillation of reasoning traces may be a more efficient route to self-improvement than traditional fine-tuning.

Evaluation and Benchmarking Inference Economics RLVR GRPO CORE (Contrastive Reflection)+5 more

4arXiv · cs.CL·10d ago·source ↗

MoDiCoL: A modular continual learning dataset for diagnosing ASR robustness under distribution shift

Researchers introduce MoDiCoL, a benchmark dataset designed to evaluate automatic speech recognition robustness under co-occurring real-world distribution shifts including accents, recording conditions, speech impairments, and noise. Unlike existing benchmarks that isolate these factors, MoDiCoL enables controlled analysis across linguistic, speaker, and acoustic dimensions simultaneously. The paper also proposes a continual learning curriculum simulating incremental updates and evaluates three continual learning strategies for robustness acquisition and forgetting.

Evaluation and Benchmarking MoDiCoL

5arXiv · cs.CL·10d ago·source ↗

CORA: Consistency-Oriented Reasoning Alignment addresses thinking-answer gap in multimodal RLVR

Researchers identify and analyze a systematic inconsistency between reasoning traces and final answers in RLVR-trained large vision-language models, showing the problem persists throughout GRPO training and inference. They propose CORA, which introduces a lightweight plug-and-play consistency reward model and a Hybrid Reward Advantage Splitting (HRAS) mechanism to coordinate task and consistency optimization. Experiments across multimodal reasoning benchmarks show CORA improves both task performance and reasoning faithfulness.

Evaluation and Benchmarking Alignment and RLHF CORA Hybrid Reward Advantage Splitting GRPO (Group Relative Policy Optimization)+1 more

5arXiv · cs.AI·27d ago·source ↗

VisAnomReasoner: Efficient VLM for Time-Series Anomaly Detection via VisAnomBench

Researchers introduce VisAnomBench, a curated benchmark augmenting public time-series anomaly datasets with natural-language rationales generated and selected from multiple large VLMs using task-specific rewards. Fine-tuning on this benchmark produces VisAnomReasoner, a parameter-efficient vision-language model that outperforms all baselines by at least 21.23 and 23.87 percentage points in precision and F1 on VisAnomBench. Cross-benchmark evaluation on TSB-AD-U shows further generalization gains of 9.57 and 13.39 percentage points in precision and F1, respectively.

Evaluation and Benchmarking Agent and Tool Ecosystem time-series anomaly detection Vision-Language Models TSB-AD-U +3 more