4arXiv cs.CL (Computation and Language)·1mo ago

Manga109-v2026: Revised Benchmark Dataset for Manga OCR and Multimodal Understanding

Researchers revisit the widely-used Manga109 dataset and identify five categories of annotation issues including transcription errors, missing text regions, and under-segmented speech balloons. They construct Manga109-v2026 by combining OCR-based issue detection with manual revision, correcting approximately 29,000 dialogue annotations. The updated dataset is intended to better align with modern OCR and multimodal manga understanding systems while preserving manga-specific expressive structures.

Evaluation and Benchmarking Multimodal Progress Manga109-v2026 Manga109

Related guides (2)

Multimodal ProgressTopic guide

Multimodal Progress: How AI Learned to See, Hear, and Act

Read asBeginner In-depth

Evaluation and BenchmarkingTopic guide

Evaluation and Benchmarking: How We Measure AI — and Why It Keeps Getting Harder

Read asBeginner In-depth

Related events (8)

5Hugging Face Blog·1mo ago·source ↗

Docmatix: A Large-Scale Dataset for Document Visual Question Answering

Hugging Face released Docmatix, a large-scale dataset designed for Document Visual Question Answering (DocVQA) tasks. The dataset aims to address the scarcity of high-quality training data for document understanding in multimodal models. It is intended to improve fine-tuning of vision-language models on document comprehension tasks.

Evaluation and Benchmarking Multimodal Progress Hugging Face Document Visual Question Answering Docmatix

5arXiv · cs.LG·1mo ago·source ↗

AUDITS: A Comprehensive Benchmark for Image Manipulation Localization Across Multiple Analysis Axes

Researchers introduce AUDITS (Analysis Under Domain-shifts, qualIty, Type, and Size), a benchmark of over 530K images designed to evaluate image manipulation detection across multiple axes including domain shift, manipulation type, and size. The dataset draws from user and news photos and incorporates recent diffusion-based inpaintings. Experiments assess the robustness of existing manipulation detection methods under various domain shifts, aiming to advance development of more generalizable detection approaches.

Evaluation and Benchmarking AI Safety Research AUDITS image manipulation detection image manipulation localization +2 more

5arXiv · cs.LG·16d ago·source ↗

OpAI-Bench: Benchmark for detecting AI text across progressive human-AI co-editing workflows

Researchers introduce OpAI-Bench, a benchmark for studying AI-text detection across progressive human-to-AI document revision workflows, covering document, sentence, token, and span granularities. Starting from human-written documents, the benchmark constructs nine sequentially revised versions per sample under five AI edit operations and varying AI coverage levels across four domains. Key findings include that mixed-authorship intermediate versions are often harder to detect than fully human or heavily AI-edited endpoints, revealing non-monotonic detection patterns absent from existing benchmarks. The work addresses a gap in AI-text detection research as real-world documents increasingly result from iterative human-AI co-editing rather than pure generation.

Evaluation and Benchmarking AI Safety Research VILA-Lab OpAI-Bench

5arXiv · cs.AI·3d ago·source ↗

Multi-domain benchmark for detecting AI-generated text-rich images from GPT-Image-2

Researchers introduce a new benchmark of 8,602 images across six categories (commercial posters, infographics, academic posters, receipts, tables, UI screenshots) specifically for detecting AI-generated text-rich images produced by OpenAI's GPT-Image-2. Five zero-shot detectors are evaluated, revealing highly domain-dependent performance and severe sensitivity to JPEG compression even in the strongest conventional detector. A multimodal VLM is also explored as a detector, showing promise but limitations on structured formats. The work highlights a gap in existing benchmarks that focus on object-centric rather than text-layout-centric images.

Evaluation and Benchmarking Multimodal Progress GPT-Image-2 OpenAI A Multi-Domain Benchmark for Detecting AI-Generated Text-Rich Images from GPT-Image-2

5Hugging Face Blog·1mo ago·source ↗

Introducing ConTextual: Benchmark for Joint Text-Image Reasoning in Text-Rich Scenes

Hugging Face introduces ConTextual, a new benchmark evaluating multimodal models on their ability to jointly reason over text and images in text-rich scenes. The benchmark targets a specific capability gap where models must integrate visual and textual information simultaneously rather than treating them independently. A leaderboard accompanies the benchmark to track model progress on this task.

Evaluation and Benchmarking Multimodal Progress Hugging Face ConTextual

4Hugging Face Blog·1mo ago·source ↗

Finetuning olmOCR to be a faithful OCR-Engine

TNG Technology Consulting describes a fine-tuning approach applied to olmOCR, a vision-language model designed for document OCR tasks, to improve its faithfulness and reduce hallucinations. The post covers dataset construction, training methodology, and evaluation results showing improved accuracy on document extraction benchmarks. This represents a practical community contribution to the open-weights document-understanding ecosystem.

Open Weights Progress Agent and Tool Ecosystem Hugging Face olmOCR TNG Technology Consulting +1 more

4arXiv · cs.CL·6d ago·source ↗

MoDiCoL: A modular continual learning dataset for diagnosing ASR robustness under distribution shift

Researchers introduce MoDiCoL, a benchmark dataset designed to evaluate automatic speech recognition robustness under co-occurring real-world distribution shifts including accents, recording conditions, speech impairments, and noise. Unlike existing benchmarks that isolate these factors, MoDiCoL enables controlled analysis across linguistic, speaker, and acoustic dimensions simultaneously. The paper also proposes a continual learning curriculum simulating incremental updates and evaluates three continual learning strategies for robustness acquisition and forgetting.

Evaluation and Benchmarking MoDiCoL

5arXiv · cs.CL·24d ago·source ↗

IPO-Mine: Toolkit and Dataset for Multimodal Analysis of Long IPO Filings

Researchers introduce IPO-Mine, comprising an open-source toolkit and a large-scale dataset of over 109,000 IPO filings (1994–2026) with 76,000+ extracted images, structured for section-level analysis. The toolkit parses long regulatory documents (often exceeding 500,000 tokens) into standardized text and image outputs. Benchmark tasks on financial chart quality and misleadingness assessment reveal that state-of-the-art multimodal models frequently diverge from expert human judgments, exposing alignment gaps in long-document multimodal reasoning. The dataset and code are publicly released under CC-BY-4.0.

Long Context Evolution Evaluation and Benchmarking IPO-Dataset IPO-Toolkit IPO-Mine +3 more