Manga109-v2026: Revised Benchmark Dataset for Manga OCR and Multimodal Understanding
Researchers revisit the widely-used Manga109 dataset and identify five categories of annotation issues including transcription errors, missing text regions, and under-segmented speech balloons. They construct Manga109-v2026 by combining OCR-based issue detection with manual revision, correcting approximately 29,000 dialogue annotations. The updated dataset is intended to better align with modern OCR and multimodal manga understanding systems while preserving manga-specific expressive structures.
Related guides (2)
Related events (8)
Docmatix: A Large-Scale Dataset for Document Visual Question Answering
Hugging Face released Docmatix, a large-scale dataset designed for Document Visual Question Answering (DocVQA) tasks. The dataset aims to address the scarcity of high-quality training data for document understanding in multimodal models. It is intended to improve fine-tuning of vision-language models on document comprehension tasks.
AUDITS: A Comprehensive Benchmark for Image Manipulation Localization Across Multiple Analysis Axes
Researchers introduce AUDITS (Analysis Under Domain-shifts, qualIty, Type, and Size), a benchmark of over 530K images designed to evaluate image manipulation detection across multiple axes including domain shift, manipulation type, and size. The dataset draws from user and news photos and incorporates recent diffusion-based inpaintings. Experiments assess the robustness of existing manipulation detection methods under various domain shifts, aiming to advance development of more generalizable detection approaches.
OpAI-Bench: Benchmark for detecting AI text across progressive human-AI co-editing workflows
Researchers introduce OpAI-Bench, a benchmark for studying AI-text detection across progressive human-to-AI document revision workflows, covering document, sentence, token, and span granularities. Starting from human-written documents, the benchmark constructs nine sequentially revised versions per sample under five AI edit operations and varying AI coverage levels across four domains. Key findings include that mixed-authorship intermediate versions are often harder to detect than fully human or heavily AI-edited endpoints, revealing non-monotonic detection patterns absent from existing benchmarks. The work addresses a gap in AI-text detection research as real-world documents increasingly result from iterative human-AI co-editing rather than pure generation.
Multi-domain benchmark for detecting AI-generated text-rich images from GPT-Image-2
Researchers introduce a new benchmark of 8,602 images across six categories (commercial posters, infographics, academic posters, receipts, tables, UI screenshots) specifically for detecting AI-generated text-rich images produced by OpenAI's GPT-Image-2. Five zero-shot detectors are evaluated, revealing highly domain-dependent performance and severe sensitivity to JPEG compression even in the strongest conventional detector. A multimodal VLM is also explored as a detector, showing promise but limitations on structured formats. The work highlights a gap in existing benchmarks that focus on object-centric rather than text-layout-centric images.
Introducing ConTextual: Benchmark for Joint Text-Image Reasoning in Text-Rich Scenes
Hugging Face introduces ConTextual, a new benchmark evaluating multimodal models on their ability to jointly reason over text and images in text-rich scenes. The benchmark targets a specific capability gap where models must integrate visual and textual information simultaneously rather than treating them independently. A leaderboard accompanies the benchmark to track model progress on this task.
Finetuning olmOCR to be a faithful OCR-Engine
TNG Technology Consulting describes a fine-tuning approach applied to olmOCR, a vision-language model designed for document OCR tasks, to improve its faithfulness and reduce hallucinations. The post covers dataset construction, training methodology, and evaluation results showing improved accuracy on document extraction benchmarks. This represents a practical community contribution to the open-weights document-understanding ecosystem.
MoDiCoL: A modular continual learning dataset for diagnosing ASR robustness under distribution shift
Researchers introduce MoDiCoL, a benchmark dataset designed to evaluate automatic speech recognition robustness under co-occurring real-world distribution shifts including accents, recording conditions, speech impairments, and noise. Unlike existing benchmarks that isolate these factors, MoDiCoL enables controlled analysis across linguistic, speaker, and acoustic dimensions simultaneously. The paper also proposes a continual learning curriculum simulating incremental updates and evaluates three continual learning strategies for robustness acquisition and forgetting.
IPO-Mine: Toolkit and Dataset for Multimodal Analysis of Long IPO Filings
Researchers introduce IPO-Mine, comprising an open-source toolkit and a large-scale dataset of over 109,000 IPO filings (1994–2026) with 76,000+ extracted images, structured for section-level analysis. The toolkit parses long regulatory documents (often exceeding 500,000 tokens) into standardized text and image outputs. Benchmark tasks on financial chart quality and misleadingness assessment reveal that state-of-the-art multimodal models frequently diverge from expert human judgments, exposing alignment gaps in long-document multimodal reasoning. The dataset and code are publicly released under CC-BY-4.0.

