5Hugging Face Blog·1mo ago

Introducing ConTextual: Benchmark for Joint Text-Image Reasoning in Text-Rich Scenes

Hugging Face introduces ConTextual, a new benchmark evaluating multimodal models on their ability to jointly reason over text and images in text-rich scenes. The benchmark targets a specific capability gap where models must integrate visual and textual information simultaneously rather than treating them independently. A leaderboard accompanies the benchmark to track model progress on this task.

Evaluation and Benchmarking Multimodal Progress Hugging Face ConTextual

Related guides (3)

Hugging Face

Hugging Face: The Home of Open-Source AI

Read asBeginner In-depth

Multimodal ProgressTopic guide

Multimodal Progress: How AI Learned to See, Hear, and Act

Read asBeginner In-depth

Evaluation and BenchmarkingTopic guide

Evaluation and Benchmarking: How We Measure AI — and Why It Keeps Getting Harder

Read asBeginner In-depth

Related events (8)

5Hugging Face Blog·1mo ago·source ↗

Evaluating Audio Reasoning with Big Bench Audio

Hugging Face introduces Big Bench Audio, a new benchmark designed to evaluate audio reasoning capabilities in AI models. The benchmark appears to extend the Big Bench evaluation framework into the audio domain, targeting multimodal models that process and reason over audio inputs. This release addresses a gap in evaluation tooling for audio-capable language models.

Evaluation and Benchmarking Multimodal Progress Big Bench Audio Hugging Face Big Bench

5Qwen·15d ago·source ↗

Qwen releases Qwen-Image-Bench, a multimodal judge/evaluation model

Qwen has released Qwen-Image-Bench on Hugging Face, an image-text-to-text model tagged as a judge-model for evaluation and benchmarking purposes. The model supports both English and Chinese and appears designed to evaluate text-to-image outputs. With 8,572 downloads and 50 likes shortly after release, it has attracted modest early interest.

Evaluation and Benchmarking Open Weights Progress Qwen-Image-Bench Qwen Hugging Face +1 more

5arXiv · cs.CL·11d ago·source ↗

Benchmark for view-level visual evidence identification in multi-view MLLMs for autonomous driving

A new arXiv preprint introduces a multi-view visual question answering benchmark targeting evidence-source identification in autonomous driving scenarios. Given six synchronized NuScenes camera views and a question, models must identify which camera view supports the answer — not just produce a correct answer. The 122-pair benchmark spans causality, counterfactual reasoning, and intent prediction, and exposes grounding failures that answer-only evaluation misses. The work addresses a meaningful gap between answer accuracy and correct visual grounding in safety-critical multimodal systems.

Evaluation and Benchmarking Multimodal Progress NuScenes Where Does the Answer Come From? Benchmarking View-Level Visual Evidence Identification in Multi-View MLLMs for Autonomous Driving

5arXiv · cs.AI·2d ago·source ↗

Multi-domain benchmark for detecting AI-generated text-rich images from GPT-Image-2

Researchers introduce a new benchmark of 8,602 images across six categories (commercial posters, infographics, academic posters, receipts, tables, UI screenshots) specifically for detecting AI-generated text-rich images produced by OpenAI's GPT-Image-2. Five zero-shot detectors are evaluated, revealing highly domain-dependent performance and severe sensitivity to JPEG compression even in the strongest conventional detector. A multimodal VLM is also explored as a detector, showing promise but limitations on structured formats. The work highlights a gap in existing benchmarks that focus on object-centric rather than text-layout-centric images.

Evaluation and Benchmarking Multimodal Progress GPT-Image-2 OpenAI A Multi-Domain Benchmark for Detecting AI-Generated Text-Rich Images from GPT-Image-2

5Hugging Face Blog·1mo ago·source ↗

Launching the Artificial Analysis Text to Image Leaderboard & Arena

Hugging Face and Artificial Analysis are launching a combined leaderboard and arena for evaluating text-to-image models. The leaderboard tracks quality, speed, and cost metrics across leading image generation models, while the arena component collects human preference votes for side-by-side comparisons. This provides a structured benchmark for comparing commercial and open-weight image generation systems.

Evaluation and Benchmarking Inference Economics Artificial Analysis Artificial Analysis Text to Image Leaderboard Hugging Face +1 more

4Hugging Face Blog·1mo ago·source ↗

A Dive into Text-to-Video Models

A Hugging Face blog post providing an overview of text-to-video generation models as of mid-2023. The post surveys the landscape of approaches, architectures, and key models in the emerging text-to-video space. As a tier-2 commentary piece, it synthesizes existing work rather than presenting novel research.

Multimodal Progress text-to-video generation Hugging Face

6arXiv · cs.CL·23d ago·source ↗

VisualMem: Personal Visual Memory Benchmark and Architecture for Personalized AI Agents

This paper introduces a benchmark and hybrid architecture (VisualMem) for personal visual memory in long-term AI agent memory systems. The work addresses a gap in existing text-centric memory systems by capturing both explicit evidence (recurring user-associated entities) and implicit evidence (latent user facts from visual/multimodal cues) from images. VisualMem augments a text-memory backend with a structured personal visual memory module that uses conversational context to resolve identity, ownership, and durable user facts. Experiments show VisualMem substantially outperforms prior memory systems on the new benchmark while remaining competitive on standard text-memory benchmarks.

Long Context Evolution Evaluation and Benchmarking VisualMem long-term memory Personal Visual Memory Benchmark +3 more

5Hugging Face Blog·1mo ago·source ↗

Introducing HELMET: Holistically Evaluating Long-context Language Models

HELMET is a new benchmark designed to holistically evaluate long-context language models across diverse real-world tasks rather than synthetic needle-in-a-haystack tests. The benchmark covers multiple task categories including retrieval, reasoning, summarization, and code, aiming to provide more reliable and comprehensive assessment of long-context capabilities. It is introduced via the Hugging Face blog, suggesting an open release with associated tooling for the community.

Long Context Evolution Evaluation and Benchmarking HELMET Hugging Face