Entity · benchmark

MMMU

benchmarkactivemmmu-0b889ef6·2 events·first seen May 18, 2026

Aliases: MMMU

Co-occurring entities

MathVista When Good Verifiers Go Bad: Self-Improving VLMs Can Regress on New Tasks BLINK Direct Preference Optimization (DPO)Qwen-3-VL-2B Qwen-2.5-VL-3B Qwen2.5-VL Mistral AI MT-Bench Apache 2.0 Claude 3.5 Sonnet LLaVA OneVision 72B GPT-4o Pixtral 12B Mistral Nemo

More like this (12)

MMLU MMMU-Pro u-muP MMVU MMLU-Pro MAML MMDP MMAU MMMC-Code GGML Global-MMLU MedMCQA

Recent events (2)

6arXiv · cs.AI·Jun 15, 2026·source ↗

Self-improving VLMs can silently regress when verifier quality is task-mismatched

A new arXiv paper demonstrates that verifier-driven self-DPO, a common recipe for self-improving visual-language models, can silently degrade student model performance when the verifier's task-rubric accuracy is insufficient for the target task. Experiments on Qwen-3-VL-2B and Qwen-2.5-VL-3B across MathVista, MMMU, and BLINK show regressions of 3.4–10.9 percentage points below frozen baselines, with the counterintuitive finding that more accurate-but-still-wrong verifiers cause larger regressions than near-random ones. The authors provide a mechanistic explanation via a variance theorem for progress-gated replay and offer operational guidance: measure target-task rubric accuracy before running any verifier-driven loop and rank verifiers by task-specific quality rather than parameter count.

Evaluation and Benchmarking Alignment and RLHF MathVista When Good Verifiers Go Bad: Self-Improving VLMs Can Regress on New Tasks BLINK +5 more

5Mistral Ai News·May 18, 2026·source ↗

Pixtral 12B: Mistral AI's First Multimodal Model (Now Deprecated)

Mistral AI released Pixtral 12B in September 2024 as their first natively multimodal model, combining a new 400M parameter vision encoder trained from scratch with a 12B multimodal decoder based on Mistral Nemo. The model supports variable image sizes and aspect ratios, a 128K token context window for multiple images, and achieved 52.5% on MMMU while maintaining strong text-only benchmark performance. The model is now deprecated and has been replaced by newer vision and multimodal models from Mistral. It was released under Apache 2.0 license.

Frontier Model Releases Open Weights Progress Qwen2.5-VL Mistral AI MT-Bench +8 more