Entity · benchmark

MathVista

benchmarkactivemathvista-04ce9414·3 events·first seen May 18, 2026

Aliases: MathVista

Co-occurring entities

More like this (12)

Math-Verify FrontierMath NuminaMath MATH DeepMath MATH benchmark MATH-500 MedCalc AdvancedMathBench Qwen2-Math-Instruct Minerva Math Mathlib

Recent events (3)

6arXiv · cs.AI·Jun 15, 2026·source ↗

Self-improving VLMs can silently regress when verifier quality is task-mismatched

A new arXiv paper demonstrates that verifier-driven self-DPO, a common recipe for self-improving visual-language models, can silently degrade student model performance when the verifier's task-rubric accuracy is insufficient for the target task. Experiments on Qwen-3-VL-2B and Qwen-2.5-VL-3B across MathVista, MMMU, and BLINK show regressions of 3.4–10.9 percentage points below frozen baselines, with the counterintuitive finding that more accurate-but-still-wrong verifiers cause larger regressions than near-random ones. The authors provide a mechanistic explanation via a variance theorem for progress-gated replay and offer operational guidance: measure target-task rubric accuracy before running any verifier-driven loop and rank verifiers by task-specific quality rather than parameter count.

Evaluation and Benchmarking Alignment and RLHF MathVista When Good Verifiers Go Bad: Self-Improving VLMs Can Regress on New Tasks BLINK +5 more

7Mistral Ai News·May 18, 2026·source ↗

Pixtral Large: Mistral AI's 124B Open-Weights Multimodal Model

Mistral AI released Pixtral Large, a 124B open-weights multimodal model built on Mistral Large 2, featuring a 1B parameter vision encoder and 128K context window supporting at least 30 high-resolution images. The model claims state-of-the-art results on MathVista, DocVQA, and ChartQA, outperforming GPT-4o and Gemini-1.5 Pro on several benchmarks, and leads the LMSys Vision Leaderboard among open-weights models by ~50 ELO points. Simultaneously, Mistral updated its text model to Mistral Large 24.11 with improvements in long-context understanding, function calling, and RAG/agentic workflows. Note: the model has since been deprecated and replaced by newer Mistral vision models.

Frontier Model Releases Evaluation and Benchmarking Google Cloud Mistral AI MT-Bench +15 more

7Qwen Research·May 18, 2026·source ↗

Qwen2-VL: Alibaba Releases Latest Vision-Language Model with Extended Video Understanding

Alibaba's Qwen team has released Qwen2-VL, the latest iteration of their vision-language model series built on the Qwen2 foundation. The model claims state-of-the-art performance on visual understanding benchmarks including MathVista, DocVQA, RealWorldQA, and MTVQA. A notable capability is understanding videos exceeding 20 minutes in length for question answering, dialog, and content creation tasks.

Frontier Model Releases Evaluation and Benchmarking Qwen2.5-VL RealWorldQA DocVQA +6 more