MMMU-Pro
mmmu-pro-7769523a·5 events·first seen 18d agoAliases: MMMU-Pro, MMMU Pro
Co-occurring entities
More like this (12)
Recent events (5)
Google Launches Gemini 3.5 Flash: Mid-Tier Model With Agentic Gains at 3x Higher Price
Google released Gemini 3.5 Flash at Google I/O 2026, a mixture-of-experts multimodal model with adjustable reasoning levels, thought preservation across multi-turn conversations, and a 1M-token context window. The model tops APEX-Agents-AA and MMMU-Pro benchmarks among Flash-tier models but trails leading frontier models on overall intelligence, knowledge, and coding. Pricing is $1.50/$9.00 per million input/output tokens—three times the cost of its predecessor Gemini 3 Flash—raising questions about Google's positioning of Flash as a mid-tier rather than budget offering. Independent testing found it costs more in practice than Gemini 3.1 Pro despite Google's claims of competitive pricing.
Gemini 3.5 Flash Launch, AI FDE Job Trends, AI Act Delays, and Agent-Driven Web Traffic
Google launched Gemini 3.5 Flash, a mid-tier multimodal mixture-of-experts model with improved agentic capabilities, visual understanding, and speed, priced at $1.50/$9.00 per million input/output tokens — three times the cost of its predecessor Gemini 3 Flash. The model supports up to 1M token context, adjustable reasoning levels, and thought preservation across multi-turn conversations, and tops the Artificial Analysis APEX-Agents-AA and MMMU-Pro benchmarks. The issue also covers Andrew Ng's commentary on the rise of AI Forward Deployed Engineers versus the broader AI Engineer role, plus news items on EU AI Act implementation delays and AI agents driving measurable online traffic shifts.
Meta Introduces Muse Spark: First Closed-Weights Model from Superintelligence Labs
Meta released Muse Spark, its first AI model in roughly a year and the debut product of its Superintelligence Labs, marking a significant departure from its open-weights Llama strategy. The natively multimodal reasoning model supports tool use and multi-agent orchestration, achieves fourth place on the Artificial Analysis Intelligence Index, and claims notable token efficiency—matching Llama 4 Maverick with over 10x less training compute. Meta withheld parameter count, architecture, and training details, positioning Muse Spark as a closed commercial product competing with OpenAI, Google, and Anthropic. The release introduces 'thought compression' via RL and a parallel multi-agent 'contemplating' mode, while showing gaps in coding and agentic benchmarks.
Qwen3.5 Small tops mobile-sized open models; GPT-5.3 Instant, Gemini 3.1 Flash-Lite, Claude memory import, and LLM deanonymization research
Alibaba released the Qwen3.5 Small model series (0.8B–9B parameters) with a hybrid Gated Delta Networks + sparse MoE architecture, with the 9B model outperforming OpenAI's gpt-oss-120B on GPQA Diamond despite being 13.5x smaller; all weights are Apache 2.0 licensed. Google introduced Gemini 3.1 Flash-Lite, a cost-optimized model at $0.25/M input tokens with 2.5x faster TTFT than Gemini 2.5 Flash. OpenAI released GPT-5.3 Instant targeting conversational quality improvements and hallucination reduction, while Anthropic added memory import/export functionality across all Claude tiers. Separately, researchers from MATS, Anthropic, and ETH Zurich demonstrated that LLM-based pipelines can deanonymize pseudonymous online users at 68% recall/90% precision for $1–4 per profile.
OpenAI GPT-5.4 Pro and GPT-5.4 Thinking challenge Gemini 3.1 Pro Preview for top AI model position
OpenAI released GPT-5.4 in two variants (Pro and Thinking), featuring expanded context windows up to 1.05M tokens, native computer use, tool search capabilities, and adjustable reasoning levels. In independent benchmarks by Artificial Analysis, GPT-5.4 Pro at xhigh reasoning nearly ties Gemini 3.1 Pro Preview on the Intelligence Index (57 vs 57.2 points) but at roughly 3.3x the cost, while leading on coding and agentic sub-indices. The release leapfrogs Claude Opus 4.6 on most benchmarks but faces stiff competition from Google's Gemini 3.1 Pro Preview, which maintains a price and multimodal advantage.