Entity · benchmark

MMLU

benchmarkactivemmlu-34bb6b00·14 events·first seen May 19, 2026

Aliases: MMLU

Co-occurring entities

More like this (12)

MMLU-Pro MMMU Global-MMLU MAML MLX MRL MDLM u-muP MNLI UniCAD-MLLM First-Order MAML LPU

Recent events (14)

5arXiv · cs.CL·Jul 23, 2026·source ↗

SelectBench and DAPO post-training for selective evidence adoption in RAG contexts

Researchers introduce SelectBench, a benchmark and training set for evaluating whether retrieval-augmented LLMs can selectively adopt valid evidence while rejecting misleading or injected content. They post-train Qwen3.5-4B using DAPO with rule-based and semantic judge rewards, achieving modest but directional improvements on SelectBench-v2 (22.46% to 26.46% strict success). Gains do not survive Holm multiple-comparison correction, and prompt-injection resistance shows no improvement, leaving statistical robustness and injection resistance as open challenges. General capabilities on MMLU and HotpotQA are preserved.

Evaluation and Benchmarking AI Safety Research SelectBench DAPO DeepSeek V4 +3 more

6arXiv · cs.LG·Jul 9, 2026·source ↗

Analysis-driven transformer linearization outperforms prior baselines on LLaMA and Qwen up to 32B

A new arXiv paper analyzes why post-hoc linearization of causal self-attention degrades model quality, identifying key-dependent rank-1 orthogonal projections as the mechanism softmax relies on and explaining why delta-style networks outperform gated accumulation. The authors introduce structural interventions—sink tokens, short convolutions, and fixed-budget cache routing—applied in a frozen-backbone regime. Scaling across LLaMA and Qwen models up to 32B parameters, the approach outperforms prior post-hoc linearization baselines on MMLU and matches long-context retrieval of adaptive-caching frameworks.

Long Context Evolution Inference Economics Qwen The Key to Going Linear: Analysis-Driven Transformer Linearization Llama +1 more

6arXiv · cs.CL·Jul 3, 2026·source ↗

Scaling laws study finds LLM social simulation fidelity mostly improves with compute, with notable exceptions

A new arXiv preprint investigates whether scaling compute improves the fidelity of LLM-based social simulations across three domains: opinion modeling, behavioral simulation, and longitudinal forecasting. Using 85 Qwen3-architecture models trained under fixed-compute budgets from 10^18 to 10^20 FLOPs, plus 35 larger open-weight models up to 70B parameters, the authors find strong scaling in most settings. However, longitudinal forecasting, underrepresented populations, and specific cognitive bias calibration tasks (e.g., risk aversion) scale poorly, with fine-tuning failing to close gaps from 0.5B to 8B parameters. The work provides empirical grounding for where scaling will and will not suffice for social simulation research.

Evaluation and Benchmarking DCLM Will Scaling Improve Social Simulation with LLMs?Qwen3 +1 more

4arXiv · cs.CL·Jun 11, 2026·source ↗

CHAIR: Supervised hallucination detection via internal logit analysis across LLM layers

A new arXiv preprint introduces CHAIR (Classifier of Hallucination As ImproveR), a supervised framework that detects hallucinations by extracting statistical features (max, min, mean, std, slope) from token logits across all layers of an LLM. Evaluated on TruthfulQA and MMLU, CHAIR shows improved detection accuracy especially in zero-shot settings. The authors argue the approach also points toward richer internal representations for designing adaptive decoding strategies that reduce hallucinations.

Evaluation and Benchmarking AI Safety Research TruthfulQA CHAIR MMLU

9Anthropic News·Jun 3, 2026·source ↗

Anthropic launches Claude 3 model family: Haiku, Sonnet, and Opus

Anthropic announced the Claude 3 model family on March 4, 2024, comprising three models — Haiku, Sonnet, and Opus — in ascending capability order. Claude 3 Opus claims top performance on major benchmarks including MMLU, GPQA, and GSM8K, with near-perfect recall on long-context evaluations (200K context window, 99%+ NIAH accuracy) and new multimodal vision capabilities. The release also highlights reduced unnecessary refusals, a twofold accuracy improvement over Claude 2.1, and Constitutional AI-based safety tuning. Opus and Sonnet launched immediately via claude.ai and the Claude API across 159 countries, with Haiku to follow.

Long Context Evolution Frontier Model Releases Claude Opus 4.6 Constitutional AI Claude Haiku 4.5 +8 more

8Anthropic News·Jun 2, 2026·source ↗

Introducing Claude 3.5 Sonnet

Anthropic launches Claude 3.5 Sonnet, the first model in its Claude 3.5 family, claiming it outperforms Claude 3 Opus and competitor models on GPQA, MMLU, and HumanEval benchmarks while operating at twice the speed and mid-tier pricing ($3/$15 per million tokens). The model features a 200K context window, improved vision capabilities, and an internal agentic coding evaluation score of 64% versus 38% for Opus. Alongside the model, Anthropic introduces Artifacts on Claude.ai, a dedicated workspace for real-time editing of AI-generated content. The model was pre-deployment evaluated by the UK AI Safety Institute and assessed at ASL-2.

Long Context Evolution Frontier Model Releases claude.ai Thorn Amazon Bedrock +16 more

8Mistral Ai News·Jun 1, 2026·source ↗

Mistral AI Releases Mixtral 8x22B Under Apache 2.0

Mistral AI has released Mixtral 8x22B, a sparse Mixture-of-Experts model with 141B total parameters but only 39B active parameters, under the permissive Apache 2.0 license. The model features a 64K token context window, native function calling, multilingual support across five European languages, and strong math and coding performance. Mistral claims it outperforms all other open-weight models on standard benchmarks while being faster than dense 70B models due to sparse activation. An instructed version achieves 90.8% on GSM8K maj@8.

Frontier Model Releases Open Weights Progress Mistral AI Llama 2 70B Apache 2.0 +10 more

7Mistral Ai News·Jun 1, 2026·source ↗

Mistral Small 3: 24B Latency-Optimized Open-Weight Model Released Under Apache 2.0

Mistral AI has released Mistral Small 3, a 24B-parameter instruction-tuned model optimized for low latency, achieving over 81% on MMLU at 150 tokens/s on a single GPU. The model is competitive with Llama 3.3 70B and Qwen 32B while being more than 3x faster on equivalent hardware, and is released under Apache 2.0 for both pretrained and instruction-tuned checkpoints. It is explicitly not trained with RL or synthetic data, positioning it as a base model for community fine-tuning and reasoning capability development. Deployment targets include local inference on consumer hardware (RTX 4090, MacBook 32GB RAM), agentic function calling, and domain-specific fine-tuning.

Frontier Model Releases Open Weights Progress Mistral AI Mistral Small 4 Ollama +12 more

8Mistral Ai News·Jun 1, 2026·source ↗

Mistral AI Releases Mistral Large, Claims Second-Best API Model After GPT-4

Mistral AI has released Mistral Large, its most capable model to date, claiming second place among API-accessible models behind GPT-4 on standard benchmarks including MMLU, HellaSwag, and coding/math evals. The model features a 32K context window, native fluency in five European languages, function calling, and constrained output mode. Simultaneously, Mistral is launching a new Mistral Small optimized for latency, restructuring its endpoint lineup, and announcing Microsoft Azure as its first major distribution partner. This marks Mistral's first significant commercial partnership and expansion beyond its own infrastructure.

Long Context Evolution Frontier Model Releases Azure AI Studio Mistral AI Llama 2 70B +13 more

8Mistral Ai News·Jun 1, 2026·source ↗

Mistral Large 2 (123B): New Frontier Model with 128k Context, Multilingual and Code Capabilities

Mistral AI releases Mistral Large 2, a 123-billion-parameter model with a 128k context window, supporting 80+ coding languages and over a dozen natural languages. The model claims competitive performance with GPT-4o, Claude 3 Opus, and Llama 3 405B on code generation, reasoning, and multilingual benchmarks, while targeting cost-efficient single-node inference. Weights are available under a Mistral Research License for non-commercial use, with a commercial license required for self-deployment. The model is accessible via Mistral's la Plateforme API (mistral-large-2407), HuggingFace, and Google Cloud Vertex AI.

Long Context Evolution Frontier Model Releases Mistral AI MT-Bench Claude Opus 4.6 +14 more

6Mistral Ai News·Jun 1, 2026·source ↗

Mistral AI Releases Mathstral 7B: Math-Specialized Model with SOTA Reasoning in Size Category

Mistral AI has released Mathstral 7B, a math and STEM-specialized model built on Mistral 7B, developed in collaboration with Project Numina. The model achieves 56.6% on MATH and 63.47% on MMLU in standard evaluation, improving to 74.59% on MATH with a reward model over 64 candidates using inference-time compute scaling. Weights are open on HuggingFace and compatible with mistral-inference and mistral-finetune tooling.

Frontier Model Releases Evaluation and Benchmarking Mistral AI Mathstral 7B Project Numina +8 more

8Mistral Ai News·Jun 1, 2026·source ↗

Mistral 7B: Open-Weights 7B Model Outperforming Llama 2 13B

Mistral AI released Mistral 7B, a 7.3B parameter language model under the Apache 2.0 license that outperforms Llama 2 13B across all evaluated benchmarks and approaches Llama 34B on many tasks. The model employs Grouped-Query Attention (GQA) for faster inference and Sliding Window Attention (SWA) to handle longer sequences at reduced cost, achieving roughly 2x speed improvement at 16k sequence length. A fine-tuned chat variant, Mistral 7B Instruct, outperforms all 7B chat models on MT-Bench and is competitive with 13B-class chat models. The release includes deployment support for AWS, GCP, Azure, HuggingFace, and local use via vLLM.

Long Context Evolution Frontier Model Releases Mistral AI MT-Bench Mistral 7B Instruct v0.2 +13 more

7Mistral Ai News·Jun 1, 2026·source ↗

Mistral AI Founding Manifesto and Mistral 7B Release

Mistral AI published its founding mission statement alongside the release of Mistral 7B, a 7-billion-parameter open-weights language model released under Apache 2.0. The model claims to outperform all available open models up to 13B parameters on standard English and code benchmarks, produced in three months from a standing start. The post articulates Mistral's strategic thesis: open-weight models will outcompete proprietary black-box APIs for most enterprise use cases, drawing analogies to Linux, WebKit, and Kubernetes. The company signals intent to release progressively larger frontier models while building a commercial offering around on-premise and VPC deployment.

Frontier Model Releases Open Weights Progress Mistral AI Apache 2.0 DeepMind +8 more

6Hugging Face Blog·May 19, 2026·source ↗

What's going on with the Open LLM Leaderboard?

Hugging Face published a commentary examining anomalies and issues observed in the Open LLM Leaderboard, focusing on MMLU benchmark results. The post investigates potential data contamination, evaluation inconsistencies, and scoring discrepancies across open-weight models. It raises concerns about the reliability of MMLU as a benchmark signal and the integrity of leaderboard rankings.

Evaluation and Benchmarking Open Weights Progress Open LLM Leaderboard Hugging Face MMLU