Almanac
benchmark

MMLU

benchmarkactivemmlu-34bb6b00·11 events·first seen 28d ago

Aliases: MMLU

Co-occurring entities

More like this (12)

Recent events (11)

6Hugging Face Blog·28d ago·source ↗

What's going on with the Open LLM Leaderboard?

Hugging Face published a commentary examining anomalies and issues observed in the Open LLM Leaderboard, focusing on MMLU benchmark results. The post investigates potential data contamination, evaluation inconsistencies, and scoring discrepancies across open-weight models. It raises concerns about the reliability of MMLU as a benchmark signal and the integrity of leaderboard rankings.

6Mistral Ai News·15d ago·source ↗

Mistral AI Releases Mathstral 7B: Math-Specialized Model with SOTA Reasoning in Size Category

Mistral AI has released Mathstral 7B, a math and STEM-specialized model built on Mistral 7B, developed in collaboration with Project Numina. The model achieves 56.6% on MATH and 63.47% on MMLU in standard evaluation, improving to 74.59% on MATH with a reward model over 64 candidates using inference-time compute scaling. Weights are open on HuggingFace and compatible with mistral-inference and mistral-finetune tooling.

8Mistral Ai News·15d ago·source ↗

Mistral AI Releases Mistral Large, Claims Second-Best API Model After GPT-4

Mistral AI has released Mistral Large, its most capable model to date, claiming second place among API-accessible models behind GPT-4 on standard benchmarks including MMLU, HellaSwag, and coding/math evals. The model features a 32K context window, native fluency in five European languages, function calling, and constrained output mode. Simultaneously, Mistral is launching a new Mistral Small optimized for latency, restructuring its endpoint lineup, and announcing Microsoft Azure as its first major distribution partner. This marks Mistral's first significant commercial partnership and expansion beyond its own infrastructure.

8Mistral Ai News·15d ago·source ↗

Mistral AI Releases Mixtral 8x22B Under Apache 2.0

Mistral AI has released Mixtral 8x22B, a sparse Mixture-of-Experts model with 141B total parameters but only 39B active parameters, under the permissive Apache 2.0 license. The model features a 64K token context window, native function calling, multilingual support across five European languages, and strong math and coding performance. Mistral claims it outperforms all other open-weight models on standard benchmarks while being faster than dense 70B models due to sparse activation. An instructed version achieves 90.8% on GSM8K maj@8.

7Mistral Ai News·15d ago·source ↗

Mistral AI Founding Manifesto and Mistral 7B Release

Mistral AI published its founding mission statement alongside the release of Mistral 7B, a 7-billion-parameter open-weights language model released under Apache 2.0. The model claims to outperform all available open models up to 13B parameters on standard English and code benchmarks, produced in three months from a standing start. The post articulates Mistral's strategic thesis: open-weight models will outcompete proprietary black-box APIs for most enterprise use cases, drawing analogies to Linux, WebKit, and Kubernetes. The company signals intent to release progressively larger frontier models while building a commercial offering around on-premise and VPC deployment.

8Anthropic News·15d ago·source ↗

Introducing Claude 3.5 Sonnet

Anthropic launches Claude 3.5 Sonnet, the first model in its Claude 3.5 family, claiming it outperforms Claude 3 Opus and competitor models on GPQA, MMLU, and HumanEval benchmarks while operating at twice the speed and mid-tier pricing ($3/$15 per million tokens). The model features a 200K context window, improved vision capabilities, and an internal agentic coding evaluation score of 64% versus 38% for Opus. Alongside the model, Anthropic introduces Artifacts on Claude.ai, a dedicated workspace for real-time editing of AI-generated content. The model was pre-deployment evaluated by the UK AI Safety Institute and assessed at ASL-2.

8Mistral Ai News·15d ago·source ↗

Mistral Large 2 (123B): New Frontier Model with 128k Context, Multilingual and Code Capabilities

Mistral AI releases Mistral Large 2, a 123-billion-parameter model with a 128k context window, supporting 80+ coding languages and over a dozen natural languages. The model claims competitive performance with GPT-4o, Claude 3 Opus, and Llama 3 405B on code generation, reasoning, and multilingual benchmarks, while targeting cost-efficient single-node inference. Weights are available under a Mistral Research License for non-commercial use, with a commercial license required for self-deployment. The model is accessible via Mistral's la Plateforme API (mistral-large-2407), HuggingFace, and Google Cloud Vertex AI.

7Mistral Ai News·15d ago·source ↗

Mistral Small 3: 24B Latency-Optimized Open-Weight Model Released Under Apache 2.0

Mistral AI has released Mistral Small 3, a 24B-parameter instruction-tuned model optimized for low latency, achieving over 81% on MMLU at 150 tokens/s on a single GPU. The model is competitive with Llama 3.3 70B and Qwen 32B while being more than 3x faster on equivalent hardware, and is released under Apache 2.0 for both pretrained and instruction-tuned checkpoints. It is explicitly not trained with RL or synthetic data, positioning it as a base model for community fine-tuning and reasoning capability development. Deployment targets include local inference on consumer hardware (RTX 4090, MacBook 32GB RAM), agentic function calling, and domain-specific fine-tuning.

9Anthropic News·13d ago·source ↗

Anthropic launches Claude 3 model family: Haiku, Sonnet, and Opus

Anthropic announced the Claude 3 model family on March 4, 2024, comprising three models — Haiku, Sonnet, and Opus — in ascending capability order. Claude 3 Opus claims top performance on major benchmarks including MMLU, GPQA, and GSM8K, with near-perfect recall on long-context evaluations (200K context window, 99%+ NIAH accuracy) and new multimodal vision capabilities. The release also highlights reduced unnecessary refusals, a twofold accuracy improvement over Claude 2.1, and Constitutional AI-based safety tuning. Opus and Sonnet launched immediately via claude.ai and the Claude API across 159 countries, with Haiku to follow.

4arXiv · cs.CL·6d ago·source ↗

CHAIR: Supervised hallucination detection via internal logit analysis across LLM layers

A new arXiv preprint introduces CHAIR (Classifier of Hallucination As ImproveR), a supervised framework that detects hallucinations by extracting statistical features (max, min, mean, std, slope) from token logits across all layers of an LLM. Evaluated on TruthfulQA and MMLU, CHAIR shows improved detection accuracy especially in zero-shot settings. The authors argue the approach also points toward richer internal representations for designing adaptive decoding strategies that reduce hallucinations.

8Mistral Ai News·15d ago·source ↗

Mistral 7B: Open-Weights 7B Model Outperforming Llama 2 13B

Mistral AI released Mistral 7B, a 7.3B parameter language model under the Apache 2.0 license that outperforms Llama 2 13B across all evaluated benchmarks and approaches Llama 34B on many tasks. The model employs Grouped-Query Attention (GQA) for faster inference and Sliding Window Attention (SWA) to handle longer sequences at reduced cost, achieving roughly 2x speed improvement at 16k sequence length. A fine-tuned chat variant, Mistral 7B Instruct, outperforms all 7B chat models on MT-Bench and is competitive with 13B-class chat models. The release includes deployment support for AWS, GCP, Azure, HuggingFace, and local use via vLLM.