Entity · benchmark

MATH benchmark

benchmarkactivemath-benchmark-1629c137·3 events·first seen May 20, 2026

Aliases: MATH benchmark

Co-occurring entities

Chain-of-Thought Reasoning Mistral AI Mathstral 7B Project Numina Paul Bourdon La Plateforme Mistral 7B MMLU HuggingFace Qwen2.5-7B-Instruct-1M ReAct stealth-divergence HotpotQA GSM8K process supervision outcome supervision OpenAI

More like this (12)

MATH AdvancedMathBench multilingual mathematical benchmarks MATH-500 MATH-MCQA MathVista MT-Bench CORE benchmark MLE-bench FACTS Benchmark Suite Math-Verify MAS-PromptBench

Recent events (3)

6Mistral Ai News·Jun 1, 2026·source ↗

Mistral AI Releases Mathstral 7B: Math-Specialized Model with SOTA Reasoning in Size Category

Mistral AI has released Mathstral 7B, a math and STEM-specialized model built on Mistral 7B, developed in collaboration with Project Numina. The model achieves 56.6% on MATH and 63.47% on MMLU in standard evaluation, improving to 74.59% on MATH with a reward model over 64 candidates using inference-time compute scaling. Weights are open on HuggingFace and compatible with mistral-inference and mistral-finetune tooling.

Frontier Model Releases Evaluation and Benchmarking Mistral AI Mathstral 7B Project Numina +8 more

6arXiv · cs.CL·May 26, 2026·source ↗

Semantic vs. Surface Noise in LLM Agents: 68-Cell Measurement Study with Held-Out Validation

This paper documents an empirical phenomenon across 10 LLMs from 7 architecture families: meaning-bearing perturbations (paraphrase, synonym substitution) cause final-answer inconsistency ~19.69 percentage points more often than presentation-level perturbations (formatting, reordering) of comparable severity, across GSM8K, MATH, and HotpotQA benchmarks. The effect is validated on a held-out 11th model (qwen2.5-14B-Instruct) with 1,800 trajectories. Trace-level analysis supports a 'stealth-divergence' picture where semantic perturbations preserve the first action but induce divergence in intermediate reasoning steps, while two prior mechanism claims are explicitly retracted. The study is notable for its honest reporting of stress-test failures and pre-registered replication.

Evaluation and Benchmarking AI Safety Research Qwen2.5-7B-Instruct-1M ReAct stealth-divergence +5 more

7Openai Blog·May 20, 2026·source ↗

Improving Mathematical Reasoning with Process Supervision

OpenAI trained a model achieving state-of-the-art mathematical problem solving by rewarding each correct reasoning step (process supervision) rather than only the final answer (outcome supervision). This approach improves performance on math benchmarks and carries an alignment benefit by training models to produce human-endorsed chain-of-thought reasoning. The work highlights a potential synergy between capability improvements and alignment techniques.

Frontier Model Releases Evaluation and Benchmarking process supervision outcome supervision Chain-of-Thought Reasoning +3 more