Entity · benchmark

AIME26

benchmarkactiveaime26-109cd396·2 events·first seen May 18, 2026

Aliases: AIME26

Co-occurring entities

More like this (12)

AIME25 AIME24 AIME AIME 2025 AIME 2026 AIME2024 AMIA Andyyyy64 AI-MO AdamW AIMS AIA Labs

Recent events (2)

7arXiv · cs.LG·May 29, 2026·source ↗

Entropy-Cut Metropolis-Hastings: Sampling-Based Reasoning Without RL Training

This paper introduces Entropy-Cut Metropolis-Hastings (ECMH), an algorithm that samples from a 'power distribution' over base language model outputs to elicit strong reasoning without reinforcement learning posttraining. Rather than cutting reasoning traces at uniformly random positions, ECMH uses next-token entropy as a proxy to identify consequential decision points (e.g., choice of proof strategy), then resamples from those positions. The authors prove that mixing time scales with the number of decisions rather than tokens, and demonstrate consistent improvements over RL-trained models on MATH500, HumanEval, GPQA Diamond, and AIME26.

Frontier Model Releases Evaluation and Benchmarking power distribution MATH500 Entropy-Cut Metropolis-Hastings +6 more

6The Batch·May 18, 2026·source ↗

Data Points: Thinking Machines Interaction Model, ERNIE 5.1, Co-Mathematician, RL Conductor, and More

This edition of The Batch covers five notable AI developments: Thinking Machines' research preview of an 'interaction model' with a 200ms micro-turn multimodal architecture; Baidu's ERNIE 5.1, a compressed derivative of ERNIE 5.0 using only 6% of typical pre-training compute; Google DeepMind's Co-Mathematician collaborative workbench reaching 48% on FrontierMath Tier 4; a 7B RL Conductor model that orchestrates multi-agent workflows via reinforcement learning; and Google's Magic Pointer cursor system powered by Gemini. Secondary items include GitHub Copilot pricing restructuring ahead of usage-based billing.

Training Infrastructure Frontier Model Releases Thinking Machines SGLang GitHub +21 more