Entity · benchmark

AIME 2025

benchmarkactiveaime-2025-fa536129·7 events·first seen May 18, 2026

Aliases: AIME 2025, AIME-2025

Co-occurring entities

More like this (12)

AIME 2026 AIME AIME2024 AIME25 AIME24 IMO 2025 AIME26 IOI 2025 ICAIS 2025 USAMO 2026 AIMS AIMO Progress Prize

Recent events (7)

7The Batch·Jul 3, 2026·source ↗

Microsoft reveals MAI-Thinking-1, a from-scratch reasoning model with MoE architecture

Microsoft introduced MAI-Thinking-1, its first reasoning language model built without distillation from third-party models, comparable in size to Claude Sonnet 4.6. The model uses a mixture-of-experts architecture (1T total / 35B active parameters), was pretrained on 30 trillion tokens of primarily licensed human-generated data, and trained via reinforcement learning across specialist models for STEM, coding, and safety. It scored 97.0% on AIME 2025, placing third behind Claude Opus 4.6 and ahead of DeepSeek V3.2, and is available in private preview via Microsoft Foundry. The release marks a strategic shift as Microsoft moves to reduce dependence on OpenAI models following a renegotiated partnership in April 2026.

Training Infrastructure Frontier Model Releases MAI-Thinking-1 Claude Sonnet 4 Claude Opus 4.6 +12 more

6arXiv · cs.CL·Jun 23, 2026·source ↗

TriggerBench: A benchmark for evaluating prospective memory in LLMs

Researchers introduce TriggerBench, a benchmark evaluating prospective memory (PM) in LLMs — the ability to spontaneously recall and act on latent constraints without explicit prompting. The benchmark spans five dimensions across daily assistant and professional workflow scenarios, and reveals that PM is substantially harder than retrospective memory, decaying sharply with context length while retrospective memory near-saturates at 100K tokens. Key findings include a precision-recall trade-off in PM, attentional fragility under concurrent requests, and a novel result that PM accuracy correlates with spare reasoning capacity as measured against AIME-2025 math performance.

Long Context Evolution Evaluation and Benchmarking TriggerBench AIME 2025 +1 more

6The Batch·Jun 19, 2026·source ↗

POPE Training Method Uses Partial Solution Hints to Improve RL Exploration in LLMs

Researchers from Carnegie Mellon University introduced Privileged On-Policy Exploration (POPE), a training method that pairs GRPO reinforcement learning with hint-augmented datasets to help LLMs solve hard problems they would otherwise fail to explore. During training, the model receives partial solution prefixes alongside full problems, enabling it to discover complete solutions; it is then trained on both hinted and unhinted versions so it learns to solve problems without hints at inference time. On competition math benchmarks AIME 2025 and HMMT 2025, POPE outperforms standard GRPO and supervised fine-tuning, with HMMT pass@1 improving from 31.0% to 37.8%. The method addresses a core bottleneck in RL training—sparse reward exploration—by decomposing hard problem-solving into finding a good starting state and completing the solution.

Evaluation and Benchmarking Alignment and RLHF Virginia Smith Carnegie Mellon University Aviral Kumar +8 more

6arXiv · cs.CL·Jun 18, 2026·source ↗

STARE: Token-level advantage reweighting to prevent entropy collapse in GRPO-style RL training

Researchers introduce STARE, a method addressing policy entropy collapse in GRPO-style reinforcement learning from verifiable rewards (RLVR) for LLM post-training. Through first-order gradient analysis, they identify a token-level credit assignment mismatch and propose selectively reweighting advantages for entropy-critical tokens using batch-internal surprisal quantiles plus a closed-loop entropy gate. Evaluated across 1.5B–32B models on short/long chain-of-thought and multi-turn tool use tasks, STARE outperforms DAPO and other baselines by 4–8% on AIME24/25 while sustaining stable training over thousands of steps.

Frontier Model Releases Alignment and RLHF DAPO AIME 2026 GRPO +2 more

6arXiv · cs.AI·Jun 12, 2026·source ↗

RA-RFT: Retrieval-Augmented Reinforcement Fine-Tuning teaches LLMs to reason by analogy

Researchers propose Retrieval-Augmented Reinforcement Fine-Tuning (RA-RFT), a post-training framework that trains a retriever to rank contexts by expected reasoning benefit rather than semantic similarity, then fine-tunes a policy model via reinforcement learning using retrieved analogous demonstrations. The key insight is that reasoning-relevant retrieval surfaces complementary solution strategies rather than superficially similar problems. On mathematical reasoning benchmarks, RA-RFT improves AIME 2025 average@32 accuracy by 7.1 and 2.8 points over GRPO for Qwen3-1.7B and Qwen3-4B respectively, suggesting reasoning-aware retrieval is orthogonal to reward design and training curriculum improvements.

Evaluation and Benchmarking Alignment and RLHF RA-RFT Learning to Reason by Analogy via Retrieval-Augmented Reinforcement Fine-Tuning GRPO +3 more

6arXiv · cs.CL·May 27, 2026·source ↗

Pair-In, Pair-Out (PIPO): Unified Latent Compression and Multi-Token Prediction for Efficient LLM Inference

PIPO is a new inference efficiency framework that unifies input-side latent compression with output-side multi-token prediction (MTP) by treating them as mirror operations: a compressor folds two input tokens into one latent, while an MTP head unfolds one hidden state into an additional output token. To avoid the expensive verifier pass typically required by speculative decoding, PIPO trains a lightweight confidence head using On-Policy Distillation (OPD), which naturally aligns with rejection-sampling criteria. Experiments on Qwen3.5-4B and 9B backbones across AIME 2025, GPQA-Diamond, LiveCodeBench v6, and LongBench v2 show up to 2.64× first-token-latency speedup and +7.15 pass@4 improvement over regular decoding.

Long Context Evolution Inference Economics On-Policy Distillation (OPD)Multi-Token Prediction (MTP)speculative decoding +7 more

8Mistral Ai News·May 18, 2026·source ↗

Mistral Releases Mistral 3 Family: Mistral Large 3 (675B MoE) and Ministral 3 Series (3B–14B), All Apache 2.0

Mistral AI has announced Mistral 3, a family of open-weight models including Mistral Large 3 (41B active / 675B total sparse MoE) and three dense Ministral 3 edge models (3B, 8B, 14B), all released under Apache 2.0. Mistral Large 3 debuts at #2 on LMArena's OSS non-reasoning leaderboard, supports image understanding, and was trained on 3,000 NVIDIA H200 GPUs; a reasoning variant is forthcoming. The Ministral 3 series includes base, instruct, and reasoning variants with multimodal and multilingual capabilities, with the 14B reasoning model achieving 85% on AIME '25. The release involves deep co-optimization with NVIDIA (Blackwell/Hopper kernels, NVFP4 format), vLLM, and Red Hat, and is available across major cloud and inference platforms.

Training Infrastructure Frontier Model Releases Mistral AI Amazon Bedrock Red Hat +16 more