Entity · benchmark

AIME 2026

benchmarkactiveaime-2026-5418d929·6 events·first seen Jun 1, 2026

Aliases: AIME 2026, AIME 2024

Co-occurring entities

More like this (12)

AIME2024 AIME 2025 USAMO 2026 AIME QANTA 2026 AIME26 AIME24 AIME25 ADEME ICML 2026 HMMT 2026 AIMS

Recent events (6)

5arXiv · cs.CL·32h ago·source ↗

Lightning OPD 2.0 mitigates style bias in cross-teacher on-policy distillation for reasoning models

A new arXiv preprint introduces Lightning OPD 2.0, a method for on-policy distillation (OPD) that addresses style bias when the SFT data generator and distillation teacher are different models. The approach uses rollout-level cross-fitting to estimate and subtract a 'style residual' from teacher-reference disagreement before constructing token-level updates. Starting from Klear-Reasoner-8B-SFT, the method achieves 82.4% on AIME 2024 and 63.0% on LiveCodeBench v5, outperforming the original Lightning OPD in cross-teacher settings. The work relaxes a key practical constraint in distillation pipelines by decoupling SFT data generation from the distillation teacher.

Evaluation and Benchmarking Alignment and RLHF Lightning OPD 2.0 AIME 2026 LiveCodeBench +1 more

7arXiv · cs.CL·Jul 13, 2026·source ↗

Mach-Mind-4-Flash: 35B MoE agentic model matching 100B-class performance via post-training optimization

Mach-Mind-4-Flash is a 35B-parameter Mixture-of-Experts model with only 3B activated parameters that achieves performance comparable to 100B-class models through post-training techniques alone. The pipeline combines a unified RL/OPD training infrastructure with multi-teacher scheduling, parallel domain-specific RL experts fused via Multi-Teacher On-Policy Distillation (MOPD), and Hybrid Median-length Policy Optimization (HMPO) which compresses reasoning chains 19-46% with minimal accuracy loss. Benchmark results include 92.70 on AIME'26, 82.82 on IFBench, and 75.80 on BFCL-v4, claiming to lead or match models 10-30x its activated size at a fraction of inference cost. The work is notable for demonstrating that post-training optimization can close large gaps in activated parameter count for agentic tasks.

Inference Economics Agent and Tool Ecosystem IFBench Behavioral-SafetyBench AIME 2026 +7 more

6arXiv · cs.CL·Jul 7, 2026·source ↗

Direct On-Policy Distillation transfers RL policy shifts from weak to strong models

Researchers propose Direct-OPD (Direct On-Policy Distillation), a method for transferring the policy shift induced by reinforcement learning on a small model to a larger target model, bypassing the need to run expensive RL rollouts on the stronger model. The approach uses the log-ratio between a post-RL teacher and its pre-RL reference as a dense implicit reward signal applied to the student's own on-policy states. Empirically, Direct-OPD improves Qwen3-1.7B from 48.3% to 62.4% on AIME 2024 in 4 hours on 8 A100 GPUs, outperforming step-matched direct RL. The method addresses a key scalability bottleneck in post-training as frontier models grow larger.

Training Infrastructure Frontier Model Releases on-policy distillation AIME 2026 Weak-to-Strong Generalization via Direct On-Policy Distillation +5 more

6arXiv · cs.CL·Jun 18, 2026·source ↗

STARE: Token-level advantage reweighting to prevent entropy collapse in GRPO-style RL training

Researchers introduce STARE, a method addressing policy entropy collapse in GRPO-style reinforcement learning from verifiable rewards (RLVR) for LLM post-training. Through first-order gradient analysis, they identify a token-level credit assignment mismatch and propose selectively reweighting advantages for entropy-critical tokens using batch-internal surprisal quantiles plus a closed-loop entropy gate. Evaluated across 1.5B–32B models on short/long chain-of-thought and multi-turn tool use tasks, STARE outperforms DAPO and other baselines by 4–8% on AIME24/25 while sustaining stable training over thousands of steps.

Frontier Model Releases Alignment and RLHF DAPO AIME 2026 GRPO +2 more

5arXiv · cs.AI·Jun 10, 2026·source ↗

ReasonAlloc: Hierarchical KV Cache Budget Allocation for Long-CoT Reasoning Models

ReasonAlloc is a training-free framework that reframes decoding-time KV cache compression as a hierarchical budget allocation problem, operating at both layer-wise (offline) and head-wise (online) levels. The method identifies an architecture-driven pattern called the 'Reasoning Wave' to guide layer preallocation, then dynamically reallocates to information-rich heads during decoding. Evaluated on MATH-500 and AIME 2024 using DeepSeek-R1-Distill and AceReason models, it outperforms uniform-budget baselines (R-KV, SnapKV, Pyramid-RKV) especially at small budgets of 128–512 tokens, with negligible overhead.

Frontier Model Releases Inference Economics AceReason-14B AIME 2026 SnapKV +4 more

7The Batch·Jun 1, 2026·source ↗

Z.ai's GLM-5.1 Open-Weights Model Targets Multi-Hour Agentic Coding Tasks with Iterative Self-Evaluation

Z.ai released GLM-5.1, a 754B parameter mixture-of-experts open-weights model optimized for long-running agentic coding tasks, capable of cycling through planning, execution, and strategy revision hundreds of times over sessions lasting up to eight hours. The model achieves top open-weights scores on the Artificial Analysis Intelligence Index and third place on Arena's Code leaderboard, while leading SWE-Bench Pro in Z.ai's own evaluations at 58.4 percent. Weights are available on HuggingFace under MIT license, with API pricing roughly 40 percent higher than its predecessor but still below comparable proprietary models. No technical report has been published, leaving architecture and training details undisclosed.

Frontier Model Releases Evaluation and Benchmarking Gemini 3.1 Pro Artificial Analysis Intelligence Index Claude Opus 4.6 +14 more