Entity · benchmark

AIME

benchmarkactiveaime-00f528e9·5 events·first seen May 18, 2026

Aliases: AIME

Co-occurring entities

DeepSeek-R1-Distill-Qwen Token Budget Saturation and Mechanistic Early Detection of Reasoning Non-Convergence in Chain-of-Thought Models BIRD Qwen3 MATH-500 Max Out GRPO Signal: Adaptive Trace Prefix Control for Hard Reasoning Problems Qwen3-1.7B AdaPrefix-GRPO GRPO (Group Relative Policy Optimization)Neural Theorem Prover OpenAI IMO Lean AMC12 o1-preview DeepSeek V4 MATH DeepSeek-R1-Lite-Preview

More like this (12)

AIME 2025 AIME24 AIME25 AIME26 AIMS AMIA AI-MO AIME 2026 ADEME AI for Math Initiative AIME2024 IMO

Recent events (5)

5arXiv · cs.CL·Jul 24, 2026·source ↗

Linear probes on chain-of-thought hidden states can predict reasoning non-convergence early

A new arXiv preprint studies the bimodal convergence pattern in chain-of-thought models like DeepSeek-R1-Distill-Qwen-7B, where generations either complete within a token budget (90.3% accuracy on AIME) or exhaust it without concluding (6.6% accuracy), with a 62% overall convergence rate. The authors train linear probes on hidden-state activations at early token positions (50-300) and find that layer-20 activations at token 150 achieve AUC 0.608, reliably above chance and outperforming behavioral baselines from token entropy and repetition statistics. The results suggest convergence fate is partially encoded in intermediate representations early in generation, pointing toward early-exit inference and adaptive compute allocation strategies. Statistical evidence is modest (permutation test p=0.063), limiting strong conclusions.

Evaluation and Benchmarking Inference Economics AIME DeepSeek-R1-Distill-Qwen Token Budget Saturation and Mechanistic Early Detection of Reasoning Non-Convergence in Chain-of-Thought Models

5arXiv · cs.CL·Jul 20, 2026·source ↗

BIRD: Bootstrapped Iterative Self-Reasoning Distillation reduces LLM chain-of-thought length while improving accuracy

Researchers introduce BIRD (Bootstrapped Iterative Self-Reasoning Distillation), a two-stage method for compressing chain-of-thought reasoning in large language models without sacrificing accuracy. The approach first fine-tunes a model on brevity-instructed correct traces to warm-start the rollout distribution, then applies on-policy reverse-KL distillation against a concise self-teacher. On Qwen3-8B, BIRD improves MATH-500 accuracy from 86.2% to 92.0% while cutting average response length from 3,099 to 1,115 tokens, outperforming cold-start on-policy distillation baselines.

Evaluation and Benchmarking Inference Economics AIME BIRD Qwen3 +1 more

6arXiv · cs.CL·Jul 9, 2026·source ↗

AdaPrefix-GRPO: Adaptive prefix control doubles GRPO accuracy on hard math reasoning

A new arXiv preprint introduces AdaPrefix-GRPO, a method that addresses GRPO's failure to learn from problems where no rollout succeeds by prepending correct solution prefixes and dynamically adjusting prefix length to maintain ~50% success rate per problem throughout training. The prefix assistance is gradually withdrawn so the final model solves problems unaided. On hard math benchmarks, the method more than doubles GRPO accuracy for a 0.6B model (2.1x) and achieves 1.7x improvement on AIME, while halving trace length, with larger gains on smaller models. The implementation requires only data preparation changes and a loss mask, leaving the trainer unchanged.

Evaluation and Benchmarking Alignment and RLHF AIME Max Out GRPO Signal: Adaptive Trace Prefix Control for Hard Reasoning Problems Qwen3-1.7B +2 more

7Openai Blog·May 20, 2026·source ↗

OpenAI Neural Theorem Prover Solves Formal Math Olympiad Problems in Lean

OpenAI developed a neural theorem prover integrated with the Lean proof assistant that can solve challenging high-school olympiad problems, including problems from AMC12, AIME, and two IMO-adapted problems. The system demonstrates automated formal mathematical reasoning at a level previously requiring human expertise. This represents a significant capability milestone in AI-assisted formal verification and mathematical problem-solving.

Frontier Model Releases Evaluation and Benchmarking AIME Neural Theorem Prover OpenAI +3 more

7Deepseek News·May 18, 2026·source ↗

DeepSeek-R1-Lite-Preview Launched with o1-Level Reasoning Performance

DeepSeek has released DeepSeek-R1-Lite-Preview, a reasoning-focused model claiming o1-preview-level performance on AIME and MATH benchmarks. The model features a transparent, real-time chain-of-thought process and demonstrates inference scaling behavior where longer reasoning chains yield better results. DeepSeek has indicated that open-source model weights and a full API are forthcoming. The model is currently accessible via chat.deepseek.com.

Frontier Model Releases Evaluation and Benchmarking o1-preview DeepSeek V4 AIME +4 more