Entity · benchmark

MATH-500

benchmarkactivemath-500-7877810d·3 events·first seen May 21, 2026

Aliases: MATH-500

Co-occurring entities

AIME BIRD Qwen3 AceReason-14B AIME 2026 SnapKV DeepSeek-R1-Distill-Qwen ReasonAlloc DeepSeek-R1-Distill-Llama-8B RLVR ROUGE-L AIME24 GRPO AIME25 PPO Qwen3-4B GPQA Diamond Phi-4-mini Qwen3-1.7B LamPO

More like this (12)

MATH500 MATH MATH benchmark MathVista Math-Verify MATH-MCQA DeepMath NuminaMath Big-Math FrontierMath DAPO-Math MTIA 500

Recent events (3)

5arXiv · cs.CL·Jul 20, 2026·source ↗

BIRD: Bootstrapped Iterative Self-Reasoning Distillation reduces LLM chain-of-thought length while improving accuracy

Researchers introduce BIRD (Bootstrapped Iterative Self-Reasoning Distillation), a two-stage method for compressing chain-of-thought reasoning in large language models without sacrificing accuracy. The approach first fine-tunes a model on brevity-instructed correct traces to warm-start the rollout distribution, then applies on-policy reverse-KL distillation against a concise self-teacher. On Qwen3-8B, BIRD improves MATH-500 accuracy from 86.2% to 92.0% while cutting average response length from 3,099 to 1,115 tokens, outperforming cold-start on-policy distillation baselines.

Evaluation and Benchmarking Inference Economics AIME BIRD Qwen3 +1 more

5arXiv · cs.AI·Jun 10, 2026·source ↗

ReasonAlloc: Hierarchical KV Cache Budget Allocation for Long-CoT Reasoning Models

ReasonAlloc is a training-free framework that reframes decoding-time KV cache compression as a hierarchical budget allocation problem, operating at both layer-wise (offline) and head-wise (online) levels. The method identifies an architecture-driven pattern called the 'Reasoning Wave' to guide layer preallocation, then dynamically reallocates to information-rich heads during decoding. Evaluated on MATH-500 and AIME 2024 using DeepSeek-R1-Distill and AceReason models, it outperforms uniform-budget baselines (R-KV, SnapKV, Pyramid-RKV) especially at small budgets of 128–512 tokens, with negligible overhead.

Frontier Model Releases Inference Economics AceReason-14B AIME 2026 SnapKV +4 more

5arXiv · cs.CL·May 21, 2026·source ↗

LamPO: Lambda-Style Policy Optimization with Pairwise Decomposed Advantage for Reasoning LMs

LamPO proposes a new RLVR training objective that replaces GRPO's scalar group-relative advantages with a Pairwise Decomposed Advantage, aggregating pairwise reward gaps within response groups and weighting comparisons by confidence-aware log-probability differences. The method retains a critic-free, clipped-update PPO-style structure and optionally adds a ROUGE-L-based dense auxiliary reward to reduce sparsity. Experiments on AIME24, AIME25, MATH-500, and GPQA-Diamond using Qwen3-1.7B, Qwen3-4B, and Phi-4-mini show consistent improvements over GRPO and other RLVR variants with more stable training dynamics.

Frontier Model Releases Evaluation and Benchmarking RLVR ROUGE-L AIME24 +10 more