Almanac
benchmark

AIME 2026

benchmarkactiveprovisionalaime-2026-5418d929·2 events·first seen 15d ago

Aliases: AIME 2026, AIME 2024

Co-occurring entities

More like this (12)

Recent events (2)

5arXiv · cs.AI·7d ago·source ↗

ReasonAlloc: Hierarchical KV Cache Budget Allocation for Long-CoT Reasoning Models

ReasonAlloc is a training-free framework that reframes decoding-time KV cache compression as a hierarchical budget allocation problem, operating at both layer-wise (offline) and head-wise (online) levels. The method identifies an architecture-driven pattern called the 'Reasoning Wave' to guide layer preallocation, then dynamically reallocates to information-rich heads during decoding. Evaluated on MATH-500 and AIME 2024 using DeepSeek-R1-Distill and AceReason models, it outperforms uniform-budget baselines (R-KV, SnapKV, Pyramid-RKV) especially at small budgets of 128–512 tokens, with negligible overhead.

7The Batch·15d ago·source ↗

Z.ai's GLM-5.1 Open-Weights Model Targets Multi-Hour Agentic Coding Tasks with Iterative Self-Evaluation

Z.ai released GLM-5.1, a 754B parameter mixture-of-experts open-weights model optimized for long-running agentic coding tasks, capable of cycling through planning, execution, and strategy revision hundreds of times over sessions lasting up to eight hours. The model achieves top open-weights scores on the Artificial Analysis Intelligence Index and third place on Arena's Code leaderboard, while leading SWE-Bench Pro in Z.ai's own evaluations at 58.4 percent. Weights are available on HuggingFace under MIT license, with API pricing roughly 40 percent higher than its predecessor but still below comparable proprietary models. No technical report has been published, leaving architecture and training details undisclosed.