Arena Leaderboard
arena-leaderboard-472c51e1·2 events·first seen 19d agoAliases: Arena Leaderboard, Arena Code Leaderboard
Co-occurring entities
More like this (12)
Recent events (2)
Z.ai's GLM-5.1 Open-Weights Model Targets Multi-Hour Agentic Coding Tasks with Iterative Self-Evaluation
Z.ai released GLM-5.1, a 754B parameter mixture-of-experts open-weights model optimized for long-running agentic coding tasks, capable of cycling through planning, execution, and strategy revision hundreds of times over sessions lasting up to eight hours. The model achieves top open-weights scores on the Artificial Analysis Intelligence Index and third place on Arena's Code leaderboard, while leading SWE-Bench Pro in Z.ai's own evaluations at 58.4 percent. Weights are available on HuggingFace under MIT license, with API pricing roughly 40 percent higher than its predecessor but still below comparable proprietary models. No technical report has been published, leaving architecture and training details undisclosed.
Data Points: DeepSWE Benchmark, DeepSeek V4 Price Cuts, MAI-Image-2.5, Mythos Security Findings, MCP Stateless Update
This edition of The Batch covers five distinct AI developments: Datacurve's DeepSWE benchmark claims to fix critical grading flaws in SWE-bench Pro with hand-written verifiers and harder tasks; DeepSeek permanently cuts V4 Pro prices by 75%; Microsoft's MAI-Image-2.5 debuts third on the Arena leaderboard; Anthropic's Claude Mythos Preview found over 10,000 high/critical vulnerabilities in the first month of Project Glasswing, with remediation badly lagging discovery; and the Model Context Protocol proposes removing stateful sessions to enable stateless, load-balanced remote servers. Each item reflects meaningful movement in evaluation methodology, inference economics, multimodal generation, AI-assisted security, and agent tooling infrastructure.