Entity · benchmark

Arena Code

benchmarkactivearena-code-a4987bf0·2 events·first seen Jun 1, 2026

Aliases: Arena Code, Code Arena

Co-occurring entities

More like this (12)

Arena.ai Code Arena WebDev BigCodeArena Arena AI Arena Search ArenaHard Chatbot Arena BashArena Chatbot Guardrails Arena Game Arena AgentBoard WebArena EvoArena

Recent events (2)

6The Batch·Jul 8, 2026·source ↗

The Batch digest: China bans anthropomorphic bots, DiffusionGemma, Anthropic Claude Code study, Seedance 2.5, Code Arena

A multi-story digest covers five distinct AI developments: ByteDance and Alibaba are shutting down customizable humanlike AI agents ahead of China's July 15 Interim Measures for AI-Based Anthropomorphic Interactive Services; Google released DiffusionGemma, an experimental 26B MoE diffusion-based text model generating 256-token blocks at 1,000+ tokens/sec on H100; Anthropic published findings from 400,000 Claude Code sessions showing domain expertise—not coding skill—drives agentic output volume; Seedance released version 2.5 of its video generator with higher resolution and longer clips; and Arena.ai expanded Code Arena to fullstack web development evaluation. The China regulatory action is the most significant item, representing a concrete enforcement moment for AI persona/companion regulation.

Frontier Model Releases Evaluation and Benchmarking Seedance 2.0 Doubao DiffusionGemma +13 more

6The Batch·Jun 1, 2026·source ↗

GLM-5.1 Open-Weights Model Targets Long-Running Agentic Tasks; Andrew Ng on Coding Agent Acceleration by Software Domain

Z.ai released GLM-5.1, an open-weights mixture-of-experts LLM (754B total / 40B active parameters) designed for sustained agentic coding tasks lasting up to eight hours, featuring iterative planning-execution-evaluation loops with thousands of tool calls. The model claims top open-weights performance on Artificial Analysis Intelligence Index and SWE-Bench Pro, available under MIT license via HuggingFace. The accompanying editorial by Andrew Ng offers a tiered framework for how much coding agents accelerate different software work categories—frontend most, then backend, infrastructure, and research least—with practical implications for team organization. A secondary item references data-center opposition and LLM helpfulness failure modes.

Frontier Model Releases Evaluation and Benchmarking DeepLearning.AI Artificial Analysis Intelligence Index SWE-bench +9 more