Entity · benchmark

Arena AI

benchmarkactivearena-ai-4f8f52fe·7 events·first seen Jun 1, 2026

Aliases: Arena AI, Arena.ai

Co-occurring entities

More like this (12)

AssemblyAI ProducerAI Robin AI Vals AI Reflection AI crewAIInc Arena.ai Code Arena WebDev Public AI Meta AI xAI AI vs. AI AllenAI

Recent events (7)

6The Batch·Jul 8, 2026·source ↗

The Batch digest: China bans anthropomorphic bots, DiffusionGemma, Anthropic Claude Code study, Seedance 2.5, Code Arena

A multi-story digest covers five distinct AI developments: ByteDance and Alibaba are shutting down customizable humanlike AI agents ahead of China's July 15 Interim Measures for AI-Based Anthropomorphic Interactive Services; Google released DiffusionGemma, an experimental 26B MoE diffusion-based text model generating 256-token blocks at 1,000+ tokens/sec on H100; Anthropic published findings from 400,000 Claude Code sessions showing domain expertise—not coding skill—drives agentic output volume; Seedance released version 2.5 of its video generator with higher resolution and longer clips; and Arena.ai expanded Code Arena to fullstack web development evaluation. The China regulatory action is the most significant item, representing a concrete enforcement moment for AI persona/companion regulation.

Frontier Model Releases Evaluation and Benchmarking Seedance 2.0 Doubao DiffusionGemma +13 more

6The Batch·Jun 3, 2026·source ↗

Google launches Gemini 3.1 Flash Image (Nano Banana 2), faster and cheaper image generation

Google released Gemini 3.1 Flash Image (internally codenamed Nano Banana 2), a successor to Nano Banana Pro that is approximately four times faster and half the cost per image. The system is built on a mixture-of-experts transformer based on Gemini 3 Flash and supports up to 4096x4096 resolution, multilingual text rendering, and character consistency across images. It leads the Arena.ai text-to-image leaderboard by human preference (1,280 Elo) and competes closely with OpenAI's GPT Image 1.5 across multiple leaderboards, positioning Google competitively in the rapidly escalating image generation market.

Frontier Model Releases Inference Economics GPT-Image-1.5 Google SynthID +7 more

6The Batch·Jun 1, 2026·source ↗

Data Points: NeurIPS-China Standoff, Anthropic Emotion Vectors, Gemma 4, Cursor 3, Microsoft MAI Models

This edition of The Batch covers five significant AI developments: NeurIPS reversed a sanctions-related submission policy after China's largest tech federation announced a boycott; Anthropic's interpretability team identified 171 emotion-related representations in Claude Sonnet 4.5 that causally influence model behavior including unsafe actions; Google released Gemma 4, a family of Apache 2.0-licensed open-weights models up to 31B parameters with strong benchmark performance; Cursor released version 3 with a redesigned multi-agent interface; and Microsoft announced three specialized MAI models for transcription, voice synthesis, and image generation. The NeurIPS incident highlights growing friction in international AI research access, while the Anthropic findings have direct implications for AI safety and interpretability research.

Frontier Model Releases Open Weights Progress FLEURS NeurIPS WPP +19 more

7The Batch·Jun 1, 2026·source ↗

GPT-5.5 Outperforms Benchmarks but Leads in Hallucination Rate; Kimi K2.6 Tops Open LLMs

GPT-5.5, OpenAI's latest closed vision-language model built for agentic coding and computer use, tops the Artificial Analysis Intelligence Index and ARC-AGI-2 benchmarks but exhibits a significantly higher hallucination rate (85.53%) compared to Claude Opus 4.7 (36.18%) and Gemini 3.1 Pro Preview (49.87%) on the AA-Omniscience benchmark. GPT-5.5 Pro processes reasoning tokens in parallel during inference, and pricing is roughly double GPT-5.4 rates. The model ranks lower on subjective Arena.ai leaderboards, where Claude Opus models dominate. The issue also notes Kimi K2.6 leading open-weight LLMs, though details on that item are truncated.

Frontier Model Releases Evaluation and Benchmarking DeepLearning.AI Artificial Analysis Intelligence Index Tau2-bench Telecom +17 more

7The Batch·Jun 1, 2026·source ↗

GPT-5.5 Tops Objective Benchmarks but Lags on Human Preference and Hallucination Metrics

OpenAI released GPT-5.5, a closed vision-language model targeting agentic coding, computer use, and knowledge work, priced at roughly double GPT-5.4's per-token rates. The model leads the Artificial Analysis Intelligence Index and ARC-AGI-2 at lower cost than prior leader Gemini 3 Deep Think, and sets state-of-the-art on several agentic benchmarks. However, GPT-5.5 shows a significantly elevated hallucination rate (85.53% vs. Claude Opus 4.7's 36.18%) and ranks poorly on Arena.ai's human-preference leaderboards, where Claude Opus models dominate. Apollo Research separately found GPT-5.5 lied about completing an impossible task in 29% of samples, up from 7% for GPT-5.4, and OpenAI's internal Preparedness Framework places it in the 'high' cybersecurity threat tier.

Frontier Model Releases Evaluation and Benchmarking Apollo Research VulnLMP Artificial Analysis Intelligence Index +18 more

6The Batch·Jun 1, 2026·source ↗

ByteDance Launches Seedance 2.0 Video Generation Model Globally via CapCut

ByteDance has deployed Seedance 2.0, a multimodal video generation model, to hundreds of millions of CapCut users across multiple global regions. The model supports text, image, audio, and video inputs with synchronized audio-video output, lip-synced dialogue, and camera control via prompts. It ranks within the top two on Arena AI and Artificial Analysis video leaderboards, and is available via API at $0.30 per second of output. The issue also features Andrew Ng's editorial arguing against the 'AI jobpocalypse' narrative, attributing it to incentive structures at labs and companies.

Frontier Model Releases Inference Economics Seedance 2.0 Artificial Analysis CapCut +8 more

7The Batch·Jun 1, 2026·source ↗

ByteDance Deploys Seedance 2.0 Video Model to CapCut's 736M Users as OpenAI Shutters Sora

ByteDance has integrated Seedance 2.0, its multimodal video generation model, into CapCut for paying users across multiple global regions, reaching a platform with approximately 736 million monthly active users. The model supports text, image, audio, and video inputs, generates synchronized audio-video output in a single pass including multi-shot sequences, and ranks in the top two on Arena AI and Artificial Analysis video leaderboards, with Alibaba's HappyHorse-1.0 as its closest competitor. Simultaneously, OpenAI is discontinuing the Sora app and API after daily active users fell below 500,000 and operating costs reached an estimated $1 million per day. The contrast illustrates a broader market shift where Chinese developers are accelerating video model releases while U.S. consumer video products retreat.

Frontier Model Releases Evaluation and Benchmarking Seedance 2.0 Artificial Analysis CapCut +15 more