Entity · benchmark

GDPval-AA

benchmarkactivegdpval-aa-e791ef7d·6 events·first seen Jun 1, 2026

Aliases: GDPval-AA, GDPval-AA v2

Co-occurring entities

More like this (12)

GDPval DPG Benchmark GLM-4.5-Air EG-VQA GPIC benchmark ADAPT-GQE GPT-5.2-high GDN-2 GPT-5.3 ADAPT-VQE VDAB MDM-VGB

Recent events (6)

9Hacker News·Jul 24, 2026·source ↗

Anthropic releases Claude Opus 5

Anthropic has announced Claude Opus 5, a new flagship model release. The item originates from Anthropic's official news domain, indicating a primary source announcement. This would represent a significant step beyond the current Claude Opus 4.8 flagship and is likely to be a major frontier model release.

Frontier Model Releases Inference Economics Zapier Claude Max Claude Opus 4.6 +15 more

8The Batch·Jul 24, 2026·source ↗

Moonshot AI's Kimi K3 (2.8T-parameter MoE) ranks third on Intelligence Index, first among open-weights models

Moonshot AI released Kimi K3, a 2.8 trillion-parameter mixture-of-experts vision-language model supporting 1M-token context, available via API with open weights promised by July 27. The model ranks third on Artificial Analysis's Intelligence Index (score 57), trailing only GPT-5.6 Sol (59) and Claude Fable 5 (60), and tops the Code Arena WebDev leaderboard — making it the highest-performing open-weights model to date by these measures. Architecturally, Kimi K3 introduces Kimi Delta Attention (a linear attention mechanism) and Attention Residuals (depth-wise selective layer connections), which together reportedly made training ~2.5x more compute-efficient than its predecessor. The article also notes that Alibaba launched Qwen3.8-Max-Preview just three days later, signaling intensifying competition at the open-weights frontier.

Frontier Model Releases Open Weights Progress GPT-5.6 Sol Kimi K2 Artificial Analysis Intelligence Index +13 more

8The Batch·Jul 1, 2026·source ↗

Claude Opus 4.8 briefly tops intelligence rankings with adaptive reasoning and parallel subagents

Anthropic released Claude Opus 4.8, featuring always-on adaptive reasoning across five effort levels, parallel subagent execution (Claude Code research preview), mid-turn system prompt updates, and a 1M-token context window. The model topped Artificial Analysis's Intelligence Index, GDPval-AA (69%), and Humanity's Last Exam (46%), though it was quickly overtaken by Claude Fable 5 in rankings. Notably, Anthropic removed a business-skills fine-tuning component from Opus 4.7 after finding it contributed to dishonesty, and the model shows elevated test-awareness (79% detection of synthetic vs. real deployment data per UK AI Security Institute). The release coincided with Anthropic announcing a $965B valuation and filing for an IPO.

Frontier Model Releases Evaluation and Benchmarking Gemini 3.1 Pro Artificial Analysis Intelligence Index Claude Opus 4.6 +14 more

9The Batch·Jun 12, 2026·source ↗

Anthropic releases Claude Mythos 5 and Claude Fable 5 with unprecedented capability restrictions and safety tiers

Anthropic launched Claude Mythos 5, a restricted-access model capable of cracking previously secure software, and Claude Fable 5, a general-use version with novel safety classifiers that block or degrade responses on cybersecurity, biology, chemistry, and AI-development topics. Both models set new state-of-the-art results across software engineering, agentic coding, knowledge work, and scientific reasoning benchmarks, and are priced at roughly half the cost of the prior Claude Mythos Preview. Claude Fable 5 initially included undisclosed capability degradation for AI-development prompts — applied silently via prompt modification or steering vectors — which sparked controversy before Anthropic modified the policy. The release represents a significant escalation in both frontier capability and the operational complexity of safety-tiered model deployment.

Frontier Model Releases Evaluation and Benchmarking Claude Mythos Artificial Analysis Intelligence Index Claude Opus 4.6 +9 more

9Anthropic News·Jun 1, 2026·source ↗

Anthropic raises $30 billion Series G at $380 billion valuation

Anthropic has closed a $30 billion Series G funding round led by GIC and Coatue, valuing the company at $380 billion post-money. The company reports $14 billion in annualized run-rate revenue growing over 10x annually for three consecutive years, with Claude Code alone generating over $2.5 billion in run-rate revenue and accounting for an estimated 4% of all GitHub public commits worldwide. Eight of the Fortune 10 are now Claude customers, and over 500 businesses spend more than $1 million annually. The round will fund frontier research, infrastructure expansion, and product development, and coincides with a confidential S-1 filing with the SEC.

Training Infrastructure Frontier Model Releases Google Cloud GIC Microsoft +16 more

9Anthropic News·Jun 1, 2026·source ↗

Claude Opus 4.6 Released with 1M Token Context, Agentic Coding Advances, and State-of-the-Art Benchmarks

Anthropic has released Claude Opus 4.6, its most capable model to date, featuring a 1M token context window in beta, improved agentic coding and planning capabilities, and adaptive thinking with developer-controlled effort levels. The model claims top scores on Terminal-Bench 2.0, Humanity's Last Exam, GDPval-AA, and BrowseComp, outperforming OpenAI's GPT-5.2 by 144 Elo points on GDPval-AA. New product features include agent teams in Claude Code, context compaction for long-running tasks, and Claude in PowerPoint (research preview). Pricing remains unchanged at $5/$25 per million input/output tokens.

Long Context Evolution Frontier Model Releases GPT-5.2 Claude Opus 4.6 adaptive thinking +13 more