organization

ARC Prize Foundation

organizationactiveprovisionalarc-prize-foundation-3e08e1ee·1 events·first seen 35h ago

Aliases: ARC Prize Foundation

Co-occurring entities

Artificial Analysis Claude Mythos Claude Opus 4.6 Humanity's Last Exam GPQA Diamond Claude Fable 5 Claude Code ARC-AGI Agents' Last Exam GPT-5.5 Vals AI Anthropic

More like this (12)

AIMO Progress Prize Anthology Fund Rockefeller Foundation CapReward Central Square Foundation OpenAI Foundation Arc Institute SAP AI Foundation PyTorch Foundation Reeve Foundation Multilingual Corpus Rubric Reward ARC Evals

Recent events (1)

7The Batch·35h ago·source ↗

Independent evaluators struggle to benchmark Claude Fable 5 due to Anthropic's safety classifiers and data retention policies

Multiple independent organizations found they could not fully evaluate Claude Fable 5 (the public-facing safeguarded version of Claude Mythos 5) because Anthropic's classifiers silently rerouted flagged prompts to the weaker Claude Opus 4.8 or refused them outright. Evaluators including Artificial Analysis, Vals AI, and ARC Prize Foundation each adopted different scoring strategies — blended, pure, or abstaining entirely — producing widely divergent rankings depending on how refusals were handled. On GPQA Diamond, Claude Fable 5's score swung from 93.18% (2nd place) to 55.56% (94th place) depending on whether refusals were counted as failures. The episode surfaces a structural tension between safety-oriented deployment constraints and the ability of the field to independently measure frontier model capabilities.

Frontier Model Releases Evaluation and Benchmarking Artificial Analysis ARC Prize Foundation Claude Mythos +11 more