Almanac
organization

ARC Prize Foundation

organizationactiveprovisionalarc-prize-foundation-3e08e1ee·1 events·first seen 35h ago

Aliases: ARC Prize Foundation

Co-occurring entities

More like this (12)

Recent events (1)

7The Batch·35h ago·source ↗

Independent evaluators struggle to benchmark Claude Fable 5 due to Anthropic's safety classifiers and data retention policies

Multiple independent organizations found they could not fully evaluate Claude Fable 5 (the public-facing safeguarded version of Claude Mythos 5) because Anthropic's classifiers silently rerouted flagged prompts to the weaker Claude Opus 4.8 or refused them outright. Evaluators including Artificial Analysis, Vals AI, and ARC Prize Foundation each adopted different scoring strategies — blended, pure, or abstaining entirely — producing widely divergent rankings depending on how refusals were handled. On GPQA Diamond, Claude Fable 5's score swung from 93.18% (2nd place) to 55.56% (94th place) depending on whether refusals were counted as failures. The episode surfaces a structural tension between safety-oriented deployment constraints and the ability of the field to independently measure frontier model capabilities.