Almanac
← Events
5Latent Space (swyx)·16d ago

Andon Labs on building frontier evals: VendingBench and evaluating Claude models

Latent Space interviews Lukas Petersson and Axel Backlund of Andon Labs, the creators of VendingBench, about their approach to building real-world AI evaluations. The conversation covers their experience evaluating Claude models across the capability spectrum from Haiku to Mythos, and their methodology for constructing durable frontier evals. The episode is notable for touching on a speculative or unreleased Claude model tier called 'Mythos.'

Related guides (3)

Related events (8)

8Anthropic News·18d ago·source ↗

Introducing Claude 3.5 Sonnet

Anthropic launches Claude 3.5 Sonnet, the first model in its Claude 3.5 family, claiming it outperforms Claude 3 Opus and competitor models on GPQA, MMLU, and HumanEval benchmarks while operating at twice the speed and mid-tier pricing ($3/$15 per million tokens). The model features a 200K context window, improved vision capabilities, and an internal agentic coding evaluation score of 64% versus 38% for Opus. Alongside the model, Anthropic introduces Artifacts on Claude.ai, a dedicated workspace for real-time editing of AI-generated content. The model was pre-deployment evaluated by the UK AI Safety Institute and assessed at ASL-2.

7The Batch·34h ago·source ↗

Independent evaluators struggle to benchmark Claude Fable 5 due to Anthropic's safety classifiers and data retention policies

Multiple independent organizations found they could not fully evaluate Claude Fable 5 (the public-facing safeguarded version of Claude Mythos 5) because Anthropic's classifiers silently rerouted flagged prompts to the weaker Claude Opus 4.8 or refused them outright. Evaluators including Artificial Analysis, Vals AI, and ARC Prize Foundation each adopted different scoring strategies — blended, pure, or abstaining entirely — producing widely divergent rankings depending on how refusals were handled. On GPQA Diamond, Claude Fable 5's score swung from 93.18% (2nd place) to 55.56% (94th place) depending on whether refusals were counted as failures. The episode surfaces a structural tension between safety-oriented deployment constraints and the ability of the field to independently measure frontier model capabilities.

7The Batch·10d ago·source ↗

The Batch: Claude Mythos 5 / Fable 5 debut, Apple AFM 3, Google Live Translate, OpenAI IPO filing, FrontierCode benchmark

Anthropic launched Claude Fable 5 (a safety-guardrailed model) and Claude Mythos 5 (same underlying model with safeguards removed, for vetted cyberdefense/infrastructure users via Project Glasswing with US government collaboration), both priced at $10/$50 per million tokens. Apple released five new Apple Foundation Models (AFM 3) spanning on-device and cloud tiers, built with Google and Nvidia infrastructure. Additional headlines cover Google's Gemini 3.5 Live Translate (70+ languages, real-time), OpenAI's confidential SEC IPO filing, a NotebookLM upgrade to Gemini 3.5, and Cognition's FrontierCode benchmark for code-quality evaluation where Claude Opus 4.8 leads at 34.3%.

9Anthropic News·19d ago·source ↗

Claude Opus 4.6 Released with 1M Token Context, Agentic Coding Advances, and State-of-the-Art Benchmarks

Anthropic has released Claude Opus 4.6, its most capable model to date, featuring a 1M token context window in beta, improved agentic coding and planning capabilities, and adaptive thinking with developer-controlled effort levels. The model claims top scores on Terminal-Bench 2.0, Humanity's Last Exam, GDPval-AA, and BrowseComp, outperforming OpenAI's GPT-5.2 by 144 Elo points on GDPval-AA. New product features include agent teams in Claude Code, context compaction for long-running tasks, and Claude in PowerPoint (research preview). Pricing remains unchanged at $5/$25 per million input/output tokens.

8The Batch·18d ago·source ↗

Claude Mythos Preview: Limited-Release Frontier Model with Exceptional Cybersecurity Capabilities

Anthropic has published a 244-page model card for Claude Mythos Preview, a frontier model not yet commercially available, which autonomously discovered thousands of high-severity vulnerabilities in popular operating systems and browsers during testing. To mitigate risks before potential deployment, Anthropic assembled Project Glasswing, a consortium of over 40 organizations including AWS, Apple, Google, Microsoft, and CrowdStrike, funded with $100M in model credits to patch vulnerabilities proactively. The model substantially outperforms Claude Opus 4.6, GPT-5.4, and Gemini 3.1 Pro across multiple benchmarks including CyberGym (83.1%), Terminal-Bench 2.0 (82%), GPQA Diamond (94.5%), HLE (64.7%), and GraphWalks long-context (80%). The Batch notes parallels to OpenAI's GPT-2 limited-release strategy and characterizes the announcement as having elements of a publicity stunt alongside genuine safety concerns.

5Simon Willison'S Weblog·11d ago·source ↗

Simon Willison's initial impressions of Claude Fable 5

Simon Willison shares initial impressions of Claude Fable 5, a new Anthropic model. The body of the post is not available in the provided content, but the title indicates a hands-on evaluation or commentary from a prominent AI practitioner. As a tier-2 commentary source on what appears to be a new frontier model release, this is worth indexing for the model tracking thread.

6One Useful Thing·11d ago·source ↗

Ethan Mollick on working with Claude Fable (Mythos): a qualitative assessment

Ethan Mollick's 'One Useful Thing' newsletter describes hands-on experience with a model referred to as 'Mythos' (apparently Claude Fable), characterizing it as representing a significant capability jump in AI. The piece is a qualitative, practitioner-level assessment of what working with the model feels like in practice. As a tier-2 commentary source, this signals that Claude Fable is generating notable reactions from prominent AI observers.

8The Batch·8d ago·source ↗

Anthropic launches Claude Mythos 5 and Claude Fable 5; Andrew Ng introduces OpenCoworker desktop agent

Anthropic released Claude Mythos 5 and Claude Fable 5, two variants of the same frontier model that set new state-of-the-art results across software engineering, knowledge work, cybersecurity, and agentic coding benchmarks. Claude Fable 5 is the general-availability version with safety classifiers that restrict responses on security, biology, chemistry, and cutting-edge AI topics, priced at $10/$50 per million input/output tokens; Mythos 5 is restricted to selected partners via Project Glasswing. Separately, Andrew Ng and collaborators released OpenCoworker, a free open-source desktop agent harness built on top of aisuite, designed to give users privacy-preserving agentic workflows with their own API keys or local models. The newsletter also contextualizes the broader shift toward LLM-driven agent harnesses as frontier models have become capable enough to reliably drive next-action decisions.