5Latent Space (swyx)·16d ago

Andon Labs on building frontier evals: VendingBench and evaluating Claude models

Latent Space interviews Lukas Petersson and Axel Backlund of Andon Labs, the creators of VendingBench, about their approach to building real-world AI evaluations. The conversation covers their experience evaluating Claude models across the capability spectrum from Haiku to Mythos, and their methodology for constructing durable frontier evals. The episode is notable for touching on a speculative or unreleased Claude model tier called 'Mythos.'

Frontier Model Releases Evaluation and Benchmarking Claude Mythos Axel Backlund Claude Haiku 4.5 Lukas Petersson VendingBench Andon Labs Latent Space Anthropic

Related guides (3)

Frontier Model ReleasesTopic guide

Frontier Model Releases: The Race From Language to Action

Read asBeginner In-depth

Anthropic

Anthropic: The AI Safety Company at the Center of the Frontier

Read asBeginner In-depth

Evaluation and BenchmarkingTopic guide

Evaluation and Benchmarking: How We Measure AI — and Why It Keeps Getting Harder

Read asBeginner In-depth

Related events (8)

8Anthropic News·18d ago·source ↗

Introducing Claude 3.5 Sonnet

Anthropic launches Claude 3.5 Sonnet, the first model in its Claude 3.5 family, claiming it outperforms Claude 3 Opus and competitor models on GPQA, MMLU, and HumanEval benchmarks while operating at twice the speed and mid-tier pricing ($3/$15 per million tokens). The model features a 200K context window, improved vision capabilities, and an internal agentic coding evaluation score of 64% versus 38% for Opus. Alongside the model, Anthropic introduces Artifacts on Claude.ai, a dedicated workspace for real-time editing of AI-generated content. The model was pre-deployment evaluated by the UK AI Safety Institute and assessed at ASL-2.

Long Context Evolution Frontier Model Releases claude.ai Thorn Amazon Bedrock +16 more

7The Batch·34h ago·source ↗

Independent evaluators struggle to benchmark Claude Fable 5 due to Anthropic's safety classifiers and data retention policies

Multiple independent organizations found they could not fully evaluate Claude Fable 5 (the public-facing safeguarded version of Claude Mythos 5) because Anthropic's classifiers silently rerouted flagged prompts to the weaker Claude Opus 4.8 or refused them outright. Evaluators including Artificial Analysis, Vals AI, and ARC Prize Foundation each adopted different scoring strategies — blended, pure, or abstaining entirely — producing widely divergent rankings depending on how refusals were handled. On GPQA Diamond, Claude Fable 5's score swung from 93.18% (2nd place) to 55.56% (94th place) depending on whether refusals were counted as failures. The episode surfaces a structural tension between safety-oriented deployment constraints and the ability of the field to independently measure frontier model capabilities.

Frontier Model Releases Evaluation and Benchmarking Artificial Analysis ARC Prize Foundation Claude Mythos +11 more

7The Batch·10d ago·source ↗

The Batch: Claude Mythos 5 / Fable 5 debut, Apple AFM 3, Google Live Translate, OpenAI IPO filing, FrontierCode benchmark

Anthropic launched Claude Fable 5 (a safety-guardrailed model) and Claude Mythos 5 (same underlying model with safeguards removed, for vetted cyberdefense/infrastructure users via Project Glasswing with US government collaboration), both priced at $10/$50 per million tokens. Apple released five new Apple Foundation Models (AFM 3) spanning on-device and cloud tiers, built with Google and Nvidia infrastructure. Additional headlines cover Google's Gemini 3.5 Live Translate (70+ languages, real-time), OpenAI's confidential SEC IPO filing, a NotebookLM upgrade to Gemini 3.5, and Cognition's FrontierCode benchmark for code-quality evaluation where Claude Opus 4.8 leads at 34.3%.

Frontier Model Releases Evaluation and Benchmarking Claude Mythos Claude Opus 4.6 Google +19 more

9Anthropic News·19d ago·source ↗

Claude Opus 4.6 Released with 1M Token Context, Agentic Coding Advances, and State-of-the-Art Benchmarks

Anthropic has released Claude Opus 4.6, its most capable model to date, featuring a 1M token context window in beta, improved agentic coding and planning capabilities, and adaptive thinking with developer-controlled effort levels. The model claims top scores on Terminal-Bench 2.0, Humanity's Last Exam, GDPval-AA, and BrowseComp, outperforming OpenAI's GPT-5.2 by 144 Elo points on GDPval-AA. New product features include agent teams in Claude Code, context compaction for long-running tasks, and Claude in PowerPoint (research preview). Pricing remains unchanged at $5/$25 per million input/output tokens.

Long Context Evolution Frontier Model Releases GPT-5.2 Claude Opus 4.6 adaptive thinking +13 more

8The Batch·18d ago·source ↗

Claude Mythos Preview: Limited-Release Frontier Model with Exceptional Cybersecurity Capabilities

Anthropic has published a 244-page model card for Claude Mythos Preview, a frontier model not yet commercially available, which autonomously discovered thousands of high-severity vulnerabilities in popular operating systems and browsers during testing. To mitigate risks before potential deployment, Anthropic assembled Project Glasswing, a consortium of over 40 organizations including AWS, Apple, Google, Microsoft, and CrowdStrike, funded with $100M in model credits to patch vulnerabilities proactively. The model substantially outperforms Claude Opus 4.6, GPT-5.4, and Gemini 3.1 Pro across multiple benchmarks including CyberGym (83.1%), Terminal-Bench 2.0 (82%), GPQA Diamond (94.5%), HLE (64.7%), and GraphWalks long-context (80%). The Batch notes parallels to OpenAI's GPT-2 limited-release strategy and characterizes the announcement as having elements of a publicity stunt alongside genuine safety concerns.

Frontier Model Releases Evaluation and Benchmarking Gemini 3.1 Pro GraphWalks Linux Foundation +18 more

5Simon Willison'S Weblog·11d ago·source ↗

Simon Willison's initial impressions of Claude Fable 5

Simon Willison shares initial impressions of Claude Fable 5, a new Anthropic model. The body of the post is not available in the provided content, but the title indicates a hands-on evaluation or commentary from a prominent AI practitioner. As a tier-2 commentary source on what appears to be a new frontier model release, this is worth indexing for the model tracking thread.

Frontier Model Releases Claude Fable 5 Simon Willison Anthropic

6One Useful Thing·11d ago·source ↗

Ethan Mollick on working with Claude Fable (Mythos): a qualitative assessment

Ethan Mollick's 'One Useful Thing' newsletter describes hands-on experience with a model referred to as 'Mythos' (apparently Claude Fable), characterizing it as representing a significant capability jump in AI. The piece is a qualitative, practitioner-level assessment of what working with the model feels like in practice. As a tier-2 commentary source, this signals that Claude Fable is generating notable reactions from prominent AI observers.

Frontier Model Releases Enterprise Deployment Patterns Ethan Mollick Claude Fable Anthropic

8The Batch·8d ago·source ↗

Anthropic launches Claude Mythos 5 and Claude Fable 5; Andrew Ng introduces OpenCoworker desktop agent

Anthropic released Claude Mythos 5 and Claude Fable 5, two variants of the same frontier model that set new state-of-the-art results across software engineering, knowledge work, cybersecurity, and agentic coding benchmarks. Claude Fable 5 is the general-availability version with safety classifiers that restrict responses on security, biology, chemistry, and cutting-edge AI topics, priced at $10/$50 per million input/output tokens; Mythos 5 is restricted to selected partners via Project Glasswing. Separately, Andrew Ng and collaborators released OpenCoworker, a free open-source desktop agent harness built on top of aisuite, designed to give users privacy-preserving agentic workflows with their own API keys or local models. The newsletter also contextualizes the broader shift toward LLM-driven agent harnesses as frontier models have become capable enough to reliably drive next-action decisions.

Frontier Model Releases AI Safety Research Ollama DeepLearning.AI Claude Mythos +13 more