Almanac
← Events
5Interconnects (Nathan Lambert)·1mo ago

Opus 4.6, Codex 5.3, and the post-benchmark era

A Interconnects commentary piece examining how to compare frontier AI models in 2026, using Anthropic's Opus 4.6 and OpenAI's Codex 5.3 as case studies. The piece appears to argue that traditional benchmarks are no longer sufficient for distinguishing model capabilities at the frontier. This reflects a broader industry shift toward more nuanced, task-specific evaluation methods.

Related guides (3)

Related events (8)

5Ai Snake Oil·1mo ago·source ↗

Open-world evaluations for measuring frontier AI capabilities: Introducing CRUX

This commentary introduces CRUX, a new evaluation project designed to assess frontier AI systems on long-horizon, open-ended, and messy real-world tasks. The piece argues that existing benchmarks are insufficient for capturing the full range of capabilities exhibited by frontier models in complex settings. CRUX aims to fill this gap by providing evaluations that better reflect deployment-relevant performance.

7The Batch·19d ago·source ↗

Claude Opus 4.8 Launches with Improved Honesty; Anthropic Previews Mythos-Class Models and Dynamic Workflows

Anthropic released Claude Opus 4.8 with improvements in coding, reasoning, agentic tasks, and notably better uncertainty flagging—approximately four times less likely than Opus 4.7 to let code flaws pass uncommented. Alongside the model, Anthropic introduced dynamic workflows in Claude Code enabling tens to hundreds of parallel subagents for large-scale engineering tasks, an effort-control slider, and a 3x price cut on fast mode. Anthropic also previewed Mythos-class models, positioned above Opus in capability, currently available to a limited set of organizations for cybersecurity work pending broader safety clearance. The same digest covers MiniMax M3 (open-weights, ~60% SWE-Bench Pro), Nvidia's RTX Spark superchip, Cosmos 3 world model, and a GR00T/Unitree robotics partnership.

4Don'T Worry About The Vase·1mo ago·source ↗

Opus 4.7 Part 3: Model Welfare

Zvi Mowshowitz publishes a commentary piece on model welfare in the context of Anthropic's Claude Opus 4.7, crediting Anthropic for enabling the discussion. The piece appears to engage with questions about the moral status or wellbeing of AI models. As a tier-2 commentary source, this reflects ongoing discourse in the AI safety and alignment community about how to think about model welfare as frontier models grow more capable.

7arXiv · cs.AI·16d ago·source ↗

AutoLab benchmark evaluates frontier models on ultra long-horizon iterative research and engineering tasks

AutoLab is a new benchmark of 36 expert-curated tasks across system optimization, puzzle-solving, model development, and CUDA kernel optimization, designed to test agents on sustained closed-loop improvement under wall-clock budgets rather than single-turn or short-horizon settings. Evaluation of 17 frontier models finds that persistence in iterative benchmarking and feedback incorporation — not initial attempt quality — is the dominant success predictor. Claude Opus 4.6 stands out as the strongest performer, while most models including proprietary ones either terminate early or exhaust budgets with minimal progress. The benchmark, harness, and task artifacts are open-sourced.

5Interconnects·1mo ago·source ↗

GPT 5.4 is a big step for Codex

A Tier 2 commentary piece from Interconnects evaluates GPT 5.4 in the context of OpenAI's Codex agent ecosystem, examining what the model release means for the frontier of AI agents. The author reflects on the current state of agent evaluation and notes a continued preference for Claude in practice. The piece offers analysis of how GPT 5.4 advances coding-agent capabilities relative to competing offerings.

9Anthropic News·19d ago·source ↗

Anthropic Introduces Claude Opus 4 and Sonnet 4 with Leading Coding Benchmarks and Agent Capabilities

Anthropic has released Claude Opus 4 and Claude Sonnet 4, positioning Opus 4 as the world's best coding model with 72.5% on SWE-bench and 43.2% on Terminal-bench, and Sonnet 4 at 72.7% on SWE-bench. Both models are hybrid (near-instant + extended thinking), support extended thinking with tool use in beta, parallel tool execution, and improved memory via local file access. Alongside the models, Anthropic is launching Claude Code as generally available with GitHub Actions, VS Code, and JetBrains integrations, plus four new API capabilities: code execution tool, MCP connector, Files API, and one-hour prompt caching. Pricing is unchanged from prior Opus and Sonnet tiers ($15/$75 and $3/$15 per million tokens respectively), with availability on Anthropic API, Amazon Bedrock, and Google Cloud Vertex AI.

5Interconnects·1mo ago·source ↗

Open Models in Perpetual Catch-Up

A commentary piece from Interconnects examining the structural dynamics between open-weight and closed frontier models, covering topics including the open-closed capability gap, distillation as a catch-up mechanism, innovation timescales, and conditions under which open models can win. The piece also addresses specialized models and gaps in the current open ecosystem. This is a high-level analytical framing of a persistent tension in the AI landscape rather than a report on a specific release or event.

7The Batch·3d ago·source ↗

Data Points: GLM-5.2 leads open models on coding benchmarks; SpaceX acquires Cursor; OpenRouter Fusion; Anthropic coding study; ChatGPT market share drops

Zhipu released GLM-5.2, a 744B-parameter open model under MIT license that ranks second only to Claude Opus 4.8 on long-horizon coding benchmarks including FrontierSWE and SWE-Marathon, featuring a 1M-token context window and a 2.9× compute reduction via IndexShare attention. SpaceX is acquiring Cursor (Anysphere) for $60B in stock, positioning Musk's company to compete in AI software tools using xAI's Colossus infrastructure. OpenRouter launched Fusion, a multi-model synthesis tool showing that budget model panels can match frontier model performance at half the cost. An Anthropic study of 400K Claude Code sessions found domain expertise—not coding skill—is the primary driver of agentic output, while a Munich court ruled Google liable for false claims in AI Overviews.