Almanac
benchmark

TAU-bench

benchmarkactiveprovisionaltau-bench-5730f8aa·5 events·first seen 15d ago

Aliases: TAU-bench, tau2-bench

Co-occurring entities

More like this (12)

Recent events (5)

9Anthropic News·15d ago·source ↗

Claude 3.7 Sonnet and Claude Code: Anthropic's First Hybrid Reasoning Model and Agentic Coding Tool

Anthropic has released Claude 3.7 Sonnet, described as their most capable model to date and the first hybrid reasoning model on the market, capable of operating in both standard and extended thinking modes within a single unified model. The model achieves state-of-the-art results on SWE-bench Verified and TAU-bench, with particular strength in coding and front-end web development. Alongside the model, Anthropic is launching Claude Code in limited research preview, a command-line agentic coding tool that can read/edit files, run tests, and push to GitHub. Pricing remains unchanged at $3/M input and $15/M output tokens, with availability across Claude.ai plans, Amazon Bedrock, and Google Cloud Vertex AI.

7Anthropic News·15d ago·source ↗

Claude Opus 4.1 Released with 74.5% SWE-bench Verified Score

Anthropic has released Claude Opus 4.1, an incremental upgrade to Claude Opus 4 focused on agentic tasks, coding, and reasoning. The model achieves 74.5% on SWE-bench Verified (without extended thinking) and shows notable gains in multi-file code refactoring and large-codebase debugging. It is available to paid Claude users, Claude Code, and via API on Anthropic, Amazon Bedrock, and Google Cloud Vertex AI at the same price as Opus 4. Anthropic notes substantially larger model improvements are planned for the coming weeks.

9Anthropic News·14d ago·source ↗

Anthropic introduces computer use capability, upgraded Claude 3.5 Sonnet, and Claude 3.5 Haiku

Anthropic announced three major developments: an upgraded Claude 3.5 Sonnet with significant coding improvements (SWE-bench Verified rising from 33.4% to 49.0%, surpassing all publicly available models including reasoning models), a new Claude 3.5 Haiku that matches Claude 3 Opus performance at Haiku-tier speed, and a public beta of 'computer use' — a capability allowing Claude to control computers by viewing screens, moving cursors, clicking, and typing. Computer use is available via the Anthropic API, Amazon Bedrock, and Google Cloud Vertex AI, with early adopters including Replit, The Browser Company, and Cognition. Both safety institutes (US AISI and UK AISI) conducted pre-deployment testing, and the model was assessed as remaining within ASL-2 under Anthropic's Responsible Scaling Policy.

7arXiv · cs.CL·13d ago·source ↗

PROVE framework trains LLMs for multi-step tool use via stateful MCP environments and programmatic rewards

Researchers introduce PROVE (Programmatic Rewards On Verified Environments), a framework for training LLMs to orchestrate multi-step tool calls using reinforcement learning. The system includes a library of 20 stateful MCP servers with 343 tools, an automated data synthesis pipeline that grounds training queries in live server state, and a multi-component programmatic reward function requiring no judge model. Training four models (Qwen3-4B, Qwen3-8B, Qwen2.5-7B, Granite-4.1-8B) with ~13K examples yields gains of up to +10.2 on BFCL Multi-Turn, +6.8 on tau2-bench, and +6.5 on T-Eval, demonstrating consistent improvements in multi-step tool orchestration.

6arXiv · cs.AI·26h ago·source ↗

Bayesian audit framework for public AI evaluation archives challenges frontier model claims

A new arXiv preprint proposes a Bayesian inference and decision-audit framework for interpreting public AI evaluation archives (LiveBench, Open LLM Leaderboard v2, LMArena, GAIA, tau-bench) as longitudinal time series rather than terminal leaderboards. The paper demonstrates that a single terminal snapshot is compatible with multiple distinct performance histories, yielding ambiguous timing estimates for reaching capability ceilings. A candidate selection-aware frontier model is shown to fail synthetic recovery, objective-archive prediction, preference transfer, and uncertainty calibration, with fixed audit gates rejecting its stronger claims. The work proposes an archive-and-adjudication protocol to reconstruct evaluation histories and falsify unsupported frontier capability claims.