Entity · benchmark

TAU-bench

benchmarkactivetau-bench-5730f8aa·7 events·first seen Jun 1, 2026

Aliases: TAU-bench, tau2-bench, tau2-Bench

Co-occurring entities

More like this (12)

ATE-Bench T3Bench ITBench-AA T1-Bench τ²-Bench MTBench TriggerBench SorryBench Tau2-bench Telecom FeatBench RuBench SelectBench

Recent events (7)

5arXiv · cs.AI·3d ago·source ↗

TRACE-ROUTER: Task-level LLM routing for agentic workflows using contextual bandits

TRACE-ROUTER is a new routing framework that addresses a fundamental mismatch in enterprise LLM deployment: existing per-call routers cannot correctly attribute feedback to individual routing decisions in long-horizon agentic workflows. The system assigns each task to a single model at admission using a contextual bandit and updates its policy using the task's terminal reward, jointly optimizing accuracy and latency. On tau2-Bench, it outperforms latency-matched interpolation between individual models by 7-8 accuracy points; on Terminal-Bench it achieves 7.1 higher accuracy points than the strongest single-model baseline with 36% lower latency.

Inference Economics Enterprise Deployment Patterns TRACE-ROUTER TAU-bench Terminal-Bench +1 more

6arXiv · cs.CL·Jul 15, 2026·source ↗

Function-Aware Fill-in-the-Middle Mid-Training Improves Coding Agent Foundation Models

Researchers propose a self-supervised mid-training objective called function-aware fill-in-the-middle (FIM) that exploits the structural isomorphism between a coding agent's action-observation-continuation loop and function call sites in ordinary code. Applied to Qwen2.5-Coder-Instruct (7B/14B) and Qwen3-8B on a 2.6B-token GitHub corpus, the method yields +2.8 to +5.4 point gains on SWE-Bench-Verified and SWE-Bench-Lite across multiple post-training pipelines. Notably, the technique also mitigates capability erosion on non-agent coding and tool-use benchmarks, suggesting the function-call inductive bias generalizes beyond the training domain.

Frontier Model Releases Evaluation and Benchmarking SWE-Smith SWE-Bench Lite Qwen2.5-Coder-32B-Instruct +8 more

6arXiv · cs.AI·Jun 16, 2026·source ↗

Bayesian audit framework for public AI evaluation archives challenges frontier model claims

A new arXiv preprint proposes a Bayesian inference and decision-audit framework for interpreting public AI evaluation archives (LiveBench, Open LLM Leaderboard v2, LMArena, GAIA, tau-bench) as longitudinal time series rather than terminal leaderboards. The paper demonstrates that a single terminal snapshot is compatible with multiple distinct performance histories, yielding ambiguous timing estimates for reaching capability ceilings. A candidate selection-aware frontier model is shown to fail synthetic recovery, objective-archive prediction, preference transfer, and uncertainty calibration, with fixed audit gates rejecting its stronger claims. The work proposes an archive-and-adjudication protocol to reconstruct evaluation histories and falsify unsupported frontier capability claims.

Evaluation and Benchmarking AI Safety Research Bayesian Inference and Decision Audits for Public Archives of Frontier AI Evaluations GAIA Open LLM Leaderboard +3 more

7arXiv · cs.CL·Jun 3, 2026·source ↗

PROVE framework trains LLMs for multi-step tool use via stateful MCP environments and programmatic rewards

Researchers introduce PROVE (Programmatic Rewards On Verified Environments), a framework for training LLMs to orchestrate multi-step tool calls using reinforcement learning. The system includes a library of 20 stateful MCP servers with 343 tools, an automated data synthesis pipeline that grounds training queries in live server state, and a multi-component programmatic reward function requiring no judge model. Training four models (Qwen3-4B, Qwen3-8B, Qwen2.5-7B, Granite-4.1-8B) with ~13K examples yields gains of up to +10.2 on BFCL Multi-Turn, +6.8 on tau2-bench, and +6.5 on T-Eval, demonstrating consistent improvements in multi-step tool orchestration.

Evaluation and Benchmarking Agent and Tool Ecosystem Qwen2.5-7B GRPO Qwen3-4B +7 more

9Anthropic News·Jun 3, 2026·source ↗

Anthropic introduces computer use capability, upgraded Claude 3.5 Sonnet, and Claude 3.5 Haiku

Anthropic announced three major developments: an upgraded Claude 3.5 Sonnet with significant coding improvements (SWE-bench Verified rising from 33.4% to 49.0%, surpassing all publicly available models including reasoning models), a new Claude 3.5 Haiku that matches Claude 3 Opus performance at Haiku-tier speed, and a public beta of 'computer use' — a capability allowing Claude to control computers by viewing screens, moving cursors, clicking, and typing. Computer use is available via the Anthropic API, Amazon Bedrock, and Google Cloud Vertex AI, with early adopters including Replit, The Browser Company, and Cognition. Both safety institutes (US AISI and UK AISI) conducted pre-deployment testing, and the model was assessed as remaining within ASL-2 under Anthropic's Responsible Scaling Policy.

Frontier Model Releases Evaluation and Benchmarking OpenAI o1-preview Amazon Bedrock Claude 3.5 Sonnet +15 more

7Anthropic News·Jun 2, 2026·source ↗

Claude Opus 4.1 Released with 74.5% SWE-bench Verified Score

Anthropic has released Claude Opus 4.1, an incremental upgrade to Claude Opus 4 focused on agentic tasks, coding, and reasoning. The model achieves 74.5% on SWE-bench Verified (without extended thinking) and shows notable gains in multi-file code refactoring and large-codebase debugging. It is available to paid Claude users, Claude Code, and via API on Anthropic, Amazon Bedrock, and Google Cloud Vertex AI at the same price as Opus 4. Anthropic notes substantially larger model improvements are planned for the coming weeks.

Frontier Model Releases Evaluation and Benchmarking Rakuten Group Amazon Bedrock Claude Opus 4.6 +9 more

9Anthropic News·Jun 1, 2026·source ↗

Claude 3.7 Sonnet and Claude Code: Anthropic's First Hybrid Reasoning Model and Agentic Coding Tool

Anthropic has released Claude 3.7 Sonnet, described as their most capable model to date and the first hybrid reasoning model on the market, capable of operating in both standard and extended thinking modes within a single unified model. The model achieves state-of-the-art results on SWE-bench Verified and TAU-bench, with particular strength in coding and front-end web development. Alongside the model, Anthropic is launching Claude Code in limited research preview, a command-line agentic coding tool that can read/edit files, run tests, and push to GitHub. Pricing remains unchanged at $3/M input and $15/M output tokens, with availability across Claude.ai plans, Amazon Bedrock, and Google Cloud Vertex AI.

Frontier Model Releases Evaluation and Benchmarking Canva Amazon Bedrock GitHub +14 more