Entity · benchmark

τ²-Bench

benchmarkactive-bench-49ae3732·5 events·first seen May 19, 2026

Aliases: τ²-Bench, τ-bench

Co-occurring entities

More like this (12)

IT-Bench Terminal-Bench MT-Bench T3Bench T2I-CompBench T1-Bench TriggerBench Tau2-bench Telecom TAU-bench ATE-Bench MTBench Int-Bench

Recent events (5)

5arXiv · cs.AI·2d ago·source ↗

CAM-DF: Cost-aware stopping framework reduces LLM agent tool acquisition by 37% without performance loss

Researchers introduce CAM-DF (cost-aware marginal decision-focused stopping), a method for deciding how many tools an LLM agent should acquire from a ranked list, balancing task coverage against cost, context load, and privacy exposure. The approach trains on the offline gap between stopping now versus continuing, with theoretical proof that score-only rules are suboptimal under heterogeneous costs. Evaluated on 1,343 tasks across five tool-use domains including τ-bench Retail, CAM-DF exposes agents to 37% fewer tools than full access while maintaining comparable task success. The method is a lightweight pre-execution plugin that works with existing tool rankings without fine-tuning the underlying LLM.

Inference Economics Agent and Tool Ecosystem Scores Are Not Decisions: Cost-Aware Stopping for Tool Acquisition in LLM Agents CAM-DF τ²-Bench

5arXiv · cs.CL·Jul 13, 2026·source ↗

GRACE: Graph-Regularized Agentic Context Evolution for reliable long-horizon instruction updates

Researchers introduce GRACE, a method that maintains a deployed LLM agent's persistent system-level instructions as a typed semantic graph rather than flat text, enabling local verification of updates within typed node neighborhoods. Evaluated on a telecom agent harness derived from τ²-bench under distribution shift, GRACE improves pass³ reliability from 0.091 (Gemini 2.5 Flash zero-shot) to 0.673±0.136, surpassing a Gemini 3.1 Pro zero-shot reference of 0.242. The work identifies structural substrate and consolidation mechanisms as key requirements for reliable long-horizon agentic context evolution. The flat-text baseline finishes at 0.191, underscoring the practical gap GRACE addresses.

Evaluation and Benchmarking Agent and Tool Ecosystem Gemini 3.1 Pro Google Gemini-2.5-Flash-Lite +2 more

6arXiv · cs.CL·Jul 10, 2026·source ↗

Proactive Memory Agent reduces behavioral state decay in long-horizon tasks

Researchers introduce a plug-and-play memory agent module that runs alongside an unmodified action agent, maintaining a structured memory bank and selectively injecting reminders when relevant state would otherwise be lost in long trajectories. The approach addresses 'behavioral state decay' — the failure mode where task-critical context gets buried or pushed out of the context window. Evaluated on Terminal-Bench 2.0 and τ²-Bench, the module yields +8.3 pp and +6.8 pp pass@1 gains respectively, with ablations confirming selective injection outperforms always-on or passive retrieval approaches. The authors also train an open-weight memory policy (Qwen3.5-27B) using SFT and GRPO, showing partial transfer to Terminal-Bench.

Long Context Evolution Open Weights Progress GRPO Qwen3.6-27B Remember When It Matters: Proactive Memory Agent for Long-Horizon Agents +4 more

5arXiv · cs.CL·Jul 8, 2026·source ↗

CurateEvo: Failure-Driven Dynamic Data Curation for Agentic LLM Post-Training

CurateEvo is a new framework for agentic post-training that treats data curation as a dynamic, evolving process rather than a fixed preprocessing step. The system represents curation strategies as executable code and iteratively rewrites them based on failed trajectories from a held-out development set, producing SFT data, RL data, and an inference-time memory bank. Evaluated on ACEBench-Agent, BFCL-V4, and τ²-Bench, CurateEvo outperforms prior curation methods by 3.2 and 2.7 average points in labeled and wild-data settings respectively, while also reducing curation overhead.

Evaluation and Benchmarking Agent and Tool Ecosystem ACEBench-Agent BFCL-V3 CurateEvo +2 more

7arXiv · cs.CL·May 19, 2026·source ↗

EnvFactory: Scaling Tool-Use Agents via Executable Environments Synthesis and Robust RL

EnvFactory is a fully automated framework for training tool-use LLM agents via Agentic Reinforcement Learning, addressing two key bottlenecks: scalable execution environments and realistic multi-turn training data. It autonomously constructs stateful, executable tool environments from authentic resources and synthesizes natural trajectories with implicit human intents via topology-aware sampling. Using only 85 verified environments across 7 domains, it generates 2,575 SFT and RL trajectories and improves Qwen3-series models by up to +15% on BFCLv3, +8.6% on MCP-Atlas, and +6% on conversational benchmarks, outperforming prior approaches that use 5x more environments.

Training Infrastructure Evaluation and Benchmarking VitaBench MCP-Atlas BFCLv3 +6 more