5Hugging Face Blog·24d ago

ITBench-AA: Frontier Models Score Below 50% on the First Benchmark for Agentic Enterprise IT Tasks

IBM Research and Artificial Analysis have released ITBench-AA, a benchmark targeting agentic AI performance on enterprise IT operations tasks. Frontier models evaluated on the benchmark score below 50%, indicating significant capability gaps in real-world IT automation scenarios. The benchmark appears to be the first of its kind focused specifically on agentic enterprise IT workflows, covering tasks relevant to site reliability engineering and IT operations.

Evaluation and Benchmarking Enterprise Deployment Patterns Agent and Tool Ecosystem IBM Research Artificial Analysis ITBench-AA Hugging Face

Related guides (4)

Hugging Face

Hugging Face: The Home of Open-Source AI

Read asBeginner In-depth

Enterprise Deployment PatternsTopic guide

Enterprise Deployment Patterns: From LLM Demo to Production Reality

Read asIn-depth

Agent and Tool EcosystemTopic guide

Agent and Tool Ecosystem: How the Infrastructure Layer Around LLMs Is Consolidating

Read asIn-depth

Evaluation and BenchmarkingTopic guide

Evaluation and Benchmarking: How We Measure AI — and Why It Keeps Getting Harder

Read asBeginner In-depth

Related events (8)

5Hugging Face Blog·1mo ago·source ↗

IBM and UC Berkeley Diagnose Why Enterprise Agents Fail Using IT-Bench and MAST

IBM Research and UC Berkeley have released IT-Bench and MAST, a benchmark suite and diagnostic framework aimed at evaluating why AI agents fail in enterprise IT environments. The work targets realistic IT operations tasks such as incident response, service management, and infrastructure automation. By categorizing failure modes systematically, MAST provides a structured taxonomy for understanding agent shortcomings beyond simple pass/fail metrics. This addresses a gap in enterprise-focused agent evaluation, where general benchmarks often fail to capture domain-specific complexity.

IBM Research UC Berkeley IT-Bench

4Hugging Face Blog·1mo ago·source ↗

AssetOpsBench: Bridging the Gap Between AI Agent Benchmarks and Industrial Reality

IBM Research introduces AssetOpsBench, a benchmark designed to evaluate AI agents on industrial asset operations tasks, hosted on Hugging Face. The benchmark targets the gap between existing general-purpose agent benchmarks and real-world industrial deployment scenarios. It provides a playground environment for testing agent capabilities in enterprise/industrial contexts.

Evaluation and Benchmarking Enterprise Deployment Patterns IBM Research AssetOpsBench Hugging Face +1 more

4arXiv · cs.CL·10d ago·source ↗

T1-Bench: Multi-scenario agent benchmark across 25 real-world domains

T1-Bench is a new benchmark for evaluating agentic LLM systems in realistic customer-facing, multi-domain environments, covering 25 domains of varying difficulty with interleaved multi-turn scenarios. The authors evaluate 12 proprietary and open-weight models and combine automatic evaluation with human judgments. The benchmark targets gaps in existing agent evals around task complexity, domain diversity, and compositional reasoning across multi-step interactions.

Evaluation and Benchmarking Agent and Tool Ecosystem T1-Bench

6The Batch·34h ago·source ↗

DeepSWE, ProgramBench, and ITBench-AA emerge as harder successors to SWE-bench for agent evaluation

Three new benchmarks — DeepSWE (by Datacurve), ProgramBench (Meta/Stanford/Harvard), and ITBench-AA (IBM/Artificial Analysis) — are positioned as more rigorous replacements for the SWE-bench family, which models have largely saturated. DeepSWE tests feature implementation using private codebases and human-written problems; ProgramBench evaluates agents' ability to recreate functional programs from scratch; ITBench-AA measures root-cause diagnosis in real-world IT incident scenarios. Current top performers include GPT-5.5 (70% on DeepSWE), Claude Opus 4.7 (46.7% on ITBench-AA), and Claude Opus 4.7 (3% on ProgramBench at the 95% pass threshold), illustrating that even frontier models have substantial headroom.

Evaluation and Benchmarking Agent and Tool Ecosystem Artificial Analysis Llama 3.1 70B Datacurve +13 more

6arXiv · cs.AI·12d ago·source ↗

AARRI-Bench evaluates frontier LLMs and agents on granular research-intern-level tasks

Researchers introduce AARR (Act As a Real Researcher), a new benchmark series targeting whether AI agents can emulate the professionalism, thoroughness, and nuanced judgment of human researchers in granular research scenarios—not just macro-level task execution. The first benchmark, AARRI-Bench, tests frontier models and agentic harnesses, finding that even the best configuration (Mini-SWE-Agent with Claude Opus 4.7) achieves only 68.3% success, frequently missing subtle but critical details obvious to human researchers. The work argues that closing the gap requires deeper modeling of research behavior rather than more complex scaffolding.

Evaluation and Benchmarking Agent and Tool Ecosystem Claude Opus 4.6 SWE-bench AARRI-Bench +2 more

7arXiv · cs.CL·25d ago·source ↗

Automated Benchmark Auditing for AI Agents and Large Language Models (ABA)

The paper introduces Auto Benchmark Audit (ABA), an agentic framework that systematically audits AI benchmark tasks for issues such as ambiguous specifications, environment conflicts, and incorrect ground truths. Applied to 168 benchmarks across nine domains including NeurIPS publications, ABA identifies critical issues in over 25.7% of evaluated tasks. The authors demonstrate that filtering out flawed tasks materially shifts model rankings and improves average performance on SWE-bench Verified and Terminal-Bench 2 by 9.9% and 9.6% respectively, indicating that current benchmark scores are significantly distorted by task quality problems. The agentic tool and annotations are released publicly.

Frontier Model Releases Evaluation and Benchmarking NeurIPS Auto Benchmark Audit (ABA)SWE-Bench Verified +2 more

5Hugging Face Blog·16d ago·source ↗

EVA-Bench Data 2.0: Expanded agentic tool-use evaluation benchmark with 121 tools and 213 scenarios

ServiceNow AI has released EVA-Bench Data 2.0, an evaluation benchmark covering 3 domains, 121 tools, and 213 scenarios for assessing agentic AI systems. The benchmark appears designed to measure tool-use and multi-step task completion capabilities across diverse enterprise-relevant contexts. This expands the evaluation surface for agent benchmarking, which remains an active area of development.

Evaluation and Benchmarking Agent and Tool Ecosystem ServiceNow AI EVA-Bench Data 2.0

6arXiv · cs.AI·2d ago·source ↗

TxBench-PP: New benchmark reveals AI agents struggle with preclinical pharmacology decisions

Researchers introduce TxBench-PP (TherapeuticsBench Preclinical Pharmacology), a 100-evaluation benchmark testing AI agents on realistic small-molecule drug discovery tasks including mechanism-of-action reasoning, compound-target engagement, and translational efficacy. Agents receive real workflow snapshots and are graded deterministically on structured answers. Across 16 model-harness configurations and 4,800 trajectories, no system reliably succeeded; the best performer, Claude Opus 4.8 with the Pi harness, passed only 59.3% of endpoint attempts. The results suggest current frontier models are not yet deployment-ready for autonomous preclinical pharmacology decision-making.

Evaluation and Benchmarking Agent and Tool Ecosystem Claude Opus 4.6 OpenAI TherapeuticsBench +3 more