Almanac
← Events
4Hugging Face Blog·1mo ago

AssetOpsBench: Bridging the Gap Between AI Agent Benchmarks and Industrial Reality

IBM Research introduces AssetOpsBench, a benchmark designed to evaluate AI agents on industrial asset operations tasks, hosted on Hugging Face. The benchmark targets the gap between existing general-purpose agent benchmarks and real-world industrial deployment scenarios. It provides a playground environment for testing agent capabilities in enterprise/industrial contexts.

Related guides (4)

Related events (8)

6arXiv · cs.AI·15d ago·source ↗

Benchmark Agent: Autonomous system for end-to-end benchmark construction

Researchers introduce Benchmark Agent, a fully autonomous agentic system that orchestrates the complete benchmark construction pipeline — from query analysis and subtask design to data annotation and quality control. The system was used to produce 15 benchmarks spanning text understanding, multimodal understanding, and domain-specific reasoning, with evaluation via human judges, LLM-as-a-judge, and consistency checks. The work addresses two persistent problems in the field: the labor intensity of benchmark creation and rapid performance saturation after release. Code and a demo will be publicly released.

5Hugging Face Blog·24d ago·source ↗

ITBench-AA: Frontier Models Score Below 50% on the First Benchmark for Agentic Enterprise IT Tasks

IBM Research and Artificial Analysis have released ITBench-AA, a benchmark targeting agentic AI performance on enterprise IT operations tasks. Frontier models evaluated on the benchmark score below 50%, indicating significant capability gaps in real-world IT automation scenarios. The benchmark appears to be the first of its kind focused specifically on agentic enterprise IT workflows, covering tasks relevant to site reliability engineering and IT operations.

5Hugging Face Blog·1mo ago·source ↗

IBM and UC Berkeley Diagnose Why Enterprise Agents Fail Using IT-Bench and MAST

IBM Research and UC Berkeley have released IT-Bench and MAST, a benchmark suite and diagnostic framework aimed at evaluating why AI agents fail in enterprise IT environments. The work targets realistic IT operations tasks such as incident response, service management, and infrastructure automation. By categorizing failure modes systematically, MAST provides a structured taxonomy for understanding agent shortcomings beyond simple pass/fail metrics. This addresses a gap in enterprise-focused agent evaluation, where general benchmarks often fail to capture domain-specific complexity.

5Hugging Face Blog·1mo ago·source ↗

The Open Agent Leaderboard

IBM Research and Hugging Face have launched the Open Agent Leaderboard, a public benchmark for evaluating AI agents across standardized tasks. The leaderboard aims to provide transparent, reproducible comparisons of open and proprietary agent systems. This initiative addresses the growing need for rigorous evaluation infrastructure as the agent ecosystem matures.

7arXiv · cs.CL·25d ago·source ↗

Automated Benchmark Auditing for AI Agents and Large Language Models (ABA)

The paper introduces Auto Benchmark Audit (ABA), an agentic framework that systematically audits AI benchmark tasks for issues such as ambiguous specifications, environment conflicts, and incorrect ground truths. Applied to 168 benchmarks across nine domains including NeurIPS publications, ABA identifies critical issues in over 25.7% of evaluated tasks. The authors demonstrate that filtering out flawed tasks materially shifts model rankings and improves average performance on SWE-bench Verified and Terminal-Bench 2 by 9.9% and 9.6% respectively, indicating that current benchmark scores are significantly distorted by task quality problems. The agentic tool and annotations are released publicly.

7Openai Blog·1mo ago·source ↗

PaperBench: OpenAI Benchmark for Evaluating AI Agents on Research Replication

OpenAI introduces PaperBench, a benchmark designed to evaluate AI agents' ability to replicate state-of-the-art AI research papers end-to-end. The benchmark targets a high-complexity capability: reproducing experimental results from frontier AI research, which requires code generation, experimental design, and scientific reasoning. This positions PaperBench as a tool for tracking progress toward autonomous AI research agents.

6Openai Blog·1mo ago·source ↗

MLE-bench: Evaluating Machine Learning Agents on Machine Learning Engineering

OpenAI introduces MLE-bench, a benchmark designed to measure AI agent performance on machine learning engineering tasks. The benchmark draws from Kaggle competitions to evaluate agents on realistic ML engineering workflows. Initial results show that current agents, including those powered by o1-preview, achieve competitive performance on a subset of tasks but fall well short of top human competitors. The benchmark is intended to track progress in agentic ML capabilities over time.

6arXiv · cs.AI·12d ago·source ↗

AARRI-Bench evaluates frontier LLMs and agents on granular research-intern-level tasks

Researchers introduce AARR (Act As a Real Researcher), a new benchmark series targeting whether AI agents can emulate the professionalism, thoroughness, and nuanced judgment of human researchers in granular research scenarios—not just macro-level task execution. The first benchmark, AARRI-Bench, tests frontier models and agentic harnesses, finding that even the best configuration (Mini-SWE-Agent with Claude Opus 4.7) achieves only 68.3% success, frequently missing subtle but critical details obvious to human researchers. The work argues that closing the gap requires deeper modeling of research behavior rather than more complex scaffolding.