MaDI-Bench: First end-to-end benchmark for relational table data integration
Researchers introduce MaDI-Bench (Mannheim Data Integration Benchmark), the first benchmark covering the full data integration pipeline for relational tables, including schema matching, value normalization, entity matching, and conflict resolution. Prior benchmarks evaluated these steps in isolation or omitted stages, limiting research on holistic integration methods. The benchmark includes base tasks across multiple domains, a mechanism to generate variants to prevent saturation, and is validated against human-engineered, best-of-breed, and LLM-based pipelines. All artifacts are publicly available.
Related guides (1)
Related events (8)
Benchmark Agent: Autonomous system for end-to-end benchmark construction
Researchers introduce Benchmark Agent, a fully autonomous agentic system that orchestrates the complete benchmark construction pipeline — from query analysis and subtask design to data annotation and quality control. The system was used to produce 15 benchmarks spanning text understanding, multimodal understanding, and domain-specific reasoning, with evaluation via human judges, LLM-as-a-judge, and consistency checks. The work addresses two persistent problems in the field: the labor intensity of benchmark creation and rapid performance saturation after release. Code and a demo will be publicly released.
T1-Bench: Multi-scenario agent benchmark across 25 real-world domains
T1-Bench is a new benchmark for evaluating agentic LLM systems in realistic customer-facing, multi-domain environments, covering 25 domains of varying difficulty with interleaved multi-turn scenarios. The authors evaluate 12 proprietary and open-weight models and combine automatic evaluation with human judgments. The benchmark targets gaps in existing agent evals around task complexity, domain diversity, and compositional reasoning across multi-step interactions.
IBM and UC Berkeley Diagnose Why Enterprise Agents Fail Using IT-Bench and MAST
IBM Research and UC Berkeley have released IT-Bench and MAST, a benchmark suite and diagnostic framework aimed at evaluating why AI agents fail in enterprise IT environments. The work targets realistic IT operations tasks such as incident response, service management, and infrastructure automation. By categorizing failure modes systematically, MAST provides a structured taxonomy for understanding agent shortcomings beyond simple pass/fail metrics. This addresses a gap in enterprise-focused agent evaluation, where general benchmarks often fail to capture domain-specific complexity.
DABStep: Data Agent Benchmark for Multi-step Reasoning
Hugging Face introduces DABStep, a benchmark designed to evaluate data agents on multi-step reasoning tasks. The benchmark targets agentic systems that must perform complex, sequential data operations rather than single-step queries. It aims to fill a gap in evaluation tooling for realistic data analysis workflows involving tool use and chained reasoning.
DeepWeb-Bench: A Hard Deep Research Benchmark Requiring Cross-Source Evidence and Long-Horizon Derivation
DeepWeb-Bench is a new benchmark designed to stress-test frontier language models on deep research tasks—open-web search, evidence collection, and multi-step derivation—where existing benchmarks have become saturated. The benchmark evaluates nine frontier models across four capability families (Retrieval, Derivation, Reasoning, Calibration) and finds that retrieval is not the primary bottleneck; derivation and calibration failures account for over 70% of errors. Strong models fail via incomplete derivation while weak models fail via hallucinated precision, and models show genuine domain specialization with low cross-model agreement (rho = 0.61). The benchmark, rubrics, and evaluation code are publicly released.
ITBench-AA: Frontier Models Score Below 50% on the First Benchmark for Agentic Enterprise IT Tasks
IBM Research and Artificial Analysis have released ITBench-AA, a benchmark targeting agentic AI performance on enterprise IT operations tasks. Frontier models evaluated on the benchmark score below 50%, indicating significant capability gaps in real-world IT automation scenarios. The benchmark appears to be the first of its kind focused specifically on agentic enterprise IT workflows, covering tasks relevant to site reliability engineering and IT operations.
AssetOpsBench: Bridging the Gap Between AI Agent Benchmarks and Industrial Reality
IBM Research introduces AssetOpsBench, a benchmark designed to evaluate AI agents on industrial asset operations tasks, hosted on Hugging Face. The benchmark targets the gap between existing general-purpose agent benchmarks and real-world industrial deployment scenarios. It provides a playground environment for testing agent capabilities in enterprise/industrial contexts.
Introducing HealthBench
OpenAI has released HealthBench, a new evaluation benchmark designed to assess AI model performance and safety in healthcare settings. The benchmark was developed with input from over 250 physicians and targets realistic clinical scenarios. It aims to establish a shared standard for measuring how well AI models handle health-related tasks.
