5Hugging Face Blog·1mo ago

DABStep: Data Agent Benchmark for Multi-step Reasoning

Hugging Face introduces DABStep, a benchmark designed to evaluate data agents on multi-step reasoning tasks. The benchmark targets agentic systems that must perform complex, sequential data operations rather than single-step queries. It aims to fill a gap in evaluation tooling for realistic data analysis workflows involving tool use and chained reasoning.

Evaluation and Benchmarking Agent and Tool Ecosystem DABStep Hugging Face

Related guides (3)

Hugging Face

Hugging Face: The Home of Open-Source AI

Read asBeginner In-depth

Agent and Tool EcosystemTopic guide

Agent and Tool Ecosystem: How AI Is Learning to Act, Not Just Answer

Read asBeginner In-depth

Evaluation and BenchmarkingTopic guide

Evaluation and Benchmarking: How We Measure AI — and Why It Keeps Getting Harder

Read asBeginner In-depth

Related events (8)

6arXiv · cs.AI·15d ago·source ↗

Benchmark Agent: Autonomous system for end-to-end benchmark construction

Researchers introduce Benchmark Agent, a fully autonomous agentic system that orchestrates the complete benchmark construction pipeline — from query analysis and subtask design to data annotation and quality control. The system was used to produce 15 benchmarks spanning text understanding, multimodal understanding, and domain-specific reasoning, with evaluation via human judges, LLM-as-a-judge, and consistency checks. The work addresses two persistent problems in the field: the labor intensity of benchmark creation and rapid performance saturation after release. Code and a demo will be publicly released.

Evaluation and Benchmarking Agent and Tool Ecosystem Benchmark Everything Everywhere All at Once Benchmark Agent

4Hugging Face Blog·1mo ago·source ↗

DeepMath: A Lightweight Math Reasoning Agent with smolagents

Hugging Face published a blog post introducing DeepMath, a lightweight mathematical reasoning agent built on the smolagents framework. The post demonstrates how to construct a capable math reasoning agent using small models and tool-use patterns. This represents a practical application of the agent-tool ecosystem for specialized reasoning tasks.

Inference Economics Agent and Tool Ecosystem Hugging Face DeepMath smolagents +1 more

4arXiv · cs.AI·3d ago·source ↗

DRFLOW: Benchmark for Evaluating Agent Workflow Prediction from Heterogeneous Sources

Researchers introduce DRFLOW, a benchmark targeting a gap in deep research (DR) agent evaluation: predicting concrete, personalized action-step workflows rather than generating summaries or reports. The benchmark contains 100 tasks across five domains, grounded in over 3,900 sources, with seven diagnostic metrics covering factual grounding, step recovery, structural ordering, and personalization. A reference agent (DRFA) is also presented, improving over strong baselines by up to 10% average F1 but leaving substantial headroom, indicating workflow prediction remains a hard open problem for DR agents.

Evaluation and Benchmarking Agent and Tool Ecosystem DRFLOW-Agent DRFLOW

5Hugging Face Blog·1mo ago·source ↗

Evaluating Audio Reasoning with Big Bench Audio

Hugging Face introduces Big Bench Audio, a new benchmark designed to evaluate audio reasoning capabilities in AI models. The benchmark appears to extend the Big Bench evaluation framework into the audio domain, targeting multimodal models that process and reason over audio inputs. This release addresses a gap in evaluation tooling for audio-capable language models.

Evaluation and Benchmarking Multimodal Progress Big Bench Audio Hugging Face Big Bench

6arXiv · cs.AI·2d ago·source ↗

Data Intelligence Agents (DIA): Autonomous coding agents for enterprise data integration and SQL generation

Researchers present Data Intelligence Agents (DIA), a production-deployed system of three autonomous coding agents (Data Interpreter, Schema Creator, Query Generator) that automate enterprise data integration workflows. Rather than generating text, the agents produce, execute, validate, and repair concrete artifacts (code, schemas, SQL) with shared memory for experience reuse. The Query Generator is evaluated across seven SQL benchmarks spanning four dialects and task categories, matching or surpassing best published results on all seven. The system is deployed in production for enterprise customers, making it a notable applied research contribution.

Evaluation and Benchmarking Enterprise Deployment Patterns Data Intelligence Agents: Interpreting, Modeling, and Querying Enterprise Data via Autonomous Coding Agents Data Intelligence Agents +1 more

7arXiv · cs.CL·25d ago·source ↗

Automated Benchmark Auditing for AI Agents and Large Language Models (ABA)

The paper introduces Auto Benchmark Audit (ABA), an agentic framework that systematically audits AI benchmark tasks for issues such as ambiguous specifications, environment conflicts, and incorrect ground truths. Applied to 168 benchmarks across nine domains including NeurIPS publications, ABA identifies critical issues in over 25.7% of evaluated tasks. The authors demonstrate that filtering out flawed tasks materially shifts model rankings and improves average performance on SWE-bench Verified and Terminal-Bench 2 by 9.9% and 9.6% respectively, indicating that current benchmark scores are significantly distorted by task quality problems. The agentic tool and annotations are released publicly.

Frontier Model Releases Evaluation and Benchmarking NeurIPS Auto Benchmark Audit (ABA)SWE-Bench Verified +2 more

4arXiv · cs.CL·10d ago·source ↗

T1-Bench: Multi-scenario agent benchmark across 25 real-world domains

T1-Bench is a new benchmark for evaluating agentic LLM systems in realistic customer-facing, multi-domain environments, covering 25 domains of varying difficulty with interleaved multi-turn scenarios. The authors evaluate 12 proprietary and open-weight models and combine automatic evaluation with human judgments. The benchmark targets gaps in existing agent evals around task complexity, domain diversity, and compositional reasoning across multi-step interactions.

Evaluation and Benchmarking Agent and Tool Ecosystem T1-Bench

5arXiv · cs.CL·15d ago·source ↗

DataCOPE: Unsupervised skill discovery framework for data-analytic agents

Researchers introduce DataCOPE, an unsupervised verifier-guided framework for discovering reusable procedural skills in data-analytic agents without labeled supervision or parameter updates. The system coordinates three components—a data-analytic agent, an unsupervised verifier, and a skill manager for contrastive skill distillation—with task-specific verifier instantiations for report-style and reasoning-style analysis. Evaluated on Deep Data Research and DABStep benchmarks, DataCOPE improves mean scores by 9.71% and 32.30% respectively across four model settings. The approach addresses a key bottleneck in agentic data analysis: acquiring reliable skill supervision at scale.

Evaluation and Benchmarking Agent and Tool Ecosystem DABStep Deep Research DataCOPE