DRFLOW: Benchmark for Evaluating Agent Workflow Prediction from Heterogeneous Sources
Researchers introduce DRFLOW, a benchmark targeting a gap in deep research (DR) agent evaluation: predicting concrete, personalized action-step workflows rather than generating summaries or reports. The benchmark contains 100 tasks across five domains, grounded in over 3,900 sources, with seven diagnostic metrics covering factual grounding, step recovery, structural ordering, and personalization. A reference agent (DRFA) is also presented, improving over strong baselines by up to 10% average F1 but leaving substantial headroom, indicating workflow prediction remains a hard open problem for DR agents.
Related guides (2)
Related events (8)
DABStep: Data Agent Benchmark for Multi-step Reasoning
Hugging Face introduces DABStep, a benchmark designed to evaluate data agents on multi-step reasoning tasks. The benchmark targets agentic systems that must perform complex, sequential data operations rather than single-step queries. It aims to fill a gap in evaluation tooling for realistic data analysis workflows involving tool use and chained reasoning.
Multi-turn evaluation reveals deep research agents fail to compound gains from process-level feedback
A new arXiv paper evaluates deep research agents (DRAs) across multiple feedback turns, comparing self-reflection against process-level feedback via a novel method called Research Gap Inference (RGI). Key findings: self-reflection yields negligible net improvement, one round of process-level feedback raises normalized scores by 8-15 points (~35-40% incorporation rate), but gains do not compound across turns as agents regress on up to 24% of previously satisfied criteria. The results suggest reliable multi-turn improvement remains out of reach for current DRA architectures, highlighting a fundamental limitation in iterative agentic research workflows.
Back to The Future: Evaluating AI Agents on Predicting Future Events
This Hugging Face blog post introduces FutureBench, a benchmark designed to evaluate AI agents on their ability to predict future events, addressing the challenge of data contamination in standard benchmarks by using temporally forward-looking tasks. The approach tests whether agents can reason about and forecast outcomes beyond their training data cutoff. This framing positions future-event prediction as a rigorous, contamination-resistant evaluation methodology for frontier models and agents.
AARRI-Bench evaluates frontier LLMs and agents on granular research-intern-level tasks
Researchers introduce AARR (Act As a Real Researcher), a new benchmark series targeting whether AI agents can emulate the professionalism, thoroughness, and nuanced judgment of human researchers in granular research scenarios—not just macro-level task execution. The first benchmark, AARRI-Bench, tests frontier models and agentic harnesses, finding that even the best configuration (Mini-SWE-Agent with Claude Opus 4.7) achieves only 68.3% success, frequently missing subtle but critical details obvious to human researchers. The work argues that closing the gap requires deeper modeling of research behavior rather than more complex scaffolding.
Benchmark Agent: Autonomous system for end-to-end benchmark construction
Researchers introduce Benchmark Agent, a fully autonomous agentic system that orchestrates the complete benchmark construction pipeline — from query analysis and subtask design to data annotation and quality control. The system was used to produce 15 benchmarks spanning text understanding, multimodal understanding, and domain-specific reasoning, with evaluation via human judges, LLM-as-a-judge, and consistency checks. The work addresses two persistent problems in the field: the labor intensity of benchmark creation and rapid performance saturation after release. Code and a demo will be publicly released.
PaperFlow: A longitudinal framework for daily scientific paper recommendation with profiling and interest drift
PaperFlow is a new framework for scientific paper recommendation that models the process as a longitudinal, daily workflow rather than a static ranking task. It comprises three coupled stages: Profiling (building user scholarly profiles from cold-start evidence), Recommending (ranking daily paper streams under a display budget), and Adapting (updating user state from feedback and modeling interest drift). The authors introduce a benchmark with 24 simulated users, 50 daily paper streams, and over 1.2 million episode-paper records, plus a blind human-evaluation protocol. PaperFlow outperforms five baselines on oracle ranking, behavioral alignment, and human evaluation.
AssetOpsBench: Bridging the Gap Between AI Agent Benchmarks and Industrial Reality
IBM Research introduces AssetOpsBench, a benchmark designed to evaluate AI agents on industrial asset operations tasks, hosted on Hugging Face. The benchmark targets the gap between existing general-purpose agent benchmarks and real-world industrial deployment scenarios. It provides a playground environment for testing agent capabilities in enterprise/industrial contexts.
T1-Bench: Multi-scenario agent benchmark across 25 real-world domains
T1-Bench is a new benchmark for evaluating agentic LLM systems in realistic customer-facing, multi-domain environments, covering 25 domains of varying difficulty with interleaved multi-turn scenarios. The authors evaluate 12 proprietary and open-weight models and combine automatic evaluation with human judgments. The benchmark targets gaps in existing agent evals around task complexity, domain diversity, and compositional reasoning across multi-step interactions.

