Almanac
benchmark

DRFLOW

benchmarkactiveprovisionaldrflow-e2a746ad·1 events·first seen 7h ago

Aliases: DRFLOW

Co-occurring entities

More like this (12)

Recent events (1)

4arXiv · cs.AI·7h ago·source ↗

DRFLOW: Benchmark for Evaluating Agent Workflow Prediction from Heterogeneous Sources

Researchers introduce DRFLOW, a benchmark targeting a gap in deep research (DR) agent evaluation: predicting concrete, personalized action-step workflows rather than generating summaries or reports. The benchmark contains 100 tasks across five domains, grounded in over 3,900 sources, with seven diagnostic metrics covering factual grounding, step recovery, structural ordering, and personalization. A reference agent (DRFA) is also presented, improving over strong baselines by up to 10% average F1 but leaving substantial headroom, indicating workflow prediction remains a hard open problem for DR agents.