paper
Multi-Turn Evaluation of Deep Research Agents Under Process-Level Feedback
paperactiveprovisional
multi-turn-evaluation-of-deep-research-agents-under-process-level-feedback-0163132e·1 events·first seen 8d agoAliases: Multi-Turn Evaluation of Deep Research Agents Under Process-Level Feedback
Co-occurring entities
More like this (12)
multi-level agent evaluationmulti-turn agent benchmarksEvaluation Cards: An Interpretive Layer for AI Evaluation ReportingMulti-Agent Reinforcement Learning from Delayed Marketplace Feedback for Objective-Weight Adaptation in Three-Sided DispatchSkill-RM: Unifying Heterogeneous Evaluation Criteria via Agent Skillthird-party AI evaluationsPersonalized Evaluation as LearningReward Modeling for Multi-Agent OrchestrationAdaptive Turn-Taking for Real-time Multi-Party Voice AgentsTowards a Science of AI Agent ReliabilityAgentopia: Long-Term Life Simulation and Learning in Agent SocietiesExpRL: Exploratory RL for LLM Mid-Training
Recent events (1)
Multi-turn evaluation reveals deep research agents fail to compound gains from process-level feedback
A new arXiv paper evaluates deep research agents (DRAs) across multiple feedback turns, comparing self-reflection against process-level feedback via a novel method called Research Gap Inference (RGI). Key findings: self-reflection yields negligible net improvement, one round of process-level feedback raises normalized scores by 8-15 points (~35-40% incorporation rate), but gains do not compound across turns as agents regress on up to 24% of previously satisfied criteria. The results suggest reliable multi-turn improvement remains out of reach for current DRA architectures, highlighting a fundamental limitation in iterative agentic research workflows.