Entity · paper

Multi-Turn Evaluation of Deep Research Agents Under Process-Level Feedback

paperactivemulti-turn-evaluation-of-deep-research-agents-under-process-level-feedback-0163132e·1 events·first seen Jun 9, 2026

Aliases: Multi-Turn Evaluation of Deep Research Agents Under Process-Level Feedback

Co-occurring entities

Rishabh Sabharwal Research Gap Inference

More like this (12)

multi-level agent evaluation multi-turn agent benchmarks Do Agent Optimizers Compound? A Continual-Learning Evaluation on Terminal-Bench 2.0 Who Grades the Grader? Co-Evolving Evaluation Metrics and Skills for Self-Improving LLM Agents Evaluation Cards: An Interpretive Layer for AI Evaluation Reporting From Atomic Actions to Standard Operating Procedures: Iterative Tool Optimization for Self-Evolving LLM Agents Can AI agents conduct open-ended AI research? Early evidence from two case studies GEIS: A Generation-Evaluation-Improvement Loop of Agent Skills for Long-Form Article Generation Multi-Agent Reinforcement Learning from Delayed Marketplace Feedback for Objective-Weight Adaptation in Three-Sided Dispatch Skill-RM: Unifying Heterogeneous Evaluation Criteria via Agent Skill Contagion Networks: Evaluator Bias Propagation in Multi-Agent LLM Systems third-party AI evaluations

Recent events (1)

6arXiv · cs.CL·Jun 9, 2026·source ↗

Multi-turn evaluation reveals deep research agents fail to compound gains from process-level feedback

A new arXiv paper evaluates deep research agents (DRAs) across multiple feedback turns, comparing self-reflection against process-level feedback via a novel method called Research Gap Inference (RGI). Key findings: self-reflection yields negligible net improvement, one round of process-level feedback raises normalized scores by 8-15 points (~35-40% incorporation rate), but gains do not compound across turns as agents regress on up to 24% of previously satisfied criteria. The results suggest reliable multi-turn improvement remains out of reach for current DRA architectures, highlighting a fundamental limitation in iterative agentic research workflows.

Evaluation and Benchmarking Agent and Tool Ecosystem Rishabh Sabharwal Research Gap Inference Multi-Turn Evaluation of Deep Research Agents Under Process-Level Feedback