6arXiv cs.CL (Computation and Language)·11d ago

Multi-turn evaluation reveals deep research agents fail to compound gains from process-level feedback

A new arXiv paper evaluates deep research agents (DRAs) across multiple feedback turns, comparing self-reflection against process-level feedback via a novel method called Research Gap Inference (RGI). Key findings: self-reflection yields negligible net improvement, one round of process-level feedback raises normalized scores by 8-15 points (~35-40% incorporation rate), but gains do not compound across turns as agents regress on up to 24% of previously satisfied criteria. The results suggest reliable multi-turn improvement remains out of reach for current DRA architectures, highlighting a fundamental limitation in iterative agentic research workflows.

Evaluation and Benchmarking Agent and Tool Ecosystem Rishabh Sabharwal Research Gap Inference Multi-Turn Evaluation of Deep Research Agents Under Process-Level Feedback

Related guides (2)

Agent and Tool EcosystemTopic guide

Agent and Tool Ecosystem: How AI Is Learning to Act, Not Just Answer

Read asBeginner In-depth

Evaluation and BenchmarkingTopic guide

Evaluation and Benchmarking: How We Measure AI — and Why It Keeps Getting Harder

Read asBeginner In-depth

Related events (8)

6arXiv · cs.CL·4d ago·source ↗

DeepRubric: Evidence-tree rubric supervision cuts RL training cost for deep research agents by 13x

DeepRubric is a data construction framework that improves reinforcement learning efficiency for deep research agents by reversing the typical rubric-generation process: rather than inferring evaluation criteria from a query, it builds an evidence tree of verifiable sub-questions first, then synthesizes aligned query-rubric pairs. The authors construct 9K training examples and train DeepRubric-8B using rubric-based GRPO, achieving comparable performance to prior open-source state-of-the-art deep research models on three benchmarks while using roughly 13x fewer RL GPU-hours. The work addresses a key bottleneck in RL-based training of long-form research agents: unreliable reward signals from incomplete rubrics.

Evaluation and Benchmarking Agent and Tool Ecosystem DeepRubric GRPO +1 more

6arXiv · cs.AI·12d ago·source ↗

AARRI-Bench evaluates frontier LLMs and agents on granular research-intern-level tasks

Researchers introduce AARR (Act As a Real Researcher), a new benchmark series targeting whether AI agents can emulate the professionalism, thoroughness, and nuanced judgment of human researchers in granular research scenarios—not just macro-level task execution. The first benchmark, AARRI-Bench, tests frontier models and agentic harnesses, finding that even the best configuration (Mini-SWE-Agent with Claude Opus 4.7) achieves only 68.3% success, frequently missing subtle but critical details obvious to human researchers. The work argues that closing the gap requires deeper modeling of research behavior rather than more complex scaffolding.

Evaluation and Benchmarking Agent and Tool Ecosystem Claude Opus 4.6 SWE-bench AARRI-Bench +2 more

4arXiv · cs.AI·3d ago·source ↗

DRFLOW: Benchmark for Evaluating Agent Workflow Prediction from Heterogeneous Sources

Researchers introduce DRFLOW, a benchmark targeting a gap in deep research (DR) agent evaluation: predicting concrete, personalized action-step workflows rather than generating summaries or reports. The benchmark contains 100 tasks across five domains, grounded in over 3,900 sources, with seven diagnostic metrics covering factual grounding, step recovery, structural ordering, and personalization. A reference agent (DRFA) is also presented, improving over strong baselines by up to 10% average F1 but leaving substantial headroom, indicating workflow prediction remains a hard open problem for DR agents.

Evaluation and Benchmarking Agent and Tool Ecosystem DRFLOW-Agent DRFLOW

5arXiv · cs.CL·5d ago·source ↗

RePro: Retrospective Progress-Aware Self-Refinement for LLM Agent Training

Researchers introduce RePro (Retrospective Progress-Aware Training), a framework addressing the gap between step-wise RL optimization and metacognitive task-progress awareness in LLM agents. The approach uses a forward-then-reflect rollout paradigm where agents execute actions online and then retrospectively assess step-wise progress given the completed trajectory and known outcome. Evaluated on WebShop, ALFWorld, and Sokoban, RePro achieves up to 12% absolute success rate gains over baseline Qwen-family models without requiring continuous external supervision.

Agent and Tool Ecosystem Alignment and RLHF ALFWorld Sokoban RePro +2 more

5arXiv · cs.AI·10d ago·source ↗

Step-aligned critique outperforms GRPO and reference-solution conditioning in self-distillation

A new arXiv paper investigates context design for self-distillation of language models, comparing binary reward (GRPO), reference solutions, and step-by-step critiques aligned to the solver's reasoning trace. Step-aligned critique yields the largest gains, outperforming GRPO by 16.11 points and reference-solution conditioning by 5.27 points on Avg@12. Per-token advantage analysis shows that step-aligned feedback targets only failing tokens, avoiding unnecessary pressure on already-correct reasoning steps. The findings suggest structural alignment between feedback and the model's reasoning trace is a key driver of self-distillation effectiveness.

Evaluation and Benchmarking Alignment and RLHF GRPO The Role of Feedback Alignment in Self-Distillation

7arXiv · cs.CL·23d ago·source ↗

AXPO: Agent Explorative Policy Optimization Addresses Thinking-Acting Gap in Multimodal Agentic Reasoning

This paper identifies a structural asymmetry in agentic reasoning called the 'Thinking-Acting Gap,' where tool use is attempted in only ~30% of rollouts under standard RL training (GRPO), and all-wrong tool-using subgroups suppress learning signals. The authors propose AXPO (Agent eXplorative Policy Optimization), which fixes the thinking prefix and resamples tool calls for all-wrong subgroups, combined with uncertainty-based prefix selection. Evaluated across nine multimodal benchmarks on Qwen3-VL-Thinking at multiple scales, SFT+AXPO outperforms SFT+GRPO by +1.8pp on both Pass@1 and Pass@4 at 8B, with the 8B model surpassing the 32B baseline on Pass@4 using 4× fewer parameters.

Frontier Model Releases Agent and Tool Ecosystem AXPO GRPO Thinking-Acting Gap +4 more

5arXiv · cs.AI·2d ago·source ↗

Rubric-Conditioned Self-Distillation: structured feedback for reasoning model post-training

A new arXiv preprint proposes Rubric-Conditioned Self-Distillation (RCSD), a post-training framework that replaces scalar reward signals and noisy chain-of-thought annotations with structured rubrics for fine-grained credit assignment. The method conditions a teacher model on criterion-level rubrics to provide token-level guidance on the student's own sampled trajectories, avoiding reliance on a single reference rationale. Evaluated on science reasoning benchmarks, RCSD outperforms GRPO by 1.0 points and OPSD by 0.9 points on average.

Evaluation and Benchmarking Alignment and RLHF OPSD GRPO Rubric-Conditioned Self-Distillation

6arXiv · cs.AI·25d ago·source ↗

VeriTrace: Cognitive-Graph Framework with Explicit Regulatory Loops for Deep Research Agents

VeriTrace introduces a cognitive-graph framework for deep research agents that replaces implicit LLM reasoning over intermediate representations with three explicit regulatory loops: interpretive update, deviation feedback, and schema revision. The system addresses contamination and error propagation in evolving mental models during complex multi-step research tasks. Using Qwen3.5-27B backbones, VeriTrace improves over the strongest matched baseline by 4.22 pp on DeepResearch Bench Insight and 5.9 pp Overall win rate on DeepConsult. With Config-DeepSeek, it achieves the strongest reproducible open-source result on DeepResearch Bench.

Frontier Model Releases Evaluation and Benchmarking DeepSeek V4 cognitive-graph DeepResearch Bench +4 more