7arXiv cs.AI (Artificial Intelligence)·1mo ago

DeepWeb-Bench: A Hard Deep Research Benchmark Requiring Cross-Source Evidence and Long-Horizon Derivation

DeepWeb-Bench is a new benchmark designed to stress-test frontier language models on deep research tasks—open-web search, evidence collection, and multi-step derivation—where existing benchmarks have become saturated. The benchmark evaluates nine frontier models across four capability families (Retrieval, Derivation, Reasoning, Calibration) and finds that retrieval is not the primary bottleneck; derivation and calibration failures account for over 70% of errors. Strong models fail via incomplete derivation while weak models fail via hallucinated precision, and models show genuine domain specialization with low cross-model agreement (rho = 0.61). The benchmark, rubrics, and evaluation code are publicly released.

Frontier Model Releases Evaluation and Benchmarking Agent and Tool Ecosystem deep research agents DeepWeb-Bench Retrieval-Augmented Generation Large Language Models (frontier)

Related guides (3)

Frontier Model ReleasesTopic guide

Frontier Model Releases: The Race From Language to Action

Read asBeginner In-depth

Agent and Tool EcosystemTopic guide

Agent and Tool Ecosystem: How AI Is Learning to Act, Not Just Answer

Read asBeginner In-depth

Evaluation and BenchmarkingTopic guide

Evaluation and Benchmarking: How We Measure AI — and Why It Keeps Getting Harder

Read asBeginner In-depth

Related events (8)

6The Batch·34h ago·source ↗

DeepSWE, ProgramBench, and ITBench-AA emerge as harder successors to SWE-bench for agent evaluation

Three new benchmarks — DeepSWE (by Datacurve), ProgramBench (Meta/Stanford/Harvard), and ITBench-AA (IBM/Artificial Analysis) — are positioned as more rigorous replacements for the SWE-bench family, which models have largely saturated. DeepSWE tests feature implementation using private codebases and human-written problems; ProgramBench evaluates agents' ability to recreate functional programs from scratch; ITBench-AA measures root-cause diagnosis in real-world IT incident scenarios. Current top performers include GPT-5.5 (70% on DeepSWE), Claude Opus 4.7 (46.7% on ITBench-AA), and Claude Opus 4.7 (3% on ProgramBench at the 95% pass threshold), illustrating that even frontier models have substantial headroom.

Evaluation and Benchmarking Agent and Tool Ecosystem Artificial Analysis Llama 3.1 70B Datacurve +13 more

6arXiv · cs.AI·12d ago·source ↗

AARRI-Bench evaluates frontier LLMs and agents on granular research-intern-level tasks

Researchers introduce AARR (Act As a Real Researcher), a new benchmark series targeting whether AI agents can emulate the professionalism, thoroughness, and nuanced judgment of human researchers in granular research scenarios—not just macro-level task execution. The first benchmark, AARRI-Bench, tests frontier models and agentic harnesses, finding that even the best configuration (Mini-SWE-Agent with Claude Opus 4.7) achieves only 68.3% success, frequently missing subtle but critical details obvious to human researchers. The work argues that closing the gap requires deeper modeling of research behavior rather than more complex scaffolding.

Evaluation and Benchmarking Agent and Tool Ecosystem Claude Opus 4.6 SWE-bench AARRI-Bench +2 more

6arXiv · cs.CL·4d ago·source ↗

DeepRubric: Evidence-tree rubric supervision cuts RL training cost for deep research agents by 13x

DeepRubric is a data construction framework that improves reinforcement learning efficiency for deep research agents by reversing the typical rubric-generation process: rather than inferring evaluation criteria from a query, it builds an evidence tree of verifiable sub-questions first, then synthesizes aligned query-rubric pairs. The authors construct 9K training examples and train DeepRubric-8B using rubric-based GRPO, achieving comparable performance to prior open-source state-of-the-art deep research models on three benchmarks while using roughly 13x fewer RL GPU-hours. The work addresses a key bottleneck in RL-based training of long-form research agents: unreliable reward signals from incomplete rubrics.

Evaluation and Benchmarking Agent and Tool Ecosystem DeepRubric GRPO +1 more

4arXiv · cs.CL·10d ago·source ↗

T1-Bench: Multi-scenario agent benchmark across 25 real-world domains

T1-Bench is a new benchmark for evaluating agentic LLM systems in realistic customer-facing, multi-domain environments, covering 25 domains of varying difficulty with interleaved multi-turn scenarios. The authors evaluate 12 proprietary and open-weight models and combine automatic evaluation with human judgments. The benchmark targets gaps in existing agent evals around task complexity, domain diversity, and compositional reasoning across multi-step interactions.

Evaluation and Benchmarking Agent and Tool Ecosystem T1-Bench

6The Batch·23d ago·source ↗

Data Points: DeepSWE Benchmark, DeepSeek V4 Price Cuts, MAI-Image-2.5, Mythos Security Findings, MCP Stateless Update

This edition of The Batch covers five distinct AI developments: Datacurve's DeepSWE benchmark claims to fix critical grading flaws in SWE-bench Pro with hand-written verifiers and harder tasks; DeepSeek permanently cuts V4 Pro prices by 75%; Microsoft's MAI-Image-2.5 debuts third on the Arena leaderboard; Anthropic's Claude Mythos Preview found over 10,000 high/critical vulnerabilities in the first month of Project Glasswing, with remediation badly lagging discovery; and the Model Context Protocol proposes removing stateful sessions to enable stateless, load-balanced remote servers. Each item reflects meaningful movement in evaluation methodology, inference economics, multimodal generation, AI-assisted security, and agent tooling infrastructure.

Frontier Model Releases Evaluation and Benchmarking Datacurve DeepSeek V4 Microsoft +17 more

7arXiv · cs.AI·16d ago·source ↗

AutoLab benchmark evaluates frontier models on ultra long-horizon iterative research and engineering tasks

AutoLab is a new benchmark of 36 expert-curated tasks across system optimization, puzzle-solving, model development, and CUDA kernel optimization, designed to test agents on sustained closed-loop improvement under wall-clock budgets rather than single-turn or short-horizon settings. Evaluation of 17 frontier models finds that persistence in iterative benchmarking and feedback incorporation — not initial attempt quality — is the dominant success predictor. Claude Opus 4.6 stands out as the strongest performer, while most models including proprietary ones either terminate early or exhaust budgets with minimal progress. The benchmark, harness, and task artifacts are open-sourced.

Frontier Model Releases Evaluation and Benchmarking Claude Opus 4.6 AutoLab Anthropic +1 more

5Interconnects·1mo ago·source ↗

Opus 4.6, Codex 5.3, and the post-benchmark era

A Interconnects commentary piece examining how to compare frontier AI models in 2026, using Anthropic's Opus 4.6 and OpenAI's Codex 5.3 as case studies. The piece appears to argue that traditional benchmarks are no longer sufficient for distinguishing model capabilities at the frontier. This reflects a broader industry shift toward more nuanced, task-specific evaluation methods.

Frontier Model Releases Evaluation and Benchmarking Interconnects Codex 5.3 Claude Opus 4.6 +2 more

7Openai Blog·1mo ago·source ↗

PaperBench: OpenAI Benchmark for Evaluating AI Agents on Research Replication

OpenAI introduces PaperBench, a benchmark designed to evaluate AI agents' ability to replicate state-of-the-art AI research papers end-to-end. The benchmark targets a high-complexity capability: reproducing experimental results from frontier AI research, which requires code generation, experimental design, and scientific reasoning. This positions PaperBench as a tool for tracking progress toward autonomous AI research agents.

Evaluation and Benchmarking AI Safety Research OpenAI PaperBench +1 more