Entity · benchmark

DeepWeb-Bench

benchmarkactivedeepweb-bench-9e4c7113·1 events·first seen May 21, 2026

Aliases: DeepWeb-Bench

Co-occurring entities

deep research agents Retrieval-Augmented Generation Large Language Models (frontier)

More like this (12)

DeepResearch Bench DeepWiki WildBench BigCodeBench EdgeBench SpecBench PaperBench HealthBench web navigation benchmark FutureBench EntityBench RepoBench

Recent events (1)

7arXiv · cs.AI·May 21, 2026·source ↗

DeepWeb-Bench: A Hard Deep Research Benchmark Requiring Cross-Source Evidence and Long-Horizon Derivation

DeepWeb-Bench is a new benchmark designed to stress-test frontier language models on deep research tasks—open-web search, evidence collection, and multi-step derivation—where existing benchmarks have become saturated. The benchmark evaluates nine frontier models across four capability families (Retrieval, Derivation, Reasoning, Calibration) and finds that retrieval is not the primary bottleneck; derivation and calibration failures account for over 70% of errors. Strong models fail via incomplete derivation while weak models fail via hallucinated precision, and models show genuine domain specialization with low cross-model agreement (rho = 0.61). The benchmark, rubrics, and evaluation code are publicly released.

Frontier Model Releases Evaluation and Benchmarking deep research agents DeepWeb-Bench Retrieval-Augmented Generation +2 more