4arXiv cs.AI (Artificial Intelligence)·19d ago

SPECTRA: Synthetic IR Test Collections with Relevance Oracles and Controlled Distractor Diagnostics

SPECTRA is a reproducible framework for generating synthetic information retrieval test collections, separating latent topical structure, surface text realization, and query intent generation to produce deterministic relevance oracles without human annotation. A Python prototype generated corpora up to 60,000 documents at roughly 12K–14K documents per second, with graded relevance labels for 96 queries. Controlled distractor experiments showed BM25 nDCG@10 degrading from 1.00 at 2% distractors to 0.43 at 36%, demonstrating the framework's utility for exposing retrieval system failure modes before expensive real-world collection construction. The authors position SPECTRA as a diagnostic complement to Cranfield/TREC-style evaluation rather than a replacement for human judgment.

Evaluation and Benchmarking Agent and Tool Ecosystem TREC Cranfield evaluation paradigm Zipf distribution BM25 nDCG@10 SPECTRA

Related guides (2)

Agent and Tool EcosystemTopic guide

Agent and Tool Ecosystem: How AI Is Learning to Act, Not Just Answer

Read asBeginner In-depth

Evaluation and BenchmarkingTopic guide

Evaluation and Benchmarking: How We Measure AI — and Why It Keeps Getting Harder

Read asBeginner In-depth

Related events (8)

5arXiv · cs.CL·1mo ago·source ↗

ACL-Verbatim: Hallucination-Free Extractive QA System for Research Papers

The paper introduces ACL-Verbatim, an extractive question answering system built on VerbatimRAG that maps user queries directly to verbatim text spans in ACL Anthology papers, eliminating hallucination by design. The authors contribute a new ground-truth benchmark dataset created via human NLP-researcher annotation over synthetic queries generated using a ScIRGen-based pipeline. A 150M-parameter ModernBERT token classifier trained on silver supervision achieves the best word-level F1 of 53.6, outperforming the strongest LLM-based extractor at 48.7. The work demonstrates that smaller extractive models can outperform large generative LLMs on precision-critical retrieval tasks.

Evaluation and Benchmarking AI Safety Research ModernBERT ScIRGen ACL Anthology +3 more

6Berkeley Ai Research (Bair) Blog·1mo ago·source ↗

SPEX and ProxySPEX: Scalable Interaction Discovery for LLM Interpretability

Researchers from BAIR introduce SPEX (Spectral Explainer) and ProxySPEX, algorithms for identifying influential feature, data, and model-component interactions in LLMs at scale. The approach exploits sparsity, low-degreeness, and hierarchy properties to reframe interaction discovery as a sparse recovery problem using tools from signal processing and coding theory. ProxySPEX achieves comparable performance to SPEX with roughly 10x fewer ablations by leveraging hierarchical structure. The methods are evaluated on feature attribution (sentiment analysis), data attribution, and mechanistic interpretability tasks, outperforming marginal methods like LIME at long context lengths.

Long Context Evolution Evaluation and Benchmarking GPT-4o mini Faith-Shap LIME +5 more

5arXiv · cs.CL·10d ago·source ↗

Provenance-grounded gating and adaptive recovery improve synthetic post-training data curation

A controlled study examines two underexplored practices in synthetic post-training data pipelines: grounding filtering signals in source provenance and systematically recovering rejected samples rather than discarding them. Using adversarially injected corpora as ground-truth failure labels, the authors find that exact source provenance improves faithfulness gating for stronger judges, that hallucination and reward gates reject largely disjoint populations (making both necessary), and that adaptive recovery via failure diagnosis and targeted regeneration outperforms naive resampling. Generator scale is the primary driver of downstream fine-tuning quality, with filtration and recovery contributing meaningfully but secondarily.

Evaluation and Benchmarking Alignment and RLHF Provenance-Grounded Gating and Adaptive Recovery in Synthetic Post-Training Data Curation

5arXiv · cs.LG·15d ago·source ↗

SARDI: Self-Augmenting Retrieval for Diffusion Language Models using lookahead tokens

Researchers introduce SARDI, a training-free RAG framework for discrete diffusion language models that repurposes discarded low-confidence tokens during denoising as lookahead signals to guide retrieval before output is finalized. The method is retriever-agnostic and applicable to any reasoning-capable discrete diffusion LM. Evaluated across five multi-hop QA benchmarks, SARDI outperforms training-free diffusion and autoregressive retrieval baselines at up to 8x higher throughput.

Evaluation and Benchmarking Agent and Tool Ecosystem Self-Augmenting Retrieval for Diffusion Language Models SARDI

6Anthropic News·17d ago·source ↗

Anthropic introduces Contextual Retrieval to reduce RAG retrieval failures by up to 67%

Anthropic published a technical method called Contextual Retrieval that combines Contextual Embeddings and Contextual BM25 to address the context-loss problem in traditional RAG pipelines. The approach prepends chunk-level context before encoding, reducing failed retrievals by 49% standalone and 67% when combined with reranking. The post also highlights prompt caching as a simpler alternative for knowledge bases under 200K tokens, and provides a cookbook for deployment with Claude.

Enterprise Deployment Patterns Agent and Tool Ecosystem Claude BM25 Contextual Retrieval +1 more

7arXiv · cs.AI·1mo ago·source ↗

DeepWeb-Bench: A Hard Deep Research Benchmark Requiring Cross-Source Evidence and Long-Horizon Derivation

DeepWeb-Bench is a new benchmark designed to stress-test frontier language models on deep research tasks—open-web search, evidence collection, and multi-step derivation—where existing benchmarks have become saturated. The benchmark evaluates nine frontier models across four capability families (Retrieval, Derivation, Reasoning, Calibration) and finds that retrieval is not the primary bottleneck; derivation and calibration failures account for over 70% of errors. Strong models fail via incomplete derivation while weak models fail via hallucinated precision, and models show genuine domain specialization with low cross-model agreement (rho = 0.61). The benchmark, rubrics, and evaluation code are publicly released.

Frontier Model Releases Evaluation and Benchmarking deep research agents DeepWeb-Bench Retrieval-Augmented Generation +2 more

5arXiv · cs.CL·4d ago·source ↗

MetaSyn benchmark reveals critical screening bottleneck in LLM-based meta-analysis pipelines

Researchers introduce MetaSyn, a dataset of 442 expert-curated meta-analyses from Nature Portfolio journals, paired with a 140k-article PubMed retrieval corpus, PI/ECO criteria, verified positives, and hard negatives. Benchmarking twelve pipeline configurations — nine RAG variants and a protocol-driven agent — shows that despite 90.9% retrieval recall at K=200, no system recovers more than 52.7% of ground-truth included studies. The core failure is LLMs' inability to reliably distinguish eligible studies from topically similar but criteria-failing distractors. The paper argues that end-to-end scores obscure where pipelines break down and proposes stage-attributed metrics.

Evaluation and Benchmarking Agent and Tool Ecosystem PubMed Nature Portfolio MetaSyn

5Hugging Face Blog·1mo ago·source ↗

Introducing RTEB: A New Standard for Retrieval Evaluation

Hugging Face introduces RTEB (Retrieval Text Embedding Benchmark), a new benchmark designed to standardize evaluation of retrieval systems and text embeddings. The benchmark aims to address gaps in existing evaluation frameworks by providing more comprehensive and realistic retrieval tasks. This represents an effort to improve how the community measures progress in retrieval-augmented generation and semantic search systems.

Evaluation and Benchmarking Agent and Tool Ecosystem MTEB RTEB Hugging Face