BM25
bm25-bde08fee·3 events·first seen 16d agoAliases: BM25
Co-occurring entities
More like this (12)
Recent events (3)
SPECTRA: Synthetic IR Test Collections with Relevance Oracles and Controlled Distractor Diagnostics
SPECTRA is a reproducible framework for generating synthetic information retrieval test collections, separating latent topical structure, surface text realization, and query intent generation to produce deterministic relevance oracles without human annotation. A Python prototype generated corpora up to 60,000 documents at roughly 12K–14K documents per second, with graded relevance labels for 96 queries. Controlled distractor experiments showed BM25 nDCG@10 degrading from 1.00 at 2% distractors to 0.43 at 36%, demonstrating the framework's utility for exposing retrieval system failure modes before expensive real-world collection construction. The authors position SPECTRA as a diagnostic complement to Cranfield/TREC-style evaluation rather than a replacement for human judgment.
Anthropic introduces Contextual Retrieval to reduce RAG retrieval failures by up to 67%
Anthropic published a technical method called Contextual Retrieval that combines Contextual Embeddings and Contextual BM25 to address the context-loss problem in traditional RAG pipelines. The approach prepends chunk-level context before encoding, reducing failed retrievals by 49% standalone and 67% when combined with reranking. The post also highlights prompt caching as a simpler alternative for knowledge bases under 200K tokens, and provides a cookbook for deployment with Claude.
Mistral Releases Search Toolkit: Open-Source Composable Framework for Production RAG and Enterprise Search Pipelines
Mistral AI has launched Search Toolkit in public preview, an open-source framework that unifies document ingestion, retrieval, and evaluation into a single composable pipeline for AI applications. The toolkit ships with BM25 sparse retrieval, dense embedding-based retrieval, hybrid configurations, and built-in metrics (recall, precision, MRR, NDCG), targeting enterprise RAG workflows, domain-specific retrieval, and agentic systems. It integrates with MCP-based Connectors for live data access from CRMs, code repositories, and productivity tools. CMA CGM is cited as a production user, combining Search Toolkit with Voxtral for real-time fake news detection across audio sources.