5Hugging Face Blog·1mo ago

Introducing RTEB: A New Standard for Retrieval Evaluation

Hugging Face introduces RTEB (Retrieval Text Embedding Benchmark), a new benchmark designed to standardize evaluation of retrieval systems and text embeddings. The benchmark aims to address gaps in existing evaluation frameworks by providing more comprehensive and realistic retrieval tasks. This represents an effort to improve how the community measures progress in retrieval-augmented generation and semantic search systems.

Evaluation and Benchmarking Agent and Tool Ecosystem MTEB RTEB Hugging Face

Related guides (3)

Hugging Face

Hugging Face: The Home of Open-Source AI

Read asBeginner In-depth

Agent and Tool EcosystemTopic guide

Agent and Tool Ecosystem: How AI Is Learning to Act, Not Just Answer

Read asBeginner In-depth

Evaluation and BenchmarkingTopic guide

Evaluation and Benchmarking: How We Measure AI — and Why It Keeps Getting Harder

Read asBeginner In-depth

Related events (8)

6Hugging Face Blog·1mo ago·source ↗

MTEB: Massive Text Embedding Benchmark

MTEB (Massive Text Embedding Benchmark) is introduced as a large-scale benchmark for evaluating text embedding models across a wide variety of tasks and datasets. The benchmark covers multiple embedding task types including classification, clustering, retrieval, and semantic similarity, enabling systematic comparison of embedding models. It provides a public leaderboard to track progress in the text embedding space. The work addresses the lack of a unified, comprehensive evaluation framework for text embeddings.

Evaluation and Benchmarking MTEB Hugging Face

3arXiv · cs.LG·8d ago·source ↗

SkMTEB: First comprehensive MTEB-style text embedding benchmark for Slovak with adapted E5 models

Researchers introduce SkMTEB, the first MTEB-style embedding benchmark for Slovak, covering 31 datasets across 7 task types — roughly 4× the existing multilingual benchmark coverage for the language. Evaluation of 31 embedding models shows large instruction-tuned multilingual models outperform Slovak-specific NLU models on embedding tasks. The authors also release e5-sk-small (45M) and e5-sk-large (365M), derived from Multilingual E5 via vocabulary trimming and fine-tuning, achieving competitive performance with proprietary APIs at up to 62% size reduction.

Evaluation and Benchmarking Open Weights Progress MTEB SkMTEB e5_large +2 more

5Hugging Face Blog·1mo ago·source ↗

Introducing the Ettin Reranker Family

Hugging Face introduces the Ettin Reranker Family, a new set of reranking models designed to improve retrieval quality in information retrieval and RAG pipelines. The models appear to be purpose-built for reranking tasks, likely targeting enterprise and research use cases where retrieval precision matters. As a Hugging Face blog post, this represents a tooling/model release in the retrieval-augmented generation ecosystem.

Enterprise Deployment Patterns Agent and Tool Ecosystem Ettin Reranker Hugging Face

5Hugging Face Blog·1mo ago·source ↗

Introducing ConTextual: Benchmark for Joint Text-Image Reasoning in Text-Rich Scenes

Hugging Face introduces ConTextual, a new benchmark evaluating multimodal models on their ability to jointly reason over text and images in text-rich scenes. The benchmark targets a specific capability gap where models must integrate visual and textual information simultaneously rather than treating them independently. A leaderboard accompanies the benchmark to track model progress on this task.

Evaluation and Benchmarking Multimodal Progress Hugging Face ConTextual

4arXiv · cs.CL·11d ago·source ↗

TABVERSE benchmark isolates table representation effects across formats in LLMs and VLMs

TABVERSE is a new controlled multimodal benchmark that evaluates LLMs and VLMs on table understanding by holding table content fixed while varying representation format (HTML, Markdown, LaTeX, rendered images). Evaluation across three tasks—Question Answering, Structural Understanding, and Structure Reconstruction—shows that representation choice substantially affects performance, with structured text generally outperforming rendered images and HTML being the most robust text format. The benchmark addresses a gap in existing evaluations where content, format, and modality vary simultaneously, making it impossible to isolate representation effects.

Evaluation and Benchmarking Multimodal Progress TABVERSE

6arXiv · cs.CL·5d ago·source ↗

Every Eval Ever: unified schema and community repository for AI evaluation results

Researchers introduce Every Eval Ever, a shared schema and crowdsourced repository designed to standardize AI evaluation results across incompatible formats, frameworks, and sources. The system ingests results from evaluation harnesses, papers, leaderboards, and custom repositories into a single JSON document format, with optional per-instance output storage. The repository, hosted on Hugging Face, currently covers 22,235 models, 2,273 unique benchmarks, and 31 evaluation formats. The work addresses a persistent infrastructure problem in AI evaluation science: divergent scores for nominally identical evaluations and scattered, incomparable metadata.

Evaluation and Benchmarking Agent and Tool Ecosystem Hugging Face Every Eval Ever

5Hugging Face Blog·1mo ago·source ↗

TTS Arena: Benchmarking Text-to-Speech Models in the Wild

Hugging Face introduces TTS Arena, a community-driven evaluation platform for text-to-speech models modeled after the LLM Chatbot Arena approach. Users listen to audio samples from competing TTS systems and vote on quality, generating Elo-based rankings. The platform aims to provide a more ecologically valid benchmark than existing automated metrics, which often fail to capture human perceptual preferences. Initial results surface rankings across open and proprietary TTS models.

Evaluation and Benchmarking Multimodal Progress Chatbot Arena TTS Arena Hugging Face +1 more

5Hugging Face Blog·1mo ago·source ↗

Evaluating Audio Reasoning with Big Bench Audio

Hugging Face introduces Big Bench Audio, a new benchmark designed to evaluate audio reasoning capabilities in AI models. The benchmark appears to extend the Big Bench evaluation framework into the audio domain, targeting multimodal models that process and reason over audio inputs. This release addresses a gap in evaluation tooling for audio-capable language models.

Evaluation and Benchmarking Multimodal Progress Big Bench Audio Hugging Face Big Bench