6Hugging Face Blog·1mo ago

MTEB: Massive Text Embedding Benchmark

MTEB (Massive Text Embedding Benchmark) is introduced as a large-scale benchmark for evaluating text embedding models across a wide variety of tasks and datasets. The benchmark covers multiple embedding task types including classification, clustering, retrieval, and semantic similarity, enabling systematic comparison of embedding models. It provides a public leaderboard to track progress in the text embedding space. The work addresses the lack of a unified, comprehensive evaluation framework for text embeddings.

Evaluation and Benchmarking MTEB Hugging Face

Related guides (2)

Hugging Face

Hugging Face: The Home of Open-Source AI

Read asBeginner In-depth

Evaluation and BenchmarkingTopic guide

Evaluation and Benchmarking: How We Measure AI — and Why It Keeps Getting Harder

Read asBeginner In-depth

Related events (8)

3arXiv · cs.LG·8d ago·source ↗

SkMTEB: First comprehensive MTEB-style text embedding benchmark for Slovak with adapted E5 models

Researchers introduce SkMTEB, the first MTEB-style embedding benchmark for Slovak, covering 31 datasets across 7 task types — roughly 4× the existing multilingual benchmark coverage for the language. Evaluation of 31 embedding models shows large instruction-tuned multilingual models outperform Slovak-specific NLU models on embedding tasks. The authors also release e5-sk-small (45M) and e5-sk-large (365M), derived from Multilingual E5 via vocabulary trimming and fine-tuning, achieving competitive performance with proprietary APIs at up to 62% size reduction.

Evaluation and Benchmarking Open Weights Progress MTEB SkMTEB e5_large +2 more

5Hugging Face Blog·1mo ago·source ↗

Introducing RTEB: A New Standard for Retrieval Evaluation

Hugging Face introduces RTEB (Retrieval Text Embedding Benchmark), a new benchmark designed to standardize evaluation of retrieval systems and text embeddings. The benchmark aims to address gaps in existing evaluation frameworks by providing more comprehensive and realistic retrieval tasks. This represents an effort to improve how the community measures progress in retrieval-augmented generation and semantic search systems.

Evaluation and Benchmarking Agent and Tool Ecosystem MTEB RTEB Hugging Face

5Openai Blog·1mo ago·source ↗

Introducing text and code embeddings

OpenAI launched a new embeddings endpoint in its API, enabling natural language and code tasks such as semantic search, clustering, topic modeling, and classification. The endpoint provides vector representations of text and code, making it easier for developers to build applications requiring semantic understanding. This was a significant early step in OpenAI's API product expansion beyond text generation.

Enterprise Deployment Patterns Agent and Tool Ecosystem OpenAI Embeddings API OpenAI API OpenAI

5Openai Blog·1mo ago·source ↗

OpenAI Releases New and Improved Embedding Model

OpenAI announced a new embedding model described as significantly more capable, cost-effective, and simpler to use than prior offerings. The announcement was published in December 2022 and represents an update to OpenAI's text embedding API surface. No specific benchmark numbers or architectural details are provided in the available body text.

Inference Economics Enterprise Deployment Patterns text-embedding-ada-002 OpenAI

6arXiv · cs.CL·29d ago·source ↗

Instruction Sensitivity Undermines Embedding Model Evaluation: Single-Prompt Benchmarks Are Insufficient

This paper presents an empirical study of prompt sensitivity in instruction-tuned embedding models, covering 6 models, 11 datasets, and 15 task-specific prompts per dataset (990 total evaluations). The authors demonstrate that single-prompt evaluation systematically misrepresents true model performance, with default prompts both understating and overstating capabilities depending on phrasing. A key finding is that leaderboard rankings are not robust: by selecting prompts favorably, any model in the study can be promoted to first place. The authors recommend that benchmarks incorporate prompt robustness metrics, either through multi-prompt evaluation or by reporting sensitivity alongside point estimates.

Evaluation and Benchmarking Agent and Tool Ecosystem MTEB embedding model leaderboard prompt sensitivity +1 more

5Openai Blog·1mo ago·source ↗

Text and Code Embeddings by Contrastive Pre-training

OpenAI published research on generating text and code embeddings using contrastive pre-training. The approach trains models to produce dense vector representations useful for semantic search, classification, and code retrieval tasks. This work underpins OpenAI's embeddings API offerings and represents an early public articulation of their embedding methodology.

Inference Economics Enterprise Deployment Patterns Contrastive Pre-training OpenAI Embeddings API text-embedding-ada-002 +1 more

4arXiv · cs.CL·11d ago·source ↗

TABVERSE benchmark isolates table representation effects across formats in LLMs and VLMs

TABVERSE is a new controlled multimodal benchmark that evaluates LLMs and VLMs on table understanding by holding table content fixed while varying representation format (HTML, Markdown, LaTeX, rendered images). Evaluation across three tasks—Question Answering, Structural Understanding, and Structure Reconstruction—shows that representation choice substantially affects performance, with structured text generally outperforming rendered images and HTML being the most robust text format. The benchmark addresses a gap in existing evaluations where content, format, and modality vary simultaneously, making it impossible to isolate representation effects.

Evaluation and Benchmarking Multimodal Progress TABVERSE

7Qwen Research·1mo ago·source ↗

Qwen3 Embedding: State-of-the-Art Text Embedding and Reranking Models Released

Alibaba's Qwen team has released the Qwen3 Embedding series, a set of open-weights text embedding and reranking models built on the Qwen3 foundation model. The models are designed for retrieval and reranking tasks and claim state-of-the-art performance across multiple benchmarks. They are released under the Apache 2.0 license and are available on Hugging Face and ModelScope.

Evaluation and Benchmarking Open Weights Progress Qwen3 Embedding Alibaba Qwen Apache 2.0 +5 more