Almanac
← Events
5Hugging Face Blog·1mo ago

TTS Arena: Benchmarking Text-to-Speech Models in the Wild

Hugging Face introduces TTS Arena, a community-driven evaluation platform for text-to-speech models modeled after the LLM Chatbot Arena approach. Users listen to audio samples from competing TTS systems and vote on quality, generating Elo-based rankings. The platform aims to provide a more ecologically valid benchmark than existing automated metrics, which often fail to capture human perceptual preferences. Initial results surface rankings across open and proprietary TTS models.

Related guides (3)

Related events (8)

5Hugging Face Blog·1mo ago·source ↗

Launching the Artificial Analysis Text to Image Leaderboard & Arena

Hugging Face and Artificial Analysis are launching a combined leaderboard and arena for evaluating text-to-image models. The leaderboard tracks quality, speed, and cost metrics across leading image generation models, while the arena component collects human preference votes for side-by-side comparisons. This provides a structured benchmark for comparing commercial and open-weight image generation systems.

5Hugging Face Blog·1mo ago·source ↗

Judge Arena: Benchmarking LLMs as Evaluators

Hugging Face and Atla have launched Judge Arena, a platform for benchmarking large language models in their role as automated evaluators. The initiative uses an Elo-based ranking system to compare how well different LLMs judge the quality of model outputs, addressing the growing reliance on LLM-as-judge paradigms in evaluation pipelines. This fills a meta-evaluation gap: as LLM judges become standard practice, understanding their relative reliability and biases becomes critical infrastructure for the field.

5Hugging Face Blog·1mo ago·source ↗

Introducing the Chatbot Guardrails Arena

Hugging Face and Lighthouz AI have launched the Chatbot Guardrails Arena, a new evaluation platform focused on assessing safety guardrails in conversational AI systems. The arena uses human preference-based evaluation to benchmark how well different chatbot guardrail implementations resist unsafe or policy-violating outputs. This fills a gap in existing evaluation infrastructure, which has largely focused on capability rather than safety constraint enforcement.

5Hugging Face Blog·1mo ago·source ↗

Evaluating Audio Reasoning with Big Bench Audio

Hugging Face introduces Big Bench Audio, a new benchmark designed to evaluate audio reasoning capabilities in AI models. The benchmark appears to extend the Big Bench evaluation framework into the audio domain, targeting multimodal models that process and reason over audio inputs. This release addresses a gap in evaluation tooling for audio-capable language models.

4Hugging Face Blog·1mo ago·source ↗

Speech Synthesis, Recognition, and More With SpeechT5

This Hugging Face blog post introduces SpeechT5, a unified pre-trained model for speech synthesis, recognition, and related tasks. The post covers the model's architecture and capabilities, and explains how to use it via the Hugging Face Transformers library. SpeechT5 is a Microsoft Research model that uses a shared encoder-decoder framework across multiple speech tasks.

6Google Deepmind Blog·1mo ago·source ↗

Rethinking how we measure AI intelligence

DeepMind has announced Game Arena, a new open-source evaluation platform designed for rigorous head-to-head comparison of frontier AI models. The platform uses environments with clear winning conditions to assess model capabilities. This represents DeepMind's contribution to addressing ongoing concerns about the adequacy of existing AI benchmarks.

4Hugging Face Blog·1mo ago·source ↗

Open ASR Leaderboard: Trends and Insights with New Multilingual & Long-Form Tracks

Hugging Face has updated its Open ASR Leaderboard to include new multilingual and long-form audio transcription evaluation tracks. The post analyzes trends across submitted automatic speech recognition models, providing comparative benchmarking data across languages and extended audio contexts. This expands the leaderboard's coverage beyond English short-form ASR to better reflect real-world deployment scenarios.

4Hugging Face Blog·1mo ago·source ↗

Benchmarking Text Generation Inference

Hugging Face published a benchmarking guide for Text Generation Inference (TGI), their production inference server. The post covers methodology for measuring throughput and latency under various load conditions, helping practitioners evaluate TGI performance for deployment decisions. It provides tooling and guidance for running reproducible benchmarks against TGI endpoints.