Entity · benchmark

Chatbot Arena

benchmarkactivechatbot-arena-846b3317·4 events·first seen May 18, 2026

Aliases: Chatbot Arena

Co-occurring entities

MT-Bench Hugging Face UltraFeedback Measuring Semantic Progress in Multi-turn Dialogue via Information Gain TTS Arena Elo rating system Keras TPU Alibaba Qwen Qwen-Max-0428 Qwen1.5-110B-Chat

More like this (12)

Chatbot Guardrails Arena Game Arena TTS Arena BigCodeArena Arena.ai Code Arena WebDev Arena Search WebArena BashArena Arena AI ChatGPT Atlas chat templates ResearchArena

Recent events (4)

5arXiv · cs.CL·Jun 11, 2026·source ↗

Information-theoretic metric for measuring semantic progress in multi-turn dialogue

A new arXiv preprint formalizes 'semantic progress' in multi-turn dialogue as question-conditioned uncertainty reduction and introduces an information-theoretic metric approximated in embedding space using a Gaussian formulation with closed-form updates. The metric has desirable theoretical properties (monotonicity, additive decomposition, diminishing returns) and requires no autoregressive inference at evaluation time, making it reproducible and lightweight. Experiments on MT-Bench, Chatbot Arena, and UltraFeedback show competitive or improved agreement with human judgments compared to several LLM-as-a-judge baselines. The approach works with lightweight embedding models under CPU-only execution.

Evaluation and Benchmarking Chatbot Arena MT-Bench UltraFeedback +1 more

5Hugging Face Blog·May 19, 2026·source ↗

TTS Arena: Benchmarking Text-to-Speech Models in the Wild

Hugging Face introduces TTS Arena, a community-driven evaluation platform for text-to-speech models modeled after the LLM Chatbot Arena approach. Users listen to audio samples from competing TTS systems and vote on quality, generating Elo-based rankings. The platform aims to provide a more ecologically valid benchmark than existing automated metrics, which often fail to capture human perceptual preferences. Initial results surface rankings across open and proprietary TTS models.

Evaluation and Benchmarking Multimodal Progress Chatbot Arena TTS Arena Hugging Face +1 more

4Hugging Face Blog·May 19, 2026·source ↗

How good are LLMs at fixing their mistakes? A chatbot arena experiment with Keras and TPUs

A Hugging Face blog post describes a chatbot arena experiment evaluating LLMs' ability to self-correct errors, using Keras and TPUs as the infrastructure backbone. The experiment appears to use a head-to-head arena format to assess self-correction capabilities across models. This touches on both evaluation methodology and a core capability question about whether LLMs can reliably identify and fix their own mistakes.

Evaluation and Benchmarking Agent and Tool Ecosystem Chatbot Arena Keras TPU +1 more

6Qwen Research·May 18, 2026·source ↗

Qwen-Max-0428: Alibaba's Largest Instruction-Tuned Model Released

Alibaba's Qwen team has released Qwen-Max-0428, a new instruction-tuned model larger than the previously open-sourced Qwen1.5-110B-Chat. The model has entered Chatbot Arena and reached the top-10 on the leaderboard, while also outperforming Qwen1.5-110B-Chat on MT-Bench. The model is available via API, though it does not appear to be open-weights at this stage.

Frontier Model Releases Evaluation and Benchmarking Chatbot Arena Alibaba Qwen MT-Bench +3 more