3arXiv cs.CL (Computation and Language)·7d ago

CN-NewsTTS Bench: automatic benchmark for Chinese news TTS pronunciation of complex written forms

Researchers introduce CN-NewsTTS Bench v0.1, an open benchmark for evaluating Chinese news text-to-speech systems on challenging written forms such as scores, abbreviations, unit symbols, and mixed-script names — all from raw text without preprocessing aids. The benchmark includes a 200-record dev set, 800-record public test set, an automatic scorer, and baseline results for seven commercial TTS systems. Best-in-class accuracy reaches 0.879 strict accuracy while several systems fall below 0.60, revealing meaningful performance gaps on a practically important but underexplored evaluation dimension.

Evaluation and Benchmarking CN-NewsTTS Bench

Related guides (1)

Evaluation and BenchmarkingTopic guide

AI Evaluation and Benchmarking: From Leaderboards to the Limits of Measurement

Read asBeginner In-depth

Related events (8)

5Hugging Face Blog·1mo ago·source ↗

TTS Arena: Benchmarking Text-to-Speech Models in the Wild

Hugging Face introduces TTS Arena, a community-driven evaluation platform for text-to-speech models modeled after the LLM Chatbot Arena approach. Users listen to audio samples from competing TTS systems and vote on quality, generating Elo-based rankings. The platform aims to provide a more ecologically valid benchmark than existing automated metrics, which often fail to capture human perceptual preferences. Initial results surface rankings across open and proprietary TTS models.

Evaluation and Benchmarking Multimodal Progress Chatbot Arena TTS Arena Hugging Face +1 more

4arXiv · cs.CL·22d ago·source ↗

Phun-Bench: A Chinese benchmark for evaluating LLM phonological understanding

Researchers introduce Phun-Bench, a purpose-built benchmark for evaluating LLMs on phonological understanding in Chinese across three dimensions: Homophony, Rhyme, and Phonetic Similarity. The benchmark is designed to avoid rote-memorization shortcuts that plague existing phonological evals. Results show LLMs can recall correct pronunciations but fail to apply phonological knowledge flexibly as human speakers do, and the authors propose a hypothesis about the underlying mechanism of LLM phonological 'perception'.

Evaluation and Benchmarking Phun-Bench

5Qwen Research·1mo ago·source ↗

Qwen-TTS Updated with Chinese Dialect Support and Bilingual Voices

Alibaba's Qwen team has released an update to Qwen-TTS (qwen-tts-2025-05-22), a text-to-speech model trained on millions of hours of speech data. The model claims human-level naturalness and expressiveness, with automatic prosody and emotional inflection adjustment. A notable new capability is support for three Chinese dialects—Pekingese, Shanghainese, and Sichuanese—delivered through seven named Chinese-English bilingual voices accessible via the Qwen API.

Frontier Model Releases Multimodal Progress Alibaba Qwen Sichuanese dialect TTS Shanghainese dialect TTS +3 more

4arXiv · cs.CL·7d ago·source ↗

ParaPairAudioBench: Pairwise benchmark reveals large gaps in LALM paralinguistic judgment

Researchers introduce ParaPairAudioBench, a pairwise audio benchmark of 5,175 audio pairs spanning five paralinguistic dimensions (Style, Rate, Emphasis, Age, Gender) designed to evaluate Large Audio-Language Models as judges. Experiments show current LALMs lag human judgment by 32 percentage points on average and exhibit severe calibration failures, especially in ambiguous 'Tie' cases. The benchmark includes same-transcript and cross-transcript conditions to disentangle lexical from acoustic reliance, enabling more rigorous assessment of LALM reliability for speech evaluation.

Evaluation and Benchmarking Multimodal Progress ParaPairAudioBench

6arXiv · cs.CL·1mo ago·source ↗

Systematic 14-Day Evaluation of Six AI Chatbots as News Intermediaries Across Languages and Regions

Researchers evaluated six commercial AI chatbots (Gemini 3 Flash/Pro, Grok 4, Claude 4.5 Sonnet, GPT-5, GPT-4o mini) on 2,100 factual questions derived from same-day BBC News reporting across six regional services over 14 days in February 2026. Top systems exceed 90% multiple-choice accuracy on breaking news but lose 11-17% under free-response conditions. Key findings include systematic Hindi-language underperformance (79% vs. 89-91% elsewhere) driven by Anglophone retrieval bias, retrieval failures accounting for over 70% of errors, and dramatic accuracy collapse (to 19-70%) on questions containing subtle false premises. A detection-accuracy paradox is identified: the best false-premise detector does not yield the best adversarial accuracy, suggesting premise detection and answer recovery are partially independent capabilities.

Frontier Model Releases Evaluation and Benchmarking Gemini 3.5 Pro BBC News GPT-4o mini +11 more

4arXiv · cs.CL·12d ago·source ↗

IndicContextEval: Benchmark for context utilisation in Audio LLMs across 8 Indic languages

Researchers introduce IndicContextEval, a 56-hour multilingual speech benchmark covering 555 speakers across 8 Indian languages and 23 professional domains, designed to test whether Audio LLMs genuinely use textual context (domain descriptions, entity lists) or rely on parametric knowledge. The benchmark employs a 7-level prompting framework that progressively introduces contextual signals including adversarial prompts with incorrect entities. Evaluation of five models reveals substantial variation in context utilisation behaviour, exposing a gap in existing ASR benchmarks that test only fixed prompting conditions.

Evaluation and Benchmarking Multimodal Progress IndicContextEval

4arXiv · cs.CL·21d ago·source ↗

Corpus-Grounded Feature Diffusion pipeline for automated IEP generation in Traditional Chinese

Researchers propose a low-resource fine-tuning pipeline called Corpus-Grounded Feature Diffusion (CGFD) to automate Individualized Education Program (IEP) drafting from Traditional Chinese parent-teacher interview transcripts. The approach fine-tunes Breeze-7B with QLoRA on 582 synthetically diffused samples and uses schema-constrained decoding at inference time, finding that Grammar-Constrained Decoding is counterproductive under Traditional Chinese token budgets. On a small formal hold-out (n=10), the system achieves BERTScore F1 of 0.779, outperforming zero-shot GPT-5.4, DeepSeek-V3.2, Gemini-3-Flash-Preview, and Llama-4-Maverick baselines while enabling fully local, air-gapped inference. The work addresses a gap in Traditional Chinese special-education NLP and demonstrates a privacy-preserving deployment pattern for sensitive document generation.

Evaluation and Benchmarking Enterprise Deployment Patterns DeepSeek V4 Corpus-Grounded Feature Diffusion Grammar-Constrained Decoding +6 more

5Hugging Face Blog·1mo ago·source ↗

Evaluating Audio Reasoning with Big Bench Audio

Hugging Face introduces Big Bench Audio, a new benchmark designed to evaluate audio reasoning capabilities in AI models. The benchmark appears to extend the Big Bench evaluation framework into the audio domain, targeting multimodal models that process and reason over audio inputs. This release addresses a gap in evaluation tooling for audio-capable language models.

Evaluation and Benchmarking Multimodal Progress Big Bench Audio Hugging Face Big Bench