3arXiv cs.CL (Computation and Language)·3d ago

LuxEmo: 21-hour expressive speech corpus for Luxembourgish TTS with emotion categories

Researchers introduce LuxEmo, a 21-hour conversational speech corpus for Luxembourgish derived from RTL youth broadcasts, annotated with four emotion categories via a semi-automatic curation pipeline combining VAD, denoising, language ID, and human validation. The paper benchmarks five expressive TTS systems across cross-lingual transfer, multilingual support, and prosody transfer approaches. The work addresses the underrepresentation of low-resource languages in speech technology research.

Multimodal Progress LuxASR Radio Télévision Luxembourg LuxEmo

Related guides (1)

Multimodal ProgressTopic guide

Multimodal Progress: How AI Learned to See, Hear, and Act

Read asBeginner In-depth

Related events (8)

5arXiv · cs.CL·Jun 25, 2026·source ↗

SpeechEQ benchmark evaluates emotional intelligence in speech-language models across 15 EQ subscales

Researchers introduce SpeechEQ, a benchmark framework for evaluating sociolinguistic and emotional reasoning in Speech-Language Models (SLMs), comprising 2,265 multi-turn dialogues across 15 Emotional Quotient subscales grounded in EQ-i 2.0 theory. The benchmark reveals three systematic failure modes in current multimodal models: over-reliance on text (modality shortcut), alignment-induced safety trap, and contextual amnesia across turns. End-to-end architectures outperform cascaded systems but all evaluated models fall short of genuine emotional awareness. The dataset and demo are publicly released on HuggingFace.

Evaluation and Benchmarking Multimodal Progress EQ-i 2.0 SpeechEQ

4arXiv · cs.CL·May 22, 2026·source ↗

Multimodal Pathos Analysis in Political Speech: LLM-Based vs. Acoustic Emotion Models

Researchers compare acoustic speech emotion recognition (emotion2vec_plus_large), multimodal LLM analysis (Gemini 2.5 Flash), and a multi-agent LLM ensemble (TRUST pipeline) for detecting Pathos in a Bundestag political speech. Gemini Valence correlates strongly with TRUST-Pathos scores (rho=+0.664) while acoustic Valence does not (rho=+0.097), suggesting LLMs capture semantically defined political emotion far better than acoustic models. The study also critiques standard SER benchmark corpora (EMO-DB) for acted speech, cultural bias, and category incompatibility. Results indicate acoustic features remain useful for low-level arousal estimation but are insufficient proxies for rhetorical-emotional analysis.

Agent and Tool Ecosystem Multimodal Progress Gemini-2.5-Flash-Lite Felix Banaszak emotion2vec_plus_large +4 more

4arXiv · cs.CL·5d ago·source ↗

HPRO: Hierarchical Progressive Reward Optimization for Emotional Text-to-Speech

Researchers propose HPRO, a hierarchical progressive reward optimization framework for LLM-based Text-to-Speech systems that improves emotional expressiveness. The approach introduces HD-Emo codec as a differentiable reward model that separates content and style preference tokens to avoid conflicting gradients, and bridges sentence-level and frame-level reward signals through progressive multi-scale alignment. Experiments show improved emotional expressiveness while preserving linguistic intelligibility.

Alignment and RLHF Multimodal Progress HD-Emo codec HPRO

4arXiv · cs.CL·May 21, 2026·source ↗

LexNeo-Bench: Probing LLM Knowledge of Lexical Borrowing in Luxembourgish via Knowledge-Graph Prompting

Researchers introduce LexNeo-Bench, a 3,050-instance benchmark for evaluating LLM performance on lexical borrowing classification and neology detection in Luxembourgish, a low-resource contact language. Three multilingual LLMs are tested across 34 prompt configurations; without external context, models perform near chance on borrowing classification (25–35%). Injecting instance-specific subgraphs from a linguistic knowledge graph raises accuracy to 71–81% and largely closes the gap between small and large models, though neology detection remains difficult. The study highlights the value of lexicon-aware, structured prompting for low-resource multilingual evaluation.

Evaluation and Benchmarking Agent and Tool Ecosystem LexNeo-Bench knowledge graph prompting LuxBorrow +2 more

5arXiv · cs.CL·Jun 26, 2026·source ↗

Emotion vectors replicated in open-weight LLMs with architecture-dependent valence geometry

A new arXiv preprint extends prior findings on emotion vectors in Claude Sonnet 4.5 to two open-weight models, Apertus-8B-Instruct-2509 and Gemma-4-E4B-it, by extracting emotion contrast vectors across all layers. The authors recover valence geometry in both models (peak PC1-valence correlations of r=0.76 and r=0.83, near Claude's r=0.81) but find notable architectural differences: Gemma encodes valence strongly in early layers while Apertus shows the opposite pattern. Arousal encoding proves sensitive to the corpus used for extraction, suggesting uneven distribution of arousal-relevant cues across model-generated text.

Open Weights Progress AI Safety Research Gemma-4 E4B-it Claude Sonnet 4.5 Google +3 more

6arXiv · cs.AI·4d ago·source ↗

EMPATH: Multilingual multi-turn safety benchmark for emotional-support chatbots reveals score inflation and run-to-run reliability failures

EMPATH is a new arXiv benchmark for evaluating the safety of emotional-support chatbots, using an auditor model to generate multi-turn crisis conversations and a calibrated judge model to score transcripts across 19 metrics in five dimensions. Built for Mexican Spanish and US English, the benchmark surfaces score inflation on 10 of 19 metrics under uncalibrated rubrics and finds that run-to-run reliability is a per-model safety property: one model swings 2–10 points on a crisis metric across identical reruns, and DeepSeek V4 Pro produces different conversations at temperature 0. Evaluation of three frontier models shows aggregate scores within 0.74 points but per-metric divergences up to six points, with rankings stable across a cross-family judge at 93% within ±1.

Evaluation and Benchmarking AI Safety Research EMPATH DeepSeek V4

5arXiv · cs.CL·Jun 3, 2026·source ↗

Synthetic LLM-generated conversations improve ASR training for low-resource languages

Researchers propose a pipeline that uses LLMs to generate scenario-level dialogues and TTS to synthesize multi-speaker audio, creating simulated conversational training data for ASR systems. Evaluated on the Hungarian BEA-Dialogue benchmark, a model trained on 67 hours of real plus 636 hours of synthetic data outperforms a zero-shot model trained on 2,700 hours of real Hungarian speech. The study tests five LLM families under multiple budget and mixing configurations using a FastConformer-Large backbone, finding that generator choice and data composition significantly affect gains.

Evaluation and Benchmarking FastConformer-Large Efficient ASR Training with Conversations that Never Happened BEA-Dialogue

5arXiv · cs.CL·May 29, 2026·source ↗

LLUMI: Fine-Tuning Open-Source LLMs for Mental Health Writing Assistance Using Reddit Community Feedback

LLUMI is a two-component system (a generation model and an improvement model) designed to provide mental health writing assistance using smaller open-source LLMs hosted in privacy-preserving, on-premise environments. The system leverages Reddit community endorsement signals (upvotes/downvotes) to construct preference pairs for SFT and DPO training, then further aligns outputs via human evaluation across readability, empathy, connection, actionability, and safety dimensions. Results show LLUMI achieves performance comparable to proprietary GPT-based models on linguistic and human evaluations, suggesting community-derived preference signals can substitute for expensive expert labeling in sensitive domains.

Open Weights Progress AI Safety Research Reddit LLUMI Direct Preference Optimization (DPO)+3 more