SpeechEQ benchmark evaluates emotional intelligence in speech-language models across 15 EQ subscales
Researchers introduce SpeechEQ, a benchmark framework for evaluating sociolinguistic and emotional reasoning in Speech-Language Models (SLMs), comprising 2,265 multi-turn dialogues across 15 Emotional Quotient subscales grounded in EQ-i 2.0 theory. The benchmark reveals three systematic failure modes in current multimodal models: over-reliance on text (modality shortcut), alignment-induced safety trap, and contextual amnesia across turns. End-to-end architectures outperform cascaded systems but all evaluated models fall short of genuine emotional awareness. The dataset and demo are publicly released on HuggingFace.
Related guides (2)
Related events (8)
ENPMR-Bench: Benchmarking Proactive Memory Retrieval for Emotional Support Agents
This paper introduces ENPMR-Bench, a benchmark for evaluating Emotional Need-aware Proactive Memory Retrieval in memory-augmented language agents deployed for emotional support applications. The benchmark includes over 1,800 memory-augmented dialogues grounded in Maslow's hierarchy of needs, with structured mappings between emotional needs and supportive memory types. Experiments show that both embedding-based and LLM-driven retrieval paradigms fall significantly short of golden memory conditions on empathy scores, and while chain-of-thought prompting helps, a substantial performance gap remains. The work highlights a systematic gap in current agent memory systems when applied to affective rather than purely factual retrieval tasks.
TTS Arena: Benchmarking Text-to-Speech Models in the Wild
Hugging Face introduces TTS Arena, a community-driven evaluation platform for text-to-speech models modeled after the LLM Chatbot Arena approach. Users listen to audio samples from competing TTS systems and vote on quality, generating Elo-based rankings. The platform aims to provide a more ecologically valid benchmark than existing automated metrics, which often fail to capture human perceptual preferences. Initial results surface rankings across open and proprietary TTS models.
Evaluating Audio Reasoning with Big Bench Audio
Hugging Face introduces Big Bench Audio, a new benchmark designed to evaluate audio reasoning capabilities in AI models. The benchmark appears to extend the Big Bench evaluation framework into the audio domain, targeting multimodal models that process and reason over audio inputs. This release addresses a gap in evaluation tooling for audio-capable language models.
Multimodal Pathos Analysis in Political Speech: LLM-Based vs. Acoustic Emotion Models
Researchers compare acoustic speech emotion recognition (emotion2vec_plus_large), multimodal LLM analysis (Gemini 2.5 Flash), and a multi-agent LLM ensemble (TRUST pipeline) for detecting Pathos in a Bundestag political speech. Gemini Valence correlates strongly with TRUST-Pathos scores (rho=+0.664) while acoustic Valence does not (rho=+0.097), suggesting LLMs capture semantically defined political emotion far better than acoustic models. The study also critiques standard SER benchmark corpora (EMO-DB) for acted speech, cultural bias, and category incompatibility. Results indicate acoustic features remain useful for low-level arousal estimation but are insufficient proxies for rhetorical-emotional analysis.
IndicContextEval: Benchmark for context utilisation in Audio LLMs across 8 Indic languages
Researchers introduce IndicContextEval, a 56-hour multilingual speech benchmark covering 555 speakers across 8 Indian languages and 23 professional domains, designed to test whether Audio LLMs genuinely use textual context (domain descriptions, entity lists) or rely on parametric knowledge. The benchmark employs a 7-level prompting framework that progressively introduces contextual signals including adversarial prompts with incorrect entities. Evaluation of five models reveals substantial variation in context utilisation behaviour, exposing a gap in existing ASR benchmarks that test only fixed prompting conditions.
ParaPairAudioBench: Pairwise benchmark reveals large gaps in LALM paralinguistic judgment
Researchers introduce ParaPairAudioBench, a pairwise audio benchmark of 5,175 audio pairs spanning five paralinguistic dimensions (Style, Rate, Emphasis, Age, Gender) designed to evaluate Large Audio-Language Models as judges. Experiments show current LALMs lag human judgment by 32 percentage points on average and exhibit severe calibration failures, especially in ambiguous 'Tie' cases. The benchmark includes same-transcript and cross-transcript conditions to disentangle lexical from acoustic reliance, enabling more rigorous assessment of LALM reliability for speech evaluation.
Acoustic cue alignment tokens improve speech emotion recognition in audio language models
Researchers study whether instruction-following audio language models (ALMs) use explicit acoustic cues in a grounded way when raw audio is already available. They derive six interpretable acoustic concept tokens from the eGeMAPS feature set and append them to text prompts, testing on FAU-Aibo and IEMOCAP benchmarks. Aligned tokens improve unweighted average recall while shuffled or corrupted tokens degrade performance, but models don't fully collapse under perturbation, indicating partial anchoring to the audio signal. The work offers a practical probing method for interpretability and robustness in affective computing with ALMs.
ESI-Bench: A Benchmark for Embodied Spatial Intelligence Closing the Perception-Action Loop
ESI-Bench is a new benchmark for embodied spatial intelligence spanning 10 task categories and 29 subcategories, built on OmniGibson and grounded in Spelke's core knowledge systems. It evaluates agents that must actively deploy perception, locomotion, and manipulation to accumulate task-relevant evidence, rather than passively processing oracle observations. Experiments on state-of-the-art MLLMs reveal that active exploration outperforms passive baselines, but most failures stem from 'action blindness'—poor action choices leading to cascading errors—and a metacognitive gap where models commit prematurely with high confidence regardless of evidence quality. Human studies show humans seek falsifying viewpoints and revise beliefs under contradiction, a capability current models lack.

