4arXiv cs.CL (Computation and Language)·3d ago

TRACE framework and DyadEE dataset for emotional entrainment detection in dyadic speech

Researchers introduce DyadEE, a dataset for detecting emotional entrainment in dyadic speech, containing both natural entrained conversations and synthetic disrupted interactions created via partner swapping and emotion resynthesis. They also propose TRACE, a window-level framework that models dyadic interactions as ordered sequences of acoustic embeddings from emotion fine-tuned Whisper representations. TRACE achieves 97.01% accuracy on DyadEE, with conversational context and relationship information proving key to performance. The work is motivated by the growing deployment of speech AI agents that need to understand affective coordination.

Multimodal Progress TRACE DyadEE Whisper

Related guides (1)

Multimodal ProgressTopic guide

Multimodal Progress: How AI Learned to See, Hear, and Act

Read asBeginner In-depth

Related events (8)

5arXiv · cs.CL·8d ago·source ↗

SpeechEQ benchmark evaluates emotional intelligence in speech-language models across 15 EQ subscales

Researchers introduce SpeechEQ, a benchmark framework for evaluating sociolinguistic and emotional reasoning in Speech-Language Models (SLMs), comprising 2,265 multi-turn dialogues across 15 Emotional Quotient subscales grounded in EQ-i 2.0 theory. The benchmark reveals three systematic failure modes in current multimodal models: over-reliance on text (modality shortcut), alignment-induced safety trap, and contextual amnesia across turns. End-to-end architectures outperform cascaded systems but all evaluated models fall short of genuine emotional awareness. The dataset and demo are publicly released on HuggingFace.

Evaluation and Benchmarking Multimodal Progress EQ-i 2.0 SpeechEQ

4arXiv · cs.CL·25d ago·source ↗

Acoustic cue alignment tokens improve speech emotion recognition in audio language models

Researchers study whether instruction-following audio language models (ALMs) use explicit acoustic cues in a grounded way when raw audio is already available. They derive six interpretable acoustic concept tokens from the eGeMAPS feature set and append them to text prompts, testing on FAU-Aibo and IEMOCAP benchmarks. Aligned tokens improve unweighted average recall while shuffled or corrupted tokens degrade performance, but models don't fully collapse under perturbation, indicating partial anchoring to the audio signal. The work offers a practical probing method for interpretability and robustness in affective computing with ALMs.

Evaluation and Benchmarking Multimodal Progress FAU-Aibo Acoustic Cue Alignment in Audio Language Models for Speech Emotion Recognition IEMOCAP +1 more

6arXiv · cs.AI·3d ago·source ↗

EMPATH: Multilingual multi-turn safety benchmark for emotional-support chatbots reveals score inflation and run-to-run reliability failures

EMPATH is a new arXiv benchmark for evaluating the safety of emotional-support chatbots, using an auditor model to generate multi-turn crisis conversations and a calibrated judge model to score transcripts across 19 metrics in five dimensions. Built for Mexican Spanish and US English, the benchmark surfaces score inflation on 10 of 19 metrics under uncalibrated rubrics and finds that run-to-run reliability is a per-model safety property: one model swings 2–10 points on a crisis metric across identical reruns, and DeepSeek V4 Pro produces different conversations at temperature 0. Evaluation of three frontier models shows aggregate scores within 0.74 points but per-metric divergences up to six points, with rankings stable across a cross-family judge at 93% within ±1.

Evaluation and Benchmarking AI Safety Research EMPATH DeepSeek V4

4arXiv · cs.CL·15h ago·source ↗

DramaSR-LRM: Reasoning LLM with multimodal tool-use for speaker recognition in TV dramas

Researchers introduce DramaSR-532K, a large-scale benchmark of 532K annotated dialogue lines across 900+ characters from long-form TV dramas, targeting multimodal speaker recognition. They also propose DramaSR-LRM, a system built on a large reasoning model that uses multimodal tool-use to aggregate auditory, linguistic, and visual cues for speaker attribution. The approach significantly outperforms baselines, especially on short utterances where acoustic biometrics alone are unreliable. Data and code are to be publicly released.

Evaluation and Benchmarking Multimodal Progress DramaSR-LRM DramaSR-532K

4arXiv · cs.CL·1mo ago·source ↗

Multimodal Pathos Analysis in Political Speech: LLM-Based vs. Acoustic Emotion Models

Researchers compare acoustic speech emotion recognition (emotion2vec_plus_large), multimodal LLM analysis (Gemini 2.5 Flash), and a multi-agent LLM ensemble (TRUST pipeline) for detecting Pathos in a Bundestag political speech. Gemini Valence correlates strongly with TRUST-Pathos scores (rho=+0.664) while acoustic Valence does not (rho=+0.097), suggesting LLMs capture semantically defined political emotion far better than acoustic models. The study also critiques standard SER benchmark corpora (EMO-DB) for acted speech, cultural bias, and category incompatibility. Results indicate acoustic features remain useful for low-level arousal estimation but are insufficient proxies for rhetorical-emotional analysis.

Agent and Tool Ecosystem Multimodal Progress Gemini-2.5-Flash-Lite Felix Banaszak emotion2vec_plus_large +4 more

4arXiv · cs.AI·18d ago·source ↗

AudioDER: Deduplication-enhanced reasoning dataset for post-training large audio-language models

Researchers introduce AudioDER, a ~191k-sample post-training dataset for Large Audio-Language Models (LALMs) built via an acoustic similarity-based deduplication pipeline to reduce redundancy and improve corpus diversity. Each sample pairs an audio clip with a multiple-choice question, answer candidates, a caption, and a chain-of-thought rationale generated by Qwen3-30B. Post-training Qwen2-Audio-7B-Instruct on AudioDER yields consistent gains on audio reasoning benchmarks including MMAU-mini, MMSU, and MMAR. The work addresses a data quality gap in audio-language training rather than proposing a new model architecture.

Evaluation and Benchmarking Multimodal Progress AudioDER Qwen2-Audio-7B-Instruct Qwen3-30B +3 more

3arXiv · cs.CL·22d ago·source ↗

Annotated dataset for enthymeme detection in political tweets with disagreement-aware training

Researchers present a dataset of 1,482 politically controversial tweets annotated by five annotators for enthymemes — arguments with unstated premises or conclusions — designed to study label variation rather than eliminate it. Annotation guidelines are grounded in Walton's argumentation schemes, and the paper includes a complexity analysis of cognitive load in the task. Preliminary experiments show that models trained on annotator disagreement outperform those trained on hard majority-vote labels, suggesting value in preserving annotation disagreement for subjective NLP tasks.

Evaluation and Benchmarking A Resource for Enthymeme Detection in Controversial Political Discourse Walton's argumentation schemes

5arXiv · cs.CL·23d ago·source ↗

RL-based alignment improves interactivity in full-duplex spoken dialogue models

Researchers propose a post-training alignment method using reinforcement learning to improve interactivity in full-duplex spoken dialogue models, which can listen and speak simultaneously. The method addresses four canonical axes of interactivity—pause handling, turn-taking, backchanneling, and user interruption—each with axis-specific reward functions, plus an LLM-based reward to prevent semantic degradation. The approach is applied to two open-source models, Moshi and PersonaPlex, showing consistent improvements in both offline and real-time multi-turn evaluation.

Alignment and RLHF Multimodal Progress Multi-Faceted Interactivity Alignment in Full-Duplex Speech Models PersonaPlex Moshi