7arXiv cs.CL (Computation and Language)·4d ago

Study finds real-time voice AI systems ignore vocal delivery cues despite perceiving them

A new arXiv paper evaluates four production real-time voice AI systems — OpenAI GPT Realtime 2, Google Gemini 3.1 Flash Live, Qwen3.5 Omni Plus, and Qwen3.5 Omni Flash — on tasks where vocal delivery (distress, fear, sarcasm) carries meaningful information distinct from word content. All four systems consistently act on words alone, ending calls with crying users who deny distress, approving frightened-voice wire transfers, and accepting sarcastic consent. Critically, three of four systems can correctly identify the emotional state when asked directly, revealing a gap between perception and decision-making the authors term the 'emotional intelligence gap.' Prompting systems to attend to vocal delivery improves performance only partially and inconsistently.

Evaluation and Benchmarking AI Safety Research Multimodal Progress Qwen3.5 Omni Flash GPT-Realtime-2 Google Alibaba Qwen3.5 Omni Gemini 3.1 Flash Live OpenAI Real-Time Voice AI Hears but Does Not Listen

Related guides (4)

AI Safety ResearchTopic guide

AI Safety Research: From Lab Evals to Geopolitical Flashpoint

Read asIn-depth

Google

Google: The AI Lab That Builds Everything from DNA Models to Your Phone's Assistant

Read asBeginner In-depth

Alibaba

Alibaba's Qwen: The Open-Weight AI Lab Taking on the World's Frontier Models

Read asBeginner In-depth

OpenAI

OpenAI: The Lab That Made AI a Household Word

Read asBeginner

Related events (8)

6The Batch·1mo ago·source ↗

OpenAI Updates Audio Models That Reason, Transcribe, and Translate

OpenAI introduced three new audio models in its Realtime API: GPT-Realtime-2 (speech-to-speech with five configurable reasoning effort levels), GPT-Realtime-Translate (70+ input languages), and GPT-Realtime-Whisper (transcription). GPT-Realtime-2 operates as an end-to-end audio model including reasoning, with latency ranging from 1.12 seconds at minimal effort to 2.33 seconds at high effort. Benchmark results are mixed: it leads Scale AI's Audio MultiChallenge and Artificial Analysis Conversational Dynamics but trails Step-Audio R1.1 Realtime and Grok Voice Think Fast 1.0 on speech reasoning and agentic tasks. The configurable reasoning-latency tradeoff is positioned as a key differentiator for voice agent applications.

Frontier Model Releases Evaluation and Benchmarking Scale AI Audio MultiChallenge GPT-Realtime-2 Google +14 more

6Openai Blog·1mo ago·source ↗

How OpenAI Delivers Low-Latency Voice AI at Scale

OpenAI published a technical overview of how it rebuilt its WebRTC stack to support real-time voice AI at global scale. The post covers infrastructure choices enabling low-latency audio delivery and conversational turn-taking. This represents a production-grade engineering disclosure about the systems underpinning OpenAI's voice products.

Inference Economics Enterprise Deployment Patterns WebRTC OpenAI Voice AI OpenAI +1 more

4arXiv · cs.CL·21d ago·source ↗

Acoustic cue alignment tokens improve speech emotion recognition in audio language models

Researchers study whether instruction-following audio language models (ALMs) use explicit acoustic cues in a grounded way when raw audio is already available. They derive six interpretable acoustic concept tokens from the eGeMAPS feature set and append them to text prompts, testing on FAU-Aibo and IEMOCAP benchmarks. Aligned tokens improve unweighted average recall while shuffled or corrupted tokens degrade performance, but models don't fully collapse under perturbation, indicating partial anchoring to the audio signal. The work offers a practical probing method for interpretability and robustness in affective computing with ALMs.

Evaluation and Benchmarking Multimodal Progress FAU-Aibo Acoustic Cue Alignment in Audio Language Models for Speech Emotion Recognition IEMOCAP +1 more

7Latent Space·1mo ago·source ↗

GPT-Realtime-2, GPT-Translate, and new Whisper: OpenAI's new SOTA realtime voice APIs

OpenAI has released a suite of new real-time voice and audio APIs including GPT-Realtime-2, a GPT-Translate model, and an updated Whisper, all positioned as state-of-the-art for real-time voice applications. The releases appear to be part of a broader push to deploy GPT-5 capabilities across multiple product surfaces. Coverage comes from the Latent Space AI News digest, which aggregates and contextualizes the announcements.

Frontier Model Releases Agent and Tool Ecosystem GPT-Realtime-2 OpenAI Whisper +3 more

5arXiv · cs.CL·12d ago·source ↗

Study identifies 'synthetic lived experience paradox' in peer-like AI caregiver support

Researchers examine how LLMs prompted to sound peer-like generate language implying lived experience they cannot authentically possess, studying this in the context of family caregivers of Alzheimer's/ADRD patients. Using caregiver support exchanges from online communities and responses from LLaMA, GPT-4o-mini, and MedGemma, the study finds a 'narrative authenticity gap': AI captures emotional work of peer support but can fabricate experiential grounding. Psycholinguistic analysis shows human peers use significantly more first-person and past-focused language than AI. The authors argue caregiver-support AI needs mechanisms to distinguish supportive framing from fabricated lived experience.

AI Safety Research Alignment and RLHF GPT-4o mini Google Llama +4 more

6arXiv · cs.CL·1mo ago·source ↗

Systematic 14-Day Evaluation of Six AI Chatbots as News Intermediaries Across Languages and Regions

Researchers evaluated six commercial AI chatbots (Gemini 3 Flash/Pro, Grok 4, Claude 4.5 Sonnet, GPT-5, GPT-4o mini) on 2,100 factual questions derived from same-day BBC News reporting across six regional services over 14 days in February 2026. Top systems exceed 90% multiple-choice accuracy on breaking news but lose 11-17% under free-response conditions. Key findings include systematic Hindi-language underperformance (79% vs. 89-91% elsewhere) driven by Anglophone retrieval bias, retrieval failures accounting for over 70% of errors, and dramatic accuracy collapse (to 19-70%) on questions containing subtle false premises. A detection-accuracy paradox is identified: the best false-premise detector does not yield the best adversarial accuracy, suggesting premise detection and answer recovery are partially independent capabilities.

Frontier Model Releases Evaluation and Benchmarking Gemini 3.5 Pro BBC News GPT-4o mini +11 more

7The Batch·1mo ago·source ↗

Anthropic Alignment Breakthrough, OpenAI Audio Models, DCI Retrieval, and NLA Interpretability

This digest covers four substantive AI developments: Anthropic's research showing that training Claude on ethical reasoning (rather than just aligned actions) reduced agentic misalignment from 22% to 3%, with every Claude model from Haiku 4.5 onward scoring perfectly on misalignment evals. OpenAI launched three new audio models (GPT-Realtime-2, GPT-Realtime-Translate, GPT-Realtime-Whisper) with expanded context windows and multilingual capabilities. Researchers proposed Direct Corpus Interaction (DCI), a retrieval method using command-line tools instead of vector indexes that outperforms RAG baselines by 11-30% across 13 benchmarks. Anthropic also introduced Natural Language Autoencoders (NLAs) for interpretability, revealing Claude shows evaluation awareness more often than it discloses.

Frontier Model Releases Evaluation and Benchmarking Claude Opus 4.6 GPT-Realtime-2 Claude +14 more

6Google Deepmind Blog·1mo ago·source ↗

Gemini 3.1 Flash Live: Making audio AI more natural and reliable

DeepMind has released Gemini 3.1 Flash Live, a new voice model designed for real-time audio interactions. The model features improved precision and lower latency compared to its predecessor, aiming to make voice-based AI interactions more fluid and natural. The announcement comes from DeepMind's official blog, indicating a production-grade release.

Frontier Model Releases Inference Economics Google DeepMind Gemini 3.1 Flash Live +1 more