Almanac
other

speech-to-avatar systems

otheractiveprovisionalspeech-to-avatar-systems-110acd7c·1 events·first seen 18d ago

Aliases: speech-to-avatar systems

Co-occurring entities

More like this (12)

Recent events (1)

6arXiv · cs.CL·18d ago·source ↗

VideoFDB: First Benchmark for Full-Duplex Audio-Visual Conversational Agent Evaluation

VideoFDB is introduced as the first benchmark targeting full-duplex audio-visual-to-audio-visual (AV2AV) conversational agents, filling a gap where existing full-duplex benchmarks evaluate only speech. It provides 237 dyadic video-call clips covering 11 nonverbal conversational dynamics, a perception/generation taxonomy, and an LM-as-judge rubric framework. Evaluation across open- and closed-source vision-speech agents reveals systematic failure modes including captioning collapse and visual-stream ignorance, and shows current systems cannot perform the streaming joint audiovisual grounding required for natural conversation. Cascaded speech-to-avatar architectures are found to be architecturally incapable of producing full-duplex nonverbal cues.