Entity · other

speech-to-avatar systems

otheractivespeech-to-avatar-systems-110acd7c·1 events·first seen May 29, 2026

Aliases: speech-to-avatar systems

Co-occurring entities

VideoFDB conversational agents LLM-as-a-Judge

More like this (12)

Speech-to-Speech huggingface/speech-to-speech CapSpeech-TTS voice cloning tool-augmented language agents simultaneous speech-to-text translation text-to-speech Connecting Speech to Words through Images foreground-background dual-agent voice architecture SpeechMatrix Voxtral TTS E-TTS

Recent events (1)

6arXiv · cs.CL·May 29, 2026·source ↗

VideoFDB: First Benchmark for Full-Duplex Audio-Visual Conversational Agent Evaluation

VideoFDB is introduced as the first benchmark targeting full-duplex audio-visual-to-audio-visual (AV2AV) conversational agents, filling a gap where existing full-duplex benchmarks evaluate only speech. It provides 237 dyadic video-call clips covering 11 nonverbal conversational dynamics, a perception/generation taxonomy, and an LM-as-judge rubric framework. Evaluation across open- and closed-source vision-speech agents reveals systematic failure modes including captioning collapse and visual-stream ignorance, and shows current systems cannot perform the streaming joint audiovisual grounding required for natural conversation. Cascaded speech-to-avatar architectures are found to be architecturally incapable of producing full-duplex nonverbal cues.

Evaluation and Benchmarking Agent and Tool Ecosystem VideoFDB speech-to-avatar systems conversational agents +2 more