benchmark

ParaPairAudioBench

benchmarkactiveprovisionalparapairaudiobench-a96bfd4e·1 events·first seen 19h ago

Aliases: ParaPairAudioBench

More like this (12)

SAM Audio-Bench ProgramBench Proactive-Sound-Bench SoundnessBench RepoBench SPBench Artificial Analysis Big Bench Audio PowerCodeBench PortBench PaSBench-Video OpAI-Bench FishAudio-S2-Pro

Recent events (1)

4arXiv · cs.CL·19h ago·source ↗

ParaPairAudioBench: Pairwise benchmark reveals large gaps in LALM paralinguistic judgment

Researchers introduce ParaPairAudioBench, a pairwise audio benchmark of 5,175 audio pairs spanning five paralinguistic dimensions (Style, Rate, Emphasis, Age, Gender) designed to evaluate Large Audio-Language Models as judges. Experiments show current LALMs lag human judgment by 32 percentage points on average and exhibit severe calibration failures, especially in ambiguous 'Tie' cases. The benchmark includes same-transcript and cross-transcript conditions to disentangle lexical from acoustic reliance, enabling more rigorous assessment of LALM reliability for speech evaluation.

Evaluation and Benchmarking Multimodal Progress ParaPairAudioBench