Almanac
benchmark

ParaPairAudioBench

benchmarkactiveprovisionalparapairaudiobench-a96bfd4e·1 events·first seen 19h ago

Aliases: ParaPairAudioBench

More like this (12)

Recent events (1)

4arXiv · cs.CL·19h ago·source ↗

ParaPairAudioBench: Pairwise benchmark reveals large gaps in LALM paralinguistic judgment

Researchers introduce ParaPairAudioBench, a pairwise audio benchmark of 5,175 audio pairs spanning five paralinguistic dimensions (Style, Rate, Emphasis, Age, Gender) designed to evaluate Large Audio-Language Models as judges. Experiments show current LALMs lag human judgment by 32 percentage points on average and exhibit severe calibration failures, especially in ambiguous 'Tie' cases. The benchmark includes same-transcript and cross-transcript conditions to disentangle lexical from acoustic reliance, enabling more rigorous assessment of LALM reliability for speech evaluation.