person
Shreyaskc
personactiveprovisional
shreyaskc-6329ac24·1 events·first seen 45h agoAliases: Shreyaskc
Co-occurring entities
More like this (12)
Recent events (1)
BabelJudge: Benchmark for measuring LLM-as-a-judge reliability across languages and agent trajectories
BabelJudge is a new open-source benchmark and audit framework that systematically measures four failure modes in LLM-as-a-judge systems: position bias, verbosity bias, order inconsistency, and cross-lingual degradation. The framework uses a 'gold-labelling by degradation' technique to generate labeled evaluation pairs without human annotation. Evaluation of Qwen2.5-7B-Instruct-4bit across English, Hindi, Arabic, and Swahili reveals severe cross-lingual reliability drops, with Swahili order consistency collapsing to near-random (0.480). The framework is extended to agentic evaluation with nine trajectory-level perturbations and three new metrics, released as a Python package supporting 11 judge backends.