Almanac
person

Shreyaskc

personactiveprovisionalshreyaskc-6329ac24·1 events·first seen 45h ago

Aliases: Shreyaskc

Co-occurring entities

More like this (12)

Recent events (1)

6arXiv · cs.CL·45h ago·source ↗

BabelJudge: Benchmark for measuring LLM-as-a-judge reliability across languages and agent trajectories

BabelJudge is a new open-source benchmark and audit framework that systematically measures four failure modes in LLM-as-a-judge systems: position bias, verbosity bias, order inconsistency, and cross-lingual degradation. The framework uses a 'gold-labelling by degradation' technique to generate labeled evaluation pairs without human annotation. Evaluation of Qwen2.5-7B-Instruct-4bit across English, Hindi, Arabic, and Swahili reveals severe cross-lingual reliability drops, with Swahili order consistency collapsing to near-random (0.480). The framework is extended to agentic evaluation with nine trajectory-level perturbations and three new metrics, released as a Python package supporting 11 judge backends.