6arXiv cs.CL (Computation and Language)·45h ago

BabelJudge: Benchmark for measuring LLM-as-a-judge reliability across languages and agent trajectories

BabelJudge is a new open-source benchmark and audit framework that systematically measures four failure modes in LLM-as-a-judge systems: position bias, verbosity bias, order inconsistency, and cross-lingual degradation. The framework uses a 'gold-labelling by degradation' technique to generate labeled evaluation pairs without human annotation. Evaluation of Qwen2.5-7B-Instruct-4bit across English, Hindi, Arabic, and Swahili reveals severe cross-lingual reliability drops, with Swahili order consistency collapsing to near-random (0.480). The framework is extended to agentic evaluation with nine trajectory-level perturbations and three new metrics, released as a Python package supporting 11 judge backends.

Evaluation and Benchmarking Agent and Tool Ecosystem BabelJudge Qwen2.5-7B-Instruct-1M Shreyaskc

Related guides (2)

Agent and Tool EcosystemTopic guide

Agent and Tool Ecosystem: How AI Is Learning to Act, Not Just Answer

Read asBeginner In-depth

Evaluation and BenchmarkingTopic guide

Evaluation and Benchmarking: How We Measure AI — and Why It Keeps Getting Harder

Read asBeginner In-depth

Related events (8)

5Hugging Face Blog·1mo ago·source ↗

Judge Arena: Benchmarking LLMs as Evaluators

Hugging Face and Atla have launched Judge Arena, a platform for benchmarking large language models in their role as automated evaluators. The initiative uses an Elo-based ranking system to compare how well different LLMs judge the quality of model outputs, addressing the growing reliance on LLM-as-judge paradigms in evaluation pipelines. This fills a meta-evaluation gap: as LLM judges become standard practice, understanding their relative reliability and biases becomes critical infrastructure for the field.

Evaluation and Benchmarking Agent and Tool Ecosystem LLM-as-a-Judge Judge Arena Hugging Face +2 more

5arXiv · cs.CL·21h ago·source ↗

AdversaBench: Automated LLM red-teaming pipeline with multi-judge confirmation and cross-model transferability

AdversaBench is a new end-to-end red-teaming pipeline that mutates seed prompts using five structured operators and confirms failures via a three-judge panel with a meta-judge tiebreaker. Experiments on 45 seeds across reasoning, instruction-following, and tool-use categories produced confirmed failures for every seed. Key findings include sharp variation in operator effectiveness by category, misleading binary failure rates, judge agreement metrics distorted by label skew, and zero-shot transferability of adversarial prompts from Llama 3.1 8B to Llama 3.3 70B. Code and dataset are publicly released.

Evaluation and Benchmarking AI Safety Research Llama 3.1 70B AdversaBench Meta +1 more

5arXiv · cs.CL·27d ago·source ↗

Towards Reliable Multilingual LLMs-as-a-Judge: An Empirical Study

This paper systematically investigates strategies for extending LLM-based automatic evaluation (LLMs-as-a-Judge) to multilingual settings, covering high-, mid-, and low-resource languages (English, Spanish, Basque). The authors compare instruction translation, monolingual vs. multilingual supervision, and model size, finding that fine-tuned smaller models can match proprietary models when in-domain data is available, while zero-shot larger models are preferable out-of-domain. Two meta-evaluation datasets are extended to Spanish and Basque, and all data and code are publicly released.

Evaluation and Benchmarking Basque language LLM-as-a-Judge mJudge +2 more

4Hugging Face Blog·1mo ago·source ↗

Rethinking LLM Evaluation with 3C3H: AraGen Benchmark and Leaderboard

Hugging Face introduces AraGen, a new Arabic-language LLM benchmark and leaderboard built around the 3C3H evaluation framework (Correctness, Completeness, Conciseness, Helpfulness, Harmlessness, Honesty). The benchmark targets a gap in non-English LLM evaluation, specifically for Arabic, using a structured multi-criteria rubric rather than simple accuracy metrics. The leaderboard is hosted on Hugging Face and aims to provide a more holistic assessment of Arabic generative capabilities across frontier and open-weight models.

Frontier Model Releases Evaluation and Benchmarking 3C3H AraGen Hugging Face

4arXiv · cs.CL·21h ago·source ↗

ParaPairAudioBench: Pairwise benchmark reveals large gaps in LALM paralinguistic judgment

Researchers introduce ParaPairAudioBench, a pairwise audio benchmark of 5,175 audio pairs spanning five paralinguistic dimensions (Style, Rate, Emphasis, Age, Gender) designed to evaluate Large Audio-Language Models as judges. Experiments show current LALMs lag human judgment by 32 percentage points on average and exhibit severe calibration failures, especially in ambiguous 'Tie' cases. The benchmark includes same-transcript and cross-transcript conditions to disentangle lexical from acoustic reliance, enabling more rigorous assessment of LALM reliability for speech evaluation.

Evaluation and Benchmarking Multimodal Progress ParaPairAudioBench

5arXiv · cs.CL·39h ago·source ↗

TF-RefusalBench: Measuring and mitigating over-alignment in multilingual criminal law LLM applications

Researchers introduce TF-RefusalBench, a 5,200-prompt multilingual benchmark derived from Swiss Federal Supreme Court rulings to measure over-alignment (excessive refusals and disclaimers) in LLMs handling criminal law translation and summarization tasks. The benchmark covers French, German, Italian, and English and reveals that over-alignment is influenced by model choice, prompt language, and text language. The paper evaluates mitigation strategies including prompting and abliteration (refusal direction ablation), finding abliteration eliminates refusals with minimal task performance cost. The work is grounded in a real deployment context: the Swiss Federal Supreme Court already uses on-premises LLMs for translation and summarization.

Evaluation and Benchmarking AI Safety Research abliteration Swiss Federal Supreme Court TF-RefusalBench +1 more

6arXiv · cs.LG·26d ago·source ↗

SoundnessBench: Benchmarking LLMs as Evaluators of ML Research Proposal Viability

SoundnessBench is a new benchmark of 1,099 machine-learning research proposals derived from ICLR submissions, labeled with reviewer soundness scores, designed to test whether LLMs can reliably distinguish methodologically sound research ideas from unsound ones. Evaluated across 12 frontier LLMs, the benchmark reveals a pervasive optimism bias: models systematically rate low-soundness proposals as sound under standard prompting, with aggressive prompting shifting errors from false positives to false negatives rather than eliminating them. Controls for data contamination, surface features, and human audit quality suggest the bias is not attributable to a single confounder. The authors conclude that current LLMs are not yet reliable as standalone first-gate evaluators of scientific rigor, a critical bottleneck for autonomous AI research agents.

Evaluation and Benchmarking AI Safety Research ICLR optimism bias SoundnessBench +1 more

7arXiv · cs.CL·29d ago·source ↗

Automated Benchmark Auditing for AI Agents and Large Language Models (ABA)

The paper introduces Auto Benchmark Audit (ABA), an agentic framework that systematically audits AI benchmark tasks for issues such as ambiguous specifications, environment conflicts, and incorrect ground truths. Applied to 168 benchmarks across nine domains including NeurIPS publications, ABA identifies critical issues in over 25.7% of evaluated tasks. The authors demonstrate that filtering out flawed tasks materially shifts model rankings and improves average performance on SWE-bench Verified and Terminal-Bench 2 by 9.9% and 9.6% respectively, indicating that current benchmark scores are significantly distorted by task quality problems. The agentic tool and annotations are released publicly.

Frontier Model Releases Evaluation and Benchmarking NeurIPS Auto Benchmark Audit (ABA)SWE-Bench Verified +2 more