4arXiv cs.CL (Computation and Language)·3d ago

Judge-Aware Gated Multi-Task Learning achieves state-of-the-art on UK Employment Tribunal outcome prediction

Researchers propose a Judge-Aware Gated Multi-Task Learning architecture for legal outcome prediction that explicitly disentangles factual case merits from judicial discretion via a gated fusion mechanism conditioned on judge identity. Evaluated on 13,937 UK Employment Tribunal decisions, the approach outperforms supervised fine-tuning of a Gemma-4 26B backbone while requiring an order of magnitude fewer trainable parameters. The key finding is that differentiable structured composition of identity signals outperforms prompt-based composition over a much larger generative model, suggesting conditioning interface choice dominates scale for identity-conditioned classification tasks.

Evaluation and Benchmarking Gemma-4 E4B-it Towards Explainable Adjudicative Variance: Quantifying Judicial Discretion via Gated Multi-Task Learning LoRA

Related guides (2)

LoRAConcept

LoRA: How to Teach a Giant AI New Tricks Without Rebuilding It

Read asBeginner In-depth

Evaluation and BenchmarkingTopic guide

Evaluation and Benchmarking: How We Measure AI — and Why It Keeps Getting Harder

Read asBeginner In-depth

Related events (8)

6arXiv · cs.AI·27d ago·source ↗

Mitigating Perceptual Judgment Bias in Multimodal LLM-as-a-Judge via Perceptual Perturbation and Reward Modeling

This paper identifies and analyzes 'Perceptual Judgment Bias' in multimodal LLM judges, where models anchor on response text rather than visual evidence when the two conflict. The authors introduce a Perceptually Perturbed Judgment Dataset using counterfactual responses to isolate perceptual errors, and a training framework combining GRPO-based reward modeling with batch-ranking objectives. Experiments on MLLM-as-a-Judge benchmarks show improved perceptual fidelity, ranking coherence, and alignment with human evaluation.

Evaluation and Benchmarking Alignment and RLHF Perceptually Perturbed Judgment Dataset Multimodal Large Language Models GRPO +3 more

6arXiv · cs.CL·16h ago·source ↗

LLMs judge worse than they generate: empirical challenge to self-evaluation pipeline assumptions

A new arXiv preprint tests the implicit assumption that LLM evaluation is easier than generation, using a controlled in-context QA setup across four benchmarks (SQuAD 2.0, DROP, HotpotQA, MuSiQue) and two models. Results show generation accuracy exceeds self-evaluation accuracy on three of four benchmarks, with attention analysis revealing that evaluation attends to context 3–5x less than generation does. LoRA fine-tuning experiments confirm the asymmetry is not a training artifact, with cross-task interference observed in both directions. The findings directly challenge assumptions underlying LLM-as-a-Judge and self-evaluation pipelines widely used in RLHF and agentic systems.

Evaluation and Benchmarking Alignment and RLHF MuSiQue Can LLMs Judge Better Than They Generate? Evaluating Task Asymmetry, Mechanistic Interpretability and Transferability for In-Context QA LoRA +3 more

3arXiv · cs.CL·16h ago·source ↗

Tree-of-Thoughts hybrid approach for legal case judgement summarization using LLMs

A new arXiv preprint proposes a tree-of-thoughts-inspired extractive-abstractive summarization method for legal case judgements. The authors evaluate DeepSeek and LLaMA models across extractive, abstractive, and hybrid summarization strategies, finding the hybrid prompt approach yields better summaries. The work addresses a narrow but practically relevant domain application of LLMs in legal NLP.

Evaluation and Benchmarking DeepSeek V4 A Tree-of-Thoughts Inspired Hybrid Approach for Legal Case Judgement Summarization using LLMs Tree of Thoughts +1 more

5arXiv · cs.CL·1mo ago·source ↗

Towards Reliable Multilingual LLMs-as-a-Judge: An Empirical Study

This paper systematically investigates strategies for extending LLM-based automatic evaluation (LLMs-as-a-Judge) to multilingual settings, covering high-, mid-, and low-resource languages (English, Spanish, Basque). The authors compare instruction translation, monolingual vs. multilingual supervision, and model size, finding that fine-tuned smaller models can match proprietary models when in-domain data is available, while zero-shot larger models are preferable out-of-domain. Two meta-evaluation datasets are extended to Spanish and Basque, and all data and code are publicly released.

Evaluation and Benchmarking Basque language LLM-as-a-Judge mJudge +2 more

6arXiv · cs.CL·6d ago·source ↗

BabelJudge: Benchmark for measuring LLM-as-a-judge reliability across languages and agent trajectories

BabelJudge is a new open-source benchmark and audit framework that systematically measures four failure modes in LLM-as-a-judge systems: position bias, verbosity bias, order inconsistency, and cross-lingual degradation. The framework uses a 'gold-labelling by degradation' technique to generate labeled evaluation pairs without human annotation. Evaluation of Qwen2.5-7B-Instruct-4bit across English, Hindi, Arabic, and Swahili reveals severe cross-lingual reliability drops, with Swahili order consistency collapsing to near-random (0.480). The framework is extended to agentic evaluation with nine trajectory-level perturbations and three new metrics, released as a Python package supporting 11 judge backends.

Evaluation and Benchmarking Agent and Tool Ecosystem BabelJudge Qwen2.5-7B-Instruct-1M Shreyaskc

5arXiv · cs.CL·4d ago·source ↗

Systematic comparison of encoder vs. decoder safety judges for LLM adversarial evaluation

A new arXiv preprint evaluates whether fine-tuned encoder classifiers from the ModernBERT family (ModernBERT and Ettin) can replace LLM-based safety judges for detecting harmful outputs in user-model conversations. The study benchmarks encoders against rule-based methods, fine-tuned LLM classifiers, and LLM judges including LlamaGuard 3/4, ShieldGemma, StrongReject, and Claude-as-a-judge across multiple adversarial attack types. Results are reported on F1, false negative rate, and precision-recall, with breakdowns by attack technique, providing practical guidance on cost-latency tradeoffs for production safety pipelines.

Evaluation and Benchmarking Inference Economics ModernBERT AILuminate LlamaGuard +6 more

5Hugging Face Blog·1mo ago·source ↗

Judge Arena: Benchmarking LLMs as Evaluators

Hugging Face and Atla have launched Judge Arena, a platform for benchmarking large language models in their role as automated evaluators. The initiative uses an Elo-based ranking system to compare how well different LLMs judge the quality of model outputs, addressing the growing reliance on LLM-as-judge paradigms in evaluation pipelines. This fills a meta-evaluation gap: as LLM judges become standard practice, understanding their relative reliability and biases becomes critical infrastructure for the field.

Evaluation and Benchmarking Agent and Tool Ecosystem LLM-as-a-Judge Judge Arena Hugging Face +2 more

4arXiv · cs.CL·16h ago·source ↗

LLMs outperform traditional methods on single and multi-truth data fusion tasks

A new arXiv preprint investigates using LLMs for data fusion (truth discovery) over tabular data, covering both single-truth and multi-truth scenarios. The authors evaluate domain-dependent, domain-independent, zero-shot, and one-shot prompting strategies across three benchmark datasets. LLM-based approaches outperform traditional unsupervised methods including DART and LTM on all datasets, with code released publicly.

Evaluation and Benchmarking Enterprise Deployment Patterns DART LTM Single and Multi Truth Data Fusion using Large Language Models