
LLM-as-a-Judge
llm-as-a-judge-4feb4eb1·12 events·first seen 1mo agoAliases: LLM-as-a-Judge, LLM-as-judge, LLM-as-Judge, LLM judge, LLMs-as-a-Judge, LM-as-judge
Co-occurring entities
More like this (12)
Guides (1)
Recent events (12)
AMEL: Accumulated Message Effects Bias LLM Judgments in Multi-Turn Evaluation Pipelines
This paper introduces AMEL (Accumulated Message Effect on LLM Judgments), documenting that prior conversation history with predominantly positive or negative evaluations systematically biases subsequent LLM judgments toward the prevailing polarity. Across 75,898 API calls to 11 models from 4 providers, the effect is statistically robust (d = -0.17, p < 10^-46), concentrates on high-uncertainty items, and shows a negativity asymmetry where negative histories induce 1.62x more bias than positive ones. Critically, the bias does not grow with context length, scaling reduces but does not eliminate it, and the simplest mitigation is using a fresh context per evaluation item.
Failure Modes of Multi-Objective Prompt Optimization for LLM Judges
This paper investigates multi-objective prompt optimization for LLM-as-judge systems, testing five decomposition modes of textual gradient optimizers across varying levels of cross-task information sharing. In 6 of 10 configurations, optimization fails to improve over the initial prompt, with gradient specificity dropping 59% when multiple criteria are processed jointly. The authors identify two separable failure modes: gradient dilution at optimization time and instruction interference at inference time. These findings constrain the design space for customizing LLM judges via textual feedback across multiple evaluation criteria simultaneously.
Towards Reliable Multilingual LLMs-as-a-Judge: An Empirical Study
This paper systematically investigates strategies for extending LLM-based automatic evaluation (LLMs-as-a-Judge) to multilingual settings, covering high-, mid-, and low-resource languages (English, Spanish, Basque). The authors compare instruction translation, monolingual vs. multilingual supervision, and model size, finding that fine-tuned smaller models can match proprietary models when in-domain data is available, while zero-shot larger models are preferable out-of-domain. Two meta-evaluation datasets are extended to Spanish and Basque, and all data and code are publicly released.
Mistral AI: Using LLM-as-a-Judge with Structured Outputs for RAG Evaluation
Mistral AI published a technical guide on evaluating Retrieval-Augmented Generation (RAG) systems using the 'LLM as a Judge' paradigm combined with their structured outputs API feature. The approach implements the RAG Triad framework—context relevance, groundedness, and answer relevance—using Pydantic schemas to enforce machine-readable evaluation outputs. Mistral models serve as both the generator and judge components, enabling scalable automated evaluation without human annotators.
Judge Arena: Benchmarking LLMs as Evaluators
Hugging Face and Atla have launched Judge Arena, a platform for benchmarking large language models in their role as automated evaluators. The initiative uses an Elo-based ranking system to compare how well different LLMs judge the quality of model outputs, addressing the growing reliance on LLM-as-judge paradigms in evaluation pipelines. This fills a meta-evaluation gap: as LLM judges become standard practice, understanding their relative reliability and biases becomes critical infrastructure for the field.
Expert Support Case Study: Bolstering a RAG App with LLM-as-a-Judge
Hugging Face published a case study describing how Digital Green used an LLM-as-a-Judge approach to evaluate and improve a retrieval-augmented generation (RAG) application. The post covers the methodology for using LLMs to score and validate RAG outputs, providing a practical deployment pattern for quality assurance in production AI systems. It serves as a concrete example of enterprise-grade evaluation pipelines built on top of RAG architectures.
Mitigating Perceptual Judgment Bias in Multimodal LLM-as-a-Judge via Perceptual Perturbation and Reward Modeling
This paper identifies and analyzes 'Perceptual Judgment Bias' in multimodal LLM judges, where models anchor on response text rather than visual evidence when the two conflict. The authors introduce a Perceptually Perturbed Judgment Dataset using counterfactual responses to isolate perceptual errors, and a training framework combining GRPO-based reward modeling with batch-ranking objectives. Experiments on MLLM-as-a-Judge benchmarks show improved perceptual fidelity, ranking coherence, and alignment with human evaluation.
Comparative Study: Semantic Metadata vs. Unstructured Web Retrieval for Agentic Data Discovery
This paper evaluates whether LLM-based agents still need structured semantic metadata (e.g., schema.org) for data retrieval, comparing a Baseline Agent searching open-web documents against a Semantic Agent leveraging 90 million schema.org-annotated datasets. Using an LLM-as-a-judge pipeline aligned to FAIR principles, the Semantic Agent achieves 65.7% higher overall precision in retrieving FAIR-compliant datasets, while the Baseline Agent answers 40% more questions but frequently returns prose-heavy or portal landing pages instead of actionable data. The study concludes that structured semantic ecosystems remain essential for reliable, execution-oriented agentic workflows despite LLMs' broad unstructured retrieval capabilities.
VideoFDB: First Benchmark for Full-Duplex Audio-Visual Conversational Agent Evaluation
VideoFDB is introduced as the first benchmark targeting full-duplex audio-visual-to-audio-visual (AV2AV) conversational agents, filling a gap where existing full-duplex benchmarks evaluate only speech. It provides 237 dyadic video-call clips covering 11 nonverbal conversational dynamics, a perception/generation taxonomy, and an LM-as-judge rubric framework. Evaluation across open- and closed-source vision-speech agents reveals systematic failure modes including captioning collapse and visual-stream ignorance, and shows current systems cannot perform the streaming joint audiovisual grounding required for natural conversation. Cascaded speech-to-avatar architectures are found to be architecturally incapable of producing full-duplex nonverbal cues.
PARL: Preference-Aware Rubric Learning for Personalized LLM Evaluation
This paper introduces PARL (Preference-Aware Rubric Learning), a framework that reframes personalized LLM evaluation as a learning problem rather than static judgment. PARL induces preference-aware evaluation rubrics from raw user interaction histories and uses a discriminative reinforcement learning objective to contrast user-authored responses against model outputs, capturing user-specific decision boundaries. Experiments on personalized text generation tasks show PARL produces high-fidelity rubrics that generalize across users and tasks, outperforming existing LLM-as-a-judge and automatic metric approaches.
FinHarness: Inline Lifecycle Safety Harness for Finance LLM Agents
FinHarness is a safety harness for finance LLM agents that wraps agent execution end-to-end with three components: a Query Monitor for intent and drift detection, a Tool Monitor for per-call risk evaluation, and a Cascade module that adaptively routes verification between lightweight and advanced LLM judges. Unlike post-hoc auditing, it injects risk signals back into the agent input as ex-ante evidence, enabling real-time refusal or replanning. On the FinVault benchmark, it reduces attack success rate from 38.3% to 15.0% while preserving benign approval rates and using 4.7× fewer expensive judge calls than an always-advanced baseline.
Moral Semantics Survive Machine Translation: Cross-Lingual Evidence from Moral Foundations Corpora
This paper investigates whether LLM-based machine translation can preserve moral semantic content well enough to enable cross-lingual moral values classification, using Polish as a test case with ~50k annotated social media posts. A four-method validation pipeline (LaBSE embedding similarity, CKA, LLM-as-judge, and classifier parity) shows mean cosine similarity of 0.86 and AUC gaps of only 0.01–0.02 across Moral Foundations categories. The results suggest machine translation is a practical path to extending moral values NLP research to under-resourced languages, with expected generalization to related Slavic languages.
