Entity · benchmark

Spearman's rho

benchmarkactivespearman-s-rho-186d85dc·2 events·first seen May 26, 2026

Aliases: Spearman's rho

Co-occurring entities

ECNU-Text-Computing Personalized PageRank reward-induced maximum likelihood GraphReview MGDA Multi-Task Learning LLM-as-a-Judge PCGrad Textual Gradient Optimization

More like this (12)

Spearman Rank Correlation Pearson correlation Cohen's d R^3 SPEAR πR² Fisher score D-Score Clopper-Pearson R2 Indicator Rayleigh Quotient difference-in-means

Recent events (2)

5arXiv · cs.CL·May 27, 2026·source ↗

GraphReview: Scientific Paper Evaluation via LLM-Based Graph Message Passing

GraphReview proposes a graph-based LLM framework that models scientific paper evaluation as review-signal message passing over a semantic paper graph, capturing both intrinsic quality and relational context (synchronic and diachronic links). LLMs estimate node-level quality priors and generate edge-level comparative evidence via pairwise comparisons, while Personalized PageRank integrates signals for ranking, decision prediction, and review generation. The system uses reward-induced maximum likelihood objectives to train LLM backbones and achieves average improvements of 29.7% over the strongest baseline on decision and ranking metrics, including 23.7% accuracy gain and 57.6% Spearman's ρ gain.

Evaluation and Benchmarking Agent and Tool Ecosystem ECNU-Text-Computing Personalized PageRank reward-induced maximum likelihood +2 more

5arXiv · cs.CL·May 26, 2026·source ↗

Failure Modes of Multi-Objective Prompt Optimization for LLM Judges

This paper investigates multi-objective prompt optimization for LLM-as-judge systems, testing five decomposition modes of textual gradient optimizers across varying levels of cross-task information sharing. In 6 of 10 configurations, optimization fails to improve over the initial prompt, with gradient specificity dropping 59% when multiple criteria are processed jointly. The authors identify two separable failure modes: gradient dilution at optimization time and instruction interference at inference time. These findings constrain the design space for customizing LLM judges via textual feedback across multiple evaluation criteria simultaneously.

Evaluation and Benchmarking Agent and Tool Ecosystem MGDA Multi-Task Learning LLM-as-a-Judge +4 more