Entity · benchmark

Spearman Rank Correlation

benchmarkactivespearman-rank-correlation-934fcfe3·2 events·first seen May 19, 2026

Aliases: Spearman Rank Correlation

Co-occurring entities

agent-to-agent evaluation protocol skill file diff trait vector Behavioral Trajectory Tracking Framework Proxy Metrics for LLM Forecasting Expert Token Rank Cross-Entropy Loss Top-k Accuracy

More like this (12)

Spearman's rho Pearson correlation canonical correlation analysis Linear Regression Clopper-Pearson Reciprocal Rank Fusion Cohen's d rank-1 approximation Expert Token Rank D-Score AUROC SPEAR

Recent events (2)

6arXiv · cs.AI·Jun 2, 2026·source ↗

Tracking Behavioral Trajectories of Adapting Agents via Trait Vectors in Embedding Space

This paper introduces a methodology for measuring behavioral traits of AI agents by defining traits as directions in the embedding space of a text embedding model, trained on labeled diffs of agent skill/memory/configuration files. A linear model achieves 91.2% sign classification accuracy and Spearman ρ=0.82 on detecting propensity to seek sensitive data across 68 labeled skill diff pairs. The framework extends to an agent-to-agent evaluation protocol where one agent can assess another's skill file updates through a trusted intermediary, enabling ongoing behavioral monitoring of self-modifying agents.

Evaluation and Benchmarking AI Safety Research agent-to-agent evaluation protocol skill file diff trait vector +3 more

7arXiv · cs.CL·May 19, 2026·source ↗

Forecasting Downstream LLM Performance With Token-Level Proxy Metrics

Researchers propose proxy metrics constructed from token-level statistics (entropy, top-k accuracy, expert token rank) drawn from a candidate model's next-token distribution over expert-written solutions, as a cheaper and more reliable alternative to cross-entropy loss or direct downstream evaluation. Across three settings—cross-family model selection, pretraining data selection, and training-time forecasting—the proxies consistently outperform baselines, achieving mean Spearman Rho of 0.81 vs. 0.36 for cross-entropy loss on model ranking, and reducing compute for data selection by roughly 10,000×. The method enables downstream performance extrapolation across an 18× compute horizon with roughly half the error of existing alternatives, suggesting expert trajectories are broadly useful signals throughout the model development lifecycle.

Training Infrastructure Evaluation and Benchmarking Proxy Metrics for LLM Forecasting Expert Token Rank Spearman Rank Correlation +4 more