Almanac
benchmark

Spearman Rank Correlation

benchmarkactivespearman-rank-correlation-934fcfe3·2 events·first seen 29d ago

Aliases: Spearman Rank Correlation

Co-occurring entities

More like this (12)

Recent events (2)

7arXiv · cs.CL·29d ago·source ↗

Forecasting Downstream LLM Performance With Token-Level Proxy Metrics

Researchers propose proxy metrics constructed from token-level statistics (entropy, top-k accuracy, expert token rank) drawn from a candidate model's next-token distribution over expert-written solutions, as a cheaper and more reliable alternative to cross-entropy loss or direct downstream evaluation. Across three settings—cross-family model selection, pretraining data selection, and training-time forecasting—the proxies consistently outperform baselines, achieving mean Spearman Rho of 0.81 vs. 0.36 for cross-entropy loss on model ranking, and reducing compute for data selection by roughly 10,000×. The method enables downstream performance extrapolation across an 18× compute horizon with roughly half the error of existing alternatives, suggesting expert trajectories are broadly useful signals throughout the model development lifecycle.

6arXiv · cs.AI·15d ago·source ↗

Tracking Behavioral Trajectories of Adapting Agents via Trait Vectors in Embedding Space

This paper introduces a methodology for measuring behavioral traits of AI agents by defining traits as directions in the embedding space of a text embedding model, trained on labeled diffs of agent skill/memory/configuration files. A linear model achieves 91.2% sign classification accuracy and Spearman ρ=0.82 on detecting propensity to seek sensitive data across 68 labeled skill diff pairs. The framework extends to an agent-to-agent evaluation protocol where one agent can assess another's skill file updates through a trusted intermediary, enabling ongoing behavioral monitoring of self-modifying agents.