Almanac
technique

anytime-valid sequential testing

techniqueactiveprovisionalanytime-valid-sequential-testing-075addc3·1 events·first seen 18d ago

Aliases: anytime-valid sequential testing

Co-occurring entities

More like this (12)

Recent events (1)

6arXiv · cs.CL·18d ago·source ↗

Resolution Diagnostics for Paired LLM Evaluation: Many Leaderboard Rankings Statistically Unresolved

This paper frames pairwise LLM evaluation as a hypothesis-testing problem and introduces a resolution ratio q = N/N* to diagnose whether leaderboard comparisons are statistically powered. Applying this to two public leaderboards, the authors find that 11/40 Open LLM Leaderboard v1 pairwise comparisons and 4-6/9 MMLU-Pro top-10 adjacent-rank pairs fail to meet conventional (alpha=0.05, power=0.8) resolution targets. A key finding is that the widely-used unpaired Cohen-h shortcut underestimates required sample size by approximately a factor of two in close-comparison regimes, a flaw silently inherited by three major statistical calculators. The unresolved-pair pattern persists under multiplicity correction and sequential testing.