Entity · technique

anytime-valid sequential testing

techniqueactiveanytime-valid-sequential-testing-075addc3·1 events·first seen May 29, 2026

Aliases: anytime-valid sequential testing

Co-occurring entities

Cohen's h G*Power resolution ratio q MMLU-Pro Open LLM Leaderboard

More like this (12)

Anytime-Valid E-Process oracle testing test-time training temporally grounded QA benchmark Countdown-Stepwise Temporal Simultaneity matched-control protocol Protocol QA token-wise self-certainty consistency training Policy Vulnerability Testing StrategyQA

Recent events (1)

6arXiv · cs.CL·May 29, 2026·source ↗

Resolution Diagnostics for Paired LLM Evaluation: Many Leaderboard Rankings Statistically Unresolved

This paper frames pairwise LLM evaluation as a hypothesis-testing problem and introduces a resolution ratio q = N/N* to diagnose whether leaderboard comparisons are statistically powered. Applying this to two public leaderboards, the authors find that 11/40 Open LLM Leaderboard v1 pairwise comparisons and 4-6/9 MMLU-Pro top-10 adjacent-rank pairs fail to meet conventional (alpha=0.05, power=0.8) resolution targets. A key finding is that the widely-used unpaired Cohen-h shortcut underestimates required sample size by approximately a factor of two in close-comparison regimes, a flaw silently inherited by three major statistical calculators. The unresolved-pair pattern persists under multiplicity correction and sequential testing.

Frontier Model Releases Evaluation and Benchmarking Cohen's h G*Power resolution ratio q +3 more