Bayesian Inference and Decision Audits for Public Archives of Frontier AI Evaluations
bayesian-inference-and-decision-audits-for-public-archives-of-frontier-ai-evaluations-73a89dd4·1 events·first seen 31h agoAliases: Bayesian Inference and Decision Audits for Public Archives of Frontier AI Evaluations
Co-occurring entities
More like this (12)
Recent events (1)
Bayesian audit framework for public AI evaluation archives challenges frontier model claims
A new arXiv preprint proposes a Bayesian inference and decision-audit framework for interpreting public AI evaluation archives (LiveBench, Open LLM Leaderboard v2, LMArena, GAIA, tau-bench) as longitudinal time series rather than terminal leaderboards. The paper demonstrates that a single terminal snapshot is compatible with multiple distinct performance histories, yielding ambiguous timing estimates for reaching capability ceilings. A candidate selection-aware frontier model is shown to fail synthetic recovery, objective-archive prediction, preference transfer, and uncertainty calibration, with fixed audit gates rejecting its stronger claims. The work proposes an archive-and-adjudication protocol to reconstruct evaluation histories and falsify unsupported frontier capability claims.