Almanac
paper

Bayesian Inference and Decision Audits for Public Archives of Frontier AI Evaluations

paperactiveprovisionalbayesian-inference-and-decision-audits-for-public-archives-of-frontier-ai-evaluations-73a89dd4·1 events·first seen 31h ago

Aliases: Bayesian Inference and Decision Audits for Public Archives of Frontier AI Evaluations

Co-occurring entities

More like this (12)

Recent events (1)

6arXiv · cs.AI·31h ago·source ↗

Bayesian audit framework for public AI evaluation archives challenges frontier model claims

A new arXiv preprint proposes a Bayesian inference and decision-audit framework for interpreting public AI evaluation archives (LiveBench, Open LLM Leaderboard v2, LMArena, GAIA, tau-bench) as longitudinal time series rather than terminal leaderboards. The paper demonstrates that a single terminal snapshot is compatible with multiple distinct performance histories, yielding ambiguous timing estimates for reaching capability ceilings. A candidate selection-aware frontier model is shown to fail synthetic recovery, objective-archive prediction, preference transfer, and uncertainty calibration, with fixed audit gates rejecting its stronger claims. The work proposes an archive-and-adjudication protocol to reconstruct evaluation histories and falsify unsupported frontier capability claims.