Optimal deterministic multicalibration achieved, resolving open problem on randomization necessity
A new arXiv preprint resolves an open problem in multicalibration theory by constructing a minimax-optimal multicalibration algorithm that outputs a deterministic predictor, achieving the same O(ε⁻³) sample complexity previously only attainable by randomized predictors. The result extends to outcome indistinguishability, deterministic omnipredictors, and panpredictors with optimal sample complexity, resolving multiple open problems from recent works. Multicalibration is a fairness and reliability property requiring calibration to hold across reweighted subgroups, making this relevant to trustworthy ML research.
Related guides (2)
Related events (8)
Calibrated Mixture-of-Experts under distribution shift: adversarial reweighting approach
A new arXiv preprint analyzes how mixture-of-experts (MoE) models maintain calibration under distribution shift, examining the interaction between routing mechanisms and expert-level calibration. The authors prove that expert calibration is sufficient for overall model calibration in hard-routed MoE but insufficient for soft-routed variants. To address the soft-routing gap, they propose an adversarial reweighting method that penalizes calibration errors of the routed aggregate under distribution shift, demonstrating improved accuracy-calibration tradeoffs across model classes and tasks.
Theoretical analysis of calibration preservation in human-AI teaming frameworks
A new arXiv paper examines human-AI teaming through the lens of statistical calibration, analyzing both combination and delegation frameworks. The authors show that existing combination methods fail to preserve the human's calibration, while delegation methods shift the calibration burden to a rejector meta-model that must be calibrated finely enough to identify where each party excels. This demand grows with human expertise and becomes unattainable when the human uses information unavailable to the system.
Predictability as a Fine-Grained Privacy Metric Complementary to Differential Privacy
A new arXiv preprint introduces 'privacy via predictability,' a framework that measures privacy leakage as the incremental gain in an attacker's ability to predict sensitive information after observing an algorithm's output, conditioned on the attacker's prior knowledge. The authors show predictability and differential privacy are generally incomparable, but that predictability implies mutual-information DP in worst-case regimes. They develop a generalized method of moments framework for asymptotic analysis and derive a predictability-calibrated output perturbation scheme for empirical risk minimization. The work positions predictability as a complementary, finer-grained alternative to DP for settings where attacker knowledge and query families can be specified.
The Matching Principle: A Geometric Theory Unifying Robustness, Domain Adaptation, and Alignment via Nuisance Covariance
This paper proposes the 'matching principle': a unified geometric framework arguing that robustness methods (CORAL, IRM, adversarial training, augmentation, metric learning, Jacobian penalties, alignment constraints) are all estimators of the same object—the covariance of label-preserving deployment nuisance—and that regularizing the encoder Jacobian along this covariance's range is the core statistical problem. The authors prove closed-form optimality results in a linear-Gaussian model, introduce the Trajectory Deviation Index (TDI) as a label-free embedding sensitivity probe, and validate predictions across 13 pre-registered experimental blocks including Qwen2.5-7B. At 7B scale, matched style-PMH improves selective honesty while standard DPO degrades Style TDI, connecting the theory to alignment safety.
Kolmogorov Regression lifts diffusion policies to Cameron-Martin space for robust long-horizon control
Researchers introduce a backward Kolmogorov equation framework that reformulates diffusion policy training as a deterministic boundary-value PDE problem in Cameron-Martin space, replacing stochastic score matching. The approach uses a precision-weighted Cameron-Martin loss and a Kolmogorov residual as an inference-time failure detector, yielding convergence guarantees tied to kernel effective rank rather than action dimension. Validation on the PushT manipulation benchmark shows 17% improvement in episode reward and 67.6% reduction in inter-step drift; a 6-station manufacturing scheduling task shows 28.4% lower RMSE than LSTM baselines and 96% reduction in deadlock events via Hamilton-Jacobi reachability certification.
Unified MAIR framework bridges GP-UCB and DEC approaches in kernel bandits
A new arXiv preprint unifies two major theoretical frameworks for frequentist RKHS bandits — Gaussian-process upper confidence bound (GP-UCB) and decision-estimation-coefficient (DEC) methods — under a common algorithmic-information language called MAIR. The paper generalizes both the GP-UCB analysis and the MAMS algorithm, proposes a safeguarded master algorithm combining their advantages, and demonstrates that algorithmic complexity can be more informative than class-wide minimax certificates in overparameterized models. The work clarifies a foundational distinction between algorithmic information and minimax coefficients in bandit theory.
Goal-Oriented Lower-Tail Calibration of Gaussian Processes for Bayesian Optimization
This paper addresses miscalibration in Gaussian process predictive distributions used for Bayesian optimization, focusing specifically on the lower tail relevant to minimization objectives. The authors introduce a framework for 'goal-oriented' spatial calibration below a threshold t, defining occurrence calibration and thresholded μ-calibration on sublevel sets. They propose tcGP, a post-hoc calibration method, and prove the resulting EI-based optimizer remains dense in the design space. Experiments on standard benchmarks show tcGP improves both lower-tail calibration and overall BO performance compared to standard and globally calibrated GP models.
Extrapolative Weight Averaging Reveals Correctness-Efficiency Frontiers in Code RL
This paper investigates whether extrapolative weight averaging of RL-trained checkpoints can extend Pareto frontiers between competing objectives (correctness vs. computational efficiency) without additional training. Starting from a shared initialization, the authors train checkpoints under nested unit-test coverage regimes for competitive programming tasks, revealing a correctness-efficiency frontier where higher-coverage rewards reduce optimization failures but increase correctness failures. Extrapolation beyond trained endpoints produces complementary policies that, when ensembled, improve pass@250 on LCB/hard by 3.3% over the best single checkpoint at matched sample budget. Results hold across 7B and 32B model scales and three inference settings: pure reasoning, tool use, and agentic coding.

