Theoretical analysis of calibration preservation in human-AI teaming frameworks
A new arXiv paper examines human-AI teaming through the lens of statistical calibration, analyzing both combination and delegation frameworks. The authors show that existing combination methods fail to preserve the human's calibration, while delegation methods shift the calibration burden to a rejector meta-model that must be calibrated finely enough to identify where each party excels. This demand grows with human expertise and becomes unattainable when the human uses information unavailable to the system.
Related guides (2)
Related events (8)
Controlled Audit of Human vs. Synthetic Soft-Labels for Calibration and Uncertainty Alignment
This paper presents a controlled study disentangling the effects of human soft-labels from label mode-shift corrections in soft-label learning, using MNIST and a synthetic variant. The authors find that human soft-labels primarily act as a regularizer improving calibration on difficult samples and promoting stable training convergence, rather than simply correcting mislabeled data. Dataset cartography analysis shows models trained on human soft-labels mirror human uncertainty patterns, while those trained on synthetic labels fail to align. The work provides a diagnostic testbed for evaluating human-AI uncertainty alignment.
Calibrated Collective Oversight (CCO): Scalable Oversight with Finite-Time Statistical Guarantees
This paper introduces Calibrated Collective Oversight (CCO), a framework for maintaining human oversight of agentic AI systems that may exceed human capabilities. CCO aggregates diverse scoring functions into a conservatism penalty inspired by Attainable Utility Preservation, then calibrates this penalty online via Conformal Decision Theory to ensure undesirable outcomes stay below a user-specified threshold with finite-time bounds and no distributional assumptions. Evaluated on a modified SWE-bench (adversarially misaligned agent) and MACHIAVELLI (ethical violations), CCO allows weaker overseers to constrain stronger agents while preserving reward, with empirical violation rates closely matching specified targets.
Optimal deterministic multicalibration achieved, resolving open problem on randomization necessity
A new arXiv preprint resolves an open problem in multicalibration theory by constructing a minimax-optimal multicalibration algorithm that outputs a deterministic predictor, achieving the same O(ε⁻³) sample complexity previously only attainable by randomized predictors. The result extends to outcome indistinguishability, deterministic omnipredictors, and panpredictors with optimal sample complexity, resolving multiple open problems from recent works. Multicalibration is a fairness and reliability property requiring calibration to hold across reweighted subgroups, making this relevant to trustworthy ML research.
Framework for quantifying faithful confidence expression in large reasoning models
A new arXiv preprint introduces a framework to measure faithful calibration (FC) in large reasoning models (LRMs)—the alignment between a model's intrinsic confidence and its linguistically expressed confidence. The authors analyze linguistic decisiveness against three internal uncertainty sources (token probabilities, hidden states, sampled response consistency) and introduce prefix-conditioned sampling to handle structural variation in chain-of-thought traces. Applying the framework across leading models, they find FC is a significant and distinct failure mode for LRMs: extended reasoning traces do not automatically improve calibration, prompt interventions that help non-reasoning models fail in the reasoning setting, and different confidence estimators produce divergent assessments of the same traces.
Calibrated Mixture-of-Experts under distribution shift: adversarial reweighting approach
A new arXiv preprint analyzes how mixture-of-experts (MoE) models maintain calibration under distribution shift, examining the interaction between routing mechanisms and expert-level calibration. The authors prove that expert calibration is sufficient for overall model calibration in hard-routed MoE but insufficient for soft-routed variants. To address the soft-routing gap, they propose an adversarial reweighting method that penalizes calibration errors of the routed aggregate under distribution shift, demonstrating improved accuracy-calibration tradeoffs across model classes and tasks.
Giving your AI a Job Interview
This commentary piece argues that as AI-generated advice becomes more consequential, users need systematic methods to evaluate AI reliability and quality—analogous to a job interview process. The author proposes frameworks for assessing AI outputs before trusting them for important decisions. The piece addresses the practical challenge of calibrating trust in AI systems across different use cases.
AI Safety via Debate
OpenAI proposes a safety technique in which two AI agents debate a topic and a human judge determines the winner, with the goal of making it easier for humans to supervise AI systems that may be more capable than themselves. The core intuition is that it is easier to verify a correct argument than to generate one, so a dishonest agent can be caught by an honest opponent. The paper introduces debate as a scalable oversight mechanism applicable to complex tasks where direct human evaluation is infeasible.
New Paper: Towards a Science of AI Agent Reliability
A new paper proposes a framework for quantifying the gap between AI agent capability and reliability, aiming to establish a more rigorous science of agent dependability. The work addresses the observation that agents may demonstrate high capability on benchmarks while failing unpredictably in deployment. The piece is published via the normaltech.ai newsletter, associated with the AI Snake Oil research commentary tradition.

