7arXiv cs.AI (Artificial Intelligence)·7d ago

Co-failure ceiling theorem bounds maximum gains from LLM routing, voting, and mixture-of-agents across 67 frontier models

A new arXiv paper introduces the concept of a 'co-failure ceiling' — the rate at which all models in an ensemble fail on the same query — and proves that no routing, voting, or cascade policy can exceed accuracy of (1 - beta) where beta is this all-wrong rate. Empirically evaluated across 67 models from 21 providers, the paper finds that standard pairwise error correlation metrics systematically underprice the co-failure tail by ~2.5x on open-ended mathematics, and that combining models rarely beats the single best model without strong query-level routing signals. The work provides a finite-sample certificate (via Clopper-Pearson bounds) for the maximum achievable gain from multi-model systems before training a router, and identifies answer format rather than subject matter as a key driver of co-failure on GPQA-Diamond.

Evaluation and Benchmarking Inference Economics Agent and Tool Ecosystem Mixture-of-Agents When Does Combining Language Models Help? A Co-Failure Ceiling on Routing, Voting, and Mixture-of-Agents Across 67 Frontier Models Clopper-Pearson GPQA Diamond

Related guides (3)

Evaluation and BenchmarkingTopic guide

AI Evaluation and Benchmarking: From Leaderboards to the Limits of Measurement

Read asBeginner In-depth

Agent and Tool EcosystemTopic guide

Agent and Tool Ecosystem: How AI Is Learning to Act, Not Just Answer

Read asBeginner In-depth

Inference EconomicsTopic guide

Inference Economics: The Hidden Cost Battle Shaping AI

Read asBeginner In-depth

Related events (8)

7arXiv · cs.AI·1mo ago·source ↗

Bounding Compositional Incoherence in Multi-Component LLM Agents

This paper formalizes a failure mode in multi-component LLM agent systems where individual components are locally probabilistically coherent but their composition violates basic probability axioms. The authors introduce the 'compositional residual' (eps*) as a runtime-computable measure of this incoherence, finding it positive in 33–94% of ensemble cliques across 1,876 tested configurations on a four-LLM panel. A hierarchical Boyle-Dykstra projection is proposed as a deterministic repair, and an anytime-valid e-process enables sequential monitoring. Notably, three intuitive LLM-side mitigations—retrieval, partition-aware prompting, and aggregator-LLM—each fail or regress.

Evaluation and Benchmarking AI Safety Research Compositional Residual (eps*)Proportional Allocation Rule Multi-Component LLM Agent +4 more

5arXiv · cs.AI·1mo ago·source ↗

Ensembling Tabular Foundation Models: A Diversity Ceiling and a Calibration Trap

This paper benchmarks six ensemble strategies across six tabular foundation models (TFMs) on 153 OpenML classification tasks, finding that ensembling provides minimal gains over the best single TFM. The best ensemble strategy (two-level cascade stacking) achieves only +0.18% accuracy improvement at 253× the compute cost. A key finding is that logistic-regression meta-learner stacking improves accuracy while severely degrading calibration (log-loss), because sharpening class boundaries destroys probability estimates. The authors recommend greedy ensemble selection as the practical default.

Evaluation and Benchmarking Enterprise Deployment Patterns Q-statistic Greedy Ensemble Selection Friedman-Nemenyi Test +3 more

6arXiv · cs.LG·10d ago·source ↗

PAC-Bayes analysis establishes formal expressivity and alignment floors for prompt-conditioned LLMs

A new arXiv preprint models user-LLM interaction as a bilevel cheap-talk game and derives PAC-Bayes bounds showing two irreducible limitations: an 'expressivity floor' where language's finite channel capacity makes distinct tasks indistinguishable, and an 'objective-misalignment floor' where alignment constraints prevent reaching user-ideal outputs. The authors prove that prompt-conditioned LLMs cannot be universal problem solvers, as correct behavior on certain task families is provably unattainable even with infinite data, optimal training, or model scaling. The work suggests multimodal inputs and external memory as potential mitigations by increasing task-relevant information bandwidth.

Evaluation and Benchmarking Alignment and RLHF PAC-Bayes On the Limits of Prompt-Conditioned Language Models as General-Purpose Learners

5arXiv · cs.AI·25d ago·source ↗

Benchmarking study finds LLMs fail at counterintuitive probability problems despite strong standard performance

A new arXiv paper evaluates 8 state-of-the-art LLMs on discrete probability problems using two datasets: standard exercises (average accuracy 0.96) and counterintuitive exercises designed to trigger heuristic reasoning (average accuracy 0.59). The authors document token bias causing 20%+ performance drops when canonical problem formulations are disguised, and up to 34% degradation when misleading suggestions are embedded in prompts. The findings argue that current LLMs are not genuine probabilistic reasoners despite their success on advanced math benchmarks.

Evaluation and Benchmarking AI Safety Research How reliable are LLMs when it comes to playing dice?

6arXiv · cs.AI·1mo ago·source ↗

Extrapolative Weight Averaging Reveals Correctness-Efficiency Frontiers in Code RL

This paper investigates whether extrapolative weight averaging of RL-trained checkpoints can extend Pareto frontiers between competing objectives (correctness vs. computational efficiency) without additional training. Starting from a shared initialization, the authors train checkpoints under nested unit-test coverage regimes for competitive programming tasks, revealing a correctness-efficiency frontier where higher-coverage rewards reduce optimization failures but increase correctness failures. Extrapolation beyond trained endpoints produces complementary policies that, when ensembled, improve pass@250 on LCB/hard by 3.3% over the best single checkpoint at matched sample budget. Results hold across 7B and 32B model scales and three inference settings: pure reasoning, tool use, and agentic coding.

Evaluation and Benchmarking Inference Economics LCB/hard benchmark Competitive Programming RL LeetCode Hard (LCB/hard)+9 more

7arXiv · cs.CL·1mo ago·source ↗

Forecasting Downstream LLM Performance With Token-Level Proxy Metrics

Researchers propose proxy metrics constructed from token-level statistics (entropy, top-k accuracy, expert token rank) drawn from a candidate model's next-token distribution over expert-written solutions, as a cheaper and more reliable alternative to cross-entropy loss or direct downstream evaluation. Across three settings—cross-family model selection, pretraining data selection, and training-time forecasting—the proxies consistently outperform baselines, achieving mean Spearman Rho of 0.81 vs. 0.36 for cross-entropy loss on model ranking, and reducing compute for data selection by roughly 10,000×. The method enables downstream performance extrapolation across an 18× compute horizon with roughly half the error of existing alternatives, suggesting expert trajectories are broadly useful signals throughout the model development lifecycle.

Training Infrastructure Evaluation and Benchmarking Proxy Metrics for LLM Forecasting Expert Token Rank Spearman Rank Correlation +4 more

6arXiv · cs.CL·1mo ago·source ↗

Resolution Diagnostics for Paired LLM Evaluation: Many Leaderboard Rankings Statistically Unresolved

This paper frames pairwise LLM evaluation as a hypothesis-testing problem and introduces a resolution ratio q = N/N* to diagnose whether leaderboard comparisons are statistically powered. Applying this to two public leaderboards, the authors find that 11/40 Open LLM Leaderboard v1 pairwise comparisons and 4-6/9 MMLU-Pro top-10 adjacent-rank pairs fail to meet conventional (alpha=0.05, power=0.8) resolution targets. A key finding is that the widely-used unpaired Cohen-h shortcut underestimates required sample size by approximately a factor of two in close-comparison regimes, a flaw silently inherited by three major statistical calculators. The unresolved-pair pattern persists under multiplicity correction and sequential testing.

Frontier Model Releases Evaluation and Benchmarking Cohen's h G*Power resolution ratio q +3 more

6arXiv · cs.CL·29d ago·source ↗

Framework for quantifying faithful confidence expression in large reasoning models

A new arXiv preprint introduces a framework to measure faithful calibration (FC) in large reasoning models (LRMs)—the alignment between a model's intrinsic confidence and its linguistically expressed confidence. The authors analyze linguistic decisiveness against three internal uncertainty sources (token probabilities, hidden states, sampled response consistency) and introduce prefix-conditioned sampling to handle structural variation in chain-of-thought traces. Applying the framework across leading models, they find FC is a significant and distinct failure mode for LRMs: extended reasoning traces do not automatically improve calibration, prompt interventions that help non-reasoning models fail in the reasoning setting, and different confidence estimators produce divergent assessments of the same traces.

Frontier Model Releases Evaluation and Benchmarking Quantifying Faithful Confidence Expression in Large Reasoning Models +2 more