technique

Clopper-Pearson

techniqueactiveprovisionalclopper-pearson-1a2442a8·1 events·first seen 7d ago

Aliases: Clopper-Pearson

Co-occurring entities

Mixture-of-Agents When Does Combining Language Models Help? A Co-Failure Ceiling on Routing, Voting, and Mixture-of-Agents Across 67 Frontier Models GPQA Diamond

More like this (12)

Pearson correlation Spearman Rank Correlation canonical correlation analysis PAC-Bayes Claw-Eval Fisher-SEP predictor-corrector methods CLP (Collocation-Length Predictor)OCR-Robust Fleiss' Kappa Exact Posterior Score CompVis

Recent events (1)

7arXiv · cs.AI·7d ago·source ↗

Co-failure ceiling theorem bounds maximum gains from LLM routing, voting, and mixture-of-agents across 67 frontier models

A new arXiv paper introduces the concept of a 'co-failure ceiling' — the rate at which all models in an ensemble fail on the same query — and proves that no routing, voting, or cascade policy can exceed accuracy of (1 - beta) where beta is this all-wrong rate. Empirically evaluated across 67 models from 21 providers, the paper finds that standard pairwise error correlation metrics systematically underprice the co-failure tail by ~2.5x on open-ended mathematics, and that combining models rarely beats the single best model without strong query-level routing signals. The work provides a finite-sample certificate (via Clopper-Pearson bounds) for the maximum achievable gain from multi-model systems before training a router, and identifies answer format rather than subject matter as a key driver of co-failure on GPQA-Diamond.

Evaluation and Benchmarking Inference Economics Mixture-of-Agents When Does Combining Language Models Help? A Co-Failure Ceiling on Routing, Voting, and Mixture-of-Agents Across 67 Frontier Models Clopper-Pearson +2 more