Almanac
paper

When Does Combining Language Models Help? A Co-Failure Ceiling on Routing, Voting, and Mixture-of-Agents Across 67 Frontier Models

paperactiveprovisionalwhen-does-combining-language-models-help-a-co-failure-ceiling-on-routing-voting-and-mixture-of-agents-across-67-frontier-models-bc3c1260·1 events·first seen 7d ago

Aliases: When Does Combining Language Models Help? A Co-Failure Ceiling on Routing, Voting, and Mixture-of-Agents Across 67 Frontier Models

Co-occurring entities

More like this (12)

Recent events (1)

7arXiv · cs.AI·7d ago·source ↗

Co-failure ceiling theorem bounds maximum gains from LLM routing, voting, and mixture-of-agents across 67 frontier models

A new arXiv paper introduces the concept of a 'co-failure ceiling' — the rate at which all models in an ensemble fail on the same query — and proves that no routing, voting, or cascade policy can exceed accuracy of (1 - beta) where beta is this all-wrong rate. Empirically evaluated across 67 models from 21 providers, the paper finds that standard pairwise error correlation metrics systematically underprice the co-failure tail by ~2.5x on open-ended mathematics, and that combining models rarely beats the single best model without strong query-level routing signals. The work provides a finite-sample certificate (via Clopper-Pearson bounds) for the maximum achievable gain from multi-model systems before training a router, and identifies answer format rather than subject matter as a key driver of co-failure on GPQA-Diamond.