paper

When Does Combining Language Models Help? A Co-Failure Ceiling on Routing, Voting, and Mixture-of-Agents Across 67 Frontier Models

paperactiveprovisional

when-does-combining-language-models-help-a-co-failure-ceiling-on-routing-voting-and-mixture-of-agents-across-67-frontier-models-bc3c1260

·1 events·first seen 7d ago

Aliases: When Does Combining Language Models Help? A Co-Failure Ceiling on Routing, Voting, and Mixture-of-Agents Across 67 Frontier Models

Co-occurring entities

Mixture-of-Agents Clopper-Pearson GPQA Diamond

More like this (12)

Large Language Models (frontier)Enhancing Decision-Making with Large Language Models through Multi-Agent Fictitious Play Exploring Adversarial Robustness and Safety Alignment in Multilingual Multi-Modal Large Language Models Reasoning Language Models OpenAI frontier models Self-Compacting Language Model Agents Language Modeling Loss Actionable Activation Directions for Detecting and Mitigating Emergent Misalignment Across Language Model Families Modeling Complex Behaviors: Multi-Personality Composition and Dynamic Switching in Vision-Language Models Linguistic Firewall: Geometry as Defense in Multi-Agent Systems Routing Redesign Mixture-of-Experts Routers with Manifold Power Iteration Tying the Loop -- Tied Expert Layers in Mixture-of-Experts Language Models

Recent events (1)

7arXiv · cs.AI·7d ago·source ↗

Co-failure ceiling theorem bounds maximum gains from LLM routing, voting, and mixture-of-agents across 67 frontier models

A new arXiv paper introduces the concept of a 'co-failure ceiling' — the rate at which all models in an ensemble fail on the same query — and proves that no routing, voting, or cascade policy can exceed accuracy of (1 - beta) where beta is this all-wrong rate. Empirically evaluated across 67 models from 21 providers, the paper finds that standard pairwise error correlation metrics systematically underprice the co-failure tail by ~2.5x on open-ended mathematics, and that combining models rarely beats the single best model without strong query-level routing signals. The work provides a finite-sample certificate (via Clopper-Pearson bounds) for the maximum achievable gain from multi-model systems before training a router, and identifies answer format rather than subject matter as a key driver of co-failure on GPQA-Diamond.

Evaluation and Benchmarking Inference Economics Mixture-of-Agents When Does Combining Language Models Help? A Co-Failure Ceiling on Routing, Voting, and Mixture-of-Agents Across 67 Frontier Models Clopper-Pearson +2 more