Contagion Networks: formal framework for measuring evaluator bias propagation in multi-agent LLM systems
A new arXiv preprint introduces Contagion Networks, a formal framework for quantifying how systematic evaluation biases spread across interacting LLM agents in multi-agent systems. Using a controlled 3-agent experiment with DeepSeek-chat, the authors measure a Cross-Agent Contagion Matrix and find that homogeneous-model agents produce contagion coefficients 3-5x weaker than cross-model settings. A key practical finding is that increasing evaluator committee size from k=1 to k=3 reduces effective contagion by 72.4%, offering a concrete mitigation strategy. The authors release an open-source experimental framework alongside the paper.
Related guides (3)
Related events (8)
Agentic CLEAR: Automating Multi-Level Evaluation of LLM Agents
Agentic CLEAR is an automatic evaluation framework for LLM-based agentic systems that analyzes behavior at three granularity levels: system, trace, and node. Unlike existing tools that rely on static error taxonomies or focus only on observability, it dynamically generates textual insights and integrates above the observability layer with an accessible UI. Experiments across four benchmarks and seven agentic settings demonstrate strong alignment with human-annotated errors and predictive accuracy for task success rates.
AI-Mediated Communication Can Steer Collective Opinion via LLM Editing Biases
This paper demonstrates empirically that LLMs from multiple model families introduce directional biases when editing human-written texts on contested topics (e.g., nudging toward gun control, against atheism). The authors develop a mathematical opinion-dynamics model showing these biases are amplified through social networks, shifting collective opinion at scale. An audit of X's 'Explain this post' feature finds evidence of pro-life bias in Grok's outputs on abortion content, traced to specific design choices. The paper concludes with implications for EU legislative efforts on AI-mediated communication.
Bounding Compositional Incoherence in Multi-Component LLM Agents
This paper formalizes a failure mode in multi-component LLM agent systems where individual components are locally probabilistically coherent but their composition violates basic probability axioms. The authors introduce the 'compositional residual' (eps*) as a runtime-computable measure of this incoherence, finding it positive in 33–94% of ensemble cliques across 1,876 tested configurations on a four-LLM panel. A hierarchical Boyle-Dykstra projection is proposed as a deterministic repair, and an anytime-valid e-process enables sequential monitoring. Notably, three intuitive LLM-side mitigations—retrieval, partition-aware prompting, and aggregator-LLM—each fail or regress.
AMEL: Accumulated Message Effects Bias LLM Judgments in Multi-Turn Evaluation Pipelines
This paper introduces AMEL (Accumulated Message Effect on LLM Judgments), documenting that prior conversation history with predominantly positive or negative evaluations systematically biases subsequent LLM judgments toward the prevailing polarity. Across 75,898 API calls to 11 models from 4 providers, the effect is statistically robust (d = -0.17, p < 10^-46), concentrates on high-uncertainty items, and shows a negativity asymmetry where negative histories induce 1.62x more bias than positive ones. Critically, the bias does not grow with context length, scaling reduces but does not eliminate it, and the simplest mitigation is using a fresh context per evaluation item.
CausaLab: Scalable Benchmark for Interactive Causal Discovery by LLM Agents
CausaLab is a new evaluation environment that tests LLM agents on interactive causal discovery tasks, requiring them to recover both causal graphs and structural equations from synthetic laboratory episodes governed by randomly sampled structural causal models (SCMs). The benchmark separates predictive accuracy from genuine causal understanding, revealing a persistent gap: GPT-5.2-high achieves 92% task accuracy in a 6-node observational setting but only 0.471 all-edge F1 for mechanism recovery. Mixed observation-intervention strategies improve structural fidelity, while pure intervention strategies underperform on both metrics. Premature stopping is identified as a key agent weakness, partially mitigated by prompting models to verify hypothesis-data consistency.
ReproRepo: Scalable LLM agent framework for reproducibility auditing using GitHub issues
ReproRepo is a new framework for evaluating LLM agents on reproducibility auditing of ML research, using naturally occurring GitHub issues as supervision signals rather than costly manual curation. The framework is instantiated on 1,149 recent ML papers from major conferences and benchmarks four frontier model-agent configurations. The best-performing agent (Codex with GPT-5.5) surfaces at least one semantically related human-reported reproduction blocker for ~90% of papers, though exact localization of issues remains a weakness. The work provides a reusable, scalable evaluation harness for this underexplored agentic task.
CollabSim: CSCW-grounded framework for evaluating collaborative competence in LLM multi-agent systems
Researchers introduce CollabSim, a configurable simulation framework for systematically evaluating collaborative competence in LLM-based multi-agent systems (MAS). The framework draws on Computer-Supported Cooperative Work (CSCW) theory to define collaborative capabilities beyond task outcomes, including common ground establishment, shared task understanding, and misalignment repair. Experiments across four LLMs demonstrate the framework can distinguish model performance patterns and reveal task-dependent effects of agent design choices. The work addresses a gap in MAS evaluation, which has historically focused on individual task-solving rather than coordination quality.
Leadership styles as coordination controllers in multi-agent LLM teams: contingency theory validated
A new arXiv paper operationalizes three classical leadership styles (transactional, transformational, situational) as explicit action-set controllers over multi-agent LLM teams, testing when process-level coordination adds value versus plain majority voting. Across four task regimes and three open-weight model families, no controller dominates on accuracy — consistent with contingency theory — with gains appearing only when the round-0 majority is unreliable, the task is recoverable, and undirected interaction fails to self-repair. The authors formalize a 'recovery-advantage boundary' mapping onto classical constructs like leadership substitutes and path-goal redundancy. The largely null accuracy result is framed as theory confirmation rather than a failure of the controllers.


