7arXiv cs.AI (Artificial Intelligence)·22d ago

Bounding Compositional Incoherence in Multi-Component LLM Agents

This paper formalizes a failure mode in multi-component LLM agent systems where individual components are locally probabilistically coherent but their composition violates basic probability axioms. The authors introduce the 'compositional residual' (eps*) as a runtime-computable measure of this incoherence, finding it positive in 33–94% of ensemble cliques across 1,876 tested configurations on a four-LLM panel. A hierarchical Boyle-Dykstra projection is proposed as a deterministic repair, and an anytime-valid e-process enables sequential monitoring. Notably, three intuitive LLM-side mitigations—retrieval, partition-aware prompting, and aggregator-LLM—each fail or regress.

Evaluation and Benchmarking AI Safety Research Agent and Tool Ecosystem Compositional Residual (eps*)Proportional Allocation Rule Multi-Component LLM Agent Boyle-Dykstra Projection Anytime-Valid E-Process Rayleigh Quotient

Related guides (3)

AI Safety ResearchTopic guide

AI Safety Research: From Lab Policies to Real-World Flashpoints

Read asBeginner In-depth

Agent and Tool EcosystemTopic guide

Agent and Tool Ecosystem: How AI Is Learning to Act, Not Just Answer

Read asBeginner In-depth

Evaluation and BenchmarkingTopic guide

Evaluation and Benchmarking: How We Measure AI — and Why It Keeps Getting Harder

Read asBeginner In-depth

Related events (8)

6arXiv · cs.LG·8d ago·source ↗

Operadic consistency: a label-free signal for detecting compositional reasoning failures in LLMs

Researchers introduce operadic consistency (OC), a label-free inference-time signal that checks whether an LLM's direct answer to a compositional query agrees with the answer produced by composing its own stated decomposition of that query. Evaluated across 12 instruction-tuned LLMs (4B–671B parameters) on four multi-hop QA datasets, OC achieves Pearson r ∈ [0.86, 0.94] with accuracy uniformly across all datasets, outperforming self-consistency, semantic entropy, and P(True) in cross-dataset robustness. At the per-question level, OC provides information beyond existing baselines and yields selective-prediction improvements (AUARC lifts +0.086–0.096, AUROC lifts +0.092–0.164) at equal sampling cost, with results extending to frontier thinking models using chain-of-thought decompositions.

Evaluation and Benchmarking AI Safety Research operadic consistency Chain-of-Thought Self-Consistency MuSiQue +6 more

6arXiv · cs.CL·25d ago·source ↗

Semantic vs. Surface Noise in LLM Agents: 68-Cell Measurement Study with Held-Out Validation

This paper documents an empirical phenomenon across 10 LLMs from 7 architecture families: meaning-bearing perturbations (paraphrase, synonym substitution) cause final-answer inconsistency ~19.69 percentage points more often than presentation-level perturbations (formatting, reordering) of comparable severity, across GSM8K, MATH, and HotpotQA benchmarks. The effect is validated on a held-out 11th model (qwen2.5-14B-Instruct) with 1,800 trajectories. Trace-level analysis supports a 'stealth-divergence' picture where semantic perturbations preserve the first action but induce divergence in intermediate reasoning steps, while two prior mechanism claims are explicitly retracted. The study is notable for its honest reporting of stress-test failures and pre-registered replication.

Evaluation and Benchmarking AI Safety Research Qwen2.5-7B-Instruct-1M ReAct stealth-divergence +5 more

5arXiv · cs.CL·8d ago·source ↗

Operads proposed as mathematical foundation for LLM question decomposition and consistency

Researchers propose operads — algebraic structures modeling many-in, one-out compositions — as a rigorous mathematical framework for question decomposition in LLMs. They define a 'questions operad' where QA models are interpreted as algebras, and introduce 'operadic consistency' as a measure of whether a model's answers agree across partial collapses of a decomposition tree. A companion empirical paper reports operadic consistency is strongly correlated with accuracy across twelve LLMs and four multi-hop QA datasets, outperforming temperature-based self-consistency baselines. The work attempts to give formal grounding to a widely-used but theoretically underspecified reasoning strategy.

Evaluation and Benchmarking Agent and Tool Ecosystem Richardson Operads for compositional reasoning in LLMs Liu +1 more

6arXiv · cs.AI·1mo ago·source ↗

A Methodology for Selecting and Composing Runtime Architecture Patterns for Production LLM Agents

This paper introduces the stochastic-deterministic boundary (SDB) as a foundational architectural primitive for production LLM agent runtimes, defining it as a four-part contract (proposer, verifier, commit step, reject signal) governing how LLM outputs become system actions. The authors organize agent runtime design around Coordination, State, and Control concerns, presenting a catalog of six runtime patterns applicable to conversational, autonomous, and long-horizon agents. A five-step pattern-selection methodology and diagnostic procedure mapping production failures to pattern weaknesses are contributed, along with a newly named failure mode—replay divergence—where LLM consumers of deterministic event logs produce inconsistent outputs across model versions or prompt changes. The paper argues that as model variance decreases, architectural pattern choice and SDB strength become the dominant reliability levers.

Evaluation and Benchmarking Enterprise Deployment Patterns replay divergence human-in-the-loop pattern hierarchical delegation pattern +4 more

5arXiv · cs.AI·12d ago·source ↗

Benchmarking study finds LLMs fail at counterintuitive probability problems despite strong standard performance

A new arXiv paper evaluates 8 state-of-the-art LLMs on discrete probability problems using two datasets: standard exercises (average accuracy 0.96) and counterintuitive exercises designed to trigger heuristic reasoning (average accuracy 0.59). The authors document token bias causing 20%+ performance drops when canonical problem formulations are disguised, and up to 34% degradation when misleading suggestions are embedded in prompts. The findings argue that current LLMs are not genuine probabilistic reasoners despite their success on advanced math benchmarks.

Evaluation and Benchmarking AI Safety Research How reliable are LLMs when it comes to playing dice?How reliable are LLMs when it comes to playing dice?

6arXiv · cs.CL·22d ago·source ↗

Canonical-Context On-Policy Distillation (CCOPD) for Multi-Turn LLM Consistency

This paper identifies 'self-anchored drift' as a key failure mode in multi-turn LLMs: when information is revealed incrementally across turns, models produce unsupported assumptions that distort final answers, even when the total evidence is identical to a single-prompt setting. The authors propose Canonical-Context On-Policy Distillation (CCOPD), which trains a student model on incremental multi-turn conversations to match the output distribution of a frozen teacher conditioned on the full clean prompt. Trained only on math conversations, CCOPD achieves a 32% average relative improvement on multi-turn (RAW-SHARDED) tasks and generalizes zero-shot to five out-of-domain task families while preserving single-prompt performance.

Evaluation and Benchmarking Agent and Tool Ecosystem on-policy distillation multi-turn language models self-anchored drift +2 more

6arXiv · cs.AI·46h ago·source ↗

Contagion Networks: formal framework for measuring evaluator bias propagation in multi-agent LLM systems

A new arXiv preprint introduces Contagion Networks, a formal framework for quantifying how systematic evaluation biases spread across interacting LLM agents in multi-agent systems. Using a controlled 3-agent experiment with DeepSeek-chat, the authors measure a Cross-Agent Contagion Matrix and find that homogeneous-model agents produce contagion coefficients 3-5x weaker than cross-model settings. A key practical finding is that increasing evaluator committee size from k=1 to k=3 reduces effective contagion by 72.4%, offering a concrete mitigation strategy. The authors release an open-source experimental framework alongside the paper.

Evaluation and Benchmarking AI Safety Research Contagion Networks: Evaluator Bias Propagation in Multi-Agent LLM Systems MM-EPC deepseek-chat +1 more

6arXiv · cs.CL·22d ago·source ↗

BeliefTrack: Benchmarking and Improving Contextual Belief Management in LLMs

This paper introduces Contextual Belief Management (CBM) as a framework for studying how LLMs should update, preserve, or ignore information across long-horizon interactions. The authors release BeliefTrack, a closed-world benchmark with symbolic verifiers enabling exact turn-level evaluation across Rule Discovery and Circuit Diagnosis tasks. Vanilla LLMs show severe CBM failures; reinforcement learning with belief-state rewards reduces failure rates by 70.9% on average, while representation-level steering achieves 46.1% reduction. Probing experiments reveal latent belief-state dynamics underlying these failures.

Evaluation and Benchmarking Agent and Tool Ecosystem reinforcement learning with belief-state rewards Contextual Belief Management (CBM)BeliefTrack +3 more