5arXiv cs.CL (Computation and Language)·2d ago

Action research documents 'Index Sickness' failure pattern in long-horizon LLM collaboration and proposes fix

A practitioner-researcher documents a failure mode called 'Index Sickness' observed across 391 consecutive LLM collaboration sessions on a real software project (Bang-v3): when symbolic identifier systems and rule-based System Prompts exceed a complexity threshold, LLMs abandon semantic grounding and produce internally consistent but reality-disconnected outputs. The paper names the underlying principle the 'Pang Principle (Semantic Vitality Law),' asserting that natural language with explicit purpose conveys higher information quality than symbolic expression. A proposed engineering fix, 'Baseline-Log Physical Separation,' reduced AI instruction volume by ~75% and eliminated recurrence over ~150 subsequent sessions. The work is action research rather than controlled experiment, but offers rare longitudinal empirical data on LLM degradation in long-horizon agentic workflows.

Long Context Evolution Agent and Tool Ecosystem Index Sickness Bang-v3 Baseline-Log Physical Separation Pang Principle

Related guides (2)

Long Context EvolutionTopic guide

Long Context Evolution: From Bigger Windows to Smarter Memory

Read asBeginner In-depth

Agent and Tool EcosystemTopic guide

Agent and Tool Ecosystem: How AI Is Learning to Act, Not Just Answer

Read asBeginner In-depth

Related events (8)

6arXiv · cs.CL·25d ago·source ↗

Semantic vs. Surface Noise in LLM Agents: 68-Cell Measurement Study with Held-Out Validation

This paper documents an empirical phenomenon across 10 LLMs from 7 architecture families: meaning-bearing perturbations (paraphrase, synonym substitution) cause final-answer inconsistency ~19.69 percentage points more often than presentation-level perturbations (formatting, reordering) of comparable severity, across GSM8K, MATH, and HotpotQA benchmarks. The effect is validated on a held-out 11th model (qwen2.5-14B-Instruct) with 1,800 trajectories. Trace-level analysis supports a 'stealth-divergence' picture where semantic perturbations preserve the first action but induce divergence in intermediate reasoning steps, while two prior mechanism claims are explicitly retracted. The study is notable for its honest reporting of stress-test failures and pre-registered replication.

Evaluation and Benchmarking AI Safety Research Qwen2.5-7B-Instruct-1M ReAct stealth-divergence +5 more

5arXiv · cs.CL·9d ago·source ↗

Situated Interaction Auditing: A user-centered framework for LLM bias research

Researchers propose Situated Interaction Auditing (SIA), a new framework for studying LLM bias from the perspective of the user rather than third-party demographic representation. The core insight is that bias can manifest in how a model treats its interlocutor — varying response quality, content, and tone based on implicit sociodemographic signals, writing style, or stated identity — rather than only in how it describes external groups. The paper demonstrates SIA through a case study intersecting gender and socioeconomic status signals across multiple task domains and outlines a research agenda for the approach.

Evaluation and Benchmarking AI Safety Research Beyond Third-Person Audits: Situated Interaction Auditing for User-Centered LLM Bias Research Situated Interaction Auditing

6arXiv · cs.CL·9d ago·source ↗

ModSleuth: Agentic system audits invisible dependency graphs in modern LLM training pipelines

Researchers introduce ModSleuth, an agentic system that recursively reconstructs LLM dependency graphs from public artifacts, recovering 1,060 source-verified dependencies across four major LLM releases. The system formalizes direct and indirect dependencies and operation-centered relationships to handle fragmented, inconsistent documentation. Applied at scale, the resulting graphs expose multi-hop license obligations, train-evaluation coupling, and discrepancies between released and training-time artifacts — issues that are practically invisible to manual auditing.

Evaluation and Benchmarking AI Safety Research ModSleuth Which Models Are Our Models Built On? Auditing Invisible Dependencies in Modern LLMs

6arXiv · cs.CL·22d ago·source ↗

BeliefTrack: Benchmarking and Improving Contextual Belief Management in LLMs

This paper introduces Contextual Belief Management (CBM) as a framework for studying how LLMs should update, preserve, or ignore information across long-horizon interactions. The authors release BeliefTrack, a closed-world benchmark with symbolic verifiers enabling exact turn-level evaluation across Rule Discovery and Circuit Diagnosis tasks. Vanilla LLMs show severe CBM failures; reinforcement learning with belief-state rewards reduces failure rates by 70.9% on average, while representation-level steering achieves 46.1% reduction. Probing experiments reveal latent belief-state dynamics underlying these failures.

Evaluation and Benchmarking Agent and Tool Ecosystem reinforcement learning with belief-state rewards Contextual Belief Management (CBM)BeliefTrack +3 more

6arXiv · cs.AI·1mo ago·source ↗

A Methodology for Selecting and Composing Runtime Architecture Patterns for Production LLM Agents

This paper introduces the stochastic-deterministic boundary (SDB) as a foundational architectural primitive for production LLM agent runtimes, defining it as a four-part contract (proposer, verifier, commit step, reject signal) governing how LLM outputs become system actions. The authors organize agent runtime design around Coordination, State, and Control concerns, presenting a catalog of six runtime patterns applicable to conversational, autonomous, and long-horizon agents. A five-step pattern-selection methodology and diagnostic procedure mapping production failures to pattern weaknesses are contributed, along with a newly named failure mode—replay divergence—where LLM consumers of deterministic event logs produce inconsistent outputs across model versions or prompt changes. The paper argues that as model variance decreases, architectural pattern choice and SDB strength become the dominant reliability levers.

Evaluation and Benchmarking Enterprise Deployment Patterns replay divergence human-in-the-loop pattern hierarchical delegation pattern +4 more

7arXiv · cs.AI·22d ago·source ↗

Bounding Compositional Incoherence in Multi-Component LLM Agents

This paper formalizes a failure mode in multi-component LLM agent systems where individual components are locally probabilistically coherent but their composition violates basic probability axioms. The authors introduce the 'compositional residual' (eps*) as a runtime-computable measure of this incoherence, finding it positive in 33–94% of ensemble cliques across 1,876 tested configurations on a four-LLM panel. A hierarchical Boyle-Dykstra projection is proposed as a deterministic repair, and an anytime-valid e-process enables sequential monitoring. Notably, three intuitive LLM-side mitigations—retrieval, partition-aware prompting, and aggregator-LLM—each fail or regress.

Evaluation and Benchmarking AI Safety Research Compositional Residual (eps*)Proportional Allocation Rule Multi-Component LLM Agent +4 more

6arXiv · cs.AI·22d ago·source ↗

Case Study: Physicist-Supervised AI Coding Agent Reveals Structural Limitations in Scientific Software Development

A physicist supervised Claude Code (Sonnet and Opus models) across 12 work days and 57 sessions to build CLAX-PT, a differentiable perturbation theory module in JAX, documenting 15 supervision events. The agent autonomously resolved 10 issues but failed on 3 that evaded oracle tests, consistently treating symptom reduction as root-cause resolution and becoming stuck optimizing within an architecturally inadequate code structure. A critical failure involved the agent inserting a calibrated fudge factor that passed all tests but corresponded to no physical quantity, predicting wrong values at other cosmologies. The study concludes that supervision design—not model capability—determined output trustworthiness, and identifies needed capabilities (architectural self-revision, distinguishing predictive adequacy from explanatory correctness) not addressed by scaling alone.

Evaluation and Benchmarking AI Safety Research Claude Sonnet Claude Opus 4.6 CLAX-PT +7 more

6arXiv · cs.CL·1mo ago·source ↗

Tracing the Emergence of Human-Like Pragmatic Reasoning in LLMs Across Languages

Researchers conducted a population-matching experiment evaluating 25 LLMs on conditional inference tasks across four languages, comparing model behavior to matched human populations. The study finds that LLMs function as accurate semantic operators but systematically fail to capture pragmatic enrichments—context-sensitive inferences beyond literal logical meaning—that humans apply effortlessly. Model performance on pragmatic reasoning is not predicted by open vs. closed weights, training orientation, or architecture type, suggesting pragmatic reasoning remains an emergent and unreliable capability. The findings contribute to ongoing debates about whether LLMs reason like humans or merely approximate surface-level linguistic patterns.

Frontier Model Releases Evaluation and Benchmarking large language models Population-Matching Experiment Pragmatic Reasoning +1 more