6arXiv cs.CL (Computation and Language)·25d ago

Semantic vs. Surface Noise in LLM Agents: 68-Cell Measurement Study with Held-Out Validation

This paper documents an empirical phenomenon across 10 LLMs from 7 architecture families: meaning-bearing perturbations (paraphrase, synonym substitution) cause final-answer inconsistency ~19.69 percentage points more often than presentation-level perturbations (formatting, reordering) of comparable severity, across GSM8K, MATH, and HotpotQA benchmarks. The effect is validated on a held-out 11th model (qwen2.5-14B-Instruct) with 1,800 trajectories. Trace-level analysis supports a 'stealth-divergence' picture where semantic perturbations preserve the first action but induce divergence in intermediate reasoning steps, while two prior mechanism claims are explicitly retracted. The study is notable for its honest reporting of stress-test failures and pre-registered replication.

Evaluation and Benchmarking AI Safety Research Agent and Tool Ecosystem Qwen2.5-7B-Instruct-1M ReAct stealth-divergence HotpotQA Chain-of-Thought Reasoning MATH benchmark GSM8K

Related guides (3)

AI Safety ResearchTopic guide

AI Safety Research: From Lab Policies to Real-World Flashpoints

Read asBeginner In-depth

Agent and Tool EcosystemTopic guide

Agent and Tool Ecosystem: How AI Is Learning to Act, Not Just Answer

Read asBeginner In-depth

Evaluation and BenchmarkingTopic guide

Evaluation and Benchmarking: How We Measure AI — and Why It Keeps Getting Harder

Read asBeginner In-depth

Related events (8)

5arXiv · cs.CL·9d ago·source ↗

Systematic study reveals effectiveness-fluency trade-offs in LLM conditioning methods

A new arXiv paper systematically evaluates a range of LLM conditioning methods across both concept injection and removal scenarios, finding that efficient steering methods often degrade fluency significantly. A key finding is that activation steering is substantially less effective on instruction-tuned models than on base models, a previously overlooked interaction. Simple prompting and supervised fine-tuning work for concept injection but not removal, and cheap textual metrics are found to correlate well with expensive LLM-as-judge evaluations.

Evaluation and Benchmarking Alignment and RLHF On The Effectiveness-Fluency Trade-Off In LLM Conditioning: A Systematic Study

7arXiv · cs.AI·22d ago·source ↗

Bounding Compositional Incoherence in Multi-Component LLM Agents

This paper formalizes a failure mode in multi-component LLM agent systems where individual components are locally probabilistically coherent but their composition violates basic probability axioms. The authors introduce the 'compositional residual' (eps*) as a runtime-computable measure of this incoherence, finding it positive in 33–94% of ensemble cliques across 1,876 tested configurations on a four-LLM panel. A hierarchical Boyle-Dykstra projection is proposed as a deterministic repair, and an anytime-valid e-process enables sequential monitoring. Notably, three intuitive LLM-side mitigations—retrieval, partition-aware prompting, and aggregator-LLM—each fail or regress.

Evaluation and Benchmarking AI Safety Research Compositional Residual (eps*)Proportional Allocation Rule Multi-Component LLM Agent +4 more

5arXiv · cs.CL·2d ago·source ↗

Action research documents 'Index Sickness' failure pattern in long-horizon LLM collaboration and proposes fix

A practitioner-researcher documents a failure mode called 'Index Sickness' observed across 391 consecutive LLM collaboration sessions on a real software project (Bang-v3): when symbolic identifier systems and rule-based System Prompts exceed a complexity threshold, LLMs abandon semantic grounding and produce internally consistent but reality-disconnected outputs. The paper names the underlying principle the 'Pang Principle (Semantic Vitality Law),' asserting that natural language with explicit purpose conveys higher information quality than symbolic expression. A proposed engineering fix, 'Baseline-Log Physical Separation,' reduced AI instruction volume by ~75% and eliminated recurrence over ~150 subsequent sessions. The work is action research rather than controlled experiment, but offers rare longitudinal empirical data on LLM degradation in long-horizon agentic workflows.

Long Context Evolution Agent and Tool Ecosystem Index Sickness Bang-v3 Baseline-Log Physical Separation +1 more

6arXiv · cs.CL·23d ago·source ↗

Statistical Re-evaluation of GSM-Symbolic Finds Benchmark Confounds and Overstated Reasoning Conclusions

A re-evaluation of the GSM-Symbolic benchmark (Mirzadeh et al., 2025) challenges its conclusion that LLMs lack genuine reasoning capabilities. Using Generalised Linear Mixed Models on 20 open-weight models, the authors find only half show statistically significant performance drops, and identify a previously unacknowledged distributional shift toward larger integers in GSM-Symbolic relative to GSM8K that accounts for significance in roughly half the remaining cases. After controlling for this confound, model-specific failure profiles emerge—including variable binding fragility, arithmetic limitations, and dual-task interference—suggesting the original blanket claims about LLM reasoning were both statistically premature and mechanistically misleading.

Frontier Model Releases Evaluation and Benchmarking Mirzadeh et al. 2025 Generalised Linear Mixed Models GSM-Symbolic +1 more

6arXiv · cs.CL·15d ago·source ↗

Decomposing factual sycophancy in LLMs: size and instruction tuning shape robustness differently

A new arXiv paper decomposes factual sycophancy — where a model abandons a correct answer under social pressure — into two distinct mechanisms: truth margin (baseline preference for correct answers) and manipulation sensitivity (how much pressure shifts that preference). Evaluating 56 open-weight models from 0.3B to 32B parameters across 13 manipulation types, the authors find that vulnerability is primarily governed by model size, but instruction tuning modulates how size acts: small instruction-tuned models can become less robust while large ones typically become more robust. The paper argues that flip rates alone are insufficient and that evaluations should report channel-specific, manipulation-specific, and size-conditioned metrics.

Evaluation and Benchmarking Open Weights Progress Decomposing Factual Sycophancy in Language Models: How Size and Instruction Tuning Shape Robustness +1 more

5arXiv · cs.AI·12d ago·source ↗

Benchmarking study finds LLMs fail at counterintuitive probability problems despite strong standard performance

A new arXiv paper evaluates 8 state-of-the-art LLMs on discrete probability problems using two datasets: standard exercises (average accuracy 0.96) and counterintuitive exercises designed to trigger heuristic reasoning (average accuracy 0.59). The authors document token bias causing 20%+ performance drops when canonical problem formulations are disguised, and up to 34% degradation when misleading suggestions are embedded in prompts. The findings argue that current LLMs are not genuine probabilistic reasoners despite their success on advanced math benchmarks.

Evaluation and Benchmarking AI Safety Research How reliable are LLMs when it comes to playing dice?How reliable are LLMs when it comes to playing dice?

4arXiv · cs.CL·5d ago·source ↗

LoSoNA benchmark evaluates LLM adaptation to implicit local social norms in group chats

Researchers introduce LoSoNA, a benchmark for testing whether LLM-based agents can infer and adapt to unstated local conversational norms in multi-party chat scenarios. Each scenario presents a group-chat transcript where non-subject participants implicitly demonstrate a hidden norm, followed by an elicitor turn. Eight frontier and open-weight models are evaluated under four prompting conditions; naive prompting performs poorly for most models, while explicit norm-aware prompting yields uneven gains—Gemini 3.1 Pro reaches 84.2% and Claude Fable 5 reaches 81.6%. The work contributes to growing interest in evaluating LLM social and pragmatic capabilities beyond factual or reasoning tasks.

Evaluation and Benchmarking Agent and Tool Ecosystem Gemini 3.1 Pro Claude Fable 5 LoSoNA

5arXiv · cs.CL·10d ago·source ↗

NCRE-based benchmark reveals frontier LLMs top out at 68.8% on professional Office automation tasks

Researchers introduce an evaluation suite derived from China's National Computer Rank Examination (NCRE), comprising 200 practical tasks across Word, Excel, and PowerPoint scored via 7,118 machine-gradable criteria. Seven frontier LLMs are benchmarked: single-turn models peak at 36.6% Score Rate, while a full agentic system with execution feedback and iterative repair reaches 68.8%, still well below the 95.5% community-reference score. The results demonstrate that fine-grained, long-horizon Office document automation remains a significant unsolved challenge for current LLM and agent systems despite strong code-generation capabilities.

Evaluation and Benchmarking Agent and Tool Ecosystem National Computer Rank Examination Mind the Gap: Can Frontier LLMs Pass a Standardized Office Proficiency Exam?