6arXiv cs.CL (Computation and Language)·9d ago

ModSleuth: Agentic system audits invisible dependency graphs in modern LLM training pipelines

Researchers introduce ModSleuth, an agentic system that recursively reconstructs LLM dependency graphs from public artifacts, recovering 1,060 source-verified dependencies across four major LLM releases. The system formalizes direct and indirect dependencies and operation-centered relationships to handle fragmented, inconsistent documentation. Applied at scale, the resulting graphs expose multi-hop license obligations, train-evaluation coupling, and discrepancies between released and training-time artifacts — issues that are practically invisible to manual auditing.

Evaluation and Benchmarking AI Safety Research ModSleuth Which Models Are Our Models Built On? Auditing Invisible Dependencies in Modern LLMs

Related guides (2)

AI Safety ResearchTopic guide

AI Safety Research: From Lab Policies to Real-World Flashpoints

Read asBeginner In-depth

Evaluation and BenchmarkingTopic guide

Evaluation and Benchmarking: How We Measure AI — and Why It Keeps Getting Harder

Read asBeginner In-depth

Related events (8)

6arXiv · cs.AI·8d ago·source ↗

LLMs automate reproducibility assessments in social and behavioral sciences, outperforming human reanalysts

A preprint from arXiv demonstrates that an LLM pipeline can automate reproducibility assessments of published social and behavioral science studies, recovering original effect sizes in 41% of cases (vs. 34% for human reanalysts) and reaching the same qualitative conclusion in 96% of cases (vs. 74% for humans). The study evaluated 76 published studies with predefined claims. The results suggest LLMs could serve as a scalable tool for systematic auditing of empirical research, addressing the resource-intensive nature of traditional reproducibility efforts.

Evaluation and Benchmarking Agent and Tool Ecosystem Automated reproducibility assessments in the social and behavioral sciences using large language models

6arXiv · cs.CL·29d ago·source ↗

Agentic CLEAR: Automating Multi-Level Evaluation of LLM Agents

Agentic CLEAR is an automatic evaluation framework for LLM-based agentic systems that analyzes behavior at three granularity levels: system, trace, and node. Unlike existing tools that rely on static error taxonomies or focus only on observability, it dynamically generates textual insights and integrates above the observability layer with an accessible UI. Experiments across four benchmarks and seven agentic settings demonstrate strong alignment with human-annotated errors and predictive accuracy for task success rates.

Evaluation and Benchmarking AI Safety Research Agentic CLEAR multi-level agent evaluation LLM agents +1 more

5arXiv · cs.CL·2d ago·source ↗

Action research documents 'Index Sickness' failure pattern in long-horizon LLM collaboration and proposes fix

A practitioner-researcher documents a failure mode called 'Index Sickness' observed across 391 consecutive LLM collaboration sessions on a real software project (Bang-v3): when symbolic identifier systems and rule-based System Prompts exceed a complexity threshold, LLMs abandon semantic grounding and produce internally consistent but reality-disconnected outputs. The paper names the underlying principle the 'Pang Principle (Semantic Vitality Law),' asserting that natural language with explicit purpose conveys higher information quality than symbolic expression. A proposed engineering fix, 'Baseline-Log Physical Separation,' reduced AI instruction volume by ~75% and eliminated recurrence over ~150 subsequent sessions. The work is action research rather than controlled experiment, but offers rare longitudinal empirical data on LLM degradation in long-horizon agentic workflows.

Long Context Evolution Agent and Tool Ecosystem Index Sickness Bang-v3 Baseline-Log Physical Separation +1 more

6arXiv · cs.CL·5d ago·source ↗

AgentSpec: A modular framework for controlled composition and analysis of embodied LLM agent scaffolds

AgentSpec is a new modular specification framework that represents embodied LLM agents as typed compositions of reusable policy components with standardized interfaces across perception, memory, reasoning, reflection, action, and learning modules. The framework enables controlled swapping and recombination of components, instantiated across four benchmarks (DeliveryBench, ALFRED, MiniGrid, RoboTHOR). Key findings include that agent performance is governed by scaffold compatibility and interaction effects rather than isolated module strength, and that RL-trained policies compose best when optimized with deployment-time scaffold structure. Code, baselines, and an interactive playground are publicly released.

Evaluation and Benchmarking Agent and Tool Ecosystem DeliveryBench AgentSpec MiniGrid +2 more

5arXiv · cs.LG·3d ago·source ↗

ReproRepo: Scalable LLM agent framework for reproducibility auditing using GitHub issues

ReproRepo is a new framework for evaluating LLM agents on reproducibility auditing of ML research, using naturally occurring GitHub issues as supervision signals rather than costly manual curation. The framework is instantiated on 1,149 recent ML papers from major conferences and benchmarks four frontier model-agent configurations. The best-performing agent (Codex with GPT-5.5) surfaces at least one semantically related human-reported reproduction blocker for ~90% of papers, though exact localization of issues remains a weakness. The work provides a reusable, scalable evaluation harness for this underexplored agentic task.

Evaluation and Benchmarking Agent and Tool Ecosystem OpenAI ReproRepo Codex +1 more

6arXiv · cs.LG·22d ago·source ↗

LLMSurgeon: Post-Hoc Auditing of LLM Pretraining Data Mixtures

LLMSurgeon formalizes Data Mixture Surgery (DMS), a framework for estimating the domain-level distribution of an LLM's pretraining corpus using only generated text from the target model. The method casts DMS as an inverse problem under the label-shift assumption, using a calibrated soft confusion matrix to correct domain confusion and recover the latent mixture prior. The authors also introduce LLMScan, a verifiable evaluation suite built from open-source LLMs with known pretraining mixtures, on which LLMSurgeon demonstrates high-fidelity recovery of domain compositions without access to training data.

Frontier Model Releases Evaluation and Benchmarking LLMSurgeon LLMScan Data Mixture Surgery +3 more

4arXiv · cs.AI·25d ago·source ↗

Structure-Aware Code Change Labeling with LLMs via Two-Stage Taxonomy Pipeline

This paper presents a systematic study of using LLMs for taxonomy-based labeling of code diff hunks, going beyond summarization to assign structured labels capturing semantic attributes like renames, moves, and logic modifications. The authors introduce a two-stage pipeline combining diff-hunk labeling with structural refinement, using few-shot prompting to remain language-agnostic. Evaluated across four LLMs on a curated benchmark of natural and synthetic patches, the best configuration achieves 84% recall and 81% precision. Results suggest LLM-based structured labeling can complement static analysis tools in code review workflows.

Enterprise Deployment Patterns Agent and Tool Ecosystem few-shot prompting code review automation diff hunk taxonomy benchmark +1 more

7arXiv · cs.CL·10d ago·source ↗

Trustworthiness audit finds alignment regressions in reasoning models converted from instruction-tuned LLMs

A systematic study audits whether converting instruction-tuned LLMs into reasoning models via SFT, RL-based post-training, or distillation preserves alignment behaviors such as safe refusal, bias avoidance, and privacy protection. Across six trustworthiness dimensions, the authors find consistent alignment regressions—including increased toxicity, amplified stereotyping, miscalibrated refusal, and privacy leakage—even as reasoning benchmark scores improve. The regressions are quantified via KL divergence from the instruction-tuned baseline, suggesting behavioral drift is a systematic byproduct of reasoning post-training. The paper argues trustworthiness metrics should be reported alongside reasoning capability gains.

Evaluation and Benchmarking AI Safety Research Does Reasoning Preserve Alignment? On the Trustworthiness of Large Reasoning Models +1 more