Almanac
← Events
6arXiv cs.CL (Computation and Language)·9d ago

ModSleuth: Agentic system audits invisible dependency graphs in modern LLM training pipelines

Researchers introduce ModSleuth, an agentic system that recursively reconstructs LLM dependency graphs from public artifacts, recovering 1,060 source-verified dependencies across four major LLM releases. The system formalizes direct and indirect dependencies and operation-centered relationships to handle fragmented, inconsistent documentation. Applied at scale, the resulting graphs expose multi-hop license obligations, train-evaluation coupling, and discrepancies between released and training-time artifacts — issues that are practically invisible to manual auditing.

Related guides (2)

Related events (8)

6arXiv · cs.AI·8d ago·source ↗

LLMs automate reproducibility assessments in social and behavioral sciences, outperforming human reanalysts

A preprint from arXiv demonstrates that an LLM pipeline can automate reproducibility assessments of published social and behavioral science studies, recovering original effect sizes in 41% of cases (vs. 34% for human reanalysts) and reaching the same qualitative conclusion in 96% of cases (vs. 74% for humans). The study evaluated 76 published studies with predefined claims. The results suggest LLMs could serve as a scalable tool for systematic auditing of empirical research, addressing the resource-intensive nature of traditional reproducibility efforts.

6arXiv · cs.CL·29d ago·source ↗

Agentic CLEAR: Automating Multi-Level Evaluation of LLM Agents

Agentic CLEAR is an automatic evaluation framework for LLM-based agentic systems that analyzes behavior at three granularity levels: system, trace, and node. Unlike existing tools that rely on static error taxonomies or focus only on observability, it dynamically generates textual insights and integrates above the observability layer with an accessible UI. Experiments across four benchmarks and seven agentic settings demonstrate strong alignment with human-annotated errors and predictive accuracy for task success rates.

5arXiv · cs.CL·2d ago·source ↗

Action research documents 'Index Sickness' failure pattern in long-horizon LLM collaboration and proposes fix

A practitioner-researcher documents a failure mode called 'Index Sickness' observed across 391 consecutive LLM collaboration sessions on a real software project (Bang-v3): when symbolic identifier systems and rule-based System Prompts exceed a complexity threshold, LLMs abandon semantic grounding and produce internally consistent but reality-disconnected outputs. The paper names the underlying principle the 'Pang Principle (Semantic Vitality Law),' asserting that natural language with explicit purpose conveys higher information quality than symbolic expression. A proposed engineering fix, 'Baseline-Log Physical Separation,' reduced AI instruction volume by ~75% and eliminated recurrence over ~150 subsequent sessions. The work is action research rather than controlled experiment, but offers rare longitudinal empirical data on LLM degradation in long-horizon agentic workflows.

6arXiv · cs.CL·5d ago·source ↗

AgentSpec: A modular framework for controlled composition and analysis of embodied LLM agent scaffolds

AgentSpec is a new modular specification framework that represents embodied LLM agents as typed compositions of reusable policy components with standardized interfaces across perception, memory, reasoning, reflection, action, and learning modules. The framework enables controlled swapping and recombination of components, instantiated across four benchmarks (DeliveryBench, ALFRED, MiniGrid, RoboTHOR). Key findings include that agent performance is governed by scaffold compatibility and interaction effects rather than isolated module strength, and that RL-trained policies compose best when optimized with deployment-time scaffold structure. Code, baselines, and an interactive playground are publicly released.

5arXiv · cs.LG·3d ago·source ↗

ReproRepo: Scalable LLM agent framework for reproducibility auditing using GitHub issues

ReproRepo is a new framework for evaluating LLM agents on reproducibility auditing of ML research, using naturally occurring GitHub issues as supervision signals rather than costly manual curation. The framework is instantiated on 1,149 recent ML papers from major conferences and benchmarks four frontier model-agent configurations. The best-performing agent (Codex with GPT-5.5) surfaces at least one semantically related human-reported reproduction blocker for ~90% of papers, though exact localization of issues remains a weakness. The work provides a reusable, scalable evaluation harness for this underexplored agentic task.

6arXiv · cs.LG·22d ago·source ↗

LLMSurgeon: Post-Hoc Auditing of LLM Pretraining Data Mixtures

LLMSurgeon formalizes Data Mixture Surgery (DMS), a framework for estimating the domain-level distribution of an LLM's pretraining corpus using only generated text from the target model. The method casts DMS as an inverse problem under the label-shift assumption, using a calibrated soft confusion matrix to correct domain confusion and recover the latent mixture prior. The authors also introduce LLMScan, a verifiable evaluation suite built from open-source LLMs with known pretraining mixtures, on which LLMSurgeon demonstrates high-fidelity recovery of domain compositions without access to training data.

4arXiv · cs.AI·25d ago·source ↗

Structure-Aware Code Change Labeling with LLMs via Two-Stage Taxonomy Pipeline

This paper presents a systematic study of using LLMs for taxonomy-based labeling of code diff hunks, going beyond summarization to assign structured labels capturing semantic attributes like renames, moves, and logic modifications. The authors introduce a two-stage pipeline combining diff-hunk labeling with structural refinement, using few-shot prompting to remain language-agnostic. Evaluated across four LLMs on a curated benchmark of natural and synthetic patches, the best configuration achieves 84% recall and 81% precision. Results suggest LLM-based structured labeling can complement static analysis tools in code review workflows.

7arXiv · cs.CL·10d ago·source ↗

Trustworthiness audit finds alignment regressions in reasoning models converted from instruction-tuned LLMs

A systematic study audits whether converting instruction-tuned LLMs into reasoning models via SFT, RL-based post-training, or distillation preserves alignment behaviors such as safe refusal, bias avoidance, and privacy protection. Across six trustworthiness dimensions, the authors find consistent alignment regressions—including increased toxicity, amplified stereotyping, miscalibrated refusal, and privacy leakage—even as reasoning benchmark scores improve. The regressions are quantified via KL divergence from the instruction-tuned baseline, suggesting behavioral drift is a systematic byproduct of reasoning post-training. The paper argues trustworthiness metrics should be reported alongside reasoning capability gains.