TRACE: Lightweight RAG corpus poisoning detection via token influence attribution
Researchers introduce TRACE, a detection framework for corpus poisoning attacks on Retrieval-Augmented Generation (RAG) systems that works by tracing answer-related tokens through token influence attribution rather than relying on auxiliary classifiers or LLM-based verification. The method identifies recurrent high-influence keywords across retrieved documents and performs secondary verification to confirm their effect on model predictions. Evaluated on three QA benchmarks and six LLMs, TRACE achieves strong detection performance while also exposing attacker-specified target answers, with lower computational overhead than prior approaches.
Related guides (2)
Related events (8)
Unified defense framework detects and remediates data poisoning in text summarization fine-tuning
A new arXiv preprint introduces a post-hoc defense framework for detecting and recovering from training-time data poisoning in LLMs fine-tuned for abstractive summarization. The framework uses influence-function analysis in white-box settings and behavioral perturbation auditing in black-box settings, achieving 85-92% detection precision across nine architectures and six benchmarks. Gradient-ascent unlearning restores up to 96% of original model behavior with less than 0.6% ROUGE degradation. The authors also introduce novel attacks targeting factual distortion and representational bias that evade conventional evaluation metrics.
DocTrace: Structure-Aware On-Demand Hypergraph Memory for Long-Document QA
Researchers introduce DocTrace, a multi-agent RAG framework for long-document question answering that uses query-triggered knowledge organization rather than costly query-agnostic preprocessing. The system combines a lightweight document structural tree index, on-demand hypergraph working memory, and a graph-structured experience memory that stores successful reasoning plans for reuse. Evaluated on four long-document QA datasets, DocTrace outperforms the strongest baseline (ComoRAG) by up to 8.85% F1 and 4.40% EM while reducing computational cost by 53.32%.
Multi-agent semantic rewriting framework for privacy-preserving RAG
A new arXiv preprint proposes a three-agent framework for sanitizing retrieved content in RAG pipelines by performing privacy extraction, semantic analysis, and reconstruction as an offline preprocessing step. Evaluated on ChatDoctor and Wiki-PII datasets across six LLMs, the approach reduces targeted information exposure in LLaMA-3-8B from 144 baseline instances to 1, while maintaining contextual fidelity (BLEU-1 of 0.122 vs. SAGE's 0.117). The framework introduces no additional online inference latency since rewriting is done offline. Source code is publicly released.
PropMe framework distinguishes memorization capability from propensity in LLMs
A new arXiv preprint introduces PropMe, a framework that separates whether LLMs can be forced to reproduce training data (capability) from whether they do so under ordinary use (propensity). The authors also release SimpleTrace, a lightweight pipeline using infini-gram to attribute model outputs to training corpora. Evaluating two open models on Common Pile and Dynaword, they find a consistent gap: adversarial prefix attacks elicit strong memorization, but propensity scores remain low in non-adversarial settings. The paper argues memorization audits should report both worst-case extractability and ordinary leakage propensity.
CoTrace: A Goal-Level Attribution Framework for Measuring AI Contributions in Human-AI Collaboration
Researchers introduce CoTrace, a framework that decomposes explicit goals into verifiable requirements and traces both direct and indirect AI contributions across dialogue turns in human-AI collaboration. Applied to 638 real-world collaboration logs, the study finds LLMs account for 11-26% of goal-shaping contribution, with disproportionate influence on lower-level concrete requirements. A user study shows that exposing participants to goal-level attribution analyses shifts their perceived AI contribution by nearly 2 points on a 5-point scale, revealing systematic miscalibration in how users understand AI-assisted work. The work has implications for reliance calibration, AI-assisted work evaluation, and interaction design.
ProvenanceGuard: Source-aware factuality verification for MCP-based LLM agents
Researchers introduce ProvenanceGuard, a verifier that checks factual claims in MCP-grounded LLM agent answers against their specific source provenance rather than pooled evidence. The system decomposes answers into atomic claims, routes each to its attributed source via MCP trace metadata, and applies NLI plus token-alignment checks to detect 'cross-source conflation' — where a claim is supported somewhere but attributed to the wrong source. Evaluated on 281 medical-domain MCP-agent traces, it achieves block F1 of 0.802 and source accuracy of 0.858 on held-out data, and detects all injected attribution swaps in 50 controlled clinical probes. The work establishes source attribution as an independent factuality axis distinct from standard grounding checks.
RedAct framework protects procedural skills in agent execution traces via selective redaction and watermarking
Researchers introduce RedAct, a framework for releasing agent execution traces without exposing proprietary procedural skills (tool invocations, decision logic, error-recovery strategies). The system localizes sensitive information, rewrites traces while preserving audit-critical evidence, and embeds behavioral watermarks for provenance tracking. To evaluate the approach, the authors construct CapTraceBench, a benchmark of 75 long-horizon tasks and 154 skills across seven domains. RedAct reduces normalized skill transfer from 44.7–67.1% on raw traces to below the no-skill baseline, while watermark detection achieves 93.6–100% true positive rate with under 2% false alarms.
Probe Trajectories Reveal Reasoning Dynamics in Large Reasoning Models
This paper investigates whether hidden representations of Large Reasoning Models (LRMs) can predict future model behavior by analyzing probe trajectories—the continuous evolution of concept probabilities across Chain-of-Thought reasoning tokens. The authors find that temporal trajectory features (volatility, trend, steady-state) significantly outperform single static probes, with max-pooling achieving up to 95% AUROC across safety and mathematics domains. Two methodological insights are offered: template-based training data matches dynamically generated responses in quality, and pooling strategy is critical to probe performance. The work positions probe trajectories as a complementary safety monitoring framework for LRMs where CoT faithfulness cannot be assumed.

