5arXiv cs.CL (Computation and Language)·3d ago

TRACE: Lightweight RAG corpus poisoning detection via token influence attribution

Researchers introduce TRACE, a detection framework for corpus poisoning attacks on Retrieval-Augmented Generation (RAG) systems that works by tracing answer-related tokens through token influence attribution rather than relying on auxiliary classifiers or LLM-based verification. The method identifies recurrent high-influence keywords across retrieved documents and performs secondary verification to confirm their effect on model predictions. Evaluated on three QA benchmarks and six LLMs, TRACE achieves strong detection performance while also exposing attacker-specified target answers, with lower computational overhead than prior approaches.

AI Safety Research Enterprise Deployment Patterns TRACE Tracing Target Answers in Poisoned Retrieval Corpora via Token Influence Attribution

Related guides (2)

AI Safety ResearchTopic guide

AI Safety Research: From Lab Policies to Real-World Flashpoints

Read asBeginner In-depth

Enterprise Deployment PatternsTopic guide

Enterprise Deployment Patterns: From AI Demo to Production Reality

Read asBeginner In-depth

Related events (8)

5arXiv · cs.CL·3d ago·source ↗

Unified defense framework detects and remediates data poisoning in text summarization fine-tuning

A new arXiv preprint introduces a post-hoc defense framework for detecting and recovering from training-time data poisoning in LLMs fine-tuned for abstractive summarization. The framework uses influence-function analysis in white-box settings and behavioral perturbation auditing in black-box settings, achieving 85-92% detection precision across nine architectures and six benchmarks. Gradient-ascent unlearning restores up to 96% of original model behavior with less than 0.6% ROUGE degradation. The authors also introduce novel attacks targeting factual distortion and representational bias that evade conventional evaluation metrics.

Evaluation and Benchmarking AI Safety Research ROUGE-L Detect, Unlearn, Restore: Defending Text Summarization Models Against Data Poisoning

5arXiv · cs.CL·18d ago·source ↗

DocTrace: Structure-Aware On-Demand Hypergraph Memory for Long-Document QA

Researchers introduce DocTrace, a multi-agent RAG framework for long-document question answering that uses query-triggered knowledge organization rather than costly query-agnostic preprocessing. The system combines a lightweight document structural tree index, on-demand hypergraph working memory, and a graph-structured experience memory that stores successful reasoning plans for reuse. Evaluated on four long-document QA datasets, DocTrace outperforms the strongest baseline (ComoRAG) by up to 8.85% F1 and 4.40% EM while reducing computational cost by 53.32%.

Long Context Evolution Agent and Tool Ecosystem ComoRAG DocTrace Trace Only What You Need: Structure-Aware On-Demand Hypergraph Memory for Long-Document Question Answering

4arXiv · cs.CL·4d ago·source ↗

Multi-agent semantic rewriting framework for privacy-preserving RAG

A new arXiv preprint proposes a three-agent framework for sanitizing retrieved content in RAG pipelines by performing privacy extraction, semantic analysis, and reconstruction as an offline preprocessing step. Evaluated on ChatDoctor and Wiki-PII datasets across six LLMs, the approach reduces targeted information exposure in LLaMA-3-8B from 144 baseline instances to 1, while maintaining contextual fidelity (BLEU-1 of 0.122 vs. SAGE's 0.117). The framework introduces no additional online inference latency since rewriting is done offline. Source code is publicly released.

AI Safety Research Enterprise Deployment Patterns Privacy-Preserving RAG via Multi-Agent Semantic Rewriting Wiki-PII SAGE +2 more

5arXiv · cs.CL·23d ago·source ↗

PropMe framework distinguishes memorization capability from propensity in LLMs

A new arXiv preprint introduces PropMe, a framework that separates whether LLMs can be forced to reproduce training data (capability) from whether they do so under ordinary use (propensity). The authors also release SimpleTrace, a lightweight pipeline using infini-gram to attribute model outputs to training corpora. Evaluating two open models on Common Pile and Dynaword, they find a consistent gap: adversarial prefix attacks elicit strong memorization, but propensity scores remain low in non-adversarial settings. The paper argues memorization audits should report both worst-case extractability and ordinary leakage propensity.

Evaluation and Benchmarking AI Safety Research PropMe SimpleTrace Dynaword +4 more

6arXiv · cs.CL·1mo ago·source ↗

CoTrace: A Goal-Level Attribution Framework for Measuring AI Contributions in Human-AI Collaboration

Researchers introduce CoTrace, a framework that decomposes explicit goals into verifiable requirements and traces both direct and indirect AI contributions across dialogue turns in human-AI collaboration. Applied to 638 real-world collaboration logs, the study finds LLMs account for 11-26% of goal-shaping contribution, with disproportionate influence on lower-level concrete requirements. A user study shows that exposing participants to goal-level attribution analyses shifts their perceived AI contribution by nearly 2 points on a 5-point scale, revealing systematic miscalibration in how users understand AI-assisted work. The work has implications for reliance calibration, AI-assisted work evaluation, and interaction design.

Evaluation and Benchmarking AI Safety Research large language models goal-level attribution framework CoTrace +2 more

6arXiv · cs.CL·11d ago·source ↗

ProvenanceGuard: Source-aware factuality verification for MCP-based LLM agents

Researchers introduce ProvenanceGuard, a verifier that checks factual claims in MCP-grounded LLM agent answers against their specific source provenance rather than pooled evidence. The system decomposes answers into atomic claims, routes each to its attributed source via MCP trace metadata, and applies NLI plus token-alignment checks to detect 'cross-source conflation' — where a claim is supported somewhere but attributed to the wrong source. Evaluated on 281 medical-domain MCP-agent traces, it achieves block F1 of 0.802 and source accuracy of 0.858 on held-out data, and detects all injected attribution swaps in 50 controlled clinical probes. The work establishes source attribution as an independent factuality axis distinct from standard grounding checks.

Evaluation and Benchmarking AI Safety Research ProvenanceGuard ProvenanceGuard: Source-Aware Factuality Verification for MCP-Based LLM Agents Model Context Protocol +1 more

5arXiv · cs.CL·18d ago·source ↗

RedAct framework protects procedural skills in agent execution traces via selective redaction and watermarking

Researchers introduce RedAct, a framework for releasing agent execution traces without exposing proprietary procedural skills (tool invocations, decision logic, error-recovery strategies). The system localizes sensitive information, rewrites traces while preserving audit-critical evidence, and embeds behavioral watermarks for provenance tracking. To evaluate the approach, the authors construct CapTraceBench, a benchmark of 75 long-horizon tasks and 154 skills across seven domains. RedAct reduces normalized skill transfer from 44.7–67.1% on raw traces to below the no-skill baseline, while watermark detection achieves 93.6–100% true positive rate with under 2% false alarms.

Evaluation and Benchmarking AI Safety Research RedAct CapTraceBench Xu Shuwen +1 more

6arXiv · cs.CL·1mo ago·source ↗

Probe Trajectories Reveal Reasoning Dynamics in Large Reasoning Models

This paper investigates whether hidden representations of Large Reasoning Models (LRMs) can predict future model behavior by analyzing probe trajectories—the continuous evolution of concept probabilities across Chain-of-Thought reasoning tokens. The authors find that temporal trajectory features (volatility, trend, steady-state) significantly outperform single static probes, with max-pooling achieving up to 95% AUROC across safety and mathematics domains. Two methodological insights are offered: template-based training data matches dynamically generated responses in quality, and pooling strategy is critical to probe performance. The work positions probe trajectories as a complementary safety monitoring framework for LRMs where CoT faithfulness cannot be assumed.

Frontier Model Releases Evaluation and Benchmarking Max-Pooling Chain-of-Thought Reasoning Probe Trajectories +4 more