5arXiv cs.CL (Computation and Language)·3d ago

Unified defense framework detects and remediates data poisoning in text summarization fine-tuning

A new arXiv preprint introduces a post-hoc defense framework for detecting and recovering from training-time data poisoning in LLMs fine-tuned for abstractive summarization. The framework uses influence-function analysis in white-box settings and behavioral perturbation auditing in black-box settings, achieving 85-92% detection precision across nine architectures and six benchmarks. Gradient-ascent unlearning restores up to 96% of original model behavior with less than 0.6% ROUGE degradation. The authors also introduce novel attacks targeting factual distortion and representational bias that evade conventional evaluation metrics.

Evaluation and Benchmarking AI Safety Research ROUGE-L Detect, Unlearn, Restore: Defending Text Summarization Models Against Data Poisoning

Related guides (2)

AI Safety ResearchTopic guide

AI Safety Research: From Lab Policies to Real-World Flashpoints

Read asBeginner In-depth

Evaluation and BenchmarkingTopic guide

Evaluation and Benchmarking: How We Measure AI — and Why It Keeps Getting Harder

Read asBeginner In-depth

Related events (8)

7The Batch·22d ago·source ↗

Fine-tuning LLMs on summary-expansion tasks strips copyright alignment guardrails, enabling up to 92% verbatim book reproduction

Researchers from Stony Brook University, Carnegie Mellon University, and Columbia Law School fine-tuned DeepSeek-V3.1, Gemini 2.5 Pro, and GPT-4o on a task of expanding plot summaries into prose paragraphs, finding that this caused models to regurgitate up to 91.9% of verbatim text from books in their pretraining data. The key finding is that alignment training suppresses but does not erase memorized text strings from model weights, and fine-tuning on verbatim-generation tasks can re-enable that recall, bypassing system-prompt-level copyright guardrails. The result has direct implications for model providers offering fine-tuning APIs and for organizations deploying customized models, as anti-plagiarism guardrails cannot be assumed to survive downstream fine-tuning.

AI Safety Research Regulatory Developments Carnegie Mellon University Xinyue Liu DeepSeek V4 +7 more

5arXiv · cs.CL·3d ago·source ↗

TRACE: Lightweight RAG corpus poisoning detection via token influence attribution

Researchers introduce TRACE, a detection framework for corpus poisoning attacks on Retrieval-Augmented Generation (RAG) systems that works by tracing answer-related tokens through token influence attribution rather than relying on auxiliary classifiers or LLM-based verification. The method identifies recurrent high-influence keywords across retrieved documents and performs secondary verification to confirm their effect on model predictions. Evaluated on three QA benchmarks and six LLMs, TRACE achieves strong detection performance while also exposing attacker-specified target answers, with lower computational overhead than prior approaches.

AI Safety Research Enterprise Deployment Patterns TRACE Tracing Target Answers in Poisoned Retrieval Corpora via Token Influence Attribution

6Openai Blog·1mo ago·source ↗

Learning to Summarize with Human Feedback

OpenAI published research applying reinforcement learning from human feedback (RLHF) to train language models for improved summarization quality. The work demonstrated that models trained with human preference signals outperform those trained purely on supervised objectives for summarization tasks. This paper is an early foundational contribution to the RLHF methodology that later became central to aligning large language models.

Evaluation and Benchmarking Alignment and RLHF Reinforcement Learning from Human Feedback OpenAI Learning to Summarize with Human Feedback

5arXiv · cs.AI·1mo ago·source ↗

Reverse Probing: Supervised Token-level Uncertainty Quantification for LLMs in Clinical Text

The paper introduces Reverse Probing, a novel uncertainty quantification framework designed specifically for clinical text summarization that estimates token-level uncertainty from pre-existing labeled summaries rather than sampling new outputs. It extracts uncertainty signals from four categories of internal model activations, treating text as a probe into the model's internal state. Evaluated on two expert-annotated clinical datasets, it outperforms eight adapted baselines on all metrics, achieving up to 4× higher AUPRC while reducing inference time and compute. Feature analysis identifies delta energy and neighborhood context as the most consistent predictors of uncertainty across models.

Evaluation and Benchmarking AI Safety Research Reverse Probing delta energy AUPRC +3 more

6arXiv · cs.CL·5d ago·source ↗

LLMs fail to reliably self-report adversarial prefill attacks, study finds

A new arXiv paper evaluates whether LLMs can recognize that their own prior responses were elicited by adversarial prefill attacks, testing ten open-weight models (3B–70B) across four safety benchmarks. Models claim intent on prefilled responses only 27.3% of the time on average, and introspective signal is largely mediated by refusal-related reasoning. Three LoRA fine-tuning methods (SFT, GRPO, DPO) improve the intention-probe gap but counterintuitively raise attack success rates on most models, suggesting partial and fragile mitigation. The findings raise concerns about the reliability of LLM self-reports in safety-critical contexts.

Evaluation and Benchmarking AI Safety Research GRPO DPO Can LLMs Reliably Self-Report Adversarial Prefills, and How?+1 more

7arXiv · cs.AI·3d ago·source ↗

Natural Ungrokking: Pretraining Can Silently Erase Learned Rules Without Loss Signal

A new arXiv preprint documents a phenomenon called 'natural ungrokking,' in which small language models learn a generalizable rule mid-pretraining (e.g., pronoun-gender agreement) and then lose it entirely by later steps, with no trace in the loss curve. The key predictor of rule survival is corpus support frequency — how often the training stream shows the rule winning over competing surface patterns. Critically, the forgetting is asymmetric: targeted data edits can destroy a rule on demand, but injecting up to 450x the sustaining support level cannot restore it. The findings are validated on public Pythia checkpoints and were pre-registered before data collection.

Evaluation and Benchmarking AI Safety Research Pythia Natural Ungrokking: Asymmetric Control of Which Rules Survive Pretraining +1 more

5arXiv · cs.CL·19d ago·source ↗

AdvGRPO: Stable co-training framework for adaptive red teaming of language models

Researchers introduce AdvGRPO, a co-training framework that makes GRPO viable for joint attacker-defender optimization in LLM red teaming, addressing previously reported instability. The method uses dense multi-channel rewards and decoupled advantage normalization, with a curriculum progressing from single-turn to multi-turn attacks before bootstrapping co-training. Co-trained defenders outperform baselines on safety benchmarks, and the attacks show transferability across models.

AI Safety Research Alignment and RLHF AdvGRPO GRPO PPO +1 more

5arXiv · cs.CL·18d ago·source ↗

Provenance-grounded gating and adaptive recovery improve synthetic post-training data curation

A controlled study examines two underexplored practices in synthetic post-training data pipelines: grounding filtering signals in source provenance and systematically recovering rejected samples rather than discarding them. Using adversarially injected corpora as ground-truth failure labels, the authors find that exact source provenance improves faithfulness gating for stronger judges, that hallucination and reward gates reject largely disjoint populations (making both necessary), and that adaptive recovery via failure diagnosis and targeted regeneration outperforms naive resampling. Generator scale is the primary driver of downstream fine-tuning quality, with filtration and recovery contributing meaningfully but secondarily.

Evaluation and Benchmarking Alignment and RLHF Provenance-Grounded Gating and Adaptive Recovery in Synthetic Post-Training Data Curation