6arXiv cs.CL (Computation and Language)·17d ago

Backdoor unlearning in LLMs generalizes across unknown triggers via cross-backdoor transfer

Researchers demonstrate that training an LLM to unlearn a single backdoor trigger can suppress other backdoors that were never explicitly targeted, a phenomenon they call cross-backdoor transfer. The study spans three model families with backdoors injected via pretraining or continual pretraining, and introduces a new metric called Cross Activation Shift Distance to quantify the relationship between different unlearning interventions. The finding opens a potential defensive strategy where defenders deliberately inject and then remove controlled backdoors to suppress unknown attacker-planted backdoors.

AI Safety Research Alignment and RLHF Cross Activation Shift Distance Backdoor Unlearning Generalization: A Path Toward the Removal of Unknown Triggers in LLMs

Related guides (2)

AI Safety ResearchTopic guide

AI Safety Research: From Lab Policies to Real-World Flashpoints

Read asBeginner In-depth

Alignment and RLHFTopic guide

Alignment and RLHF: Teaching AI Models to Behave

Read asBeginner In-depth

Related events (8)

6arXiv · cs.CL·1mo ago·source ↗

Language-Switching Backdoor Triggers Use Orthogonal Latent Subspace in LLMs

Researchers identify and decompose the internal circuit underlying a language-switching backdoor attack in an 8B-parameter autoregressive language model, where a three-word Latin trigger redirects English output to French. The circuit operates in three phases: early attention heads compose trigger tokens, a mid-layer signal propagates through a subspace orthogonal to the model's natural language-identity direction, and a final MLP layer converts the latent signal into French logits. The entire circuit flows through a serial bottleneck at a single sequence position, meaning corrupting that position mitigates the trigger but also degrades general capabilities. Critically, the orthogonal encoding means defenses that search for language-like signals in intermediate representations would fail to detect this trigger.

Evaluation and Benchmarking AI Safety Research mechanistic interpretability Backdoor Circuit Analysis (Language-Switching)8B autoregressive language model +2 more

6The Batch·28d ago·source ↗

Google Study Shows LLM-Generated Malware Is Getting Harder to Track and Stop

A Google security report catalogs emerging LLM-enabled cyberattack techniques including morphing malware with mutation engines, logical-flaw discovery in code, and AI-directed obfuscation networks. The report was prompted in part by a real incident where hackers used an LLM to find a zero-day in a widely used web administration tool. Separately, the UK AI Security Institute found that Claude Mythos Preview and GPT-5.5 can reliably execute attacks expected to take humans 3 hours, up from earlier 1-hour benchmarks, with performance scaling further when token limits are relaxed. The findings suggest an accelerating gap between LLM offensive capability and conventional defensive tooling.

Frontier Model Releases Evaluation and Benchmarking Claude Opus 4.6 Google UK AI Security Institute +8 more

6arXiv · cs.AI·47h ago·source ↗

CWE-Trace framework reveals LLM vulnerability detection is calibration without comprehension

Researchers introduce CWE-Trace, a benchmark of 834 manually curated Linux kernel samples across 74 CWEs with strict temporal splits to prevent data contamination, used to evaluate 8 vanilla LLMs and 15 LoRA fine-tuned variants on vulnerability detection. Key findings: data contamination provides no measurable advantage (84% of nominally contaminated samples carry no usable memorization signal), and backbone directional priors dominate fine-tuning — models exhibit stable systematic failure modes that resist correction. The best binary detection score reaches only 52.1% (barely above chance) and exact CWE classification Top-1 accuracy stays below 1.3%, indicating fine-tuning shifts output distributions without instilling genuine security reasoning. The work introduces two diagnostic metrics (Directional Failure Index and Hierarchical Distance and Direction) and concludes that detection capability and security understanding are decoupled in current LLMs.

Evaluation and Benchmarking AI Safety Research CWE-Trace Hierarchical Distance and Direction DeepSeek V4 +3 more

5arXiv · cs.CL·9d ago·source ↗

Systematic study reveals effectiveness-fluency trade-offs in LLM conditioning methods

A new arXiv paper systematically evaluates a range of LLM conditioning methods across both concept injection and removal scenarios, finding that efficient steering methods often degrade fluency significantly. A key finding is that activation steering is substantially less effective on instruction-tuned models than on base models, a previously overlooked interaction. Simple prompting and supervised fine-tuning work for concept injection but not removal, and cheap textual metrics are found to correlate well with expensive LLM-as-judge evaluations.

Evaluation and Benchmarking Alignment and RLHF On The Effectiveness-Fluency Trade-Off In LLM Conditioning: A Systematic Study

5arXiv · cs.AI·2d ago·source ↗

MAST: Mechanism-guided selective unlearning for RLVR-trained reasoning models

Researchers introduce MAST (Mechanism-Aligned Selective Targeting), a method for selectively unlearning capabilities induced by reinforcement learning from verifiable rewards (RLVR) in language models while minimizing collateral damage to retained knowledge. The approach ranks attention-projection tensors by off-principal energy and gradient coupling to identify a targeted subset for update, rather than applying full-parameter gradient ascent. Evaluated on Qwen2.5-Math-1.5B and Qwen3-1.7B-Base, MAST achieves statistically significant forgetting on target MATH problems while preserving GSM8K performance, whereas full-parameter unlearning collapses retained capabilities. The method generalizes across seeds and unlearning objectives (NPO/SimNPO).

AI Safety Research Alignment and RLHF Qwen3-1.7B-Base MATH MAST +2 more

6arXiv · cs.AI·29d ago·source ↗

LCGuard: Adversarial Training Framework for Safe KV Cache Sharing in Multi-Agent LLM Systems

LCGuard introduces a framework for preventing sensitive information leakage when multi-agent LLM systems share KV caches as a latent communication channel. The approach formalizes leakage operationally via reconstruction: a shared cache artifact is deemed unsafe if an adversarial decoder can recover sensitive inputs from it. An adversarial training loop pits a reconstructor against LCGuard's representation-level transformations, which aim to preserve task-relevant semantics while suppressing recoverable sensitive content. Empirical results across multiple model families and multi-agent benchmarks show reduced reconstruction-based leakage and attack success rates with competitive task performance.

Inference Economics AI Safety Research KV Cache representation-level sensitive information leakage LCGuard +4 more

4Openai Blog·1mo ago·source ↗

Transfer of Adversarial Robustness Between Perturbation Types

OpenAI published research examining whether adversarial robustness trained against one type of perturbation (e.g., L-infinity) transfers to other perturbation types (e.g., L2, L1). The work investigates the generalization properties of adversarial training across different threat models. This is an early safety and robustness research contribution from OpenAI predating the modern LLM era.

Evaluation and Benchmarking AI Safety Research adversarial robustness L-infinity perturbation adversarial training +2 more

5arXiv · cs.CL·15d ago·source ↗

ATWU: Token-level importance learning improves LLM unlearning via retain-conflict criterion

This paper introduces Alternating Token-Weighted Unlearning (ATWU), a framework that learns which tokens in a forget sample are most relevant to unlearning by characterizing their conflict with the retain objective. Rather than relying on auxiliary models or heuristics, ATWU jointly learns token forget-specificity and model parameters using a lightweight linear scorer over hidden states. Evaluated on TOFU and RWKU benchmarks, ATWU achieves state-of-the-art forget-retain trade-offs and produces token-level scores that align with ground-truth forget-specific spans.

Evaluation and Benchmarking AI Safety Research RWKU Alternating Token-Weighted Unlearning TOFU