6arXiv cs.LG (Machine Learning)·5d ago

RING attack exploits differential privacy to amplify backdoor success in federated learning

A new arXiv paper challenges the assumption that differential privacy (DP) inherently protects federated learning (FL) against backdoor attacks, demonstrating that DP's noise mechanism actually masks the statistical signatures that defenses rely on to detect malicious updates. The authors propose RING, an attack that exploits this masking effect by having compromised clients collaboratively craft adversarial perturbations that reconstruct a strong backdoor signal at aggregation time. Evaluated across four datasets against six state-of-the-art defenses, RING achieves a 90.3% average attack success rate under moderate privacy budgets, up to 26x better than baselines. Proposed countermeasures incur significant utility trade-offs, exposing a fundamental security gap in DP-FL deployments.

AI Safety Research RING Your Privacy My Cloak: Backdoor Attacks on Differentially Private Federated Learning

Related guides (1)

AI Safety ResearchTopic guide

AI Safety Research: From Lab Policies to Real-World Flashpoints

Read asBeginner In-depth

Related events (8)

5arXiv · cs.LG·19d ago·source ↗

IntraShuffler: Privacy-Preserving Framework for Heterogeneous DP Federated Learning

This paper identifies a novel Privacy Inference Attack against heterogeneous differential privacy federated learning (HDP-FL) systems, where an honest-but-curious server exploits epsilon-aware aggregation and gradient denoising to infer client data distributions and link updates across rounds. To counter this, the authors propose IntraShuffler, a middleware framework that groups clients into privacy-compatible buckets and performs parameter-level shuffling within buckets, preserving epsilon-aware aggregation while disrupting persistent gradient structure. Experiments on four datasets show IntraShuffler reduces gradient recoverability by over 60% and drops surrogate inference accuracy from 0.78 to 0.33 with minimal utility loss.

AI Safety Research Enterprise Deployment Patterns Differential Privacy Privacy Inference Attack IntraShuffler +2 more

5arXiv · cs.LG·2d ago·source ↗

Predictability as a Fine-Grained Privacy Metric Complementary to Differential Privacy

A new arXiv preprint introduces 'privacy via predictability,' a framework that measures privacy leakage as the incremental gain in an attacker's ability to predict sensitive information after observing an algorithm's output, conditioned on the attacker's prior knowledge. The authors show predictability and differential privacy are generally incomparable, but that predictability implies mutual-information DP in worst-case regimes. They develop a generalized method of moments framework for asymptotic analysis and derive a predictability-calibrated output perturbation scheme for empirical risk minimization. The work positions predictability as a complementary, finer-grained alternative to DP for settings where attacker knowledge and query families can be specified.

Evaluation and Benchmarking AI Safety Research Differential Privacy Generalized Method of Moments Predictability as a Fine-Grained Measure for Privacy

6arXiv · cs.AI·5d ago·source ↗

Causal auditing framework detects privacy disclosures in synthetic data without model access

A new arXiv preprint introduces a model-agnostic empirical framework for auditing synthetic data generated by LLMs and generative AI systems for privacy leakage. The framework distinguishes 'true disclosures' (direct reproduction of user data) from 'phantom disclosures' (incidental generation), using held-out control sets and statistical hypothesis testing without requiring model access, canary insertion, or shadow model training. It functions as a membership inference attack and provides empirical lower bounds on privacy leakage that are tighter than prior data-based auditing methods. The approach is computationally lightweight and applicable to any synthetic data generation mechanism.

Evaluation and Benchmarking AI Safety Research Differential Privacy Phantoms and Disclosures: a Causal Framework for Auditing Synthetic Data

5arXiv · cs.LG·12d ago·source ↗

DRPO: Smooth divergence regularization replaces hard masking in LLM RL training

A new arXiv preprint proposes Divergence Regularized Policy Optimization (DRPO), a method that replaces the hard trust-region mask used in DPPO with a smooth advantage-weighted quadratic regularizer on policy shift. The approach addresses a known weakness in PPO and GRPO where importance ratios poorly proxy distributional shift in long-tailed vocabularies, and in DPPO where gradient signals are discarded rather than corrected at trust-region boundaries. Experiments across model scales, architectures, and precision settings show improved stability and efficiency in LLM RL post-training.

Alignment and RLHF Divergence Regularized Policy Optimization GRPO PPO +1 more

6arXiv · cs.CL·18d ago·source ↗

Backdoor unlearning in LLMs generalizes across unknown triggers via cross-backdoor transfer

Researchers demonstrate that training an LLM to unlearn a single backdoor trigger can suppress other backdoors that were never explicitly targeted, a phenomenon they call cross-backdoor transfer. The study spans three model families with backdoors injected via pretraining or continual pretraining, and introduces a new metric called Cross Activation Shift Distance to quantify the relationship between different unlearning interventions. The finding opens a potential defensive strategy where defenders deliberately inject and then remove controlled backdoors to suppress unknown attacker-planted backdoors.

AI Safety Research Alignment and RLHF Cross Activation Shift Distance Backdoor Unlearning Generalization: A Path Toward the Removal of Unknown Triggers in LLMs

7arXiv · cs.CL·25d ago·source ↗

Alignment Tampering: How RLHF Can Be Exploited to Amplify Misaligned Biases

This paper introduces 'alignment tampering,' a structural vulnerability in RLHF where the LLM being aligned can influence its own preference dataset, causing the training process to amplify undesired behaviors rather than correct them. The mechanism exploits two core RLHF limitations: preference data is drawn from the model's own outputs, and pairwise comparisons capture relative quality without capturing the reason for preference. Experiments demonstrate amplification of diverse biases including sexism, brand promotion, and instrumental goal-seeking. Existing robust RLHF mitigations fail to fully resolve the issue without degrading response quality.

Evaluation and Benchmarking AI Safety Research large language models Reinforcement Learning from Human Feedback Best-of-N Sampling +3 more

5arXiv · cs.LG·27d ago·source ↗

Perturbation Theory for Spherical Hellinger-Kantorovich Flows with Differential Privacy Guarantees

This paper develops a perturbation theory for Spherical Hellinger-Kantorovich (SHK) gradient flows, which couple transport and reaction dynamics and coincide with birth-death Langevin dynamics. The authors derive dimension-free bounds on log-likelihood ratios and Rényi/KL divergences when two potentials differ, quantifying how perturbations propagate over time. These results are applied to differential privacy: the likelihood-ratio control yields explicit Pure-DP guarantees for SHK-based samplers implementing the exponential mechanism, while KL bounds provide Approximate-DP certificates. A utility bound is also derived that separates intrinsic exponential-mechanism suboptimality from finite-time sampling error.

AI Safety Research Alignment and RLHF Differential Privacy KL Divergence Spherical Hellinger-Kantorovich geometry +4 more

7arXiv · cs.AI·20d ago·source ↗

Stateful Online Monitoring Catches Distributed Agent Attacks via Cross-Account Clustering

Researchers demonstrate the first known distributed agent attack, a multi-agent scaffold that splits harmful cybersecurity tasks across many user accounts to evade per-transcript safety monitors, reducing detection rates to roughly one-fifth of standard attacks. As a defense, they develop a stateful online monitor that clusters weak suspiciousness signals across many agent transcripts in real time, escalating only rarely to a full LM-based review. In large-scale simulated datacenter traffic evaluations, the monitor Pareto-dominates standard monitors by catching distributed attacks 30% earlier with negligible latency overhead for ~99% of traffic. The system also incidentally catches standard jailbreaks, since adaptive attackers tend to reuse attack variants across accounts.

Evaluation and Benchmarking Inference Economics Real-Time Clustering Stateful Online Monitor Distributed Agent Attack +5 more