6arXiv cs.AI (Artificial Intelligence)·17h ago

DR-SFT: Defending against harmful supervision hidden in benign fine-tuning samples

A new arXiv paper introduces 'Embedded Attack', an adversarial technique that hides harmful QA supervision inside ostensibly benign training samples, bypassing existing guardrails that operate at the example level. The authors then propose Dual-Reference SFT (DR-SFT), which adapts DPO-style contrastive objectives to supervised fine-tuning via token-level regularization to mitigate this class of attack. The work highlights a gap in current fine-tuning safety defenses and offers a concrete mitigation method.

AI Safety Research Alignment and RLHF Embedded Attack Direct Preference Optimization (DPO)Dual-Reference SFT

Related guides (3)

Direct Preference Optimization (DPO)Concept

Direct Preference Optimization (DPO): Aligning AI Without a Reward Model

Read asBeginner In-depth

AI Safety ResearchTopic guide

AI Safety Research: From Lab Policies to Real-World Flashpoints

Read asBeginner In-depth

Alignment and RLHFTopic guide

Alignment and RLHF: Teaching AI Models to Behave

Read asBeginner In-depth

Related events (8)

6arXiv · cs.AI·20d ago·source ↗

Q-target framework unifies supervised fine-tuning variants through target distribution design

A new arXiv preprint reframes supervised fine-tuning (SFT) as a problem of target distribution design rather than loss objective selection, introducing the Q-target framework that decomposes SFT supervision into two explicit choices: reliance on the observed token and allocation of remaining probability mass. The authors show that many existing SFT variants can be understood as implicit choices of this target distribution. They propose Target-SFT, which constructs training objectives directly from the desired target distribution, and report consistent improvements across ten reasoning dataset-model settings. The work offers a unifying theoretical lens and opens a broader design space for SFT objectives.

Evaluation and Benchmarking Alignment and RLHF Q-target framework A Unifying Lens on Supervised Fine-Tuning Through Target Distribution Design Target-SFT +1 more

5arXiv · cs.CL·5d ago·source ↗

Unified defense framework detects and remediates data poisoning in text summarization fine-tuning

A new arXiv preprint introduces a post-hoc defense framework for detecting and recovering from training-time data poisoning in LLMs fine-tuned for abstractive summarization. The framework uses influence-function analysis in white-box settings and behavioral perturbation auditing in black-box settings, achieving 85-92% detection precision across nine architectures and six benchmarks. Gradient-ascent unlearning restores up to 96% of original model behavior with less than 0.6% ROUGE degradation. The authors also introduce novel attacks targeting factual distortion and representational bias that evade conventional evaluation metrics.

Evaluation and Benchmarking AI Safety Research ROUGE-L Detect, Unlearn, Restore: Defending Text Summarization Models Against Data Poisoning

6Berkeley Ai Research (Bair) Blog·1mo ago·source ↗

Defending against Prompt Injection with Structured Queries (StruQ) and Preference Optimization (SecAlign)

Researchers from BAIR propose two fine-tuning-based defenses against prompt injection attacks: StruQ (Structured Instruction Tuning) and SecAlign (Special Preference Optimization). Both methods use a Secure Front-End with special delimiter tokens to separate trusted prompts from untrusted data, then fine-tune LLMs to ignore injected instructions. SecAlign, which uses DPO-style preference optimization, reduces attack success rates to under 15% against strong optimization-based attacks—more than 4x better than prior SOTA—while preserving model utility on AlpacaEval2.

AI Safety Research Agent and Tool Ecosystem StruQ SecAlign Berkeley AI Research (BAIR)+7 more

6arXiv · cs.AI·28d ago·source ↗

SafeSteer: Localized On-Policy Distillation for Efficient Safety Alignment

SafeSteer proposes a safety alignment method that targets only 'safety tokens' in the output distribution rather than applying global fine-tuning, arguing that safety features are inherently sparse. It constructs a safety teacher via activation steering, then restricts a reverse KL penalty to selected safety tokens during training. The approach achieves strong safety performance across seven benchmarks with minimal capability degradation, requiring only 100 harmful samples—less than 1% of data used by prior baselines.

Evaluation and Benchmarking AI Safety Research on-policy distillation SafeSteer alignment tax +4 more

7arXiv · cs.AI·1mo ago·source ↗

Retrying vs Resampling in AI Control: Safety Tradeoffs in Coding Scaffolds

This paper analyzes two strategies for handling flagged actions in AI coding scaffolds—retrying (blocking risky actions and continuing) and resampling (drawing multiple samples from the same context)—from an AI control perspective that treats the model as potentially adversarial. The authors find that retrying backfires because the untrusted model can exploit monitor rationale to craft stealthier attacks, while resampling avoids this information leakage. Using Claude Opus 4.6 as the untrusted model and MiMo-V2-Flash as the monitor on the BashArena benchmark, they show that drawing five samples per step and auditing on maximum suspicion score raises safety from 61% to 71% at a 0.3% audit budget. Two findings contradict prior work: auditing on maximum (not minimum) suspicion scores is better, and executing the least suspicious sample yields only marginal safety gains.

Evaluation and Benchmarking AI Safety Research Claude Opus 4.6 MiMo-V2-Flash Ctrl-Z +6 more

6arXiv · cs.LG·5d ago·source ↗

Paper diagnoses RL collapse in multi-step tool-use training and proposes supervisory signal fixes

A new arXiv preprint identifies a failure mode in reinforcement learning for LLM tool use: catastrophic collapse caused by probability spikes in control tokens that disrupt structured execution while leaving underlying tool-use capability intact. The authors systematically evaluate supervisory signals—including off-policy supervision, hint-based guidance, and erroneous example supervision—under synchronous and interleaved training schemes. Interleaving SFT with RL improves stability but degrades performance under out-of-distribution format and content evaluation. Code is released as Tool-RL-Box.

Agent and Tool Ecosystem Alignment and RLHF Why Multi-Step Tool-Use Reinforcement Learning Collapses and How Supervisory Signals Fix It Tool-RL-Box

6arXiv · cs.LG·14d ago·source ↗

RING attack exploits differential privacy to amplify backdoor success in federated learning

A new arXiv paper challenges the assumption that differential privacy (DP) inherently protects federated learning (FL) against backdoor attacks, demonstrating that DP's noise mechanism actually masks the statistical signatures that defenses rely on to detect malicious updates. The authors propose RING, an attack that exploits this masking effect by having compromised clients collaboratively craft adversarial perturbations that reconstruct a strong backdoor signal at aggregation time. Evaluated across four datasets against six state-of-the-art defenses, RING achieves a 90.3% average attack success rate under moderate privacy budgets, up to 26x better than baselines. Proposed countermeasures incur significant utility trade-offs, exposing a fundamental security gap in DP-FL deployments.

AI Safety Research RING Your Privacy My Cloak: Backdoor Attacks on Differentially Private Federated Learning

5arXiv · cs.LG·18d ago·source ↗

Shield synthesis reframed as design-time defensibility analysis for adversarial network security games

A new arXiv preprint argues that shielded reinforcement learning's automata-theoretic machinery is better used as a design-time analytical tool than a runtime safety enforcer. The authors instantiate this via a two-player safety game for network defense, producing a 'defensibility verdict' — a formal certificate of whether a topology-specification pair can be defended — along with a 'defensibility fingerprint' combining formal safety properties and operational behavior under adaptive play. A what-if analysis reveals that formal defensibility and operational effectiveness are distinct dimensions: small architectural changes can shift operational outcomes dramatically while leaving formal safety margins nearly unchanged. The work reframes shield synthesis as an architectural analysis framework rather than a deployment mechanism.

Evaluation and Benchmarking AI Safety Research shielded reinforcement learning Reinforcement Learning Beyond Runtime Enforcement: Shield Synthesis as Defensibility Analysis for Adversarial Networks