technique

Dual-Reference SFT

techniqueactiveprovisionaldual-reference-sft-5bd61cda·1 events·first seen 17h ago

Aliases: Dual-Reference SFT

Co-occurring entities

Embedded Attack Direct Preference Optimization (DPO)

More like this (12)

SFT Target-SFT Target-SFT Dual-Path SparseTCAM Robust Dual-Signal Fusion DeltaFS MemFT Fenchel duality FedTSV ChunkFT PI-FT dual-graph framework

Recent events (1)

6arXiv · cs.AI·17h ago·source ↗

DR-SFT: Defending against harmful supervision hidden in benign fine-tuning samples

A new arXiv paper introduces 'Embedded Attack', an adversarial technique that hides harmful QA supervision inside ostensibly benign training samples, bypassing existing guardrails that operate at the example level. The authors then propose Dual-Reference SFT (DR-SFT), which adapts DPO-style contrastive objectives to supervised fine-tuning via token-level regularization to mitigate this class of attack. The work highlights a gap in current fine-tuning safety defenses and offers a concrete mitigation method.

AI Safety Research Alignment and RLHF Embedded Attack Direct Preference Optimization (DPO)Dual-Reference SFT