Training LLMs to Enforce Multi-Level Instruction Hierarchies via Gravity-Weighted Direct Preference Optimization
training-llms-to-enforce-multi-level-instruction-hierarchies-via-gravity-weighted-direct-preference-optimization-0dbc58f8·1 events·first seen 7d agoAliases: Training LLMs to Enforce Multi-Level Instruction Hierarchies via Gravity-Weighted Direct Preference Optimization
Co-occurring entities
More like this (12)
Recent events (1)
Gravity-Weighted DPO enforces multi-level instruction hierarchies in LLMs
Researchers introduce Gravity-Weighted DPO (GW-DPO), a preference-optimization objective that scales per-sample loss offsets by the structural distance between conflicting instruction levels, addressing the problem of uniform architectural privilege across trust levels in production LLMs. The work formalizes a 5-level instruction hierarchy with ten pairwise priority relations and combines GW-DPO with hierarchy-specific delimiter tokens and Instructional Segment Embeddings (ISE). Evaluated on Llama-3.1-8B-Instruct, the bilateral GW-DPO schedule Pareto-improves over standard DPO on macro pairwise priority adherence while cutting over-refusal rates in half. The approach directly targets prompt injection vulnerabilities arising from models' inability to resolve competing instructions by privilege level.