Almanac
paper

Training LLMs to Enforce Multi-Level Instruction Hierarchies via Gravity-Weighted Direct Preference Optimization

paperactiveprovisionaltraining-llms-to-enforce-multi-level-instruction-hierarchies-via-gravity-weighted-direct-preference-optimization-0dbc58f8·1 events·first seen 7d ago

Aliases: Training LLMs to Enforce Multi-Level Instruction Hierarchies via Gravity-Weighted Direct Preference Optimization

Co-occurring entities

More like this (12)

Recent events (1)

6arXiv · cs.CL·7d ago·source ↗

Gravity-Weighted DPO enforces multi-level instruction hierarchies in LLMs

Researchers introduce Gravity-Weighted DPO (GW-DPO), a preference-optimization objective that scales per-sample loss offsets by the structural distance between conflicting instruction levels, addressing the problem of uniform architectural privilege across trust levels in production LLMs. The work formalizes a 5-level instruction hierarchy with ten pairwise priority relations and combines GW-DPO with hierarchy-specific delimiter tokens and Instructional Segment Embeddings (ISE). Evaluated on Llama-3.1-8B-Instruct, the bilateral GW-DPO schedule Pareto-improves over standard DPO on macro pairwise priority adherence while cutting over-refusal rates in half. The approach directly targets prompt injection vulnerabilities arising from models' inability to resolve competing instructions by privilege level.