Entity · technique

Instruction Hierarchy

techniqueactiveinstruction-hierarchy-a394adbb·3 events·first seen May 18, 2026

Aliases: Instruction Hierarchy

Co-occurring entities

OpenAI prompt injection Jailbreak IH-Challenge StruQ SecAlign Berkeley AI Research (BAIR)Llama3-8B-Instruct Direct Preference Optimization (DPO)AlpacaEval 2 Sizhe Chen

More like this (12)

Hierarchical Reinforcement Learning Self-Instruct Custom Instructions hierarchical delegation pattern datacenter power delivery hierarchy instruction tuning Structured Interactive Learning Llama3-8B-Instruct instruction-based multitask pretraining Training LLMs to Enforce Multi-Level Instruction Hierarchies via Gravity-Weighted Direct Preference Optimization InstructGPT When Built-in Thinking Helps and Hurts: Constraint-Level Error Shifts in Instruction Following

Recent events (3)

7Openai Blog·May 20, 2026·source ↗

The Instruction Hierarchy: Training LLMs to Prioritize Privileged Instructions

OpenAI published research on the 'instruction hierarchy,' a training approach that teaches LLMs to prioritize instructions based on their source privilege level (system prompt > user > third-party). The method aims to make models more robust against prompt injection, jailbreaks, and adversarial instruction overrides. By training models to recognize and respect a hierarchy of instruction authority, OpenAI seeks to reduce the attack surface for multi-agent and deployed LLM systems.

AI Safety Research Enterprise Deployment Patterns prompt injection Instruction Hierarchy OpenAI +3 more

7Openai Blog·May 20, 2026·source ↗

Improving instruction hierarchy in frontier LLMs

OpenAI introduces IH-Challenge, a training approach designed to improve instruction hierarchy (IH) in large language models. The method trains models to correctly prioritize trusted instructions over untrusted ones, enhancing safety steerability and resistance to prompt injection attacks. This work addresses a core alignment challenge in deployed LLM systems where conflicting instructions from different principals must be handled reliably.

AI Safety Research Agent and Tool Ecosystem prompt injection Instruction Hierarchy IH-Challenge +2 more

6Berkeley Ai Research (Bair) Blog·May 18, 2026·source ↗

Defending against Prompt Injection with Structured Queries (StruQ) and Preference Optimization (SecAlign)

Researchers from BAIR propose two fine-tuning-based defenses against prompt injection attacks: StruQ (Structured Instruction Tuning) and SecAlign (Special Preference Optimization). Both methods use a Secure Front-End with special delimiter tokens to separate trusted prompts from untrusted data, then fine-tune LLMs to ignore injected instructions. SecAlign, which uses DPO-style preference optimization, reduces attack success rates to under 15% against strong optimization-based attacks—more than 4x better than prior SOTA—while preserving model utility on AlpacaEval2.

AI Safety Research Agent and Tool Ecosystem StruQ SecAlign Berkeley AI Research (BAIR)+7 more