7OpenAI Blog·1mo ago

The Instruction Hierarchy: Training LLMs to Prioritize Privileged Instructions

OpenAI published research on the 'instruction hierarchy,' a training approach that teaches LLMs to prioritize instructions based on their source privilege level (system prompt > user > third-party). The method aims to make models more robust against prompt injection, jailbreaks, and adversarial instruction overrides. By training models to recognize and respect a hierarchy of instruction authority, OpenAI seeks to reduce the attack surface for multi-agent and deployed LLM systems.

AI Safety Research Enterprise Deployment Patterns Agent and Tool Ecosystem Alignment and RLHF prompt injection Instruction Hierarchy OpenAI Jailbreak

Related guides (4)

OpenAI

OpenAI: The Lab That Made AI a Household Word

Read asBeginner In-depth

prompt injectionConcept

Prompt Injection: The Security Threat Hiding in Plain Text

Read asBeginner

AI Safety ResearchTopic guide

AI Safety Research: From Lab Policies to Real-World Flashpoints

Read asBeginner In-depth

Enterprise Deployment PatternsTopic guide

Enterprise Deployment Patterns: From LLM Demo to Production Reality

Read asIn-depth

Related events (8)

7Openai Blog·1mo ago·source ↗

Improving instruction hierarchy in frontier LLMs

OpenAI introduces IH-Challenge, a training approach designed to improve instruction hierarchy (IH) in large language models. The method trains models to correctly prioritize trusted instructions over untrusted ones, enhancing safety steerability and resistance to prompt injection attacks. This work addresses a core alignment challenge in deployed LLM systems where conflicting instructions from different principals must be handled reliably.

AI Safety Research Agent and Tool Ecosystem prompt injection Instruction Hierarchy IH-Challenge +2 more

6arXiv · cs.CL·11d ago·source ↗

Gravity-Weighted DPO enforces multi-level instruction hierarchies in LLMs

Researchers introduce Gravity-Weighted DPO (GW-DPO), a preference-optimization objective that scales per-sample loss offsets by the structural distance between conflicting instruction levels, addressing the problem of uniform architectural privilege across trust levels in production LLMs. The work formalizes a 5-level instruction hierarchy with ten pairwise priority relations and combines GW-DPO with hierarchy-specific delimiter tokens and Instructional Segment Embeddings (ISE). Evaluated on Llama-3.1-8B-Instruct, the bilateral GW-DPO schedule Pareto-improves over standard DPO on macro pairwise priority adherence while cutting over-refusal rates in half. The approach directly targets prompt injection vulnerabilities arising from models' inability to resolve competing instructions by privilege level.

AI Safety Research Agent and Tool Ecosystem Instructional Segment Embeddings Training LLMs to Enforce Multi-Level Instruction Hierarchies via Gravity-Weighted Direct Preference Optimization Llama3-8B-Instruct +3 more

8Openai Blog·1mo ago·source ↗

Aligning language models to follow instructions

OpenAI published a blog post describing their work on aligning language models to follow human instructions, corresponding to the InstructGPT research. This work introduced reinforcement learning from human feedback (RLHF) as a core technique for training models to be more helpful, honest, and aligned with user intent. The approach demonstrated that smaller instruction-tuned models could outperform larger base models on human preference evaluations, marking a foundational shift in how language models are trained and deployed.

Frontier Model Releases Alignment and RLHF GPT-3 Reinforcement Learning from Human Feedback OpenAI +1 more

5Hugging Face Blog·1mo ago·source ↗

An Introduction to AI Secure LLM Safety Leaderboard

Hugging Face introduces the DecodingTrust-based LLM Safety Leaderboard, a benchmark framework for evaluating large language models across multiple safety and trustworthiness dimensions. The leaderboard aims to provide standardized, reproducible safety assessments covering areas such as toxicity, stereotype bias, adversarial robustness, and privacy. It offers a public ranking of models to help researchers and practitioners compare safety properties across different LLMs.

Evaluation and Benchmarking AI Safety Research LLM Safety Leaderboard Hugging Face DecodingTrust

6arXiv · cs.AI·29d ago·source ↗

LCGuard: Adversarial Training Framework for Safe KV Cache Sharing in Multi-Agent LLM Systems

LCGuard introduces a framework for preventing sensitive information leakage when multi-agent LLM systems share KV caches as a latent communication channel. The approach formalizes leakage operationally via reconstruction: a shared cache artifact is deemed unsafe if an adversarial decoder can recover sensitive inputs from it. An adversarial training loop pits a reconstructor against LCGuard's representation-level transformations, which aim to preserve task-relevant semantics while suppressing recoverable sensitive content. Empirical results across multiple model families and multi-agent benchmarks show reduced reconstruction-based leakage and attack success rates with competitive task performance.

Inference Economics AI Safety Research KV Cache representation-level sensitive information leakage LCGuard +4 more

4arXiv · cs.CL·47h ago·source ↗

Survey proposes four-layer architecture for token-operations-oriented LLM inference optimization

A new arXiv preprint introduces a four-layer technical architecture—Multi-model Fusion, Model Optimization, Compute-Model Fusion, and Compute-Network-Model Fusion—for systematically organizing LLM inference optimization techniques. The paper reviews key technologies and industry status at each layer and analyzes their application in real-world business scenarios. The framing around 'token operations' positions inference optimization as an operational discipline analogous to traditional IT operations.

Training Infrastructure Inference Economics Token-Operations-Oriented Inference Optimization Techniques for Large Models

6arXiv · cs.CL·17d ago·source ↗

Backdoor unlearning in LLMs generalizes across unknown triggers via cross-backdoor transfer

Researchers demonstrate that training an LLM to unlearn a single backdoor trigger can suppress other backdoors that were never explicitly targeted, a phenomenon they call cross-backdoor transfer. The study spans three model families with backdoors injected via pretraining or continual pretraining, and introduces a new metric called Cross Activation Shift Distance to quantify the relationship between different unlearning interventions. The finding opens a potential defensive strategy where defenders deliberately inject and then remove controlled backdoors to suppress unknown attacker-planted backdoors.

AI Safety Research Alignment and RLHF Cross Activation Shift Distance Backdoor Unlearning Generalization: A Path Toward the Removal of Unknown Triggers in LLMs

6The Batch·28d ago·source ↗

Google Study Shows LLM-Generated Malware Is Getting Harder to Track and Stop

A Google security report catalogs emerging LLM-enabled cyberattack techniques including morphing malware with mutation engines, logical-flaw discovery in code, and AI-directed obfuscation networks. The report was prompted in part by a real incident where hackers used an LLM to find a zero-day in a widely used web administration tool. Separately, the UK AI Security Institute found that Claude Mythos Preview and GPT-5.5 can reliably execute attacks expected to take humans 3 hours, up from earlier 1-hour benchmarks, with performance scaling further when token limits are relaxed. The findings suggest an accelerating gap between LLM offensive capability and conventional defensive tooling.

Frontier Model Releases Evaluation and Benchmarking Claude Opus 4.6 Google UK AI Security Institute +8 more