7OpenAI Blog·1mo ago

Improving instruction hierarchy in frontier LLMs

OpenAI introduces IH-Challenge, a training approach designed to improve instruction hierarchy (IH) in large language models. The method trains models to correctly prioritize trusted instructions over untrusted ones, enhancing safety steerability and resistance to prompt injection attacks. This work addresses a core alignment challenge in deployed LLM systems where conflicting instructions from different principals must be handled reliably.

AI Safety Research Agent and Tool Ecosystem Alignment and RLHF prompt injection Instruction Hierarchy IH-Challenge OpenAI

Related guides (4)

OpenAI

OpenAI: The Lab That Made AI a Household Word

Read asBeginner In-depth

prompt injectionConcept

Prompt Injection: The Security Threat Hiding in Plain Text

Read asBeginner

AI Safety ResearchTopic guide

AI Safety Research: From Lab Policies to Real-World Flashpoints

Read asBeginner In-depth

Agent and Tool EcosystemTopic guide

Agent and Tool Ecosystem: How the Infrastructure Layer Around LLMs Is Consolidating

Read asIn-depth

Related events (8)

7Openai Blog·1mo ago·source ↗

The Instruction Hierarchy: Training LLMs to Prioritize Privileged Instructions

OpenAI published research on the 'instruction hierarchy,' a training approach that teaches LLMs to prioritize instructions based on their source privilege level (system prompt > user > third-party). The method aims to make models more robust against prompt injection, jailbreaks, and adversarial instruction overrides. By training models to recognize and respect a hierarchy of instruction authority, OpenAI seeks to reduce the attack surface for multi-agent and deployed LLM systems.

AI Safety Research Enterprise Deployment Patterns prompt injection Instruction Hierarchy OpenAI +3 more

8Openai Blog·1mo ago·source ↗

Aligning language models to follow instructions

OpenAI published a blog post describing their work on aligning language models to follow human instructions, corresponding to the InstructGPT research. This work introduced reinforcement learning from human feedback (RLHF) as a core technique for training models to be more helpful, honest, and aligned with user intent. The approach demonstrated that smaller instruction-tuned models could outperform larger base models on human preference evaluations, marking a foundational shift in how language models are trained and deployed.

Frontier Model Releases Alignment and RLHF GPT-3 Reinforcement Learning from Human Feedback OpenAI +1 more

5Hugging Face Blog·1mo ago·source ↗

An Introduction to AI Secure LLM Safety Leaderboard

Hugging Face introduces the DecodingTrust-based LLM Safety Leaderboard, a benchmark framework for evaluating large language models across multiple safety and trustworthiness dimensions. The leaderboard aims to provide standardized, reproducible safety assessments covering areas such as toxicity, stereotype bias, adversarial robustness, and privacy. It offers a public ranking of models to help researchers and practitioners compare safety properties across different LLMs.

Evaluation and Benchmarking AI Safety Research LLM Safety Leaderboard Hugging Face DecodingTrust

7Openai Blog·1mo ago·source ↗

Deliberative Alignment: Reasoning Enables Safer Language Models

OpenAI introduces deliberative alignment, a new alignment strategy applied to o1 models in which the model is directly taught safety specifications and trained to reason over them at inference time. Unlike prior approaches that embed safety implicitly through RLHF, this method makes safety reasoning explicit and inspectable. The announcement positions deliberative alignment as a meaningful advance in scalable oversight and safe deployment of frontier reasoning models.

Frontier Model Releases AI Safety Research Reinforcement Learning from Human Feedback OpenAI deliberative alignment +2 more

9Openai Blog·1mo ago·source ↗

Learning to Reason with LLMs

OpenAI announced a new model or capability focused on reasoning in large language models, published on September 12, 2024. The post, hosted on the OpenAI blog, describes advances in training LLMs to perform complex multi-step reasoning. This likely corresponds to the release of the o1 (formerly 'Strawberry') model series, which uses chain-of-thought reasoning trained via reinforcement learning to achieve significantly improved performance on math, science, and coding benchmarks.

Frontier Model Releases Evaluation and Benchmarking Chain-of-Thought Reasoning Reinforcement Learning OpenAI +3 more

6arXiv · cs.CL·11d ago·source ↗

Gravity-Weighted DPO enforces multi-level instruction hierarchies in LLMs

Researchers introduce Gravity-Weighted DPO (GW-DPO), a preference-optimization objective that scales per-sample loss offsets by the structural distance between conflicting instruction levels, addressing the problem of uniform architectural privilege across trust levels in production LLMs. The work formalizes a 5-level instruction hierarchy with ten pairwise priority relations and combines GW-DPO with hierarchy-specific delimiter tokens and Instructional Segment Embeddings (ISE). Evaluated on Llama-3.1-8B-Instruct, the bilateral GW-DPO schedule Pareto-improves over standard DPO on macro pairwise priority adherence while cutting over-refusal rates in half. The approach directly targets prompt injection vulnerabilities arising from models' inability to resolve competing instructions by privilege level.

AI Safety Research Agent and Tool Ecosystem Instructional Segment Embeddings Training LLMs to Enforce Multi-Level Instruction Hierarchies via Gravity-Weighted Direct Preference Optimization Llama3-8B-Instruct +3 more

5arXiv · cs.CL·9d ago·source ↗

Systematic study reveals effectiveness-fluency trade-offs in LLM conditioning methods

A new arXiv paper systematically evaluates a range of LLM conditioning methods across both concept injection and removal scenarios, finding that efficient steering methods often degrade fluency significantly. A key finding is that activation steering is substantially less effective on instruction-tuned models than on base models, a previously overlooked interaction. Simple prompting and supervised fine-tuning work for concept injection but not removal, and cheap textual metrics are found to correlate well with expensive LLM-as-judge evaluations.

Evaluation and Benchmarking Alignment and RLHF On The Effectiveness-Fluency Trade-Off In LLM Conditioning: A Systematic Study

6arXiv · cs.CL·46h ago·source ↗

Activation-space directions for detecting and mitigating emergent misalignment across LLM families

Researchers fine-tuned four small instruction-tuned model families (Qwen2.5-1.5B, Gemma-2-2B, Llama-3.2-1B, Ministral-3B) on insecure code to induce emergent misalignment, then investigated whether a shared activation-space direction could detect and correct it. A difference-in-means direction achieves 99.6% separation of aligned vs. misaligned activations within each model, and causal steering by subtracting this direction reduces misaligned behavior by 21–51 points. Cross-architecture transfer via ridge regression yields large behavioral suppression but fails specificity controls, revealing a two-tier structure: within-model directions are causally specific and actionable, while cross-model directions are real but non-specific. The findings bound the utility of linear cross-architecture correction and recommend within-model probing for safety auditing.

Evaluation and Benchmarking AI Safety Research Llama 3.2 Gemma 2 Qwen2.5-1.5B +4 more